Non-coding RNAs in Complex Diseases

This book offers an in-depth and comprehensive review on current understanding of regulatory ncRNAs in complex diseases from a view of bioinformatics. It conveys state-of-the-art bioinformatics tools and methods for ncRNAs from computational detection, functional prediction, to their roles in diseases. Computational methods used to investigate uncharacterised ncRNAs in diseases are mainly summarized in seven aspects: DNA variation of ncRNAs in diseases, prioritization of disease-related ncRNAs, dysregulation epigenetic factors that regulate ncRNA misexpression (DNA methylation and histone-modification), complex crosstalk across ncRNAs, ncRNAs acted as competing regulators to mediate the expression of protein-coding genes, non-coding RNAs mediated risk pathways,and their contributions to drug target predictions. The generally used data resources of ncRNAs are also listed in the end. This book provides important information on the current progress in the fast-moving fields of bioinformatics for regulatory ncRNAs. It provides a timely and useful reference for computational biologists, specifically with RNA interests, and other related areas.Prof. Xia Li is a Professor and the Dean of College of Bioinformatics Science and Technology, Harbin Medical University, China. Dr. Yun Xiao, Dr. Juan Xu, Dr. Shangwei Ning and Dr. Yunpeng Zhang are from College of Bioinformatics Science and Technology, Harbin Medical University, China.


107 downloads 2K Views 6MB Size

Recommend Stories

Empty story

Idea Transcript


Advances in Experimental Medicine and Biology 1094

Xia Li · Juan Xu · Yun Xiao · Shangwei Ning  Yunpeng Zhang Editors

Non-coding RNAs in Complex Diseases A Bioinformatics Perspective

Advances in Experimental Medicine and Biology Volume 1094 Editorial Boards IRUN R. COHEN, The Weizmann Institute of Science, Rehovot, Israel ABEL LAJTHA, N.S. Kline Institute for Psychiatric Research, Orangeburg, NY, USA JOHN D. LAMBRIS, University of Pennsylvania, Philadelphia, PA, USA RODOLFO PAOLETTI, University of Milan, Milan, Italy NIMA REZAEI, Tehran University of Medical Sciences Children’s Medical Center, Children’s Medical Center Hospital, Tehran, Iran

More information about this series at http://www.springer.com/series/5584

Xia Li • Juan Xu • Yun Xiao • Shangwei Ning • Yunpeng Zhang Editors

Non-coding RNAs in Complex Diseases A Bioinformatics Perspective

Editors Xia Li College of Bioinformatics Science and Technology Harbin Medical University Harbin, Heilongjiang, China

Juan Xu College of Bioinformatics Science and Technology Harbin Medical University Harbin, China

Yun Xiao College of Bioinformatics Science and Technology Harbin Medical University Harbin, China

Shangwei Ning College of Bioinformatics Science and Technology Harbin Medical University Harbin, China

Yunpeng Zhang College of Bioinformatics Science and Technology Harbin Medical University Harbin, China

ISSN 0065-2598 ISSN 2214-8019 (electronic) Advances in Experimental Medicine and Biology ISBN 978-981-13-0718-8 ISBN 978-981-13-0719-5 (eBook) https://doi.org/10.1007/978-981-13-0719-5 Library of Congress Control Number: 2018952454 # Springer Nature Singapore Pte Ltd. 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Noncoding RNAs (ncRNAs), especially microRNAs (miRNAs) and long noncoding RNAs (lncRNAs), starred scientific research over the last decade. This is mainly credited to the development of microarray and sequencing techniques that led to the discovery of nearly 98% of transcripts not translated to proteins. Such a fact promoted us to find out the intricate roles they play in organisms, ranging from cell proliferation, differentiation to apoptosis and tumorigenesis. Surprisingly enough, organismal complexity is better correlated with the diversity and size of noncoding RNA expression repertoires compared to that of protein-coding gene, which drives us to believe that RNA-based regulatory mechanisms might partially explain the evolution of species. Therefore, efforts are worthwhile to work out the regulations of noncoding RNAs, in both development and diseases. In this book, we focus on the study of noncoding RNAs in diseases. We screened ncRNAs from different perspectives so it meet different people’s needs, ranging from experimenters who expect to pinpoint the ncRNAs of interest to those who would like to see a panorama of the relationship between ncRNAs and other molecules. Comprehensively, this book provides details about ncRNA resources, data acquiring, and data preparation. Included are also a variety of tools and common bioinformatics methods to deal with the bunch of data on expression, modification, and variation. The first three chapters concern identification of miRNAs and lncRNAs from original microarray and RNA-seq data, as well as functional characterization and prioritization of these ncRNAs. Chapters 4, 5, and 6 refer to mutation and epigenetic modifications of ncRNAs and how these changes or modifications contribute to diseases. The following three chapters talk about the interactions between different ncRNAs, between ncRNAs and mRNAs. These ncRNAs may mediate the destruction of biological pathways or compete to bind or degrade mRNAs thus affecting mRNA expression. In Chap. 10, we attempt to predict drug targets through three different approaches. Chapter 11 is a collection of ncRNA resources, which are classified into categories like disease resources and variants resources. Finally, we want to explore the interaction between ncRNA and protein, so Chap. 12 offers both experimental and computational methods. This book is a crystal of researchers who have been respectively devoted to their areas for years. It draws information on ncRNAs over the last decade, v

vi

Preface

which is especially suitable for readers who decide to dedicate into the study of ncRNAs. Although the authors have figured out some complex regulatory roles ncRNAs play, more remains to work out regarding the intricate relationships among RNAs, proteins, and other molecules as well as how these molecules and relationships shape who we are. Wish you all benefit from the book. Harbin, Heilongjiang, China

Xia Li

Contents

1

Non-coding RNA Resources . . . . . . . . . . . . . . . . . . . . . . . . . . Shangwei Ning and Xia Li

1

2

Systematic Identification of Non-coding RNAs . . . . . . . . . . . . Yun Xiao, Jing Hu, and Wenkang Yin

9

3

Functional Characterization of Non-coding RNAs Through Genomic Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yun Xiao, Min Yan, Chunyu Deng, and Hongying Zhao

19

Genomic-Scale Prioritization of Disease-Related Non-coding RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Wang and Xia Li

29

4

5

Genome-Wide Mapping of SNPs in Non-coding RNAs . . . . . . Shangwei Ning and Yunpeng Zhang

6

Profiling DNA Methylation Patterns of Non-coding RNAs (ncRNAs) in Human Disease . . . . . . . . . . . . . . . . . . . . . . . . . Hui Zhi, Yongsheng Li, and Li Wang

49

Aberrant Epigenetic Modifications of Non-coding RNAs in Human Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yun Xiao, Jinyuan Xu, and Wenkang Yin

65

7

8

Computationally Modeling ncRNA-ncRNA Crosstalk . . . . . . Juan Xu, Jing Bai, and Jun Xiao

9

Computational Inferring of Risk Subpathways Mediated by Dysfunctional Non-coding RNAs . . . . . . . . . . . . . . . . . . . . Yanjun Xu, Yunpeng Zhang, and Xia Li

39

77

87

10

Computational Identification of Cross-Talking ceRNAs . . . . . Yongsheng Li, Caiqin Huo, Xiaoyu Lin, and Juan Xu

97

11

Prediction of Non-coding RNAs as Drug Targets . . . . . . . . . . 109 Wei Jiang, Yingli Lv, and Shuyuan Wang

12

Methods for Identification of Protein-RNA Interaction . . . . . 117 Juan Xu, Zishan Wang, Xiyun Jin, Lili Li, and Tao Pan

vii

Contributors

Jing Bai College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Chunyu Deng College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Jing Hu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Caiqin Huo College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Wei Jiang Department of Biomedical Engineering, College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Xiyun Jin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Lili Li College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Xia Li College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China Yongsheng Li College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Xiaoyu Lin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yingli Lv College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Shangwei Ning College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Tao Pan College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China

ix

x

Li Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Peng Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Shuyuan Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Zishan Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Jun Xiao College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yun Xiao College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Jinyuan Xu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Juan Xu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yanjun Xu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Min Yan College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Wenkang Yin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yunpeng Zhang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Hongying Zhao College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Hui Zhi College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China

Contributors

1

Non-coding RNA Resources Shangwei Ning and Xia Li

Abstract

Non-coding RNAs (ncRNAs) are a kind of functional RNA molecules that are not translated into proteins. There is now a large number of evidence suggests that ncRNAs are associated with various disease processes including cancer. Follow the rapid development of next-generation sequencing (NGS) technologies, a large amount of ncRNAs has accumulated in a short period of time. To date, many ncRNA-related databases have already been developed. Here we systematically reviewed these specialized ncRNA-related databases, and we expect these databases could serve as useful resources for researchers to further investigate and explore the associations between ncRNAs and human diseases. Keywords

miRNA · lncRNA · Database · Web server · Disease

1.1

Introduction

Noncoding RNAs (ncRNAs) are a class of nonprotein-coding RNA and functional RNA S. Ning · X. Li (*) College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China e-mail: [email protected]; [email protected]

molecules that include microRNAs (miRNAs), long noncoding RNAs (lncRNA), small interfering RNA (siRNA), small nucleolar RNAs (snoRNA), piwi-interacting RNA (piRNA), among others [1]. Overwhelming evidence has indicated that various ncRNAs are implicated in human disease processes [2]. For example, miRNAs (~22 nt small ncRNA molecules that repress mRNA target expression) are the important molecules of various diseases [3]. In recent years, lncRNAs (>200 nt RNA molecules) have been found and studied to play a role in development, evolution and disease [4]. For example, the lncRNA HOTAIR serves as an endogenous “sponge” of miR-141, resulting in significant reduction of glioma growth [5]. Numerous resources specific for miRNA and lncRNAs have been developed to depict the considerable diversity of ncRNAs and their involvement in important biological processes (Table 1.1). These resources contained diverse characteristics of non-coding RNAs. The primary resources mainly collect or integrate basic annotation and functional information on ncRNA transcripts, such as miRBase [6], NONCODE [7], LNCipedia [8], and LNCat [9]. Another type of resource lists the functions and roles of lncRNAs that participate in disease, such as lncRNAdb [10], LncRNADisease [11] and Lnc2Cancer [12]. The remaining resources explore the regulatory mechanism of lncRNAs interacting with other functional elements, such

# Springer Nature Singapore Pte Ltd. 2018 X. Li et al. (eds.), Non-coding RNAs in Complex Diseases, Advances in Experimental Medicine and Biology 1094, https://doi.org/10.1007/978-981-13-0719-5_1

1

2

S. Ning and X. Li

Table 1.1 Non-coding RNA resources Database name Content ncRNA annotation resources miRBase miRNA

Description

miRNA sequence information, annotation and predicted target information NONCODE lncRNAs Comprehensive annotation of non-coding RNAs, especially lncRNAs LNCipedia lncRNAs Basic transcript information and secondary structure information, protein coding potential and miRNA binding sites LNCat lncRNAs LncRNA structures, visualization of different resources from multiple angles and download of different combinations of lncRNA annotation ncRNA-related disease resources HMDD miRNA miRNA-disease association data from genetics, epigenetics, circulating miRNAs and miRNA–target interactions miR2disease miRNA A comprehensive resource of experimentally verified miRNA-disease relationship LncRNADisease lncRNA Experimentally supported and predicted lncRNA-disease associations, lncRNA interacting partners at various molecular level Lnc2Cancer lncRNA Manually curated cancer-associated lncRNAs with experimental support

ncRNA-related variants resources dPORE-miRNA miRNA An integrates information of promoter regions of human miRNA genes, SNPs, and predicted TFBSs miRdSNP miRNA Disease-associated SNPs on the 30 UTRs of human genes manually curated from PubMed miRNASNP miRNA A resource of the SNPs in pre-miRNAs of human and other species, and target gain and loss by SNPs in miRNA seed regions or 30 UTR of target mRNAs MirSNP miRNA A collection of human SNPs in predicted miRNA-mRNA binding sites PolymiRTS miRNA Genetic polymorphisms in miRNA seed regions and miRNA target sites MicroSNiPer miRNA A collection of SNPs in putative microRNA targets lncRNASNP

lncRNA

A resource of SNPs in lncRNAs and their potential impacts on lncRNA structure and function in human and mouse

LincSNP

lncRNA

A database of disease-associated SNPs in human lncRNAs and their transcription factor binding sites (TFBSs)

Other ncRNA resources DIANAlncRNA LncBase ChIPBase

Gene, ncRNA

LncRNA2Target

lncRNA

A database of experimentally supported and in silico predicted miRNA Recognition Elements (MREs) on lncRNAs A database of transcriptional regulatory relationships between transcription factors (TFs) and genes from ChIP-seq data A curated database of lncRNA-to-target genes information

Website http://microrna. sanger.ac.uk/ http://www.noncode. org/ http://biocc.hrbmu. edu.cn/LNCat/ http://biocc.hrbmu. edu.cn/LNCat/

http://cmbi.bjmu.edu. cn/hmdd http://www. miR2Disease.org http://cmbi.bjmu.edu. cn/lncrnadisease http://www.biobigdata.net/ lnc2cancer http://cbrc.kaust.edu. sa/dpore/ http://mirdsnp.ccr. buffalo.edu http://bioinfo.life. hust.edu.cn/ miRNASNP2/) http://cmbi.bjmu.edu. cn/mirsnp http://compbio.uthsc. edu/miRSNP/ http://cbdb.nimh.nih. gov/microsniper http://bioinfo.life. hust.edu.cn/ lncRNASNP/ http://bioinfo.hrbmu. edu.cn/LincSNP www.microrna.gr/ LncBase http://rna.sysu.edu.cn/ chipbase/ http://www. lncrna2target.org

1

Non-coding RNA Resources

as genetic variants, including LincSNP [13], lncRNASNP [14], and LncVar [15]. Other databases focus on RNA editing sites such as LNCediting [16]. In addition, microRNAs-related databases include DIANA-LncBase [17] and ChIPBase [18]. LncRNA function-related database include LncRNA2Target [19] and LncReg [20]. These databases contributed to further research on the appropriate regulatory mechanisms and functions of ncRNAs.

1.2 1.2.1

ncRNA Annotation Resources miRBase

One important aim of miRBase is to provide an integrated resources for comprehensive miRNA sequence information, annotation and predicted target information [6]. MiRBase Registry serves as an independent arbiter of miRNA gene nomenclature, assigning names prior to publication of novel miRNA sequences. MiRBase Sequences is the basal online repository for miRNA sequence data and annotation. MiRBase Targets is a novel database to predict the target genes of miRNAs. MiRBase is available at http://microrna.sanger. ac.uk/.

1.2.2

NONCODE

NONCODE is an interactive database that contains a comprehensive collection and annotation of non-coding RNAs, especially lncRNAs [7]. In its 2016 update, NONCODE has the total to 16 species. The lncRNAs in NONCODE have increased to 527,336. 167,150 and 130,558 lncRNAs for human and mouse were included in the database. NONCODE has also fused some important novel functions, including: (i) conservation annotation; (ii) the associations between lncRNAs and diseases; and (iii) an interface to choose high-quality datasets through predicted scores, literature support and long-read sequencing method support. NONCODE is available at http://www.noncode.org/.

3

1.2.3

LNCipedia

LNCipedia is a database for human lncRNA transcripts and genes and offers human lncRNA transcripts obtained from diverse sources [21]. Not only basic transcript information and gene structure, but also several comprehensive statistics are showed for each entry in the database, such as secondary structure information, protein coding potential and miRNA binding sites. Available articles on specific lncRNAs is linked, and users or authors can submit articles through a web interface. Protein coding potential is assessed by two different prediction algorithms: Coding Potential Calculator and HMMER. The users could query and download lncRNA sequences and structures based on different search criteria on LNCipedia. LNCipedia is available at http://www.lncipedia.org.

1.2.4

LNCat

LNCat is a user-friendly database that provides a genome browser of lncRNA structures, visualization of different resources from multiple angles and download of different combinations of lncRNA annotations, and supports rapid exploration, comparison and integration of lncRNA annotation resources [9]. A comprehensive comparison of numerous lncRNA annotations were contained in LNCat, and can facilitate understanding of lncRNAs in human disease. LNCat is freely available at http://biocc.hrbmu.edu.cn/ LNCat/.

1.3 1.3.1

ncRNA-Associated Disease Resources HMDD

HMDD is a collection of human miRNA and disease associations and all miRNAs are supported by experiment [22]. In the HMDD database, miRNA– disease association data were annotated in many details, containing miRNA–disease association data

4

S. Ning and X. Li

from genetics, epigenetics, circulating miRNAs and miRNA–target interactions. HMDD provides many data that were generated based on concepts derived from the miRNA–disease association data, including disease spectrum width of miRNAs and miRNA spectrum width of human diseases. Users can download all the data in the HMDD and submit new data into the database. HMDD is freely accessed at http:// cmbi.bjmu.edu.cn/hmdd.

1.3.2

miR2Disease

miR2Disease is a manually collected database which aims at providing a comprehensive resource of miRNA dysregulation in a variety of human diseases [23]. Each entry in the miR2Disease contains detailed information on a miRNA-disease association including a miRNA ID, the disease name, a brief description of the miRNA-disease relationship, an expression pattern of the miRNA, the detection method for miRNA expression, experimentally verified target gene(s) of the miRNA and a literature reference. miR2Disease is available at http://www. miR2Disease.org.

1.3.3

LncRNADisease

LncRNADisease is a lncRNA and disease association database, which contains experimentally supported lncRNA-disease associations [11]. LncRNADisease also curates lncRNA interacting partners at various molecular levels, including protein, RNA, miRNA and DNA. Each lncRNA-disease association includes genomic information, sequences, references and species. LncRNADisease designed a bioinformatic strategy to identify new lncRNA-disease relationships. Besides, authors also integrated the method and the predicted associated diseases of human lncRNAs into the database. LncRNADisease is freely available at http:// cmbi.bjmu.edu.cn/lncrnadisease.

1.3.4

Lnc2Cancer

Lnc2Cancer is a manually collected database of cancer-associated lncRNAs with experimental support that aims to provide a high-quality and integrated resource for exploring lncRNA deregulation in all kinds of human cancers [12]. Each association includes lncRNA and cancer name, the lncRNA expression pattern, experimental techniques, a brief functional description, the original reference and additional annotation information. Lnc2Cancer is friendly for users to conveniently browse, retrieve and download data. Lnc2Cancer is publicly accessible at http:// www.bio-bigdata.net/lnc2cancer.

1.4 1.4.1

ncRNA-Related Variants Resources dPORE-miRNA

dPORE-miRNA (Dragon Database of Polymorphic Regulation of miRNA genes) is a data resource that including information from promoter regions of human miRNA genes, SNPs, and predicted TFBSs in the promoter regions [24]. The web-interface permits to explore the effect of SNPs on the transcriptional regulation of miRNA genes. The aim is on SNPs that affect TFBSs or lead to the creation of a TFBS. On the other hand, only TFBSs are included that are themselves affected by SNPs. dPORE-miRNA is publicly accessible at http://cbrc.kaust.edu.sa/ dpore/.

1.4.2

miRdSNP

miRdSNP (http://mirdsnp.ccr.buffalo.edu) supplies a unique database of dSNPs (diseaseassociated SNPs) on the regions of human gene 30 UTRs manually curated from PubMed [25]. miRdSNP explains genes targeted by miRNAs which are supported by experiment. On the other hand, miRdSNP also cites miRNA target sites predicted by TargetScan and PicTar as

1

Non-coding RNA Resources

well as potential miRNA target sites newly generated by dSNPs. A robust link site and study tools are given for studying the proximity of miRNA binding sites to dSNPs in relation to human diseases.

1.4.3

miRNASNP

miRNASNP (http://bioinfo.life.hust.edu.cn/ miRNASNP2/) is desighed to provide a resource of the miRNA-related SNPs, which includes SNPs in pre-miRNAs of human and other species, and target gain and loss by SNPs in miRNA seed regions or 30 UTR of target mRNAs [26]. The latest version of this database is miRNASNP v2, which increase the potential functional SNPs to 365,994.

1.4.4

MirSNP

MirSNP (http://cmbi.bjmu.edu.cn/mirsnp) is a publicly accessible online database, and it collects human SNPs in identified miRNA-mRNA binding sites [27]. It contains 414,510 SNPs that might affect miRNA-mRNA binding. Annotations were added to these SNPs to predict whether a SNP within the target site would decrease/break or enhance/create an miRNAmRNA binding site.

1.4.5

5

1.4.6

MicroSNiPer

MicroSNiPer (http://cbdb.nimh.nih.gov/ microsniper), a online application, which is used to predict the effects of a SNP on putative microRNA targets [29]. This web-based tool query the 30 -UTR and predicts whether a SNP located in the target locus will disrupt/eliminate or enhance/create a microRNA binding site. MicroSNiPer calculates these sites and evaluates the influence of SNPs in time. It has many virtues, such as: friendly interface, flexible to use, simple and intuitive graphical representation of the results.

1.4.7

lncRNASNP

lncRNASNP (http://bioinfo.life.hust.edu.cn/ lncRNASNP/) aims to offer a useful databank about lncRNA SNPs [14]. lncRNASNP systematically excavated SNPs in lncRNAs and estimated their latent impacts on lncRNA structure and function in human and mouse. A huge number of SNPs were identified with the talents for impact the interactions between miRNA and lncRNA. The previously experiment, conservative property of miRNA-lncRNA interaction, and miRNA expressions from TCGA were all integrated to predict the associations between miRNA-lncRNA interactions and SNPs located in binding sites. This database with a friendly interface for users to query and browse through the SNP, lncRNA and miRNA sections.

PolymiRTS

PolymiRTS (Polymorphism in microRNAs and their TargetSites, http://compbio.uthsc.edu/ miRSNP/) is an integrated online tools for analyzing the functional effects of genetic polymorphisms located in miRNA seed regions and miRNA target sites [28]. The latest version of this database is PolymiRTS Database 3.0, which saved more miRNA-mRNA interactions obtained from CLASH (cross linking, ligation and sequencing of hybrids) experiments.

1.4.8

LincSNP

LincSNP (http://bioinfo.hrbmu.edu.cn/LincSNP) is a database to specifically store and annotate disease-related single nucleotide polymorphisms (SNPs) that located in human long non-coding RNAs (lncRNAs) and their transcription factor binding sites (TFBSs) [13]. This platform collected five kind of data sets, such as (i) diseaserelated SNPs in human lncRNAs; (ii) diseaserelated SNPs in lncRNA TFBSs; (iii) LD-SNPs

6

S. Ning and X. Li

from the 1000 Genomes Project; and (iv) experimentally corroborated SNP-lncRNAdisease associations. LincSNP provides users a newly designed, simple and intuitive interface to query and download all the data.

1.5 1.5.1

Other ncRNA Resources DIANA-LncBase

DIANA-LncBase (www.microrna.gr/LncBase), a database to store experimentally verified and in silico predicted miRNA Recognition Elements (MREs) on lncRNAs [17]. This web resource collected miRNA-lncRNA interactions widely from low and high-throughput data, which acquired from manually curated publications and the analysis of AGO CLIP-Seq libraries. LncBase also hosts in silico predicted miRNA targets on lncRNAs, identified with the DIANAmicroT algorithm. LncBase collects information about cell type specific miRNA-lncRNA regulation and enables users to identify interactions in particular cell types, tissues for human and mouse easily.

1.5.2

ChIPBase

ChIPBase identified a large number of binding motif matrices and their binding locus from ChIP-seq data of DNA-binding proteins and predicted millions of transcriptional regulatory relationships between transcription factors (TFs) and genes [18]. ChIPBase constructed ‘Regulator’ module to predict hundreds of TFs and histone modifications that were involved in or affected transcription of ncRNAs and PCGs. ChIPBase built a web-based tool, Co-Expression, to recognize the co-expression patterns between DNA-binding proteins and various types of genes by integrating the gene expression profiles of 10,000 tumor samples and 9100 normal tissues and cell lines.

1.5.3

LncRNA2Target

LncRNA2Target (http://www.lncrna2target.org) is a curated database which stores lncRNA-totarget genes [19]. A gene was identified as a target of a lncRNA if it is abnormally expressed after the lncRNA knockdown or over expression. LncRNA2Target offers a web platform, through which, its users can search for the targets of a particular lncRNA or for the lncRNAs that target a particular gene.

1.6

Perspectives and Conclusion

In the past few years, many databases have been published to aid researchers in exploring the function of ncRNAs (miRNAs and lncRNA) in human disease genes. These studies and databases emphasize the importance of bioinformatics analysis in the identification of potential disease-associated ncRNAs. In future, these databases will not only provide a comprehensive resource for experimental research but also presents a more global view on ncRNA functions in human diseases. These databases will serve as a valuable resource for researchers interested in determining the role of miRNA and lncRNA in human diseases.

References 1. Guttman M, Rinn JL (2012) Modular regulatory principles of large non-coding RNAs. Nature 482 (7385):339–346 2. Esteller M (2011) Non-coding RNAs in human disease. Nat Rev Genet 12(12):861–874 3. Tan CL, Plotkin JL, Veno MT, von Schimmelmann M, Feinberg P, Mann S, Handler A, Kjems J, Surmeier DJ, O'Carroll D et al (2013) MicroRNA-128 governs neuronal excitability and motor behavior in mice. Science 342(6163):1254–1258 4. Briggs JA, Wolvetang EJ, Mattick JS, Rinn JL, Barry G (2015) Mechanisms of long non-coding RNAs in mammalian nervous system development, plasticity, disease, and evolution. Neuron 88(5):861–877 5. Bian EB, Ma CC, He XJ, Wang C, Zong G, Wang HL, Zhao B (2016) Epigenetic modification of miR-141

1

Non-coding RNA Resources

regulates SKA2 by an endogenous ‘sponge’ HOTAIR in glioma. Oncotarget 6. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A (2006) Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34(Database issue):D140–D144 7. Zhao Y, Li H, Fang S, Kang Y, Wu W, Hao Y, Li Z, Bu D, Sun N, Zhang MQ et al (2016) NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res 44(D1):D203– D208 8. Volders PJ, Verheggen K, Menschaert G, Vandepoele K, Martens L, Vandesompele J, Mestdagh P (2015) An update on LNCipedia: a database for annotated human lncRNA sequences. Nucleic Acids Res 43(Database issue):D174–D180 9. Xu J, Bai J, Zhang X, Lv Y, Gong Y, Liu L, Zhao H, Yu F, Ping Y, Zhang G et al (2017) A comprehensive overview of lncRNA annotation resources. Brief Bioinform 18(2):236–249 10. Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, Gloss BS (2015) Dinger ME: lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res 43(Database issue):D168–D173 11. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q (2013) LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res 41(Database issue): D983–D986 12. Ning S, Zhang J, Wang P, Zhi H, Wang J, Liu Y, Gao Y, Guo M, Yue M, Wang L et al (2016) Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res 44(D1):D980– D985 13. Ning S, Yue M, Wang P, Liu Y, Zhi H, Zhang Y, Zhang J, Gao Y, Guo M, Zhou D et al (2017) LincSNP 2.0: an updated database for linking disease-associated SNPs to human long non-coding RNAs and their TFBSs. Nucleic Acids Res 45(D1):D74–D78 14. Gong J, Liu W, Zhang J, Miao X, Guo AY (2015) lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human and mouse. Nucleic Acids Res 43(Database issue):D181–D186 15. Chen X, Hao Y, Cui Y, Fan Z, He S, Luo J, Chen R (2017) LncVar: a database of genetic variation associated with long non-coding genes. Bioinformatics 33(1):112–118 16. Gong J, Liu C, Liu W, Xiang Y, Diao L, Guo AY, Han L (2017) LNCediting: a database for functional effects of RNA editing in lncRNAs. Nucleic Acids Res 45 (D1):D79–D84 17. Paraskevopoulou MD, Vlachos IS, Karagkouni D, Georgakilas G, Kanellos I, Vergoulis T, Zagganas K, Tsanakas P, Floros E, Dalamagas T et al (2016)

7 DIANA-LncBase v2: indexing microRNA targets on non-coding transcripts. Nucleic Acids Res 44(D1): D231–D238 18. Zhou KR, Liu S, Sun WJ, Zheng LL, Zhou H, Yang JH, Qu LH (2017) ChIPBase v2.0: decoding transcriptional regulatory networks of non-coding RNAs and protein-coding genes from ChIP-seq data. Nucleic Acids Res 45(D1):D43–D50 19. Jiang Q, Wang J, Wu X, Ma R, Zhang T, Jin S, Han Z, Tan R, Peng J, Liu G et al (2015) LncRNA2Target: a database for differentially expressed genes after lncRNA knockdown or overexpression. Nucleic Acids Res 43(Database issue):D193–D196 20. Zhou Z, Shen Y, Khan MR, Li A (2015) LncReg: a reference resource for lncRNA-associated regulatory networks. Database (Oxford) 2015 21. Volders PJ, Helsens K, Wang X, Menten B, Martens L, Gevaert K, Vandesompele J, Mestdagh P (2013) LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res 41(Database issue):D246–D251 22. Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, Cui Q (2014) HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res 42(Database issue):D1070–D1074 23. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y (2009) miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res 37(Database issue): D98–D104 24. Schmeier S, Schaefer U, MacPherson CR, Bajic VB (2011) dPORE-miRNA: polymorphic regulation of microRNA genes. PLoS One 6(2):e16657 25. Bruno AE, Li L, Kalabus JL, Pan Y, Yu A, Hu Z (2012) miRdSNP: a database of disease-associated SNPs and microRNA target sites on 30 UTRs of human genes. BMC Genomics 13:44 26. Gong J, Tong Y, Zhang HM, Wang K, Hu T, Shan G, Sun J, Guo AY (2012) Genome-wide identification of SNPs in microRNA genes and the SNP effects on microRNA target binding and biogenesis. Hum Mutat 33(1):254–263 27. Liu C, Zhang F, Li T, Lu M, Wang L, Yue W, Zhang D (2012) MirSNP, a database of polymorphisms altering miRNA target sites, identifies miRNA-related SNPs in GWAS SNPs and eQTLs. BMC Genomics 13:661 28. Bhattacharya A, Ziebarth JD, Cui Y (2014) PolymiRTS Database 3.0: linking polymorphisms in microRNAs and their target sites with human diseases and biological pathways. Nucleic Acids Res 42(Database issue):D86–D91 29. Barenboim M, Zoltick BJ, Guo Y, Weinberger DR (2010) MicroSNiPer: a web tool for prediction of SNP effects on putative microRNA targets. Hum Mutat 31(11):1223–1232

2

Systematic Identification of Non-coding RNAs Yun Xiao, Jing Hu, and Wenkang Yin

Abstract

Keywords

Non-coding RNAs (ncRNAs) are biologically significant in variable ways. They modulate gene expression at the levels of transcription and post-transcription. MiRNAs and lncRNAs are two major classes of non-coding RNAs and have been extensively characterized. They are implicated in various biological processes and diseases. Thus, identification of miRNAs and lncRNAs are fundamental to further understand their roles and dissect their mechanisms. Here, we overviewed pipelines of identifying miRNAs and lncRNAs based on next-generation sequencing technologies. We applied the pipelines to identify miRNAs in multiple cell lines and perform expression quantification of mature, precursor and primary miRNAs. In addition, we provided an alternative way to re-annotate lncRNAs from microarray data. We summarized multiple resources and databases for lncRNA annotation and compared their annotation processes and specific parameters. Finally, we utilized RNA-seq and miRNA-seq data to construct a comprehensive transcriptome containing miRNAs, lncRNAs and protein-coding genes in heart failure.

Non-coding RNA · Identification pipeline · Expression quantification · Annotation · Transcriptome

Y. Xiao (*) · J. Hu · W. Yin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China e-mail: [email protected]; [email protected]

2.1

Introduction

Coding genes account for just 1% of the total human transcripts [1], for the rest, non-coding RNAs (ncRNAs) remain an unknown component of mammalian genomes. NcRNAs are ribonucleic acid (RNA) molecules that don’t encode proteins. Different types of non-coding RNAs are involved in different cellular processes, such as gene expression regulation (miRNAs, piRNAs, lncRNAs), RNA maturation (snRNAs, snoRNAs) and protein synthesis (rRNAs, tRNAs). Among them, miRNAs and lncRNAs are the two most extensively characterized ones and they are implicated in a variety of biological processess. MicroRNAs (miRNAs) are small (~22 nucleotides in length) non-coding regulatory genes found in many eukaryotic organisms. They can mediate the expression of target genes at post-transcriptional levels to serve as important regulators of various developmental control and diseases [2]. Rapid advancement in highthroughput sequencing allow unprecedented sensitive detection of miRNAs with the help of

# Springer Nature Singapore Pte Ltd. 2018 X. Li et al. (eds.), Non-coding RNAs in Complex Diseases, Advances in Experimental Medicine and Biology 1094, https://doi.org/10.1007/978-981-13-0719-5_2

9

10

bioinformatic algorithms such as miRDeep2 [3] and miRanalyzer [3]. LncRNAs are defined as those whose length ranges from 200 bp to more than 10 kb [4, 5]. Recent studies showed that lncRNAs play key roles in many normal biological processes, like the development of vertebrates, immune responses and cell differentiation, and they are also related to complex human diseases [6– 8]. LncRNAs can participate in gene regulation in many ways, especially in the epigenetic control of chromatin [8–11]. The most famous example is the inactive X chromosome through cis-acting of XIST lncRNA [12]. Trans-regulation is another way lncRNAs adopt to regulate gene expression [13] as Rinn et al. found that HOTAIR acted in trans to repress HOXD locus transcription. Despite the interesting findings in a few lncRNAs, it is difficult to generalize these findings to the massive lncRNAs. More importantly, the functions of most lncRNAs are largely unknown compared to small noncoding RNAs (i.e., microRNAs) [14], which offers opportunities and raises challenges for predicting functions of lncRNAs. RNA-sequencing (RNA-seq) is a whole transcriptome sequencing technique that quantifies gene expression with dynamic range. It overcomes the shortcomings of microarray technology and has already been widely used in the study of model organisms and human. Cabili et al. defined a reference catalog of more than 8000 human long intergenic noncoding RNAs from RNA-seq data [6] and most of them had not been previously described. Recent advance in RNA-seq and computational methods for reconstructing transcriptome offers a wonderful opportunity to annotate and characterize lncRNAs. As a matter of fact a large number of lncRNAs have been discovered using RNA-seq [6, 15, 16]. Therefore, abundant RNA-seq data allow us to comprehensively identify and quantify lncRNAs (also protein-coding genes) and enable us to characterize the functions of lncRNAs. Here, we provided canonical pipelines for identifying miRNAs and lncRNAs using nextgeneration sequencing data. Applying the

Y. Xiao et al.

pipelines to five human cell lines, we identified miRNAs and quantified expression levels of mature, precursor and primary miRNAs. Alternatively, we could re-annotate lncRNAs from microarray data. We interrogated multiple resources and databases for lncRNA annotation and summarized their common and specific processes. Finally, we constructed a comprehensive transcriptome composed of miRNAs, lncRNAs and protein-coding genes with the advantage of RNA-seq and miRNA-seq data in heart failure.

2.2 2.2.1

Methods Identification of miRNAs

Given a miRNA-seq data, we could take advantage of the miRanalyzer [3] to map sequence reads to miRNA annotations in miRBase database. Then, read counts of each miRNA could be calculated and normalized to the total counts of sequence reads as RPMs (reads per million mapped reads). Lowly expressed miRNAs should be filtered out.

2.2.2

Identification of lncRNAs

Given a microarray data, we could use the probe sequences from the corresponding manufacturer’s website and then uniquely map them to the human genome (hg19) by Bowtie without mismatch. Probes completely mapping within exons of lncRNAs but without overlapping with protein-coding genes were retained to label corresponding lncRNAs. LncRNAs having less than four probes should be filtered out. Given an RNA-seq data, we could use Tophat (version 2.0.13) [17] to map the sequencing reads to the human genome (hg19). And then Cufflinks (version 2.2.1) [18] could be utilized to assemble the uniquely mapped reads into transcripts for each sample. Subsequently, the assemblies of all samples were merged together with Cuffmerge. Besides known lncRNAs, we could also extract novel lncRNAs. Transcripts with length >¼200 bp which were previously unannotated

2

Identification of Non-coding RNAs

and lack coding potential, which could be calculated by CPAT (version 1.2.2) [19], were defined as novel lncRNAs. Fragments per kilobase per million mapped reads (FPKM) for each known and novel lncRNA could be extracted from the cufflinks output. Alternatively, read counts could be computed using BEDTools (http://code. google.com/p/bedtools).

2.3 2.3.1

Results miRNA Transcriptome Detected by Small RNA-seq Annotation

To identify miRNAs in five human cell lines and quantify their expression levels, we applied miRNA-seq datasets and downloaded the original whole-cell 0 in the nucleus, but with reads¼0 in the cytosol. Finally, we defined the total numbers of mapped reads in the nucleus as the expression levels of primary transcripts of intergenic miRNAs.

11

2.3.2

Re-annotation of Microarray for Revealing lncRNAs

To re-annotate Affymetrix exon array, we designed a custom pipeline to utilize its substantial probes annotated to thousands of long non-coding RNA [15, 21]. The probe sequences could be downloaded from the manufacturer’s website (http://www.affymetrix.com) and then we used Bowtie to uniquely map them to the human genome (hg19) without mismatch [22]. We kept probes entirely mapped with exons of lncRNAs but without overlapping with protein-coding genes using BEDTools (http:// code.google.com/p/bedtools). Finally, the expression levels of lncRNA genes including at least four probes were calculated.

2.3.3

Summaries of lncRNA Annotation Resources

We collected 19 literatures corresponding to 21 resources that applied high-throughput sequencing data to identify lncRNAs by searching PubMed via keywords “ChIP-seq”, “RNA-seq”, “lncRNA”, “long noncoding RNA” and “long intergenic non-coding RNA”. The name of the first author indicated by capital letters to distinctly discriminate these resources including CABILI [6], IYER [23], MORAN [24], TRIMARCHI [25], KRETZ [26], WHITE [27], KELLEY [28], HANGAUER [29], PARALKAR [30], HE [31], YANG [32], NECSULEA1/2 [33], NE [34], SOWALSKY [35], KHALIL [36], SIGOVA1/2 [37], BELL [38], YAN [39] and DING [40]. Moreover, we also included three extensively used lncRNA databases namely GENCODE (V19) [41], LNCipedia (version 2.1) [42] and NONCODE (version 4.0) [43]. Finally, 24 lncRNA annotation resources were used for the further analysis. These 24 human lncRNA annotation resources contained over 205,000 lncRNAs, 3 to over 7000 number of samples and were used to annotate lncRNAs cover over 50 tissues or cell lines (Fig. 2.1a and Table 2.1). To identify lncRNAs,

12

Y. Xiao et al.

Fig. 2.1 Statistics of the annotation information among different resources. (a) Pie chart showing the distribution of lncRNA annotation resources referring to different tissues and cell lines. The 21 resources were categorized into 12 cohorts, and those involving multiple tissues and

cell lines were added into corresponding cohorts. (b) Pie chart of the distribution of RNA-seq based resources using single-end, paired-end and both single and paired-end sequencing data

the majority of resources (such as CABILI and IYER) applied RNA-seq data to establish transcriptome based on ab initio or de novo assembly. Among these resources, eight used paired-end sequencing techniques, three used both paired-end and single-end and five used single-end only (Fig. 2.1b). Various filtering strategies of these resources, including five criteria: size selection, coding potential, exon number, expression level and epigenetic signals, were used to identify credible lncRNAs. The size of lncRNA transcripts to be above 200 bp was required by most resources, but the size of 100 bp, 1 kb and 5 kb were used as thresholds for SIGOVA, YAN and KHALIL, respectively. The expression levels were considered by 15 resources, but with different thresholds. Epigenetic signals, derived from ChIP-seq data (such as H3K4me3), were utilized to screen active lncRNAs by some resources. Notably, 13 resources focused on lncRNAs while 8 resources focused on intergenic lncRNAs (lincRNAs). RNA-seq data from 24 tissues and cell types were used to identify lincRNAs by CABILI. The transcripts with single exon, or with length less than 200 bases, with low abundance (100 or a known protein-coding domain) were filtered out after mapping and assembling the reads. We identified potential lincRNAs according to the remaining transcripts that did not overlap known non-lincRNA annotations. A strategy similar to CABILI was applied by HE to find novel lincRNAs in the human prefrontal cortex, except for a different expression threshold (1 RPKM). By using TopHat and Cufflinks, KELLEY assembled a list of lincRNAs based on the same RNA-seq data as CABILI. Filtering conditions of KELLEY were the same as CABILI. By using Cufflinks, dynamic RNA-seq was assembled by KRETZ from primary human keratinocytes and novel lincRNAs that have multiple exons and a total length of 200 bp without overlapping with any annotated genes were identified. The RNA-seq data from breast cancer tissues were analyzed by DING. Considering expression abundance (10 reads) and their minimum distance to neighbor genes (1500 bp for upstream and downstream genes), we could recognize lincRNAs after filtering mapped reads against the RepeatMask, rRNA and other repeated sequences. We used de novo assembly

24 tissues and cell types

18 organs

Islets and beta-cells

T-ALL cell lines and primary leukemia samples Keratinocytes

Lung cancer tissues

28 tissues and cell lines

23 tissues

Erythroblasts

Prefrontal cortex

Failing LV samples

8 organs

2 organs

Monocytes

Castration-resistant prostate cancer (CRPC) tissues

CABILI

IYER

MORAN

TRIMARCHI

WHITE

KELLEY

HANGAUER

PARALKAR

HE

YANG

NECSULEA1

NECSULEA2

NE

SOWALSKY

KRETZ

Tissue/cell

Source

RNA-seq RNA-seq

8

RNA-Seq

RNA-Seq

RNA-seq

RNA-seq

RNA-seq

RNA-seq

RNA-seq

RNA-seq

RNA-seq and ChIPseq RNA-seq and ChIPseq RNA-seq

RNA-seq

RNA-seq

Data type

8

53

185

16

38

15

127

70

728

3

14

15

7256

24

Samples

Ab initio assembly

lncRNA

lncRNA

lncRNA

lncRNA

lncRNA

lncRNA

lincRNA

lncRNA

lincRNA

lincRNA

lincRNA

Ab initio assembly De novo assembly De novo assembly Ab initio assembly Ab initio assembly

De novo assembly

De novo assembly Ab initio assembly

Ab initio assembly De novo assembly Ab initio assembly

Ab initio assembly

lncRNA

lincRNA

Ab initio assembly

Ab initio assembly

Method

lncRNA

lincRNA

Research scope

Table 2.1 Summary of 24 lncRNA annotation resources reviewed in this study

Pairedend Pairedend

Single

Pairedend Single

Single

Pairedend Pairedend Pairedend and single Pairedend Pairedend

Pairedend

Pairedend and single Pairedend and single Pairedend

Read type

>¼2

>¼2

>¼2

>¼2

>¼2

>¼2

>¼2

>¼2

>¼2

>¼2

>¼2

>200 bp

>¼2

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

>200 bp

Length

Exon number

BlastX, HMMER, PhyloCSF, GetORF PhyloCSF10 reads

>0.5 RPKM

>1 read

ORF 5 RPKM

3 reads

ORF¼2

Exon number

>1 kb

>200 bp

>100 bp

>100 bp

>5 Kb

Length

>10 read

>1 read

>0.7 RPKM

>0.07FPKM

>0.07FPKM

Expression

PhyloCSF

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 AZPDF.TIPS - All rights reserved.