Advances in Experimental Medicine and Biology 1094
Xia Li · Juan Xu · Yun Xiao · Shangwei Ning Yunpeng Zhang Editors
Non-coding RNAs in Complex Diseases A Bioinformatics Perspective
Advances in Experimental Medicine and Biology Volume 1094 Editorial Boards IRUN R. COHEN, The Weizmann Institute of Science, Rehovot, Israel ABEL LAJTHA, N.S. Kline Institute for Psychiatric Research, Orangeburg, NY, USA JOHN D. LAMBRIS, University of Pennsylvania, Philadelphia, PA, USA RODOLFO PAOLETTI, University of Milan, Milan, Italy NIMA REZAEI, Tehran University of Medical Sciences Children’s Medical Center, Children’s Medical Center Hospital, Tehran, Iran
More information about this series at http://www.springer.com/series/5584
Xia Li • Juan Xu • Yun Xiao • Shangwei Ning • Yunpeng Zhang Editors
Non-coding RNAs in Complex Diseases A Bioinformatics Perspective
Editors Xia Li College of Bioinformatics Science and Technology Harbin Medical University Harbin, Heilongjiang, China
Juan Xu College of Bioinformatics Science and Technology Harbin Medical University Harbin, China
Yun Xiao College of Bioinformatics Science and Technology Harbin Medical University Harbin, China
Shangwei Ning College of Bioinformatics Science and Technology Harbin Medical University Harbin, China
Yunpeng Zhang College of Bioinformatics Science and Technology Harbin Medical University Harbin, China
ISSN 0065-2598 ISSN 2214-8019 (electronic) Advances in Experimental Medicine and Biology ISBN 978-981-13-0718-8 ISBN 978-981-13-0719-5 (eBook) https://doi.org/10.1007/978-981-13-0719-5 Library of Congress Control Number: 2018952454 # Springer Nature Singapore Pte Ltd. 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Noncoding RNAs (ncRNAs), especially microRNAs (miRNAs) and long noncoding RNAs (lncRNAs), starred scientiﬁc research over the last decade. This is mainly credited to the development of microarray and sequencing techniques that led to the discovery of nearly 98% of transcripts not translated to proteins. Such a fact promoted us to ﬁnd out the intricate roles they play in organisms, ranging from cell proliferation, differentiation to apoptosis and tumorigenesis. Surprisingly enough, organismal complexity is better correlated with the diversity and size of noncoding RNA expression repertoires compared to that of protein-coding gene, which drives us to believe that RNA-based regulatory mechanisms might partially explain the evolution of species. Therefore, efforts are worthwhile to work out the regulations of noncoding RNAs, in both development and diseases. In this book, we focus on the study of noncoding RNAs in diseases. We screened ncRNAs from different perspectives so it meet different people’s needs, ranging from experimenters who expect to pinpoint the ncRNAs of interest to those who would like to see a panorama of the relationship between ncRNAs and other molecules. Comprehensively, this book provides details about ncRNA resources, data acquiring, and data preparation. Included are also a variety of tools and common bioinformatics methods to deal with the bunch of data on expression, modiﬁcation, and variation. The ﬁrst three chapters concern identiﬁcation of miRNAs and lncRNAs from original microarray and RNA-seq data, as well as functional characterization and prioritization of these ncRNAs. Chapters 4, 5, and 6 refer to mutation and epigenetic modiﬁcations of ncRNAs and how these changes or modiﬁcations contribute to diseases. The following three chapters talk about the interactions between different ncRNAs, between ncRNAs and mRNAs. These ncRNAs may mediate the destruction of biological pathways or compete to bind or degrade mRNAs thus affecting mRNA expression. In Chap. 10, we attempt to predict drug targets through three different approaches. Chapter 11 is a collection of ncRNA resources, which are classiﬁed into categories like disease resources and variants resources. Finally, we want to explore the interaction between ncRNA and protein, so Chap. 12 offers both experimental and computational methods. This book is a crystal of researchers who have been respectively devoted to their areas for years. It draws information on ncRNAs over the last decade, v
which is especially suitable for readers who decide to dedicate into the study of ncRNAs. Although the authors have ﬁgured out some complex regulatory roles ncRNAs play, more remains to work out regarding the intricate relationships among RNAs, proteins, and other molecules as well as how these molecules and relationships shape who we are. Wish you all beneﬁt from the book. Harbin, Heilongjiang, China
Non-coding RNA Resources . . . . . . . . . . . . . . . . . . . . . . . . . . Shangwei Ning and Xia Li
Systematic Identiﬁcation of Non-coding RNAs . . . . . . . . . . . . Yun Xiao, Jing Hu, and Wenkang Yin
Functional Characterization of Non-coding RNAs Through Genomic Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yun Xiao, Min Yan, Chunyu Deng, and Hongying Zhao
Genomic-Scale Prioritization of Disease-Related Non-coding RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Wang and Xia Li
Genome-Wide Mapping of SNPs in Non-coding RNAs . . . . . . Shangwei Ning and Yunpeng Zhang
Proﬁling DNA Methylation Patterns of Non-coding RNAs (ncRNAs) in Human Disease . . . . . . . . . . . . . . . . . . . . . . . . . Hui Zhi, Yongsheng Li, and Li Wang
Aberrant Epigenetic Modiﬁcations of Non-coding RNAs in Human Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yun Xiao, Jinyuan Xu, and Wenkang Yin
Computationally Modeling ncRNA-ncRNA Crosstalk . . . . . . Juan Xu, Jing Bai, and Jun Xiao
Computational Inferring of Risk Subpathways Mediated by Dysfunctional Non-coding RNAs . . . . . . . . . . . . . . . . . . . . Yanjun Xu, Yunpeng Zhang, and Xia Li
Computational Identiﬁcation of Cross-Talking ceRNAs . . . . . Yongsheng Li, Caiqin Huo, Xiaoyu Lin, and Juan Xu
Prediction of Non-coding RNAs as Drug Targets . . . . . . . . . . 109 Wei Jiang, Yingli Lv, and Shuyuan Wang
Methods for Identiﬁcation of Protein-RNA Interaction . . . . . 117 Juan Xu, Zishan Wang, Xiyun Jin, Lili Li, and Tao Pan
Jing Bai College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Chunyu Deng College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Jing Hu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Caiqin Huo College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Wei Jiang Department of Biomedical Engineering, College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Xiyun Jin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Lili Li College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Xia Li College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China Yongsheng Li College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Xiaoyu Lin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yingli Lv College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Shangwei Ning College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Tao Pan College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
Li Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Peng Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Shuyuan Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Zishan Wang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Jun Xiao College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yun Xiao College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Jinyuan Xu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Juan Xu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yanjun Xu College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Min Yan College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Wenkang Yin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Yunpeng Zhang College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Hongying Zhao College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China Hui Zhi College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
Non-coding RNA Resources Shangwei Ning and Xia Li
Non-coding RNAs (ncRNAs) are a kind of functional RNA molecules that are not translated into proteins. There is now a large number of evidence suggests that ncRNAs are associated with various disease processes including cancer. Follow the rapid development of next-generation sequencing (NGS) technologies, a large amount of ncRNAs has accumulated in a short period of time. To date, many ncRNA-related databases have already been developed. Here we systematically reviewed these specialized ncRNA-related databases, and we expect these databases could serve as useful resources for researchers to further investigate and explore the associations between ncRNAs and human diseases. Keywords
miRNA · lncRNA · Database · Web server · Disease
Noncoding RNAs (ncRNAs) are a class of nonprotein-coding RNA and functional RNA S. Ning · X. Li (*) College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China e-mail: [email protected]
; [email protected]
molecules that include microRNAs (miRNAs), long noncoding RNAs (lncRNA), small interfering RNA (siRNA), small nucleolar RNAs (snoRNA), piwi-interacting RNA (piRNA), among others . Overwhelming evidence has indicated that various ncRNAs are implicated in human disease processes . For example, miRNAs (~22 nt small ncRNA molecules that repress mRNA target expression) are the important molecules of various diseases . In recent years, lncRNAs (>200 nt RNA molecules) have been found and studied to play a role in development, evolution and disease . For example, the lncRNA HOTAIR serves as an endogenous “sponge” of miR-141, resulting in signiﬁcant reduction of glioma growth . Numerous resources speciﬁc for miRNA and lncRNAs have been developed to depict the considerable diversity of ncRNAs and their involvement in important biological processes (Table 1.1). These resources contained diverse characteristics of non-coding RNAs. The primary resources mainly collect or integrate basic annotation and functional information on ncRNA transcripts, such as miRBase , NONCODE , LNCipedia , and LNCat . Another type of resource lists the functions and roles of lncRNAs that participate in disease, such as lncRNAdb , LncRNADisease  and Lnc2Cancer . The remaining resources explore the regulatory mechanism of lncRNAs interacting with other functional elements, such
# Springer Nature Singapore Pte Ltd. 2018 X. Li et al. (eds.), Non-coding RNAs in Complex Diseases, Advances in Experimental Medicine and Biology 1094, https://doi.org/10.1007/978-981-13-0719-5_1
S. Ning and X. Li
Table 1.1 Non-coding RNA resources Database name Content ncRNA annotation resources miRBase miRNA
miRNA sequence information, annotation and predicted target information NONCODE lncRNAs Comprehensive annotation of non-coding RNAs, especially lncRNAs LNCipedia lncRNAs Basic transcript information and secondary structure information, protein coding potential and miRNA binding sites LNCat lncRNAs LncRNA structures, visualization of different resources from multiple angles and download of different combinations of lncRNA annotation ncRNA-related disease resources HMDD miRNA miRNA-disease association data from genetics, epigenetics, circulating miRNAs and miRNA–target interactions miR2disease miRNA A comprehensive resource of experimentally veriﬁed miRNA-disease relationship LncRNADisease lncRNA Experimentally supported and predicted lncRNA-disease associations, lncRNA interacting partners at various molecular level Lnc2Cancer lncRNA Manually curated cancer-associated lncRNAs with experimental support
ncRNA-related variants resources dPORE-miRNA miRNA An integrates information of promoter regions of human miRNA genes, SNPs, and predicted TFBSs miRdSNP miRNA Disease-associated SNPs on the 30 UTRs of human genes manually curated from PubMed miRNASNP miRNA A resource of the SNPs in pre-miRNAs of human and other species, and target gain and loss by SNPs in miRNA seed regions or 30 UTR of target mRNAs MirSNP miRNA A collection of human SNPs in predicted miRNA-mRNA binding sites PolymiRTS miRNA Genetic polymorphisms in miRNA seed regions and miRNA target sites MicroSNiPer miRNA A collection of SNPs in putative microRNA targets lncRNASNP
A resource of SNPs in lncRNAs and their potential impacts on lncRNA structure and function in human and mouse
A database of disease-associated SNPs in human lncRNAs and their transcription factor binding sites (TFBSs)
Other ncRNA resources DIANAlncRNA LncBase ChIPBase
A database of experimentally supported and in silico predicted miRNA Recognition Elements (MREs) on lncRNAs A database of transcriptional regulatory relationships between transcription factors (TFs) and genes from ChIP-seq data A curated database of lncRNA-to-target genes information
Website http://microrna. sanger.ac.uk/ http://www.noncode. org/ http://biocc.hrbmu. edu.cn/LNCat/ http://biocc.hrbmu. edu.cn/LNCat/
http://cmbi.bjmu.edu. cn/hmdd http://www. miR2Disease.org http://cmbi.bjmu.edu. cn/lncrnadisease http://www.biobigdata.net/ lnc2cancer http://cbrc.kaust.edu. sa/dpore/ http://mirdsnp.ccr. buffalo.edu http://bioinfo.life. hust.edu.cn/ miRNASNP2/) http://cmbi.bjmu.edu. cn/mirsnp http://compbio.uthsc. edu/miRSNP/ http://cbdb.nimh.nih. gov/microsniper http://bioinfo.life. hust.edu.cn/ lncRNASNP/ http://bioinfo.hrbmu. edu.cn/LincSNP www.microrna.gr/ LncBase http://rna.sysu.edu.cn/ chipbase/ http://www. lncrna2target.org
Non-coding RNA Resources
as genetic variants, including LincSNP , lncRNASNP , and LncVar . Other databases focus on RNA editing sites such as LNCediting . In addition, microRNAs-related databases include DIANA-LncBase  and ChIPBase . LncRNA function-related database include LncRNA2Target  and LncReg . These databases contributed to further research on the appropriate regulatory mechanisms and functions of ncRNAs.
ncRNA Annotation Resources miRBase
One important aim of miRBase is to provide an integrated resources for comprehensive miRNA sequence information, annotation and predicted target information . MiRBase Registry serves as an independent arbiter of miRNA gene nomenclature, assigning names prior to publication of novel miRNA sequences. MiRBase Sequences is the basal online repository for miRNA sequence data and annotation. MiRBase Targets is a novel database to predict the target genes of miRNAs. MiRBase is available at http://microrna.sanger. ac.uk/.
NONCODE is an interactive database that contains a comprehensive collection and annotation of non-coding RNAs, especially lncRNAs . In its 2016 update, NONCODE has the total to 16 species. The lncRNAs in NONCODE have increased to 527,336. 167,150 and 130,558 lncRNAs for human and mouse were included in the database. NONCODE has also fused some important novel functions, including: (i) conservation annotation; (ii) the associations between lncRNAs and diseases; and (iii) an interface to choose high-quality datasets through predicted scores, literature support and long-read sequencing method support. NONCODE is available at http://www.noncode.org/.
LNCipedia is a database for human lncRNA transcripts and genes and offers human lncRNA transcripts obtained from diverse sources . Not only basic transcript information and gene structure, but also several comprehensive statistics are showed for each entry in the database, such as secondary structure information, protein coding potential and miRNA binding sites. Available articles on speciﬁc lncRNAs is linked, and users or authors can submit articles through a web interface. Protein coding potential is assessed by two different prediction algorithms: Coding Potential Calculator and HMMER. The users could query and download lncRNA sequences and structures based on different search criteria on LNCipedia. LNCipedia is available at http://www.lncipedia.org.
LNCat is a user-friendly database that provides a genome browser of lncRNA structures, visualization of different resources from multiple angles and download of different combinations of lncRNA annotations, and supports rapid exploration, comparison and integration of lncRNA annotation resources . A comprehensive comparison of numerous lncRNA annotations were contained in LNCat, and can facilitate understanding of lncRNAs in human disease. LNCat is freely available at http://biocc.hrbmu.edu.cn/ LNCat/.
ncRNA-Associated Disease Resources HMDD
HMDD is a collection of human miRNA and disease associations and all miRNAs are supported by experiment . In the HMDD database, miRNA– disease association data were annotated in many details, containing miRNA–disease association data
S. Ning and X. Li
from genetics, epigenetics, circulating miRNAs and miRNA–target interactions. HMDD provides many data that were generated based on concepts derived from the miRNA–disease association data, including disease spectrum width of miRNAs and miRNA spectrum width of human diseases. Users can download all the data in the HMDD and submit new data into the database. HMDD is freely accessed at http:// cmbi.bjmu.edu.cn/hmdd.
miR2Disease is a manually collected database which aims at providing a comprehensive resource of miRNA dysregulation in a variety of human diseases . Each entry in the miR2Disease contains detailed information on a miRNA-disease association including a miRNA ID, the disease name, a brief description of the miRNA-disease relationship, an expression pattern of the miRNA, the detection method for miRNA expression, experimentally veriﬁed target gene(s) of the miRNA and a literature reference. miR2Disease is available at http://www. miR2Disease.org.
LncRNADisease is a lncRNA and disease association database, which contains experimentally supported lncRNA-disease associations . LncRNADisease also curates lncRNA interacting partners at various molecular levels, including protein, RNA, miRNA and DNA. Each lncRNA-disease association includes genomic information, sequences, references and species. LncRNADisease designed a bioinformatic strategy to identify new lncRNA-disease relationships. Besides, authors also integrated the method and the predicted associated diseases of human lncRNAs into the database. LncRNADisease is freely available at http:// cmbi.bjmu.edu.cn/lncrnadisease.
Lnc2Cancer is a manually collected database of cancer-associated lncRNAs with experimental support that aims to provide a high-quality and integrated resource for exploring lncRNA deregulation in all kinds of human cancers . Each association includes lncRNA and cancer name, the lncRNA expression pattern, experimental techniques, a brief functional description, the original reference and additional annotation information. Lnc2Cancer is friendly for users to conveniently browse, retrieve and download data. Lnc2Cancer is publicly accessible at http:// www.bio-bigdata.net/lnc2cancer.
ncRNA-Related Variants Resources dPORE-miRNA
dPORE-miRNA (Dragon Database of Polymorphic Regulation of miRNA genes) is a data resource that including information from promoter regions of human miRNA genes, SNPs, and predicted TFBSs in the promoter regions . The web-interface permits to explore the effect of SNPs on the transcriptional regulation of miRNA genes. The aim is on SNPs that affect TFBSs or lead to the creation of a TFBS. On the other hand, only TFBSs are included that are themselves affected by SNPs. dPORE-miRNA is publicly accessible at http://cbrc.kaust.edu.sa/ dpore/.
miRdSNP (http://mirdsnp.ccr.buffalo.edu) supplies a unique database of dSNPs (diseaseassociated SNPs) on the regions of human gene 30 UTRs manually curated from PubMed . miRdSNP explains genes targeted by miRNAs which are supported by experiment. On the other hand, miRdSNP also cites miRNA target sites predicted by TargetScan and PicTar as
Non-coding RNA Resources
well as potential miRNA target sites newly generated by dSNPs. A robust link site and study tools are given for studying the proximity of miRNA binding sites to dSNPs in relation to human diseases.
miRNASNP (http://bioinfo.life.hust.edu.cn/ miRNASNP2/) is desighed to provide a resource of the miRNA-related SNPs, which includes SNPs in pre-miRNAs of human and other species, and target gain and loss by SNPs in miRNA seed regions or 30 UTR of target mRNAs . The latest version of this database is miRNASNP v2, which increase the potential functional SNPs to 365,994.
MirSNP (http://cmbi.bjmu.edu.cn/mirsnp) is a publicly accessible online database, and it collects human SNPs in identiﬁed miRNA-mRNA binding sites . It contains 414,510 SNPs that might affect miRNA-mRNA binding. Annotations were added to these SNPs to predict whether a SNP within the target site would decrease/break or enhance/create an miRNAmRNA binding site.
MicroSNiPer (http://cbdb.nimh.nih.gov/ microsniper), a online application, which is used to predict the effects of a SNP on putative microRNA targets . This web-based tool query the 30 -UTR and predicts whether a SNP located in the target locus will disrupt/eliminate or enhance/create a microRNA binding site. MicroSNiPer calculates these sites and evaluates the inﬂuence of SNPs in time. It has many virtues, such as: friendly interface, ﬂexible to use, simple and intuitive graphical representation of the results.
lncRNASNP (http://bioinfo.life.hust.edu.cn/ lncRNASNP/) aims to offer a useful databank about lncRNA SNPs . lncRNASNP systematically excavated SNPs in lncRNAs and estimated their latent impacts on lncRNA structure and function in human and mouse. A huge number of SNPs were identiﬁed with the talents for impact the interactions between miRNA and lncRNA. The previously experiment, conservative property of miRNA-lncRNA interaction, and miRNA expressions from TCGA were all integrated to predict the associations between miRNA-lncRNA interactions and SNPs located in binding sites. This database with a friendly interface for users to query and browse through the SNP, lncRNA and miRNA sections.
PolymiRTS (Polymorphism in microRNAs and their TargetSites, http://compbio.uthsc.edu/ miRSNP/) is an integrated online tools for analyzing the functional effects of genetic polymorphisms located in miRNA seed regions and miRNA target sites . The latest version of this database is PolymiRTS Database 3.0, which saved more miRNA-mRNA interactions obtained from CLASH (cross linking, ligation and sequencing of hybrids) experiments.
LincSNP (http://bioinfo.hrbmu.edu.cn/LincSNP) is a database to speciﬁcally store and annotate disease-related single nucleotide polymorphisms (SNPs) that located in human long non-coding RNAs (lncRNAs) and their transcription factor binding sites (TFBSs) . This platform collected ﬁve kind of data sets, such as (i) diseaserelated SNPs in human lncRNAs; (ii) diseaserelated SNPs in lncRNA TFBSs; (iii) LD-SNPs
S. Ning and X. Li
from the 1000 Genomes Project; and (iv) experimentally corroborated SNP-lncRNAdisease associations. LincSNP provides users a newly designed, simple and intuitive interface to query and download all the data.
Other ncRNA Resources DIANA-LncBase
DIANA-LncBase (www.microrna.gr/LncBase), a database to store experimentally veriﬁed and in silico predicted miRNA Recognition Elements (MREs) on lncRNAs . This web resource collected miRNA-lncRNA interactions widely from low and high-throughput data, which acquired from manually curated publications and the analysis of AGO CLIP-Seq libraries. LncBase also hosts in silico predicted miRNA targets on lncRNAs, identiﬁed with the DIANAmicroT algorithm. LncBase collects information about cell type speciﬁc miRNA-lncRNA regulation and enables users to identify interactions in particular cell types, tissues for human and mouse easily.
ChIPBase identiﬁed a large number of binding motif matrices and their binding locus from ChIP-seq data of DNA-binding proteins and predicted millions of transcriptional regulatory relationships between transcription factors (TFs) and genes . ChIPBase constructed ‘Regulator’ module to predict hundreds of TFs and histone modiﬁcations that were involved in or affected transcription of ncRNAs and PCGs. ChIPBase built a web-based tool, Co-Expression, to recognize the co-expression patterns between DNA-binding proteins and various types of genes by integrating the gene expression proﬁles of 10,000 tumor samples and 9100 normal tissues and cell lines.
LncRNA2Target (http://www.lncrna2target.org) is a curated database which stores lncRNA-totarget genes . A gene was identiﬁed as a target of a lncRNA if it is abnormally expressed after the lncRNA knockdown or over expression. LncRNA2Target offers a web platform, through which, its users can search for the targets of a particular lncRNA or for the lncRNAs that target a particular gene.
Perspectives and Conclusion
In the past few years, many databases have been published to aid researchers in exploring the function of ncRNAs (miRNAs and lncRNA) in human disease genes. These studies and databases emphasize the importance of bioinformatics analysis in the identiﬁcation of potential disease-associated ncRNAs. In future, these databases will not only provide a comprehensive resource for experimental research but also presents a more global view on ncRNA functions in human diseases. These databases will serve as a valuable resource for researchers interested in determining the role of miRNA and lncRNA in human diseases.
References 1. Guttman M, Rinn JL (2012) Modular regulatory principles of large non-coding RNAs. Nature 482 (7385):339–346 2. Esteller M (2011) Non-coding RNAs in human disease. Nat Rev Genet 12(12):861–874 3. Tan CL, Plotkin JL, Veno MT, von Schimmelmann M, Feinberg P, Mann S, Handler A, Kjems J, Surmeier DJ, O'Carroll D et al (2013) MicroRNA-128 governs neuronal excitability and motor behavior in mice. Science 342(6163):1254–1258 4. Briggs JA, Wolvetang EJ, Mattick JS, Rinn JL, Barry G (2015) Mechanisms of long non-coding RNAs in mammalian nervous system development, plasticity, disease, and evolution. Neuron 88(5):861–877 5. Bian EB, Ma CC, He XJ, Wang C, Zong G, Wang HL, Zhao B (2016) Epigenetic modiﬁcation of miR-141
Non-coding RNA Resources
regulates SKA2 by an endogenous ‘sponge’ HOTAIR in glioma. Oncotarget 6. Grifﬁths-Jones S, Grocock RJ, van Dongen S, Bateman A (2006) Enright AJ: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34(Database issue):D140–D144 7. Zhao Y, Li H, Fang S, Kang Y, Wu W, Hao Y, Li Z, Bu D, Sun N, Zhang MQ et al (2016) NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res 44(D1):D203– D208 8. Volders PJ, Verheggen K, Menschaert G, Vandepoele K, Martens L, Vandesompele J, Mestdagh P (2015) An update on LNCipedia: a database for annotated human lncRNA sequences. Nucleic Acids Res 43(Database issue):D174–D180 9. Xu J, Bai J, Zhang X, Lv Y, Gong Y, Liu L, Zhao H, Yu F, Ping Y, Zhang G et al (2017) A comprehensive overview of lncRNA annotation resources. Brief Bioinform 18(2):236–249 10. Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, Gloss BS (2015) Dinger ME: lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res 43(Database issue):D168–D173 11. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q (2013) LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res 41(Database issue): D983–D986 12. Ning S, Zhang J, Wang P, Zhi H, Wang J, Liu Y, Gao Y, Guo M, Yue M, Wang L et al (2016) Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res 44(D1):D980– D985 13. Ning S, Yue M, Wang P, Liu Y, Zhi H, Zhang Y, Zhang J, Gao Y, Guo M, Zhou D et al (2017) LincSNP 2.0: an updated database for linking disease-associated SNPs to human long non-coding RNAs and their TFBSs. Nucleic Acids Res 45(D1):D74–D78 14. Gong J, Liu W, Zhang J, Miao X, Guo AY (2015) lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human and mouse. Nucleic Acids Res 43(Database issue):D181–D186 15. Chen X, Hao Y, Cui Y, Fan Z, He S, Luo J, Chen R (2017) LncVar: a database of genetic variation associated with long non-coding genes. Bioinformatics 33(1):112–118 16. Gong J, Liu C, Liu W, Xiang Y, Diao L, Guo AY, Han L (2017) LNCediting: a database for functional effects of RNA editing in lncRNAs. Nucleic Acids Res 45 (D1):D79–D84 17. Paraskevopoulou MD, Vlachos IS, Karagkouni D, Georgakilas G, Kanellos I, Vergoulis T, Zagganas K, Tsanakas P, Floros E, Dalamagas T et al (2016)
7 DIANA-LncBase v2: indexing microRNA targets on non-coding transcripts. Nucleic Acids Res 44(D1): D231–D238 18. Zhou KR, Liu S, Sun WJ, Zheng LL, Zhou H, Yang JH, Qu LH (2017) ChIPBase v2.0: decoding transcriptional regulatory networks of non-coding RNAs and protein-coding genes from ChIP-seq data. Nucleic Acids Res 45(D1):D43–D50 19. Jiang Q, Wang J, Wu X, Ma R, Zhang T, Jin S, Han Z, Tan R, Peng J, Liu G et al (2015) LncRNA2Target: a database for differentially expressed genes after lncRNA knockdown or overexpression. Nucleic Acids Res 43(Database issue):D193–D196 20. Zhou Z, Shen Y, Khan MR, Li A (2015) LncReg: a reference resource for lncRNA-associated regulatory networks. Database (Oxford) 2015 21. Volders PJ, Helsens K, Wang X, Menten B, Martens L, Gevaert K, Vandesompele J, Mestdagh P (2013) LNCipedia: a database for annotated human lncRNA transcript sequences and structures. Nucleic Acids Res 41(Database issue):D246–D251 22. Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, Cui Q (2014) HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res 42(Database issue):D1070–D1074 23. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y (2009) miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res 37(Database issue): D98–D104 24. Schmeier S, Schaefer U, MacPherson CR, Bajic VB (2011) dPORE-miRNA: polymorphic regulation of microRNA genes. PLoS One 6(2):e16657 25. Bruno AE, Li L, Kalabus JL, Pan Y, Yu A, Hu Z (2012) miRdSNP: a database of disease-associated SNPs and microRNA target sites on 30 UTRs of human genes. BMC Genomics 13:44 26. Gong J, Tong Y, Zhang HM, Wang K, Hu T, Shan G, Sun J, Guo AY (2012) Genome-wide identiﬁcation of SNPs in microRNA genes and the SNP effects on microRNA target binding and biogenesis. Hum Mutat 33(1):254–263 27. Liu C, Zhang F, Li T, Lu M, Wang L, Yue W, Zhang D (2012) MirSNP, a database of polymorphisms altering miRNA target sites, identiﬁes miRNA-related SNPs in GWAS SNPs and eQTLs. BMC Genomics 13:661 28. Bhattacharya A, Ziebarth JD, Cui Y (2014) PolymiRTS Database 3.0: linking polymorphisms in microRNAs and their target sites with human diseases and biological pathways. Nucleic Acids Res 42(Database issue):D86–D91 29. Barenboim M, Zoltick BJ, Guo Y, Weinberger DR (2010) MicroSNiPer: a web tool for prediction of SNP effects on putative microRNA targets. Hum Mutat 31(11):1223–1232
Systematic Identification of Non-coding RNAs Yun Xiao, Jing Hu, and Wenkang Yin
Non-coding RNAs (ncRNAs) are biologically signiﬁcant in variable ways. They modulate gene expression at the levels of transcription and post-transcription. MiRNAs and lncRNAs are two major classes of non-coding RNAs and have been extensively characterized. They are implicated in various biological processes and diseases. Thus, identiﬁcation of miRNAs and lncRNAs are fundamental to further understand their roles and dissect their mechanisms. Here, we overviewed pipelines of identifying miRNAs and lncRNAs based on next-generation sequencing technologies. We applied the pipelines to identify miRNAs in multiple cell lines and perform expression quantiﬁcation of mature, precursor and primary miRNAs. In addition, we provided an alternative way to re-annotate lncRNAs from microarray data. We summarized multiple resources and databases for lncRNA annotation and compared their annotation processes and speciﬁc parameters. Finally, we utilized RNA-seq and miRNA-seq data to construct a comprehensive transcriptome containing miRNAs, lncRNAs and protein-coding genes in heart failure.
Non-coding RNA · Identiﬁcation pipeline · Expression quantiﬁcation · Annotation · Transcriptome
Y. Xiao (*) · J. Hu · W. Yin College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China e-mail: [email protected]
; [email protected]
Coding genes account for just 1% of the total human transcripts , for the rest, non-coding RNAs (ncRNAs) remain an unknown component of mammalian genomes. NcRNAs are ribonucleic acid (RNA) molecules that don’t encode proteins. Different types of non-coding RNAs are involved in different cellular processes, such as gene expression regulation (miRNAs, piRNAs, lncRNAs), RNA maturation (snRNAs, snoRNAs) and protein synthesis (rRNAs, tRNAs). Among them, miRNAs and lncRNAs are the two most extensively characterized ones and they are implicated in a variety of biological processess. MicroRNAs (miRNAs) are small (~22 nucleotides in length) non-coding regulatory genes found in many eukaryotic organisms. They can mediate the expression of target genes at post-transcriptional levels to serve as important regulators of various developmental control and diseases . Rapid advancement in highthroughput sequencing allow unprecedented sensitive detection of miRNAs with the help of
# Springer Nature Singapore Pte Ltd. 2018 X. Li et al. (eds.), Non-coding RNAs in Complex Diseases, Advances in Experimental Medicine and Biology 1094, https://doi.org/10.1007/978-981-13-0719-5_2
bioinformatic algorithms such as miRDeep2  and miRanalyzer . LncRNAs are deﬁned as those whose length ranges from 200 bp to more than 10 kb [4, 5]. Recent studies showed that lncRNAs play key roles in many normal biological processes, like the development of vertebrates, immune responses and cell differentiation, and they are also related to complex human diseases [6– 8]. LncRNAs can participate in gene regulation in many ways, especially in the epigenetic control of chromatin [8–11]. The most famous example is the inactive X chromosome through cis-acting of XIST lncRNA . Trans-regulation is another way lncRNAs adopt to regulate gene expression  as Rinn et al. found that HOTAIR acted in trans to repress HOXD locus transcription. Despite the interesting ﬁndings in a few lncRNAs, it is difﬁcult to generalize these ﬁndings to the massive lncRNAs. More importantly, the functions of most lncRNAs are largely unknown compared to small noncoding RNAs (i.e., microRNAs) , which offers opportunities and raises challenges for predicting functions of lncRNAs. RNA-sequencing (RNA-seq) is a whole transcriptome sequencing technique that quantiﬁes gene expression with dynamic range. It overcomes the shortcomings of microarray technology and has already been widely used in the study of model organisms and human. Cabili et al. deﬁned a reference catalog of more than 8000 human long intergenic noncoding RNAs from RNA-seq data  and most of them had not been previously described. Recent advance in RNA-seq and computational methods for reconstructing transcriptome offers a wonderful opportunity to annotate and characterize lncRNAs. As a matter of fact a large number of lncRNAs have been discovered using RNA-seq [6, 15, 16]. Therefore, abundant RNA-seq data allow us to comprehensively identify and quantify lncRNAs (also protein-coding genes) and enable us to characterize the functions of lncRNAs. Here, we provided canonical pipelines for identifying miRNAs and lncRNAs using nextgeneration sequencing data. Applying the
Y. Xiao et al.
pipelines to ﬁve human cell lines, we identiﬁed miRNAs and quantiﬁed expression levels of mature, precursor and primary miRNAs. Alternatively, we could re-annotate lncRNAs from microarray data. We interrogated multiple resources and databases for lncRNA annotation and summarized their common and speciﬁc processes. Finally, we constructed a comprehensive transcriptome composed of miRNAs, lncRNAs and protein-coding genes with the advantage of RNA-seq and miRNA-seq data in heart failure.
Methods Identification of miRNAs
Given a miRNA-seq data, we could take advantage of the miRanalyzer  to map sequence reads to miRNA annotations in miRBase database. Then, read counts of each miRNA could be calculated and normalized to the total counts of sequence reads as RPMs (reads per million mapped reads). Lowly expressed miRNAs should be ﬁltered out.
Identification of lncRNAs
Given a microarray data, we could use the probe sequences from the corresponding manufacturer’s website and then uniquely map them to the human genome (hg19) by Bowtie without mismatch. Probes completely mapping within exons of lncRNAs but without overlapping with protein-coding genes were retained to label corresponding lncRNAs. LncRNAs having less than four probes should be ﬁltered out. Given an RNA-seq data, we could use Tophat (version 2.0.13)  to map the sequencing reads to the human genome (hg19). And then Cufﬂinks (version 2.2.1)  could be utilized to assemble the uniquely mapped reads into transcripts for each sample. Subsequently, the assemblies of all samples were merged together with Cuffmerge. Besides known lncRNAs, we could also extract novel lncRNAs. Transcripts with length >¼200 bp which were previously unannotated
Identification of Non-coding RNAs
and lack coding potential, which could be calculated by CPAT (version 1.2.2) , were deﬁned as novel lncRNAs. Fragments per kilobase per million mapped reads (FPKM) for each known and novel lncRNA could be extracted from the cufﬂinks output. Alternatively, read counts could be computed using BEDTools (http://code. google.com/p/bedtools).
Results miRNA Transcriptome Detected by Small RNA-seq Annotation
To identify miRNAs in ﬁve human cell lines and quantify their expression levels, we applied miRNA-seq datasets and downloaded the original whole-cell 0 in the nucleus, but with reads¼0 in the cytosol. Finally, we deﬁned the total numbers of mapped reads in the nucleus as the expression levels of primary transcripts of intergenic miRNAs.
Re-annotation of Microarray for Revealing lncRNAs
To re-annotate Affymetrix exon array, we designed a custom pipeline to utilize its substantial probes annotated to thousands of long non-coding RNA [15, 21]. The probe sequences could be downloaded from the manufacturer’s website (http://www.affymetrix.com) and then we used Bowtie to uniquely map them to the human genome (hg19) without mismatch . We kept probes entirely mapped with exons of lncRNAs but without overlapping with protein-coding genes using BEDTools (http:// code.google.com/p/bedtools). Finally, the expression levels of lncRNA genes including at least four probes were calculated.
Summaries of lncRNA Annotation Resources
We collected 19 literatures corresponding to 21 resources that applied high-throughput sequencing data to identify lncRNAs by searching PubMed via keywords “ChIP-seq”, “RNA-seq”, “lncRNA”, “long noncoding RNA” and “long intergenic non-coding RNA”. The name of the ﬁrst author indicated by capital letters to distinctly discriminate these resources including CABILI , IYER , MORAN , TRIMARCHI , KRETZ , WHITE , KELLEY , HANGAUER , PARALKAR , HE , YANG , NECSULEA1/2 , NE , SOWALSKY , KHALIL , SIGOVA1/2 , BELL , YAN  and DING . Moreover, we also included three extensively used lncRNA databases namely GENCODE (V19) , LNCipedia (version 2.1)  and NONCODE (version 4.0) . Finally, 24 lncRNA annotation resources were used for the further analysis. These 24 human lncRNA annotation resources contained over 205,000 lncRNAs, 3 to over 7000 number of samples and were used to annotate lncRNAs cover over 50 tissues or cell lines (Fig. 2.1a and Table 2.1). To identify lncRNAs,
Y. Xiao et al.
Fig. 2.1 Statistics of the annotation information among different resources. (a) Pie chart showing the distribution of lncRNA annotation resources referring to different tissues and cell lines. The 21 resources were categorized into 12 cohorts, and those involving multiple tissues and
cell lines were added into corresponding cohorts. (b) Pie chart of the distribution of RNA-seq based resources using single-end, paired-end and both single and paired-end sequencing data
the majority of resources (such as CABILI and IYER) applied RNA-seq data to establish transcriptome based on ab initio or de novo assembly. Among these resources, eight used paired-end sequencing techniques, three used both paired-end and single-end and ﬁve used single-end only (Fig. 2.1b). Various ﬁltering strategies of these resources, including ﬁve criteria: size selection, coding potential, exon number, expression level and epigenetic signals, were used to identify credible lncRNAs. The size of lncRNA transcripts to be above 200 bp was required by most resources, but the size of 100 bp, 1 kb and 5 kb were used as thresholds for SIGOVA, YAN and KHALIL, respectively. The expression levels were considered by 15 resources, but with different thresholds. Epigenetic signals, derived from ChIP-seq data (such as H3K4me3), were utilized to screen active lncRNAs by some resources. Notably, 13 resources focused on lncRNAs while 8 resources focused on intergenic lncRNAs (lincRNAs). RNA-seq data from 24 tissues and cell types were used to identify lincRNAs by CABILI. The transcripts with single exon, or with length less than 200 bases, with low abundance (100 or a known protein-coding domain) were ﬁltered out after mapping and assembling the reads. We identiﬁed potential lincRNAs according to the remaining transcripts that did not overlap known non-lincRNA annotations. A strategy similar to CABILI was applied by HE to ﬁnd novel lincRNAs in the human prefrontal cortex, except for a different expression threshold (1 RPKM). By using TopHat and Cufﬂinks, KELLEY assembled a list of lincRNAs based on the same RNA-seq data as CABILI. Filtering conditions of KELLEY were the same as CABILI. By using Cufﬂinks, dynamic RNA-seq was assembled by KRETZ from primary human keratinocytes and novel lincRNAs that have multiple exons and a total length of 200 bp without overlapping with any annotated genes were identiﬁed. The RNA-seq data from breast cancer tissues were analyzed by DING. Considering expression abundance (10 reads) and their minimum distance to neighbor genes (1500 bp for upstream and downstream genes), we could recognize lincRNAs after ﬁltering mapped reads against the RepeatMask, rRNA and other repeated sequences. We used de novo assembly
24 tissues and cell types
Islets and beta-cells
T-ALL cell lines and primary leukemia samples Keratinocytes
Lung cancer tissues
28 tissues and cell lines
Failing LV samples
Castration-resistant prostate cancer (CRPC) tissues
RNA-seq and ChIPseq RNA-seq and ChIPseq RNA-seq
Ab initio assembly
Ab initio assembly De novo assembly De novo assembly Ab initio assembly Ab initio assembly
De novo assembly
De novo assembly Ab initio assembly
Ab initio assembly De novo assembly Ab initio assembly
Ab initio assembly
Ab initio assembly
Ab initio assembly
Table 2.1 Summary of 24 lncRNA annotation resources reviewed in this study
Pairedend Pairedend Pairedend and single Pairedend Pairedend
Pairedend and single Pairedend and single Pairedend
BlastX, HMMER, PhyloCSF, GetORF PhyloCSF10 reads
ORF 5 RPKM