It is commonly believed that protein-coding messenger ribonucleic acids (mRNAs) are the primary controller of cells, carrying out the necessary functions for life. Our understanding of how a gene functions has been significantly challenged due to the advances in high-throughput sequencing (HTS) technology, which has provided researchers with an expanded view of the complexity of the human genome, allowing for the identification of diverse types of RNAs. Mounting evidence from various studies has shown that the non-coding portion of the genome plays a more significant role in human biology than previously thought. Depending on the version of the annotation, the protein-coding genome regions represent around 1–3% of the human genome. However, the rest of the genome plays an important role in controlling the expression of the coding deoxyribonucleic acid (DNA), and in the spatial organization of the genome. One of the most critical corroborations of this is the distribution of results of genome-wide association studies (GWAS). Based on the latest GWAS Catalog (June 2017), only 3.56% of disease-associated single-nucleotide polymorphisms (SNPs) reside in the protein-coding region and 96.44% lie in non-coding regions equally proportioned between the intergenic regions and intron regions.
The Encyclopedia of DNA Elements (ENCODE) is a large consortium project aimed at mapping all functional elements in the human genome. The initial results of the ENCODE project suggest that as much as 80% of the genome is biologically active and functional. However, this conclusion has drawn many sharp criticisms. The arguments against ENCODE’s conclusion include the C-value paradox, which argues that transcription activity does not necessarily indicate biological functionality, and the current estimated fraction of the genome that is conserved through purifying selection is less than 10%. A later study was able to narrow down the fraction of the conserved region to 8.2%. The argument between ENCODE and the other researchers who reject ENCODE’s conclusion primarily lies in the definition of functionality. For example, the encode researchers consider all DNAs that are transcribed to RNA including non-coding RNA as functional. However, some researchers only consider that evolutionarily conserved regions are functional.
Non-coding RNA (ncRNA) is an RNA molecule that is transcribed from DNA but not translated to a protein. The last decade has witnessed a sharp rise of interest in non-coding RNA research (Figure 1). There are multiple classes and sub-classes of ncRNA. RNA size is a commonly used division of ncRNA: with small ncRNA (sRNA, less than 200 nucleotides), and long ncRNA (lncRNA, more than 200 nucleotides). Our review reflects on how the recent widespread interest in non-coding RNA is a result of the combination of the maturity of HTS technology and the bioinformatics development supporting that technology.
LncRNAs are a class of long transcribed but not translated RNAs that are longer than 200 nucleotides. lncRNAs are often transcribed by RNA polymerase II, and can be classified as antisense, intronic, intergenic, divergent and enhancer lncRNAs according to their relative genome position (Figure 1). They have many characteristics similar to mRNAs such as 5’ capping, exon–intron splicing, and poly-adenylation, and the primary distinction from mRNAs is the lack of open reading frames (ORFs). LncRNA was discovered through the bioinformatics analysis of transcriptome data. Unlike protein coding mRNAs, lncRNA was traditionally believed to be non-functional. However, many recent studies have shown evidence for the functionality of lncRNA, such as roles in high-order chromosomal dynamics, embryonic stem cell differentiation, telomere biology and subcellular structural organization. The best characterized lncRNAs come from cancer studies. For example, MALAT-1 was associated with poor prognosis in non-small cell lung cancer, and oral squamous cell carcinoma. Similarly, the lncRNA HOTAIR has been shown to promote metastasis in multiple cancers including breast, gastric, colorectal, cervical and liver cancer. LncRNA SNHG1 has been found to regulate NOB1 expression by sponging miR-326 and promotes tumorigenesis in osteosarcoma. Gioia R et al. observed that silencing lncRNA RP11-624C23.1 or RP11-203E8 could provide a selective advantage to leukemic cells by increasing resistance to genotoxic stress, possibly by modulating the DNA damage response (DDR) pathway. Moreover, lncRNA was recently hailed as a possible biomarker for cancer. For instance, lncRNA ZEB1-AS1 predicts unfavorable prognosis in gastric cancer and lncRNA-ATB has potential as a biomarker for the prognosis of hepatocellular carcinoma and as a targeted therapy for afflicted patients.
The interest in lncRNA has grown considerably as evidence of lncRNA’s role in various biological contexts has accumulated in recent years (Figure 2). lncRNAdb is a database that provides comprehensive annotations of eukaryotic lncRNAs manually curated from referenced literature. The current version of lncRNAdb is released in Jan 2015 providing information of sequence, genomic context, expression profile, structure, subcellular localization, conservation, and function for 287 eukaryotic lncRNAs. LncRNADisease, another database that documents previously identified and predicts novel lncRNA-disease associations for human and mouse was published in 2013. The LncRNADisease curated the experimentally supported lncRNA-disease association data and integrated a tool for predicting potential associated diseases for a novel lncRNA based on its genomic context. In addition, LncRNADisease also collected lncRNA interactions in various levels, including protein, RNA, miRNA, and DNA. Current version (July 2017) of LncRNADisease documents ~3000 published lncRNA-disease associations including 914 lncRNAs and 329 diseases. Another notable lncRNA resource is LNCipedia (v4.1), a database which contains 146,742 annotated human lncRNAs. Other published lncRNA database that are worth noting are lncRNome, a comprehensive database of lncRNA in humans; MONOCLdb, which provides the annotations and expression profiles of mouse lncRNAs involved in influenza and SARS-CoV infections; and NRED, a database of long non-coding RNA expression. A list of currently available databases for lncRNA is shown in Table 1.
The identification of lncRNA requires the detection of transcription from unannotated genomic regions. This can be done by a number of techniques, including tiling array, serial analysis of gene expression (SAGE), cap analysis gene expression (CAGE), and the most powerful technique to date: RNA-seq, which prompts the development of multiple RNA-seq based pipelines for identifying lncRNAs. Furthermore, chromatin immunoprecipitation (ChIP) technology, either ChIP-chip or Chip-seq can also identify novel lncRNAs indirectly by studying genomic regions with protein or histone modifications.
Traditional RNAseq libraries were built based on the presence of a poly(A) tail, which is present in all mRNA except for histone encoded mRNAs. It is estimated that 60% of lncRNAs also have poly(A) tails. Thus poly(A) tail-based RNA capture is not able to describe the entire range of lncRNAs. The preferred RNA library preparation method for this is total RNA library construction, which depletes ribosomal RNA (rRNA) and washes out small RNAs by size selection. The current common total RNA library kits are Ribo-Zero and RNase H, each of which has its own strengths and weaknesses. The total library usually requires more sequencing reads due to the multiple RNA species that are present in the library, and the rRNA reduction is not 100% efficient, usually leaving a portion of the rRNA in the library. In the last few years, lncRNA-focused microarray products have become available for purchase, such as Array star INC’s Human LncRNA Expression Microarray V4.0. However, these products are limited to known lncRNAs. In our opinion, RNAseq remains the ideal technology for detecting lncRNAs.
The number of definable lncRNAs varies by study. In 2005, a study claimed to have identified over 35,000 lncRNAs. Another study in 2007 estimated that there are four times more lncRNA compared to protein-coding RNA. In the ENCODE lncRNA release (Version 26), 15,787 lncRNAs were identified, further categorizing them into four sub-classes: antisense, large intergenic non-coding RNAs (lincRNA), sense intronic, and processed transcripts. In the latest study in 2017 by Hon et al., 27,919 human lncRNAs with high confidence 5’ end were described.
According to the latest gene annotation (GRCh38) gene transfer format file (GTF) (Homo_sapiens.GRCh38.89.gtf) from Ensemble, the total length of coding RNA is 97,662,789 nucleotides compared to 9,699,539 nucleotides for lincRNA. How lncRNA exerts its influence over disease is not well-known. The lncRNA-disease association may be activated through the regulation of protein-coding gene expression by a trans-expression quantitative trait locus (eQTL) and these trans-eQTL SNPs could be a surrogate SNP for SNPs residing in protein-coding RNA within the same high-linkage disequilibrium haplotype block. Cabili et al. found that many tissue-specific cis-eQTL are SNPs with known diseases or trait associations. Hon et al. found that eQTL-linked lncRNA-mRNA pairs were more co-expressed than random lncRNA-mRNA pairs.
Even though the majority of lncRNAs are believed to be non-coding, some of them may potentially harbor novel unannotated proteins. In recent years, there have been increasing efforts to predict the functional potential of lncRNA. The first tool created for such a purpose was the Coding Potential Calculator (CPC). CPC assesses coding-potential by incorporating features such as the length of the open reading frame, coverage and integrity of the predicted open reading frame into a support vector machine (SVM). Getorf is a tool implemented by the European Molecular biology Open Software Suite, which identifies ORFs by identifying start and stop codons. The coding-potential assessment tool (CPAT) was designed to access coding potential through a logistic regression model accessing similar features used in CPC, such as length of ORF and ORF coverage. CPAT utilized two additional features, the Fickett score and the hexamer score, where the Fickett score describes the nucleotide composition of the RNA, and the hexamer score indicates the relative degree of hexamer usage bias in a particular sequence. The latest entry to this tool category is slncky, which is designed to discover lncRNAs from an RNA-seq dataset and assess their functional importance. Slncky assesses the coding potential by identifying conserved ORFs in syntenic regions across multiple species, with the syntenic regions identified by liftOver.
3. Circular RNA (circRNA)
CircRNA is a type of evolutionarily conserved RNA that forms a covalently closed continuous RNA loop. There are two scenarios of the formation of circRNAs: direct ligation of 5’ and 3’ ends of linear RNA, and backsplicing, wherein a downstream 3’ splice site joins to an upstream 5’ splice site (Figure 3). It has been suggested that the formation mechanism is associated with the RNA editing enzyme adenosine deaminase acting on RNA, and that the RNA-binding protein quaking regulates the formation of circRNAs. It has been suggested that circRNA can be formed from mRNA and other non-coding RNA in the intergenic regions. CircRNAs do not possess poly(A) tails, thus total RNA library preparation is more suitable for circRNA detection. There have been multiple reports that circRNA can be detected in non-tissue samples, such as saliva, blood plasma and seminal plasma. Currently, a major recognized function of circRNA is their action as a micro RNA (miRNA) sponge. It was found that circRNAs are much easier to bind with miRNA than with mRNA, allowing circRNA to repress miRNA regulation on mRNA.
CircRNA was originally identified through the analysis of scrambled exons over two decades ago using older, non-high throughput technology. CircRNA was considered a rare event and exclusive to only a few viruses, and has evaded detection in mammalian genomes for a long time due to the lack of poly(A) tails. Advances in HTS technology have allowed researchers to scrutinize the mammalian genome in unprecedented detail, allowing for the identification of thousands of additional circRNAs. A large portion of the identified circRNAs are derived from protein-coding genes. However, after formation, these circRNAs are not able to undergo translation to proteins, and thus they have been categorized as non-coding RNA. Interestingly, recently several groups have discovered a protein-encoding function for some circRNAs, revealing an unexplored model of gene expression. There have been increasingly more human disease studies devoted to circRNAs (Figure 2). For example, circRNA has been found to be a biomarker for cancer and is associated with neurological disease. The molecular roles and function, and how circRNA dysregulation affects disease, were reviewed in.
Novel bioinformatic tools have contributed greatly to the identification of the new circRNAs from RNA-seq data. Notable circRNA identification tools are find_circ, MapSplice2, Segemehl, circExplorer, circRNA_finder, CIRI, ACFS, KNIFE, NCLscan, DCC and UROBORUS. Information regarding these tools is listed in Table 2. In addition, CircView, a platform visualization tool, is specifically developed to visualize circRNAs detected from these tools. Furthermore, unlike other types of ncRNAs, circRNAs are not well-annotated. The traditional RNA annotation database RefSeq and Ensembl do not contain circRNA information currently. Instead, seven independent circRNA databases are available (Table 3).
Pseudogenes are a category of ncRNA that resembles mRNA, but they are not transcribed into proteins. Currently there are two major mechanisms for the formation of pseudogenes. The first one suggests that pseudogenes are the products of the process of genomic duplication. Genes in the duplicated regions often retain the original functions of their parent protein-coding genes. It was shown that high mutation rates were observed near or on duplicated regions. When mutations such as stop-gain and frameshifts disrupt the original function of the duplicated genes, pseudogenes are formed. The second mechanism describes how pseudogenes can be formed through retrotransposition, where the process of reverse transcription of mRNA re-integrates cDNA sequences into the genome, forming new pseudogenes. The formation of pseudogenes provides vital clues on how genomic DNA has adapted to evolutionary pressure to ensure survival. Pseudogenes are often regarded as dead or disabled with respect to protein synthesis. However, evolutionary studies have found evidence that a small number of pseudogenes in the human lineage have regained their protein-coding function.
The detection of pseudogenes usually relies on the careful analysis of sequence alignment. A common approach is to determine the homologous sequences to protein coding genes using tools such as FASTA or BLAST. Several more complicated computational approaches have been developed over the years to detect pseudogenes. However, their utilization has been low, probably partially due to the lack of interest in pseudogenes. Despite this lack of enthusiasm, several interesting studies have shown how pseudogenes can potentially affect human health. Most of the proposed pseudogene functions are facilitated through the homology of their sequences. An example is PTENP1 which is the pseudogene of PTEN, a well-characterized tumor suppressor. PTENP1 regulates cellular levels of PTEN by both sense and antisense RNAs which act as decoys for PTEN targeting microRNAs and also exert tumor-suppressive activities. It has been shown that a missense mutation in PTENP1 can eliminate a codon associated with methionine initiation, thus inhibiting the translation of regular PTEN protein. Another example is a pseudogene acting as a miRNA sponge. Studies have shown that BRAFP1 and PTENP1 compete for miRNA binding with their mRNA counterparts. The majority of the pseudogenes have been characterized in RefSeq and Ensembl. They are usually part of the regular RNA-seq analysis. Additional pseudogene resources for non-human species can be found at pseudogene.org. Pseudofam provides family based pseudogene resources.
5. Small RNA
sRNAs are non-coding RNAs with a length of less than 200 nucleotides. The discovery of sRNAs has substantially enriched our understanding of the diverse world of RNAs. There are many species of sRNA. Here we will discuss eight species that have been proven to be detectable from HTS data; miRNA, transfer RNA (tRNA), piwi-interacting RNA (piRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), Y RNA (yRNA), single-recognition particle RNA (7SL RNA), and 7SK RNA.
The most studied smallest RNAs are miRNAs with more than 10,000 published manuscripts on PubMed (Figure 2). Originally discovered in 1993, miRNAs are single-stranded ncRNAs of 19–25 nucleotides, which modulate translation by binding mRNAs through the seed sequence (up to seven nucleotides). Prior to the introduction of HTS technology, high-throughput miRNAs studies were conducted using hybridization-based technology, which limited the detection of miRNA to known and annotated miRNAs. The advancement of HTS has substantially increased the detection throughput of miRNA. More importantly, HTS enables the examination of miRNA at a single nucleotide resolution in addition to the quantification of abundance. By scrutinizing the precise nucleotide sequences of miRNA, researchers have discovered the phenomenon of miRNA isoforms (isomiR). The isomiRs are miRNAs with clipped seed sequences, compared to reference miRNA sequences (Figure 4). The seed sequence of isomiRs and their parent miRNAs can differ by up to two nucleotides, causing a substantial difference in the repertoire of predicted mRNAs targets. Because miRNAs are well studied, their annotations are readily available in most annotation databases, such as Ensemble and Refseq. In addition, more than 50 miRNA and/or miRNA target resources are available now (Table 4 and Table 5). The most commonly used independent miRNA database is miRBase, and the most used miRNA target prediction webserver is TargetScan. Another feature of miRNA that can be characterized by HTS technology is the non-templated nucleotide additions at the 3’ end of miRNAs. The miRNA generally function as the transcriptional control of the regulatory elements of other protein-coding genes. Thus, miRNAs play important roles in most biological processes, including development, proliferation, differentiation, immune reaction, apoptosis, tumorgenesis, adaptation to stress, and etc. The miRNAs have exhibited potential as biomarkers or therapeutic targets for human diseases including cancer. For example, overexpression of miR-185 was shown to inhibit autophagy and apoptosis of dopaminergic cells in Parkinson’s disease, potentially via regulation of the AMPK/mTOR signaling pathway. Several miRNAs have been repeatedly reported to be significantly dys-expressed in prostate cancer, including the down-regulated miR-143/145 and up-regulated let-7a, miR-130b, miR-141, and miR-17-5p. In lung cancer, the 5q33 region containing miR-143 and miR-145 is often deleted which implies decreased expression of both miRNAs.
HTS technology has also facilitated the discovery and identification of a wide range of sRNA species. At the infancy of HTS technology, small RNA-sequencing was often referred to as miRNA-sequencing as the goal was mostly to study miRNA. Through meticulous examination of HTS data, researchers became aware that miRNAs are only a fraction of the sRNA-sequencing data. The sequencing libraries for sRNA are constructed with size-selected gel electrophoresis, which is agnostic to sRNA categories. All RNA of less than 50 nucleotides in size are selected into the library. The most, or second most, abundant sRNA species is often tRNA (Figure 5A). tRNAs have a length of between 76 to 90 nucleotides, and serve as the physical link between mRNA and protein. The tRNAs detected with sRNA-sequencing are usually tRNA-derived sRNAs, a fragment of the parent tRNA, usually the 33 nucleotide sequences before the anticodon or the 33 nucleotide sequence after the anticodon (Figure 5B,C). The fragmentation of tRNA is caused by the cleavage by RNAse III enzyme, producing tRNA-derived halves. Moreover, tRNAs can also be cleaved in a Dicer-dependent manner or as an in-vitro phenomenon by incubation with MgCl2 or nuclease S1. The 5’ tRNA fragments has been found to inhibit the translation initiation by interfering with the cap binding complex elF4F. The production of the tRNA fragments has been shown to be associated with stress. The detection of tRNA nucleotide variants and their association with diseases has also been increasingly reported. For example, in animals, the tRNA fragment abundance has been found to be correlated with the severity of tissue damage in kidneys. These 5’ tRNA fragments may be captured by high-throughput sequencing. In 2015, a novel technique was developed to sequence the entire tRNA by removing the bases with potential modification.
Other noticeable species of sRNA that can be detected through HTS are piRNA, yRNA, snRNA, snoRNA, 7SL, and 7SK. piRNA is a small RNA with a length of 24 to 32 nucleotides, and is considered by many to be the most abundant species of sRNA. PiRNAs form RNA-protein complexes with the piwi proteins. Although the piRNA pathway has been commonly perceived as germline-specific, recent studies have demonstrated that the piRNA pathway has somatic functions, and potential associations with cancer. Several studies have suggested that piRNA can be derived from pseudogenes. yRNAs are components of the Ro60 ribonucleoprotein particle. Their primary function is DNA replication through interaction with chromatin and initiation proteins. The latest studies have suggested that yRNAs regulate cell death and inflammation in monocytes. snRNA is a species of small RNA confined to the splicing speckles and Cajal bodies of the nucleus in eukaryotic cells. The average length of snRNA is 150 nucleotides. The snRNAs bind with proteins to form small nuclear ribonucleoprotein particles (snRNPs). Some snRNAs are engaged in the formation and function of spliceosomes where pre-mRNA splicing occurs. The transcription of snRNA is carried out by RNA polymerase II or III. snoRNA is a species of sRNA with a length of 60 to 300 nucleotides. Its primary function is to guide chemical modifications of ribosomal RNA and tRNA. The two main classes of snoRNA are C/D box snoRNA and H/ACA box snoRNA. The primary function of C/D box snoRNA is to regulate methylation and pre-mRNA splicing. The H/ACA box snoRNAs have been associated with pseudouridine, the most abundant modified nucleoside in RNA. Discovered in 1970, 7SL is a species of sRNA, and a component of the signal-recognition particle ribonucleoprotein complex. The primary functions of 7SL RNA are to regulate protein translation, and post-translational transport. 7SK is a species of sRNA found in metazoans. Its primary role is regulating transcription through the regulation of the positive transcription elongation factor P-TEFb.
Many miRNA processing pipelines have been established for utilizing HTS data. The major pipelines include: Oasis, Chimira, miRge, and TIGER, etc. Several tools have also been developed to detect isomiRs such as SeqBuster, isomiRID and DeAnnIso. One tRNA detection tool, tDRMapper, is currently available. For other species of sRNAs, the majority of the alignment-based pipelines would be sufficient for detection given that the annotation is available.
The biology of the human body is a vast and complex system and we are just beginning to understand the role of non-coding RNA in regulating that system. The advancement of biotechnology contributes to every breakthrough in our understanding of human biology. ncRNAs, the supposedly insignificant portion of the RNA universe, have exploded with an array of studies centered around the potential functions and disease associations related to ncRNAs. The sudden interest in ncRNA is due to the maturity of HTS technology and the development of bioinformatics allowing the interpretation of HTS data.
The interest in ncRNA is reflected by the increasing number of manuscripts published pertaining to ncRNA and the initiation of large consortium projects focused on ncRNA, such as ENCODE. The controversy surrounding whether ncRNAs are functional really comes down to the definition of “functional.” In the 2012 ENCODE publication, all transcribed RNAs were considered to be functional, but some researchers require a stricter definition. Nevertheless, the majority of the human genome should have a purpose, whether it is to synthesize proteins or to serve as a sponge for miRNA, or with lost or undiscovered mechanisms.
The common consensus of the functions of ncRNA is that they regulate gene expression at both transcriptional and epigenetic levels. The exact mechanism of this regulation varies by ncRNA categories; and some may be yet to be discovered. As HTS technology and sequencing library construction methods advance, we come closer to elucidating the entire human RNA spectrum and uncovering the secrets of ncRNAs.