Swine are an important models for human anatomy, nutrition, metabolism and immunology [1–3]. Their organs are anatomically and histologically similar to humans as are their sensory innervation and blood supply. Pigs are naturally susceptible to infection with organisms that are closely related or identical to those species infecting humans including helminths (Ascaris, Taenia, Trichuris, Trichinella, Shistosoma, Strongyloides), bacteria (Campylobacter, Chlamydia, Eschericia coli, Helicobacter, Neisseria, Mycoplasma, Salmonella), protozoans (Toxoplasma) and viri (Coronavirus, Hepatitis E, Influenza, Nipah, Reovirus, Rotavirus) [2, 5, 6]. The last 10 years has seen a boon in the development of genetically modified pig as models for human cardiovascular and lung disease, neurodegenerative and musculoskeletal disorders [7, 8] and cancer. There is also a robust effort to develop pigs as sources for organs and tissues for human xenotransplantation.
Despite these potential strengths as a model, the lack of an annotated database for porcine gene and protein expression data is a limiting factor for translating findings in one species to another. Multiple online databases exist for the storage and retrieval of diverse bovine, rodent or human biomedical data [11–19]. Other databases exist for Zebrafish (ZFIN,), C. elegans (WormBase,), and Drosophila melanogaster (Flybase,). Databases that encompass multispecies analysis such as Homologene and/or that rely on manual annotation such as InnateDb include bovine but not porcine genes. Several porcine genome companion databases exist; however they lack robust manual annotation and are somewhat limited in scope or are infrequently updated [16–19]. Agbase, a large, multispecies functional analysis database allows the user to search 51,489 porcine genes based on 12 criteria including gene and protein names (UniProt) and Gene Ontology (GO) annotations. Furthermore, databases can contain a significant number of errors due to their primary reliance on machine-based annotation. For example, the SUS-BAR database is designed to identify protein orthologs based upon data that includes annotations from the machine-annotated NCBI genome. NCBI has recently begun to include GO annotations into curated entries for non-human and rodent species but most of these are indirect and often based on observations made in other species. As swine are an important model for comparative human studies, there is a critical need to have a centralized, manually-curated source of information for biomedical research. To address these needs, we created the Porcine Translational Research Database.
Genbank (non-redundant, expressed sequences tag, high throughput genomic sequence, trace archive databases and whole genome shotgun contigs databases) was searched by discontiguous Megablast using default settings (word size = 11), using reference sequence accession numbers to human, bovine, mouse or rat genes/proteins of interest. A similar search was conducted in the following databases using the human or bovine reference sequence; NIH Intramural Sequencing Center (NISC) Comparative Vertebrate Sequencing Project; National Center for Biotechnology Information (NCBI), Sus scrofa Genome Assembly releases 102 to 105 and Ensembl v10.2 releases 83 to 89. For genes that were determined to be missing from build 10.2 (Additional file 1) (and for the mis-assembled or duplicated gene artifacts (Additional file 1), we also constructed templates from de novo assemblies derived from Illumina 80 bp reads of the pig alveolar macrophage transcriptome (Dawson, unpublished results) using the de novo assembly algorithm of CLC Genomics Workbench using word size of 20 and a bubble size of 50. When necessary, predicted templates (from bovine or human sequences) were supplemented with porcine expressed sequence tag (EST) assemblies, single ESTs and portions of the published Tibetan (Bioproject # PRJNA291130), Wuzhishan (Bioproject # PRJNA144099), Goettingen (Bioproject # PRJNA291011), Jinhua, Meishan, Bamei, Large White, Berkshire, Hampshire, Pietrain, Landrace, Rongshang and Duroc (Bioproject # PRJNA309108) porcine genomes. ESTs were assembled using CAP3 (http://doua.prabi.fr/software/cap3). RNASeq reads were then mapped to these predicted templates in order to derive the full-length consensus sequence (unambiguous 6X coverage) using CLC Genome Workbench 7.0 (QIAGEN Bioinformatics, Redwood City CA). The following settings were used. Mismatch cost =2, Insertion cost = 3, Deletion cost = 3, similarity fraction = 0.95, length fraction = 0.95. Nucleotide sequences were translated using the ExPASy translate tool (http://web.expasy.org/translate/). A total of 1279 of these sequences have been deposited to the transcriptome shotgun assembly sequence database under Bioproject PRJNA80971 and the short read archive under project SRP013743). In silico-derived full-length RNA sequences are provided for an additional 3391 genes. This process/pipeline is summarized in Fig. 1. A summary of these sequences is provided in Table 1.
We randomly chose 268 of these mRNA for comparison of the 5′, 3′ and ORF length comparison to the corresponding human mRNA. Data are presented in Additional file 4. For the 1041 protein-coding genes missing from the genome, we entered the gene symbols into the DAVID version 6.8 (https://david.ncifcrf.gov) to assess overrepresentation of groups of gene with related function. The functional data were limited to human. Nine hundred and fifty six genes out of 1041 genes were recognized and 955 had functional annotations, of the unrecognized gene 41 are pig or artiodactyl specific genes. Data on functional enrichment of genes with a multiple comparison adjustment (Benjamini) value of >0.05 are presented in Table 3. We chose the 60 largest proteins of extreme size (>3000 amino acids) to compare the status (number of loci and completeness) in the NCBI and Ensembl build 10.2 genome. Because exon preservation is usually well conserved and there is fragmentation of certain areas of the porcine genome, the number of exons for the corresponding human gene was used for comparison. Lastly, we determined the chromosomal location of 1307 duplicated gene artifacts (2889 loci, Additional file 2) to identify problematic regions. Data are expressed as duplication per megabase (number of bases derived from the NCBI genome build (http://www.ncbi.nlm.nih.gov/genome?term=sus%20scrofa) and are presented in Fig. 2.
The currently described database was constructed in the Filemaker Po Advanced v14.0 program (Filemaker Inc., Santa Clara, CA). The layout is illustrated in the sample database entry for the cytokine IL10 (Fig. 3 panels A–D). It was deployed using the Filemaker Server Advanced v14.0 program (Filemaker Inc., Santa Clara, CA). External access to the database has been successfully tested using Chrome, Internet Explorer and Safari browsers. Other areas of the database were populated from existing published or our own unpublished data. Each publication is manually reviewed and data (antibodies, real-time PCR assays, RNA or protein expression data, functional data) is abstracted and entered into the database, along with the Pubmed ID, in the appropriate field. We have developed Taqman real-time PCR assays for 1867 of these genes making them cross reactive for as many species as possible (1067 are partially or fully human gene cross reactive). This is to ensure that comparable areas of the gene are being analyzed as well as for economic reasons. We also conducted a literature survey to determine the sequence of porcine SYBR green PCR assays. Tissue-specific gene expression summaries, using these assays, are provided for these and other studies (i.e., those using microarray and RNASeq), and a comprehensive search of catalog and published literature to identify antibodies to the corresponding proteins. Last, the “Notes Field” in the database was populated with information such as types of errors discovered, degree of 5′ and 3′ UTR conservation, degree of positive selection pressure in various species, and intron status. When the gene (sequence) is present in the genome but not annotated as a gene, we annotate the gene in the Notes field as “Not an identified gene in Ensembl build 10.2.” or “not an identified gene in NCBI build 10.2”.
To date, we have generated 9720 full-length transcripts representing 9165 genes (Table 1). They include 1354 genes missing from Ensembl build 10.2 (Table 2 and Additional file 1) and 1400 genes that have been sequenced at least two times (gene duplicated artifacts shown in Table 2 and Additional file 2 that were annotated as separate genes in either Ensembl or NCBI builds. Functional enrichment analysis of 1041 protein-coding genes that are missing from the genome reveals that genes that are annotated as cytokines (24, p = 0.0053) and transcription factors (68) (particularly Homeodomain-like transcription factors (34, p = 0.032) and CENP-B/Helix-turn-helix (HTH) domains (6, p = 0.035) are significantly overrepresented (Table 3). Of note, the great majority of the Interleukin 1 Superfamily (IL1F10, IL1RN, IL36A, IL36B, IL36G, IL36RN, IL37) members are significantly (p = 0.0073) overrepresented. Data analysis that do not account for these genes risk missing assessment of important genes involved in inflammation and development.
Based upon gene number estimates from other closely related species such as human and cow, we estimated that our database has a coverage rate of approximately 42% of the porcine genome. These represent sequences found in 10,232 Unigene entries (1.45 per gene), 9967 NCBI loci (5756 are single loci that are not duplicated gene artifacts or split into multiple loci, and 1793 genes have multiple (4211) loci. A total of 2109 and 1616 of the genes have no assigned Unigene number or NCBI loci, respectively. In addition to GO and Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, literature-based functional annotations (derived from more than 5500 references) are provided for these sequences. We have also discovered a relatively large number (178) of porcine or artiodactyl-specific paralogs (Additional file 3) for 104 protein or non-protein coding porcine genes. For genes with multiple paralogs, genes are named in the order of phylogenetic distance of the parent human or bovine gene. Some of these genes are expressed pseudogenes. Some of these genes have been previously discussed (i.e., CD36, IL1B [25, 31]) or will be discussed in the following sections.
The transcripts we have generated for protein-coding genes include, on average, 70.5% of the corresponding 5′ and 3′ ends (each) of the human sequence (Additional file 4). The ORF is 99.4% conserved on a nucleotide count basis. These percentages indicate the fidelity of our procedure. We discovered extensive gene truncation (incomplete ORF) and gene duplicated artifacts (genes sequenced more than once) among the machine annotated versions of these genes. These problems are common among 1st drafts of other genomes [32, 33]. Gene duplicated artifacts appear most frequently for chromosomes 12, 2 and 3, and less frequently for chromosomes X, 11, 13, and 1 (Fig. 2). The most frequent areas should be targeted for re-sequencing or reassembly. Analysis of the 60 largest porcine proteins in the database shows that gene fragmentation and truncation roughly correlate with protein size and number of exons (Table 4). Figure 4 shows BLAST search results from two extremely large proteins, hemicentin (HMCN1, panel a) and titin (TTN, panel b) that have 9 loci assignments each, in the current NCBI build. Surprisingly, these proteins are not represented in Ensembl build 10.2 as annotated genes. Overall, of the 60 largest porcine proteins, only 6 and 10 are represented as single full-length sequences of the correct size in Ensembl and Genbank, respectively. We have deposited 12 de novo assemblies in the TSA archive and have provided in silico predicted RNA and protein sequences for 37 of these genes.
In previous studies, we extensively compared porcine, human and mouse genes related to immunity and inflammation [2, 3, 25, 26]. In the following section, we will summarize our findings for three major Superfamilies or functionally related groups of proteins (CD marker genes, Solute carrier superfamily, ATP binding cassette superfamily) or non-coding RNA (microRNA) that have complete or nearly complete representation. CD markers (accessible as a group by entering CD markers in the Annotations field) encode a heterogeneous group of cell surface proteins. The Human Leucocyte Differentiation Antigen (HLDA) workshop has designated 408 molecules (some of which are grouped within a CD) as CD markers. Based upon our assembly and analysis, we could establish 1:1 orthology for 357 porcine genes to those that compose HLDA version 10. Forty-three genes are not present in the porcine genome or could not be designated as 1:1 orthologs. Of these, nine genes (CLEC4C, CLEC4M, SIGLEC7, BTN3A1, LILRA1, LAIR2, PSG1, SIRPG, TNFRSF10C, are primate-specific [35–37]. KLRC2 (CD159c) is found in humans and rodents but not pigs. FCGR2C is a human-specific gene/pseudogene that belongs to a family of three low-affinity immunoglobulin gamma Fc receptors (CD32). We have determined that pigs have two member of this family that roughly corresponds to FCGR2A and FCGR2B. TNFRSF14 (CD270) is a marker for B cells, dendritic cells, monocytes, and Treg cells found in humans and rodents, but not cows. Although, canine, feline, equine and ursine homologs have been identified, this gene may be a pseudogene in pigs as the putative ORF is interrupted by an endogenous retroviral sequence (H. Dawson, unpublished). FCRL2 (CD307b) is a marker for B cells in humans. Although sequences corresponding to FCRL2 have been identified in other mammals including dog and horse, no mouse ortholog has been identified. This gene shows evidence of positive selection in humans and is most likely a pseudogene in pigs.
Due to rapid evolution and post-speciation gene duplication, no 1:1 orthology could be established for most mouse and pig LILR or KIR family members, including LILRA4 (CD85G) and LILRB4 (CD85K). Similarly, other than CEACAM1 (CD66) and CEACAM6 (CD66C), no 1:1 orthology could be established for most pig and mouse CEACAM family members (CEACAM3 (CD66D) CEACAM5 (CD66E). CEACAM8 (CD67) may be a pseudogene as ESTs in Unigene Ssc.60435 predict a 243 amino acid protein interrupted by several stop codons. CEACAM8 and CEACAM6 were previously determined to have no direct murine orthologs. Several other shared human-pig CD marker orthologs (ADGRE2 (CD312), ADGRE3 (CD313r), CD1A, CD1E, CR1 (CD35), CD58, FCGR2A (CD32), FCAR (CD89), FCRL3 (CD307c), FCRL4 (CD307d), ICAM3 (CD50), NCR2 (CD336), NCR3 (CD337) and TLR10 (CD290r) have no rodent orthologs [2, 40, 43].
A significant number of errors were discovered in genes encoding porcine orthologs of human CD markers; 25 are not present in Ensembl build 10.2, 88 of the proteins are truncated and 52 are duplicated gene artifacts. Sixty-seven full-length mRNA sequences encoding proteins, assembled from macrophage RNA-Seq reads, have been deposited in Genbank. An additional 79 in silico constructs are provided. Antibody data, gathered from publications, manufacturers or generated in house, is provided for 186 proteins including 395 monoclonal and 285 polyclonal antibodies. Additional cross reactivity for 29 proteins is expected because they are >95% similar to human proteins. Several of the CD Marker family are members of other gene families including the Solute Carrier and ATP-binding Cassette Super Family.
The Human Genome Organization’s gene nomenclature committee (HGNC) has assigned 395 genes to the Solute Carrier Superfamily, 21 are pseudogenes and three hundred seventy four encode proteins (accessible as a group by entering Solute Carrier Superfamily in the Annotations field). These are organized into 52 subfamilies; about 25% are dedicated to nutrient transport. The porcine Solute Carrier Super family contains 398 protein-coding members and all human subfamilies are represented. Forty-two of these genes are present in other porcine genomes but missing from Ensembl build 10.2, 113 are truncated and 58 of these are duplicated gene artifacts. Sixty three full-length mRNA sequences, assembled from macrophage RNA-Seq reads, have been deposited in Genbank and an additional 159 in silico constructs are provided. Forty-two of these genes are missing from all porcine genomes or are present as pseudogenes. Among these genes are UCP1 (thermogenein), a protein involved in non-shivering thermogenesis and a pseudogene in pigs and SLC52A2, a primate specific riboflavin transporter. Other species-specific genes include eight primate-specific (SLC2A14, SLC22A24, SLC35E2, SLC35G3, SLC35G4, SLC35G5, SLCO1B1, SLCO1B7), one human specific (SLC22A25) gene and 14 mouse or rodent-specific genes (Slc6a20b, Slc7a12, Slc21a4, Slc22a19, Slc22a21, Slc22a22, Slc22a26, Slc22a27, Slc22a28, Slc22a29, Slc22a30, Slco1a1, Slco1b2, and Slco6b1). SLC25A18 is present in human and rodent genomes but is missing from bovine and porcine genomes. SLC25A52 is present in primate and rat genomes but not mouse. SLC9C2 is a pseudogene in mouse. SLC22A31 is an expressed pseudogene in pigs and is missing in rodents. SLC22A11 is an expressed pseudogene in pigs and a non-expressed pseudogene in mouse. Lastly, SLC23A4, an intestinal nucleobase transporter, is a pseudogene in humans but is present in pig, cow and rodent genomes. Several porcine or artiodactyl-specific gene expansions are found in subfamilies (Additional file 3) including SLC7A3 (14 members), SLC7A13 (3 members) SLC22A6 (2 members), SLC22A10 (4 members) and SLC47A1 (2 members). The biological functions of these paralogs remain to be determined; however the parent genes are involved in amino acid (SLC7A3, SLC7A13) or dipeptide transport (SLC22A6) [48, 49].
The HGNC has assigned 51 genes to the ATP binding Cassette Superfamily, three are pseudogenes and 48 encode proteins (accessible as a group by entering ATP binding Cassette Superfamily in the Annotations field). These are organized into five subfamilies (A-G), about 20% are dedicated to nutrient (i.e., carotenoid, cholesterol and vitamin A) transport. The porcine ATP binding Cassette Family contains 57 members and all human subfamilies are represented. These include five that are missing from Ensembl build 10.2 and 18 that are duplicated gene artifacts. Five of these genes are present in other porcine genomes, but missing from Ensembl build 10.2, 21 are truncated, and 18 of these genes are duplicated gene artifacts, Eleven full-length mRNA sequences, assembled from macrophage RNA-Seq reads, have been deposited in Genbank and an additional 24 in silico constructs are provided.
An analysis of this superfamily revealed that ABCC11 has no murine ortholog and ABCA8 has no direct rodent ortholog as the gene has diverged into two paralogs, Abca8a and Abca8b. The ABCC4, a prostaglandin E2 transporter, has diverged from the parent gene into five paralogs (ABCC4L1, ABCC4L2, ABCC4L3, ABCC4L4 and ABCC4L5 (Additional file 3). ABCA10, involved in human macrophage cholesterol transport, is a pseudogene in rodents. It may be an expressed pseudogene in pigs as the predicted protein is half (787 amino acids) the size of human ABCA10 (1543 amino acids) and weak expression (by RNASeq) was detected in macrophages and moderated expression in intestine (H. Dawson, unpublished). ABCA17 is an expressed pseudogene is humans and pigs. Like the Solute Carrier Superfamily, most of the genes in the ATP binding Cassette Super family have not been characterized at the functional level. Nevertheless, the similarities and differences in the ATP binding Cassette and ATP binding Cassette Super families impact the suitability of rodents and pigs as models for human drug and nutrient transport and metabolism.
The exact number of microRNAs in the porcine genome is unknown. There are 4272 annotated microRNAs in the human genome (build 30). Although there are several papers describing the measurement of porcine microRNAs in various tissues or estimating the number in the porcine genome [54–57] and three partially overlapping sources of porcine microRNA sequences, the exact number of porcine microRNAs is currently unknown. There are only 382, 385 and 816 (non-redundant) annotated pig miRNA sequences in Mirbase, NCBI gene build, and Ensembl build 10.2, respectively. These three sources of information have a significant amount of overlap (Fig. 5a). We have consolidated this information and provide sequence data for our own predicted sequences based on conserved sequence identity to 1900 human, mouse or bovine sequences, to provide 1033 non-redundant porcine microRNA sequences (accessible as a group by entering MicroRNA in the Annotations field). Of note, all of the sequences found in Mirbase were found in the NCBI gene build, 59 of the microRNA sequences in Ensembl were found to be duplicated artifacts, and 214 of the 1033 sequences are not present in the current Ensembl gene build (10.2). This includes 81 that we have predicted based upon their presence in other species and other unfinished porcine genomes. We discovered the following species- or genera-specific microRNA; pigs (454), humans (199), primates (111) bovine (179), mouse (76) and rodents (20). Many of the porcine-specific microRNA have arisen from biological duplication/expansion (Additional file 3). A comparison of microRNA that are present in pigs and shared among at least one of the three other species (human, cows, and mice) revealed that 318 microRNA are shared among the four species, 107 are shared between pigs, humans and cows but not mice, and 34 are shared between pigs, mice and cows but not humans (Fig. 5b). Thus, the frequency of non-conserved microRNA preservation between human and pig is nearly three times that of mouse to pig.
The Porcine Translational Research Database is named because of its unique utility to translate findings made in rodents to pigs and from those in pigs to humans. A comprehensive literature-based survey was conducted to identify genes that have demonstrated function in humans, mice or pigs. The resulting data in the database is documented by >6000 references. The database currently contains 65 data fields for each entry. Our efforts to improve the genome and its annotation are similar to other efforts, for example the sequencing of 12,000 genes to supplement annotation of the pig genome [32, 33, 58] and de novo assembly of multiple pig genomes to reveal 1737 protein coding genes that are missing from Ensembl build 10.2. The online Supplemental data from the latter manuscript was unavailable at the time of the preparation of this manuscript so no comparison could be made. The manual assembly of >9700 RNA sequences has direct practical implications for genomics-based analysis. The state of the current genome build (mis-annotations, duplication artifacts, and missing sequences) effectively prohibits its use for aligning RNAseq reads. We have used these sequences to compare gene expression separately from Ensembl 10.2 and have also compared the number of reads obtained from the corresponding templates in Ensembl 10.2. For the great majority of transcripts compared, as expected, our full-length sequences provided a higher level of sensitivity than the corresponding Ensembl sequences (H. Dawson unpublished).
The full 5′ and 3′ representation of each gene will also allow for characterization of regulatory regions and miRNA target sites. In our estimation, >40% of transcripts in Ensembl or NCBI genomes do not represent the full-length gene. Our efforts will also allow for further consolidation of porcine Unigene numbers. Currently, each gene is represented by from 0 to >10 Unigene assignments, and >10% of genes have more than one.
It is significant that we discovered a large number of errors (about 30% of entries) in the publicly available sequence databases (these can be accessed by searching the “Notes Field” using the word “error” (Fig. 3)). In addition to the duplication artifacts, mis-annotations and missing genes, we also encountered a number of RNA sequences in publically available archives belonging to other species. For, example, human (AHR, AF233432.1), panda (IL2, NM_001199892.1) and rat (NUDT14, ESTs in Unigene Ssc.85635) RNA sequences are annotated as porcine derived. We also found sources of contaminating DNA from completely unrelated species. For example, about 1/5 of porcine chromosome 4 clone CU076066.6 is from Zebrafish. These sequences represent 6 Zebrafish genes (LOC100003615, LOC447815, LOC108179932, LOC108183883, LOC108183971, and LOC103910681) and are annotated as porcine genes by Ensembl build 10.2 (ENSSSCG00000006223) and NCBI genomes (LOC100739857). Similarly, several NCBI loci (ASNA1L*, LOC100737282, LOC100737202, LOC100620149, LOC100737282) and one Ensembl locus (ENSSSCG00000026988) are derived from contaminating Babesia bigemina genomic DNA.
We have discovered several sources of systematic errors in the Ensmbl and NCBI gene/protein prediction or annotation pipelines. For example all selenoproteins in Ensembl are truncated because the codon (UGA) for selenocysteine is mistranslated or translated as a stop codon. We and others have identified a systematic error in the identification of another gene family, the Taste receptor, type 2 (TAS2R) Superfamily. Despite being intronless and mostly devoid of 5′ and 3′ UTR regions, Ensembl consistently fails to recognize them as genes. These data illustrate the critical importance of the manual-curation process to reduce errors.
We believe that this is the largest manually curated database for any veterinary species and that the infomantics are unique among those targeting a veterinary species in regard to linking gene expression to gene function, identification of related gene pathways, and connectivity with other porcine gene databases, as well as for reagents that measure gene and protein expression. In addition, it is the largest source of centralized antibody information for the pig. Any database must be updated frequently in order to be useful. Currently the database is updated monthly and we anticipate expanding the content to include all porcine genes. There are several Super families of genes that will be the next targets of our efforts. One is the GPCR super family, the exact size of the GPCR super family is still unknown, but nearly 800 different human genes (or ~4% of the entire protein-coding genome) have been predicted to code for them. We will also continue to develop and annotate new assays. We intend to include our own prediction analysis for the promoter and 3′ UTR region of RNA for transcription factor and microRNA binding sites. Lastly, we intend to synchronize our database with the porcine “Snowball” array and porcine gene expression atlas.