Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹Bovine respiratory disease risk

Comparative genomics of hepatitis A virus, hepatitis C virus, and hepatitis E virus provides insights into the evolutionary history of Hepatovirus species


Several viral species can cause liver inflammation (hepatitis) in humans. Three of the more common hepatitis viruses contain a genome of positive strand ssRNA: hepatitis A virus (HAV, also known as Hepatovirus A, a Hepatovirus, member of Picornaviridae), hepatitis C virus (HCV Hepacivirus C, a Flaviviridae member), and hepatitis E virus (HEV Orthohepevirus A, of the Hepeviridae family). Other viral species causing human hepatitis can contain an ssRNA(‐) genome (e.g., hepatitis D virus) or are retro‐transcribing viruses (such as the hepatitis B virus). Despite their shared replication strategy, host specificity and tissue tropism, HAV, HCV, and HEV share very little sequence conservation, indicative of their independent evolutionary origins. We compared the genome sequences of a large number of these three viral species to evaluate commonalities and differences between and within the respective clades.

The genome of HAV is approximately 7.5 kb long and encodes a polyprotein that is processed into four structural and six nonstructural proteins by a proteinase (recently reviewed in McKnight & Lemon, 2018). In lack of the cap assembly that is common in other RNA virus species, translation of HAV is initiated by a secondary structure formed by the 5′‐untranscribed region of the RNA genome, which functions as a ribosome entry site (McKnight & Lemon, 2018; Vaughan et al., 2014). Of note is that the codon use of HAV is quite distinct from that of its host, a property that is also reflected by its low GC content (37%); as a consequence, this virus is slow to replicate in human cells. The virus is transmitted via the fecal–oral route, and the relative high stability of unenveloped virus particles in the environment enables transmission via fecally contaminated food and water. Yearly, approximately 1.5 million clinical cases of HAV occur globally, although at least ten times as many new undocumented infections may occur, as suggested by serological evidence (reviewed in Vaughan et al., 2014). The infection is mostly self‐limiting and results in lifelong immunity.

In contrast, infection with HCV is often chronic, and this parentally transmitted virus is one of the leading causes of chronic liver disease. It is responsible for approximately 180 million infections worldwide, with 3 million new infections occurring annually (Preciado et al., 2014). The virus can be transmitted via unsafe medical practices including blood transfusions and needle reuse. As a result, developing countries present higher incidences than developed countries, with vast differences between countries (reviewed in Ansaldi, Orsi, Sticchi, Bruzzone, & Icardi, 2014). The genome of HCV is about 9.6 kb, has a GC content of 56% and codes for a polypeptide that after processing results in 10 mature proteins.

Hepatitis E virus is the latest discovered virus of the three species considered here, but it actually is the most common cause of acute viral hepatitis in humans (Kamar et al., 2017), with an estimated 20 million novel infections worldwide, of which 3 million are symptomatic and around 70,000 are lethal; these may be underestimates, even in developed countries (Webb & Dalton, 2019). Hepatitis E virus is spread via the fecal–oral route as well as by animal contact or via contaminated food of animal origin; parenteral transmission has also been described. Most infections are self‐limiting. The genome of HEV is 7.2 kb with a GC content of 56%, and the 5′‐untranscribed region is capped. The ssRNA(+) genome encodes a large polypeptide of which it is uncertain whether it is active as such or first processed into separate proteins with distinct functions, and 2 shorter proteins, translated from partly overlapping open reading frames (recently reviewed in Primadharsini, Nagashima, & Okamoto, 2019).

All three viral species are subdivided into genotypes and subtypes therein, based on hypervariable regions of their genomes. For HAV, three genotypes that infect humans (I, II and II, each with subtypes A and B) are recognized, and these belong to a single serotype. Three more genotypes are specific to the simian host (Costa‐Mattioli et al., 2003), while a previously described genotype VII is reclassified as IIB. A substitution rate of 9.76 × 10–4 substitutions per site per year (ssy) was calculated, based on complete VP1 sequences from French genotype IA isolates (Moratorio et al., 2007), but this may be an overestimate, as an analysis of complete genome sequences from multiple countries produced an estimate of 1.00 × 10–4 ssy (Kulkarni, Walimbe, Cherian, & Arankalle, 2009). Thus, the range is likely to be around one in ten thousand substitutions per site per year, or roughly one or two substitutions per viral genome per year. This reflects the slow evolutionary rate of this virus. A last common ancestor of human HAV and the simian genotypes was estimated to have existed between 1,250 and 3,500 years ago (Kulkarni et al., 2009). However, human HAV is presumed to have originated from a rodent virus as a result of a host jump (Dexler et al., 2015).

The genomic variation of HCV is much more extensive than that of HAV. HCV is subdivided in at least 7 major genotypes, with multiple subtypes therein (Simmonds et al., 2005; Smith et al., 2014). Genetic diversity between the HCV genotypes is about 30%, while subtypes within a given genotype differ by 15%–25% (Hartlage, Cullen, & Kapoor, 2016; Preciado et al., 2014). The different genotypes roughly coincide with geographical distribution, with genotypes 1, 2, and 3 being globally detected; genotypes 4 and 5 are more prevalent in Africa and the Middle East, and genotype 6 is found in Southeast Asia, as is reviewed elsewhere (Ansaldi et al., 2014). The RNA‐dependent RNA polymerase of HCV lacks proof‐reading activity, resulting in a high mutational rate of 10–5–10–4 nucleotides per replication cycle (Duffy, Shackelton, & Holmes, 2008), thus producing a heterogenic quasi‐species population within infected individuals. Estimates for individual nucleotides produced a substitution rate of between 1.40 and 1.72 × 10−3 ssy (Takahashi et al., 2004), which is 10 times higher than that of HAV. Hepacivirus species have now also been isolated from dogs, horses, and other mammals, which poses the possibility that HCV originated from a nonprimate host. In particular, a virus replicating in the equine host might have either made a natural host jump or it may have been aided by medical practices with horse‐derived products (Hartlage et al., 2016).

There are currently 8 recognized genotypes within HEV (Smith et al., 2016), of which genotypes 1 and 2 are exclusively found in humans, while genotypes 3 and 4 are shared between humans and other mammalian hosts, in particular pig/wild boar and rabbits. Genotypes that have not been described in humans but are isolated from other mammals (e.g., camels) are not considered here. The substitution rate of HEV was estimated between 3 and 5 × 10–3 ssy with differences observed between genotypes (Brayne, Dearlove, Lester, Kosakovsky Pond, & Frost, 2017).

Here, we compare the genotype groupings of all three viral species utilizing several thousand genome sequences downloaded from public databases. The phylogenetic analysis was restricted to human isolates, except for HEV for which pig/wild boar and rabbit isolates were included. Their codon usage and amino acid frequencies were compared, and viral genomes of other species were then included to provide insights toward the possible evolutionary origin of HAV.

Hepatitis virus datasets

In March 2019, over 5,000 viral genomes of HAV, HCV, and HEV were downloaded from Genbank to assess their inter‐ and intraspecies relationships. The size of downloaded sequences was restricted to 7,000–7,900 bp for HAV, 9,000–9,990 bp for HCV, and 7,000–7,800 bp for HEV. Animal isolates were excluded, except for swine/wild boar/rabbit HEV isolates. Sequences with ambiguous nucleotide stretches >2 were removed. Finally, redundancy was removed. A provisional phylogenetic tree was constructed (see below), and exceptionally long branches were checked in detail; these were without exception isolates of animal origin, which were subsequently removed. The final datasets contained 134 HAV genomes, 2,542 HCV genomes, and 557 HEV genomes.

Phylogenetic analysis

The genomes were aligned by MAFFT (Yamada, Tomii, & Katoh, 2016), and FastTree was used to build phylogenetic maximum‐likelihood (ML) trees (Price, Dehal, & Arkin, 2009). This infers approximately maximum‐likelihood phylogenetic trees and is much faster than other algorithms; we used the generalized time‐reversible (GTR) model of nucleotide evolution and the Shimodaira‐Hasegawa test for statistical confidence of internal nodes. Information on genotypes and subtypes that were included in GenBank annotations was used to map these on the trees. For visual representation, the HCV and HEV trees are shown after collapsing branches at 90% identity.

Codon usage analysis

Codon usage tables were also calculated for representative genomes for each genotype per species, as there were only minor differences between genotypes within a species, using the Codon Usage Calculator ( and averaged results were plotted in net plots with Excel. The codons of overlapping coding regions in HepE were first removed for this analysis, and their effect was assessed by a separate analysis where they were added in both frames, which did not affect the overall results. For these analyses, the open reading frames were extracted from the following genomes: For 6 genomes of HAV: AB623053.1 (genotype IA), M14707.1 (IB), AY644676.1 (IIA), AY644670.1 (IIB), FJ360731.1 (IIIA), and AB300205.1 (IIIB); for 6 genomes of HCV: EU781811.1 (1a), KF676352.1 (2a), KY620493.1 (3a), DQ418788.1 (4a), KJ925147.1 (5a), and DQ480522.1 (6a); for 9 genomes of HEV: LC225387.1/human (1a), MH809516.1/human (2a), KX462160.1/human, MF444099.1/human, EU375463.1/swine, MH184584.1/swine, JX565469.1/rabbit (all 3a), HQ634346.1/human, and DQ279091.1/swine (both 4a). Other virus species included for comparison are listed in Table 1. In the table and throughout the text, Uracil (present in RNA) is written as Thymine (T) as this is the nucleotide used in genome data. If all four nucleotides are evenly distributed, 25% would be expected for each. Overrepresented nucleotides (more then 30%) are shaded green, and underrepresented nucleotides (less than 20%) are shaded gray in the table.

For comparisons of amino acid composition between polypeptides, the polypeptides and complete open reading frames of the virus species listed above were compared and the analysis was extended to virus species of other families as listed in Table 1.

The effective number of codons (ENC) was calculated using DAMBE (Xia, 2013).


The three viral species HAV, HCV, and HEV are not phylogenetically related. Separate phylogenetic trees were produced for the three viral species, with 134 HAV genomes, 2,542 HCV genomes, and 557 HEV genomes, as shown in Figure 1. All trees were drawn to scale, to illustrate the much lower genomic diversity of HAV compared with the other two species. The trees for HCV and HEV were collapsed at 90% identity for graphical representation. The HAV branches remained un‐collapsed, since the shown branches per genotype all have a similarity >90%. The genomic diversity of HEV is higher than that of HAV but lower than that of HCV, as illustrated by the branch lengths of the trees. The produced HAV tree is in good agreement with previously published data (Vaughan et al., 2014), although we did not include simian serotypes IV to VI. Our HCV tree that was based on 2,542 genomes is also mostly in agreement with previous publications that used smaller datasets. A neighbor joining (NJ) tree based on 162 genome sequences (Jackowiak et al., 2014) already identified a relationship between genotype 1 (Gt1) and Gt4, which is also visible in Figure 1; these two genotypes have evolved later than the other genotypes (Preciado et al., 2014). The relationship between Gt2 and Gt7 was also noted before. However, our data produced a better resolution of genotypes than the NJ tree published by Jackowiak and colleagues. An ML tree based on 129 genome sequences (Smith et al., 2014) also did not fully resolve the branch of Gt1 and Gt4 with respect to the other genotypes. It is unclear on what data the tree shown in a review article (Preciado et al., 2014) was based, in which Gt1 and Gt4 were separated. In our tree, the branch leading to Gt1 and Gt4 is placed between Gt3 and Gt6/Gt8. Upon closer investigation of HCV genomes for which a subtype was specified in their Genbank annotation, only a few annotations did not match the current nomenclature, all of which were given a genotype before standardization of the nomenclature (Tokita et al., 1996). This illustrates that historical Genbank annotations must be interpreted with caution. The six genotypes of HEV are also well resolved in Figure 1, but the subtypes within these genotypes are not. Notably, one collapsed branch contained members of subtypes 1a, 1b, 1c, 1d, and 1f, whose close distances have been noted before (Smith et al., 2016). On the other hand, the genetic diversity within Gt3 is extensive, even if rabbit isolates are ignored (Figure 1). Based on their overall genomic similarity, members of Gt3 could be considered to belong to multiple genotypes that could be newly defined. This was observed by Smith and colleagues as well, but they decided to keep the nomenclature of genotypes 1–4 as proposed by Lu, Li, and Hagedorn (2006), since this was already well established. As a result, there is no consistent degree of similarity within the different genotypes and their subtypes for HEV, in contrast to the more transparent nomenclature of HCV.

The codon usage of the three viral species was next analyzed. Typically, codon usage is expressed as relative synonymous codon usage (RSCU), which calculates the over‐ or underabundance of specific codons relative to their expected frequencies based on the nucleotide composition of an open reading frame (ORF). For instance, codon use of HEV was analyzed by RCSU (Hu et al., 2011). Appendix Figure A1 compares the RSCU values of the three viral species, plotted in a wheel plot. This identified differences in codon usage between HAV on the one hand and HCV/HEV on the other hand. The codon usage of HAV is suboptimal for replication in the human host, while the codon usage of HCV is highly adapted to that of human cells, as has been described before (Pintó, Aragonès, Costafreda, Ribes, & Bosch, 2007). RCSU values are an excellent means to compare codon usage of individual genes within a given (prokaryotic) genome, as cells typically control gene expression by minor codon preferences (Sharp & Li, 1986). However, when comparing virus proteomes with large differences in codon usage, we consider it useful to look at this usage without correcting for nucleotide composition differences, as a dependence exists between nucleotide composition and codon usage preferences. Thus, we calculated codon usage as the fraction of used codons per given amino acid. Without a correction for nucleotide composition, the differences between HAV and HCV/HEV is amplified, and now the values can be compared with the overall codon usage of human cells (Figure 2). As can be seen, two trends describe the deoptimized codon use of HAV: (a) the virus strongly prefers codons with T over C at the third (wobble) position, while for human cells it is the other way round; this is clearly visible for Cys, Asp, Phe, Asn, and Tyr (Figure 2c). These are all amino acids for which only two codons are available, but the same third‐base preference for T can also be seen for Ala and Pro; (b) A weaker preference for codons having A at the third position is visible for Lys, Gly, and Arg. The deoptimized codon usage is still evident when only those amino acids are considered that occur at a high frequency (>4.5%) in the polypeptide of HAV (Figure 2d). That HAV strongly prefers T at the third position may be partly responsible for its high T‐content (32.8% on average, Table 1), but this is not a general rule. For instance, rabies virus (a negative strand ssRNA virus) prefers codons ending in G while its genome contains only 22.7% G (Zhang et al., 2018).

Various explanations have been proposed for the observed codon usage of HAV. Vaughan and coworkers proposed that it slows down translation of proteins, resulting in better competition for loaded tRNAs during translation of virus proteins against that of host proteins (Vaughan et al., 2014). However, most variation is in the third‐base wobble, in which case there would be little selection for less commonly used tRNAs and amino‐acyltransferases. Nevertheless, nonpreferred third bases can slow down the translation machinery, which presumably allows better protein folding of the capsid protein (Pintó et al., 2018). Costafreda and coworkers have shown that selection for deoptimized HAV codons was related to transcription efficiency, antigenicity of capsid protein, plaque size, and survival rates of virions (Costafreda et al., 2014). However, it is hard to envisage how this situation might have evolved when an ancestor of HAV had a codon usage that was better adapted to the mammalian host. In general, the direction of virus evolution would be toward more efficient, not toward less efficient translation and replication in a given host, as it would result in more (or more rapid) virion production. Moreover, if the selective pressure would mostly apply to optimal folding of the capsid protein, only that coding region of the genome would depend on using deoptimized codons, but the virus proteome is consistently using codons that human cells do not prefer, over its complete ORF length This is shown in Appendix Figure A2. We consider the most likely explanation for the current codon usage of HAV that it is a remnant of an ancestor virus that replicated in a host with a codon usage preference different to that of humans.

The most likely direct ancestor of HAV was a virus replicating in rodents, although simian HAV is more closely related to human HAV than rodent hepatoviruses are (Dexler et al., 2015). The codon usage in simian HAV (genotype V) strongly resembles that of human HAV, as do rodent hepatoviral species (Appendix Figure A3 panel a). Since Dexler and colleagues had proposed that rodent HAV species may have originated from an ancestor replicating in bats, we further assessed the codon usage of hepatovirus from bat species. Two of the currently 7 available genome sequences of bat hepatovirus were selected, one from a fruit‐eating bat and one from an insectivore species (Table 1). The latter (African sheath‐tailed bat) is widespread in Africa and prefers a diet of beetles and lepidopterans (McWilliam, 1987). Both investigated bat hepatovirus species had a codon usage extremely similar to that of the other analyzed hepatovirus species (Figure A3 panel a). Thus, possibly the selection mechanism proposed for HAV that resulted in de‐optimized codon use for its human host also applies to these other hepatoviruses that replicate in other mammalian hosts.

Alternatively, a possible ancestor with a codon usage that was adapted to an alternative host must be sought in a more distant evolutionary history. If such an ancestor virus once existed, it more likely replicated in a host with a GC content much lower than that of mammals, as the overall codon usage of mammalian cells does not vary much between species. Instead, codon usage in mammals is primarily governed by within‐genome variation in GC content, and only weakly, if at all, correlates to gene expression and tRNA content (Galtier et al., 2018).

We first assessed the possibility that a putative ancestor of HAV and other hepatoviruses was a virus propagating in bacteria that had a GC content in the range of the HAV genome. If a putative bacteriophage was the ancestor of HAV and other hepatoviral species, a host and kingdom jump would most likely have taken place in the gut. Therefore, to test this hypothesis, we compared the codon usage of three species of bacteria that are abundant in a mammalian gut and have a GC content around 37% (the current base G + C composition of HAV), for which we chose Acinetobacter baumannii, a member of the Gram‐negative class Gammaproteobacteria (Whitman et al., 2018); Enterococcus faecium, a Gram‐positive Firmicute in the class Baccilli; and Prevotella oralis, a Gram‐negative member of the class Bacteroidia. In Appendix Figure A4, it is shown that the best match in codon use between HAV and these bacterial species was found for E. faecalis.

Bacteriophages with an ssRNA (+) genome have been described, for instance bacteriophage MS2, which infects Escherichia coli and other members of Enterobacteriaceae. It is an icosahedral virus, just like HAV is. Another example is phage AP205 that propagates in Acinetobacter species (Klovins, Overbeek, Worm, Ackermann, & Duin, 2002). Single‐strand RNA bacteriophages are typically members of the Leviviridae family (Olsthoorn & van Duin, 2011) that bear no sequence resemblance to HAV. So far, a bacteriophage with structural or sequence similarity to HAV has not been described, but it should be noted that RNA phages have not been extensively studied or described, and this type of bacteriophages suffers from underreporting (Callanan et al., 2018).

Another possibility of an ancestral virus for HAV was assessed based on the reported structural similarity between HAV and insect viruses that are members of Dicistroviridae (also Picornavirales; Wang et al., 2015). In particular, Wang and colleagues observed structural similarity between HAV and triatoma virus that replicates in triatomines (kissing bugs, Czibener, Torre, Muscio, Ugalde, & Scodeller, 2000) and with cricket paralysis virus (CrPV) that propagates in cricket species endemic to Australia (Wilson, Powell, Hoover, & Sarnow, 2000). When we compared codon usage of HAV to that of these two viral species, a striking similarity was observed. In particular, the codon use of triatoma virus is highly similar to that of HAV (Figure 3) and that similarity is higher than that of HAV to CrPV or to the tested potential bacterial hosts. Figure 3c further demonstrates that the codon usage of triatoma virus is well adapted to its natural insect host, Triatoma infestans.

We next tested if the similarity in codon usage is restricted to HAV, its direct cousins, possible ancestors, and the two insect virus species. That was not the case, as another Dicistroviridae member, Israel acute paralysis virus, showed the same pattern. Even an insect virus not belonging to Dicistroviridae, varroa destructor virus (Iflaviridae, replicating in the varroa mite that is parasitic to bees) produced a very similar codon usage plot (Figure 3d). We then extended the comparison to other ss(+)RNA virus families and identified an equally strong similarity to human coronavirus, a Nidovirales member. Coronaviruses are not known to replicate in insects, but it has been proposed that they might have an insect virus as their ancestor (Nga et al., 2011; Zirkel et al., 2011).

This is not to say that HCV and HEV are exceptional with respect to their codon usage. Other Picornaviridae members such as human cosavirus or rhinovirus have codon preferences that more resemble HCV and HEV than HAV (Appendix Figure A3 panel b), although between these species slightly more variation is observed for single codons than we observe for HAV and the virus species shown in Figure 3d. These two Picornaviridae illustrate that the observed distinction in codon usage does not follow taxonomic divisions, as they do not group with HAV and other Picornavirales. Norovirus (a Caliciviridae member) also matches the HCV/HEV codon usage pattern. Human pegivirus (HPgV), which is also known by the alternative names hepatitis G virus or GB virus C, is also included in this comparison. The virus rarely infects hepatocytes and its role in human disease is still being discussed (Marano et al., 2017). Its codon usage also resembles that of HCV and HEV, although it is a member of the Flavivridae (Table 1). Zika virus and West Nile virus (also Flaviviridae) are transmitted by mosquitoes; however, they do not have an insect‐like signature as their codon usage is also similar to that of HCV and HEV (Figure A3).

The effective number of codons (ENC) was also calculated, using the method by Xia (Xia, 2013) which is an improved version of the original method by Wright (Wright, 1990). A theoretical proteome under maximal codon bias that would only use a single codon for each of the 20 amino acids would result in an EcN score of 20, while a proteome using all 61 possible codons free of any bias would score 61. The values obtained are shown in Appendix Table A1; HAV had a score of 46.9, HEV scored 51.9, and HCV scored 54.0. The insect virus species varied from 45.0 (triatoma virus) to 52.6 (cricket paralysis virus). The highest score was reported for norovirus (58.3) and the lowest for rodent hepatovirus (43.1). These results support the view that a codon bias exists for HAV, but it seems an even stronger bias exists for rodent hepatovirus and for triatomavirus, although codon usage of the latter is well adapted to its host.

Since codon preference is related to nucleotide composition, this parameter was next compared for all coding regions of the virus species included in the comparison. Nucleotide composition is normally expressed as %GC, but for single‐strand genomes, the contribution of individual bases was also assessed (Figure 4). Either analysis clearly divided the various virus species into two groups, with HAV, the insect viruses and coronavirus, rhinovirus, and cosavirus being more AT rich and relatively low in C (panel 4A), while Zikavirus, WNV, norovirus, and HPgV grouped with HCV and HEV and were all rich in G and C (panel 4B). Simian, rodent, and bat hepatovirus were very similar to HAV. These were not included in panel 4A for clarity, but their similarity in base composition can be seen in Table 1. In terms of %GC versus %AT, rhinovirus and cosavirus were closer to HAV than they were to HCV/HEV (panel 4C) although in terms of their codon usage preference, they clearly did not belong to the HAV group. This shows that the codon usage findings only partly correlate with nucleotide composition.

We next assessed if preference for a certain base at a certain position in the codon correlated with nucleotide composition. For this, the frequency at which a base was found at the first, second, and third position was plotted against the percentage of that base in the complete proteome‐coding region of the genome, for each of the analyzed virus species (Figure 5). Any deviation from the x = y axis identifies an under‐ or overabundance of that nucleotide at that position. A consistent overrepresentation of G at the fisrt position was observed in all analyzed virus species. An underrepresentation of G at the second, and of T at the first position was also consistently observed. The striking preference of HAV for T at the third position is clearly visible, as is the avoidance of C at that same position. That HAV also has a notable preference for C at the second position had not been apparent from the wheel plot of Figure 2. In none of these analyses did HAV behave differently from some or all of the other analyzed genomes.

Finally, because of lack of sequence similarity between the various virus species, the amino acid composition of their proteomes was compared, as this is even less dependent of nucleotide composition than codon usage preference is. Again, this comparison segregated the analyzed virus proteomes into two groups, one containing virus species with an amino acid frequency more HAV‐like and the other group more resembling HCV and HEV (Figure 6). The insect virus proteomes of triatoma virus, IAPV, CrPV, and varroa destructor virus have amino acid frequencies similar to that of HAV. In addition, cosavirus, norovirus and rhinovirus have amino acid frequencies resembling HAV, while their codon usage is more similar to that of HCV/HEV and their nucleotide composition is less rich in A and T. Zikavirus, HPgV, and WNV more resembled HCV and HEV in terms of amino acid composition (Figure 6 panel b).

In summary, HAV shares a high AT‐content in its coding regions with positive ssRNA invertebrate virus species (triatoma virus, CrPV, IAPV, and varroa destructor virus) and with coronavirus, as shown in Figure 4. These virus species also display a conserved codon preference (Figure 3d) and share a similarity in amino acid frequency for their total proteome (Figure 6a). HCV and HEV form a separate group together with Zika virus, HPgV, and WNV in terms of their amino acid frequencies (Figure 6b). The proteome constituents of rhinovirus, cosavirus, and Zika virus are more like HAV than HEV/HCV (Figure 6a), while their codon usage resembles that of HEV/HCV (Figure A3 panel b). These findings indicate that codon usage preference can vary between viruses with similar amino acid frequency, and these parameters are not completely dictated by nucleotide composition.

In all analyses presented here, HCV and HEV group together, although these virus species do not share sequence similarity and are not classified in the same taxonomic families (Table 1). A structural similarity between HEV and Caliciviridae (to which norovirus belongs) has been noted before (Bradley, 1990), but it is apparent that HCV also bears resemblance to Flaviviridae, as not only exemplified by HCV but also by the other Flaviviridae included here (HPgV, Zika virus and WNV).

A full explanation for the observations regarding HAV cannot be given, but it opens the intriguing possibility that HAV, its close relatives simian, rodent, and bat hepatovirus, and the insect virus species analyzed here are somehow related, and might even have shared a common ancestor. That ancestor might have been an insect virus that underwent a host jump to bats, after it was passed on to rodents and eventually simians and humans. The jump from insect to bat may have occurred in the blood (in case a blood‐sucking insect was the source) or in the gut of insectivorous bats. A candidate for this putative common ancestor has not been identified, as no invertebrate virus is yet described with sequence similarity to HAV, but its existence can be hypothesized. In contrast, HEV and HCV seem to have a common ancestor not related to that of HAV and form a different group of (human) ssRNA(+) virus species that includes Zika virus, HPgV, and WNV, while a striking resemblance between HEV and HCV for all analyzed parameters is observed. An alternative explanation for the observed similarities is that the various virus species found to share the identified features with virus species of different taxonomic families have undergone parallel evolution that drove these species toward identical amino acid frequencies and conserved codon use, even if (as in the case of human vs. insect viruses) their hosts have alternative codon preferences. We consider that second possibility less likely.


Although no sequence similarity is detected between the various virus species compared here, in combination the presented data make it plausible that an ancestor virus of HAV and other hepatoviral species was an insect virus, with a codon use adapted to that host, whose signature is still visible in current HAV genome. We consider it possible that a blood‐sucking insect such as triatomes, which feeds on mammals, may have been the source for a virus crossing host species. Alternatively, a host jump may have taken place in the gut of insectivorous bats. More speculative is the possibility that in a long evolutionary past all these virus species may have originated from bacteriophages that propagated in Gram‐negative AT‐rich bacteria, with which they still share codon preference and amino acid frequency.


None declared.


TMW designed the study, produced Figures 2, 3, 4, 5, 6, wrote the first draft of the manuscript, and interpreted the data; S‐RJ produced and curated the required datasets and produced Figure 1; MR advised on software tools for all figures, interpreted the data, and edited the manuscript; DWU ensured funding, advised on all figures, interpreted the data, and edited the manuscript.


None required.