It is well known that the 64 codons of the genetic code encode the 20 standard amino acids as well as three translation termination signals (UAA, UAG, UGA). Each amino acid is encoded with at least one codon (e.g., Met and Try); however, due to the degeneracy of the genetic code, some amino acids are encoded with up to six codons (e.g, Leu, Ser and Arg). Codons encoding the same amino acid are referred to as synonymous codons. Studies have indicated that synonymous codon usage is non-random and species-specific. Some synonymous codons are more frequent than others both within and between genes, and this phenomenon is termed synonymous codon usage bias. In general, genome dynamics, primarily mutation pressure, facilitate the evolution of novel viruses and strains and contribute to adaption to environment and host. Hence, codon usage variation is considered to be an indicator of the type of force that influences genome evolution. Investigation of codon bias and the forces that influence it provides insights into the fundamental mechanisms of viral evolution. Thus, understanding codon bias is essential to understand the interplay between a virus and its host.
It was well established that mutational pressure and natural selection were presented as the two major factors accounting for codon usage variation in mammalian, protozoan and endosymbiotic bacterial genes. In their investigate of codon usage variation, Shackelton et al (2006) found that codon usage bias was strongly correlated with overall genomic GC content, indicating that compositional constraint under mutation pressure rather than natural selection was the main factor for specific codons. Naya et al (2001) examined the Chlamydomonas reinhardtii genome, which has a high GC content, and found no evidence that base constraint under mutation pressure was responsible for determining the codon usage pattern. Recently, it was also reported that codon usage variation is related to gene function and length, DNA replication and selective transcription, protein secondary structure and environmental factors.
Torque teno virus (TTV) is a small, single-stranded, negative-sense non-enveloped, circular DNA virus, which has been classified as a member of the recently discovered Anelloviridae family. It was first identified in a Japanese patient with post-transfusion hepatitis of unknown aetiology in 1997. Subsequently, TTV has been detected in humans, chimpanzees, poultry, swine, cattle, sheep, cats and dogs. TTV was first detected in swine in 1999 and two genetically distinct species, Torque teno sus virus 1 (TTSuV1) and 2 (TTSuV2), have been identified based on the low sequence identity between the two variants.
Recently, Torque teno sus virus (TTSuV) infection of pigs has become widespread in many countries, including the USA, Canada, Spain, Germany, China, Japan, Korea and Brazil. Despite the fact that TTV infection in humans is not yet directly associated with any disease, TTSuVs have been shown to be involved in co-infection with other diseases, including the experimental induction of porcine dermatitis and nephropathy syndrome in combination with porcine reproductive and respiratory syndrome virus infection and post-weaning multisystemic wasting syndrome (PMWS) in combination with porcine circovirus type 2 (PCV2) infection in a gnotobiotic pig model. Moreover, Kekarainen et al. (2006) found that TTSuV2 was detected at a significantly higher rate in PMWS pigs than in healthy pigs. Other research comfirmed that the replication of TTSuV2, but not of TTSuV1, was up-regulated in the pigs with PMWS. This result was supported by Taira et al (2009), who examined animals suspected of infection with PMWS and porcine respiratory disease complex. However, due to the limited number of animal species examined and the lack of information about viral cell and tissue tropism, the characteristics and evolution of TTSuV are not fully understood.
We previously investigated synonymous codon usage in TTSuV1 and began to suspect that this method might be important for elucidating the molecular mechanism and evolutionary process of TTSuV. In this study, synonymous codon usage bias was analyzed in the coding sequences (CDS) from the 41 available TTSuV2 genomes, and the codon usage patterns of TTSuV2 and TTSuV1 were compared.
Complete genome sequences from 41 TTSuV2 isolates were downloaded from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/Genbank/). Each TTSuV2 CDS was analyzed using DNAStar version 7.1 (DNAStar, Madison, WI). Table 1 summarizes relevant details about these viral sequences.
The Recombination Analysis Tool (RAT, http://cbr.jic.ac.uk/dicks/software/RAT/) was used to detect recombination events in TTSuV2 and TTSuV1 sequences. Recombination is a prevailing drive that shapes genome evolution, and it is believed to influence the efficacy of natural selection on codon usage. RAT uses a distance-method-based algorithm to perform pair-wise comparisons with multiple sequence alignments (DNA or protein). The RAT graph represents the genetic distance of each sequence in the alignment to a reference sequence (Y-axis) for each position in the sequence (X-axis). A putative recombination event is detected when the lines representing two sequences intersect in the graph.
Compositional properties measures
General nucleotide composition (A%, C%, T% and G%) and nucleotide composition at the third position of each codon (A3S%, C3S%, T3S% and G3S%) were analyzed for TTSuV2 CDSs using Molecular Evolutionary Genetics Analysis (MEGA) software version 5.0. The GC and GC3S index was used to calculate the overall G + C content in the gene sequence and at the third position of synonymous codon (excluding Met, Trp and termination codons).
Measure of synonymous codon usage
Relative synonymous codon usage (RSCU) values and effective number of codons (ENC) values were calculated using CodonW software version 1.4 (http://codonw.sourceforge.net). The RSCU is defined as the ratio between the usage frequency of one codon in the gene and its expected frequency in the synonymous codon family (i.e., the observed frequency of a codon adjusted for amino acid composition). RSCU value is calculated according to the following published equation:
ij denotes the position of the codon (i) in the CDS for the corresponding amino acid (j). n
i denotes the total number of synonymous codons encoding the amino acid at this position. Codons with RSCU values greater than 1.0 exhibit positive codon usage bias, while those with RSCU values less than 1.0 have negative codon usage bias. RSCU values of 1.0 indicate that the codon frequencies are equal or random.
The ENC is the most useful estimator of absolute synonymous codon usage bias and can indicate the degree of synonymous codon bias in a codon family. ENC values range from 20 (only one synonymous codon occurs in the CDS) to 61 (all synonymous codons occur with equal frequency). A gene with an ENC value lower than 35 is generally considered to have significant codon usage bias.
Correspondence analysis (COA), also known as principal component analysis, was performed with CodonW software version 1.4. COA is the most commonly used multivariate statistical analysis method. In this analysis, COA was used to study the major trends in sequence variation and distribute genes along continuous axes according to these trends. Each gene was represented as a 59-dimensional vector, each dimension corresponding to the RSCU value for each sense codon (excluding Met, Trp and termination codons). Major variation trends within this dataset can be determined with the relative inertia: genes were positioned according to the major inertia to determine the major factors affecting codon usage bias in the gene.
Correlation analysis was performed to compare the relationship between nucleotide composition and synonymous codon usage pattern using Spearman’s rank correlation analysis method. A phylogenetic tree was constructed by the neighbor-joining method with a bootstrap of 1000 replicates, based on the Clustal W alignment produced with MEGA software version 5. Cluster analysis was performed using the hierarchical cluster method, and the distances between selected sequences were calculated by the Euclidean distance method. All statistical results were analyzed using Student’s t-test, SPSS software version 11.6 for Windows (p > 0.05, no difference; 0.01 < p < 0.05, non-significant difference; p < 0.01, significant difference).
Recombination is believed to influence the efficacy of natural selection on codon usage. A single recombinant sequence present in an alignment can seriously influence the branch order and branch length of the trees generated using standard phylogenetic methods. Therefore, it was necessary to exclude any TTSuV2 and TTSuV1 sequences found to be recombinant from further analysis. Recombination analysis of a nucleotide sequence alignment including all 41 TTSuV2 sequences and 29 TTSuV1 sequences was performed using RAT software (Figure 1). The resulting graph provided no evidence for recombination within or between TTSuV2 and TTSuV1 sequences. However, the graph indicated that the sequences diverged at nucleotide position 2282 into branches corresponding to TTSuV2 and TTSuV1.
The 41 TTSuV2 sequences were further analyzed for codon usage bias and the synonymous codon usage pattern between TTSuV2 and TTSuV1 (previously analyzed) was compared, as described in the following sections.
The nucleotide content of the TTSuV2 genomes is provided in Table 2. In the CDSs from the 41 genomes, A and G occurred more frequently than C and T. A occurred most frequently at the third codon position (average A3S% = 41.77%) and T occurred the least frequently (average T3S% = 27.67%). The overall nucleotide composition and the composition at the third codon position in TTSuV2 genomes suggest that compositional constraint might be influencing the codon usage pattern of this genome. The GC% of TTSuV2 genomes (42.9% to 46.7%, average 45.1%) is lower than for other vertebrate DNA viruses. The GC3S% ranged from 43.2% to 48.2% with a mean value of 46.2%. Due to this compositional constraint, it was expected that A would occur most frequently at the third codon position in TTSuV2 genomes.
The ENC values of these TTSuV2 genomes were much higher than genomes of other DNA viruses, varying from 55.20 to 58.18 with a mean value of 56.21. This result indicates that codon usage bias is not remarkable in TTSuV2 genomes and is apparently maintained at a stable level.
Codon usage in TTSuV2
The overall RSCU values for the 59 codons in all 41 TTSuV2 genomes indicated that A and C occurred most frequently at the third codon position (i.e., GUA for Val, GCA for Ala, CAA for Gln and AAC for Asn) as shown in Table 3. In addition, the CCU, ACU and UAU codons, encoding Pro, Thr and Tyr, respectively, occurred more frequently than the other synonymous codons for these amino acids. Two codons encoding Arg, CGA and CGC, also occurred more frequently than their synonymous codons. These results support the hypothesis that compositional constraint is a major contributing factor in codon usage pattern in TTSuV2 genomes.
For TTSuV2 sequences, ENC was plotted against both the GC content at the third synonymous codon position (GC3S%) and the expected ENC values, as determined by CodonW analysis (Figure 2). All actual codon usage indices were lower than expected, although differences were small. In addition, a positive correlation (r = 0.316, 0.01 < p < 0.05) between GC3S and ENC values was found. These results taken together support the conclusion that factors other than compositional constraint under mutation pressure (the major factor accounting for codon usage bias) have influenced TTSuV2 evolution.
COA of codon usage
To investigate RSCU variation, COA was performed using the 41 TTSuV2 genomes as a single dataset. As described in the "Materials and methods" section, the distribution of genes on the COA axis was used to identify the source of the variation among a set of multivariate data points. A major trend in the first axis (f
1’) accounted for 16.91% of total synonymous codon usage variation, and the second major trend in the second axis (f
2’) accounted for 13.72% of the total variation (data not shown).
COA was performed for TTSuV1 and TTSuV2 genomes separately and the first two axes of the plots are shown in Figure 3. Although TTSuV1 and TTSuV2 genes occupied all four quadrants of the rectangular coordinate system, the points were generally separated from each other. This result reveals that variation in codon usage might be one of the factors driving the observed aspect of TTSuV evolution.
Effect of mutational bias on codon usage variation
To explore whether the evolution of codon usage bias in TTSuV2 CDS had been driven by mutation pressure alone or whether translation selection from its host has also contributed, we first compared the correlation between general nucleotide composition (A%, T%, G%, C%, GC%) and nucleotide composition at the third codon position (A3S%, T3S%, G3S%, C3S%, GC3S%) using the Spearman’s rank correlation analysis method (Table 4). A significant positive correlation was observed between A% and A3S% (r = 0.761, p < 0.01), C% and C3S% (r = 0.392, 0.01 < p < 0.05), GC% and GC3S% (r = 0.645, p < 0.01) and significant negative correlation was observed for most of heterogeneous nucleotide comparisons. Taken alone, these results suggest that compositional constraints under mutation pressure determine the codon usage pattern for TTSuV2. However, a significant positive correlation between G% and C3S% (r = 0.434, p < 0.01), GC% and T3S% (r = 0.434, p < 0.01) and no correlation between T% and T3S% (r = 0.175, p > 0.05), G% and G3S% (r = 0.171, p > 0.05) suggest that natural selection from its host might have played an appreciable role in determining the codon usage pattern of this virus.
Furthermore, G + C content at the first and second codon positions (GC1% and GC2%) was compared with the G + C content at the third codon position (GC3%). A highly significant correlation was observed between GC1% with GC2% (r = 0.551, p < 0.01), GC3% (r = 0.699, p < 0.01), and GC2% with GC3% (r = 0.490, p < 0.01). Since the effects were present at all codon positions, the results further support the hypothesis that nucleotide constraint under mutation pressure was a main determinant for synonymous codon usage pattern in TTSuV2.
COA was also performed for the first two principle axes (f
1’ and f
2’) and A%, T%, G%, C%, GC%, A3S%, T3S%, G3S%, C3S%, GC3S% (Table 5). The first principle axis (f
1’) exhibited a significant positive correlation with G%, C%, GC%, C3S%, GC3S% and a negative correlation with A%, A3S%. It was interesting to note that, except G3S% (r = –0.357, 0.01 < p <0.05), the second principle axis (f
2’) had no correlation with any nucleotide content. These results further support the conclusion that composition constraints under mutational bias is an important factor determining synonymous codon usage pattern in TTSuV2, and but that other factors, such as natural selection, contributed.
Relationship between TTSuV and host codon usage patterns
In the ENC plot (Figure 2), most points were near to and under the expected curve, which suggested that other factors contributed to codon usage bias in addition to mutation pressure. To examine this further, a comparative analysis of RSCU values was performed for TTSuV2, TTSuV1 and swine, the natural host for this virus. We found that the codon usage pattern of TTSuV2 was mostly coincident with that of TTSuV1 and that the similarity between the viruses and the host was low. In particular, except for CCU encoding Pro and UAU encoding Tyr, all the preferentially used codons in TTSuV2 and TTSuV1 had an A or C in the third codon position: UUA for Leu, AUA for Ile, UCA for Ser, CAC for His, GAC for Asp and UGC for Gly (Table 3). In contrast, most frequent codons in swine had a T or A at the third codon position. Although some codons frequent in swine, such as CAC for His, AAA for Lys, GAC for Asp and AAA for Glu, were also frequent in TTSuV2 and TTSuV1, the high frequency codons in swine (CUG for Leu, UCU for Ser, UGU for Cys) were generally low frequency codons in TTSuV2 and TTSuV1. It was worth noting that the similarity to swine was higher for TTSuV1 than it was for TTSuV2. The RSCU values of synonymous codons in TTSuV1 and swine, including GUG for Val, GCU for Ala, CAG for Gln, AAU for Asn, were clearly different than TTSuV2 values. This suggests that TTSuV1 might have adapted to its host under natural selection to some degree for improved translation efficiency and that selection pressure from the host had less effect on codon usage pattern of TTSuV2.
Phylogenetic and cluster analysis
A cluster tree was generated with the RSCU values from all 41 TTSuV2 genomes using a hierarchical cluster method. As shown in Figure 4, the TTSuV2 CDS were divided into three main lineages (I–III). Lineage I comprised two strains isolated from the USA, one from Germany and five from China. Twenty-two strains isolated from Brazil, Spain and China were grouped into Lineage II. Lineage III was comprised of strains isolated from China only. Some genes from different isolates were classified into the same lineage, while others genes from the same isolate were classified into different lineages; thus lineage did not correspond well with geographical distribution.
The phylogenetic analysis of all 41 TTSuV2 (black dots) and 29 TTSuV1 sequences (white dots) was performed to determine the conservation and variation of codon usage pattern within TTSuV lineages (Figure 5). The two major branches of the resulting phylogenetic tree corresponded to TTSuV2 and TTSuV1, and each branch had several minor branches. Thus, phylogenetic analysis of the two viruses did not reveal correlations between sequence differences and geographical distribution.
TTSuV is an emerging small DNA virus, widely distributed in pig-farming countries. Although reports implicate TTSuV in co-infection with other diseases, in depth studies on molecular characteristics and pathogenic mechanism are lacking. Synonymous codon usage is a well established technique for analyzing genetic information from viral genomes. Most codon usage studies have focused on higher organisms or microorganisms with large genomes and viruses that pose a great threat to human health, such as human immunodeficiency virus, human bocavirus, hepatitis virus and Influenza A virus. Results from analyzing codon usage bias in TTSuV genomes are expected to contribute to the knowledge of the characteristics and molecular evolution of this virus. This report furthers our investigation of synonymous codon usage variation in TTSuV1 and provides the first analysis of TTSuV2.
Recombination is an important event in viral evolution and epidemiology. It is interesting to note that recombinant viruses appear to be highly pathogenic, suggesting that recombination events either preserve or increase the pathogenicity of the original strains. Various studies have demonstrated that natural inter- and intra-genotypic recombination occurs frequently in viruses, as shown for highly pathogenic porcine reproductive and respiratory syndrome viruses, PCV2, humane enterovirus 71, and rabbit haemorrhagic disease virus. Thus, before analyzing codon usage bias for TTSuV2, we first conducted recombination analysis of 41 TTSuV2 sequences and 29 TTSuV1 sequences, and found no evidence for recombination between the two viruses (Figure 1).
In this study, we analyzed synonymous codon usage bias in TTSuV2 CDS, as well as the relationship between codon usage patterns of TTSuV2 and TTSuV1. Most frequent codons in both TTSuV2 and TTSuV1 had A or C at the third codon position. Mean ENC values for H5N1 influenza A virus, severe acute respiratory syndrome and human bocavirus, reported as 50.91, 48.99 and 44.45, respectively, are lower than the ENC values for TTSuV2 and TTSuV1 (56.21 and 56.46, respectively), indicating a relatively low codon usage bias for these two viruses. Codon usage patterns for TTSuV2 and TTSuV1 were remarkably similar. In addition, no significant relationship was found between the codon usage pattern of TTSuV2 and its host; although TTSuV1 codon usage was comparatively more similar to swine than that of TTSuV2 (Table 3). This observation might be the result of genome composition evolution and dynamic processes of mutation and selection that enabled the TTSuV1 virus to escape the antiviral cell responses and adapt its codon usage to its host environment.
In this study, nucleotide frequency at the third codon position of synonymous codons correlated to general composition for some codons but not for others (Table 4). The GC content was similar at all codon positions in TTSuV2 genomes, presumably as a result of mutational pressure. In addition, the general correlation between codon usage bias and composition constraint suggest that mutational pressure was an important factor determining codon usage in TTSuV2, as seen in the highly significant correlation between GC1%, GC2% and GC3% (p < 0.01), and remarkable correlation between f
1’ values with respect to A%, G%, C%, GC%, A3S%, G3S%, GC3S% (p<0.01) (Table 5). Furthermore, in all ENC plots, values for TTSuV2 genomes were below the expected curve (Figure 1). Taken together, the above evidence indicates that compositional constraint under mutational pressure significantly contributed to the variation of synonymous codon usage in TTSuV2 genomes.
Natural selection has been shown to influence the synonymous codon usage pattern in viruses and this conclusions is supported by this study. First, although the GC3S% for the TTSuV2 genome is lower than average (46.20%), the most frequent codons had A or C at the third codon position (Table 3). Second, a significant positive correlation existed between G% and C3S%, and GC% and T3S% (p < 0.01), whereas no correlation was detected between T% and T3S% or G% and G3S% (p > 0.05) (Table 4). Except G3S%, no correlation was found between f
2’ values and A%, T%, G%, C%, GC%, A3S%, T3S%, C3S% or GC3S% (p > 0.05) in this study (Table 5). Third, most points in the ENC plot were close to the expected curve, although all were below it (Figure 2). The above evidences suggests that, in addition to mutation pressure, natural selection played an important role in determining codon usage bias for TTSuV2 genomes as well. Thus, codon bias in the TTSuV2 genome is multi-factorial. We believe that these characteristics of TTSuV2 genomes might have conferred adaptive advantage resulting in a highly efficient dissemination of this virus through different modes of transmission.
The analysis of TTSuV genome sequences identified two genetically distinct species, TTSuV1 and TTSuV2. COA was performed to detect possible codon usage variation between these two viruses. Unexpectedly, the distribution of the two viruses showed that genetically distinct species were distantly located in the plane defined by the first two axes of the analysis (Figure 3). A cluster tree analysis based on the RSCU values of TTSuV2 genomes revealed that geographic factors failed to correspond to the codon usage pattern of this virus (Figure 4). Further, the phylogenetic tree had two major branches corresponding to the two different species, and no specific geographical correlation was detected in this analysis (Figure 5). It seems likely that, given extensive international communication and various modes of transmission for this virus, geographical distance is a weak factor in the distribution of TTSuV2 in different countries.
In summary, our investigation of synonymous codon usage pattern in TTSuV2 CDS revealed that codon usage bias is not remarkable, possibly representing the interactions between compositional constraint under mutation pressure and natural selection. However, both TTSuV1 and TTSuV2 genomes exhibited significant synonymous codon usage bias favoring A or C at the third codon position, presumably determined by compositional constraint under mutation pressure. Although the analysis of synonymous codon usage does not perfectly reflect the genetic variation of TTSuV2 nor does it distinguish between TTSuV1 and TTSuV2, our results provide an insight into the codon usage variation in TTSuV2 genes that may also facilitate understanding of TTSuV evolution.