Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
.
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹American foulbrood risk

Cell Tropism Predicts Long-term Nucleotide Substitution Rates of Mammalian RNA Viruses

Introduction

RNA viruses are responsible for a disproportionate number of emerging human diseases, including influenza, ebola hemorrhagic fever, hantavirus pulmonary syndrome, and Middle East respiratory syndrome, which place tremendous health and economic burdens on both the developing and developed world,. In 2008, rotavirus and measles virus caused the deaths of 570,000 children under the age of five, making them two of the leading killers of children worldwide. In 2009, it was estimated that rotavirus infections alone result in $325 million in medical treatment costs and $423 million in societal costs each year. Further, the implementation of many intervention strategies has either failed or been delayed as a result of the evolutionary dynamics of these pathogens,,,,,.

Differences in viral evolutionary dynamics, such as rates of evolution, can explain why certain viruses have the capacity to adapt to new host species, increase in virulence, or develop resistance to antivirals,,,,. Therefore, understanding why some RNA viruses evolve more quickly can facilitate better prediction of their pathogenic and epidemiological potential,,,. Though extremely high nucleotide substitution rates are a defining feature of RNA virus evolution,,,, there have been few attempts to comprehensively examine the driving genomic and ecological factors behind these rates.

Differences in the strength and direction of selection pressures on these viruses result in variation among their substitution rates,,. However, while some general patterns have been observed in selection pressures, such as enhanced purifying selection on the structural proteins of arboviruses, there have been no attempts to quantify the relationship between selection pressures and long-term viral substitution rates.

The high rates of RNA virus evolution are most commonly attributed to their replication with error-prone RNA-dependent RNA polymerases (RdRps),, but these nucleotide substitution rates are known to span at least three orders of magnitude, and do not correlate well with experimentally measured viral mutation rates. Further, the substitution rates of some DNA viruses, which replicate with high-fidelity DNA polymerases, are comparable to the high substitution rates of RNA viruses. Therefore, the polymerase error rate alone cannot explain the substitution rate variation in RNA viruses.

Along with mutation rate, viral replication frequency directly impacts the rate at which mutations can be introduced, and ultimately fixed as substitutions. Replication frequencies could be influenced by a variety of factors related to viral genomic architecture or ecology. For example, weak negative correlations between viral genome lengths and substitution rates have been attributed to either enhanced replication frequencies or higher mutation rates in viruses with smaller genomes,,,. It has also been suggested that different transmission and infection modes result in differences in generation time, ultimately causing variation among per-year rates of synonymous substitution of RNA virus structural genes.

In this modern survey of mammalian RNA virus evolution rates, we generated and compiled published substitution rates of structural and non-structural genes produced by Bayesian coalescent analyses. We analyzed these rates as a function of seven factors related to virus genomic architecture (i.e., genome length, genome sense, and whether or not the genome is segmented) and virus ecology (i.e., target cell, transmission mode, host range, and whether the infection is acute or persistent). We also evaluated the relationships of viral substitution rates with dN/dS estimates, experimentally measured mutation rates, and estimated generation times. Though recombination undeniably plays a role in shaping viral evolutionary dynamics and could inflate substitution rate estimates,, we conservatively removed any potential recombinants from our datasets prior to analysis. Through this broad analysis, we were able to demonstrate that cell tropism, and its impact on viral generation time, has the greatest influence on rates of mammalian RNA virus evolution.

Datasets

A review of the literature yielded 92 published Bayesian nucleotide substitution rate estimates for the structural genes of 35 different mammalian RNA viral species, and 21 published Bayesian rates for RdRps or a non-structural gene of 14 different viral species (referred to collectively as “non-structural,” Table S1). These rates were supplemented with 26 novel Bayesian substitution rates of structural genes of 19 different viral species, and 19 novel Bayesian rates of non-structural genes of 16 different viral species (Table S2). Collectively, these rates span three orders of magnitude, ranging from 3.0×10−5 to 1.5×10−2 nucleotide substitutions per site per year (ns/s/y) and 2.0×10−5 to 1.3×10−2 ns/s/y for the structural genes and non-structural genes, respectively (Table S1).

Plotting the levels of each variable by ascending mean substitution rate revealed similar patterns (i.e., the same ordering of levels) for both the structural (S) and non-structural (NS) datasets in three of these variables, excepting transmission route. Viral substitution rates grouped according to target cell (panels 1A and 1B), transmission route (panels 1C and 1D), infection type (panels 1E and 1F), and host range (panels 1G and 1H) are shown in Figure 1.

Substitution rates were also grouped by viral genomic architecture (genome sense/strandedness, Figure 2A and 2B, and genome segmentation, Figure 2C and 2D) and plotted against viral genome length (Figure 2E and 2F). There were no apparent relationships between genomic properties and substitution rates (Figure 2), including no linear relationship between substitution rates and genome lengths in either dataset (coefficient of determination, S: R

2 = 0.06, NS: R

2 = 0.08).

dN/dS estimates calculated in this study were compiled with published estimates also calculated using the Single Likelihood Ancestor Counting (SLAC) method (56 structural gene dN/dS estimates, 33 non-structural gene dN/dS estimates total, Table S1).

Statistical analyses

ANCOVA analyses were performed separately on the structural and non-structural gene datasets to determine which, if any, of seven factors (target cell, transmission route, infection mode, host range, genome length, genome sense, and genome segmentation) significantly predict the nucleotide substitution rates of mammalian RNA viruses. To explore the many dummy-coded categorical variables, three analyses were run using different variable levels as the base levels (see Methods for details, Tables 1 and 2). For all of the ANCOVA analyses, the adjusted coefficient of determination () was ≥0.73, indicating that over 70% of the substitution rate variability can be explained by the predictor variables included in this study. Standardized residual plots identified only six potential outliers of the 118 structural gene rates and one potential outlier of the 40 non-structural gene rates (Figure S1), indicating that the data are normally distributed and therefore amenable to a general linear model.

Regardless of the base levels, target cells were the only significant predictors of log-transformed substitution rates for both structural and non-structural genes (Tables 1 and 2), with cell tropism as the only significant predictor variable by type III sum of squares (SS) analyses (P<0.0001 and P = 0.003 for the structural and non-structural gene datasets, respectively). Targeting epithelial cells or neurons was found to be the most significant predictor of structural gene rates in each analysis where these were not the base levels (P<0.0001, Table 1, Figure 3), while targeting neurons was found to be the sole significant predictor of substitution rates for the smaller non-structural gene dataset (P = 0.009, Table 2, Figure 3). Further, there was a high correlation between each viral species' estimated structural gene substitution rate and its corresponding non-structural gene rate (33 viruses, Pearson r = 0.87, P<0.0001). This suggests that if it were possible to calculate more non-structural rates, we would likely see results similar to those from the structural gene dataset.

To minimize any potential bias introduced by using multiple published rates for a single viral strain or species, we conducted control analyses using datasets with only one rate per species. For species with multiple substitution rates in one of our datasets, we calculated the average log substitution rate and used that as the sole substitution rate for the species in the control analysis. These data were also normally distributed (Figure S2), but the for these analyses were slightly lower than for the full datasets (S: = 0.65, NS: = 0.70, Tables S3 and S4). These control results were consistent with those from the full dataset analyses: tropisms for epithelial cells or neurons were the most significant substitution rate predictors (Tables S3 and S4, Figure S3).

Because of the high correlation between the structural and non-structural gene rates, we combined the two datasets (Figure 4) and performed a final set of three ANCOVA analyses using this combined dataset. The results from these analyses were nearly identical to those from the structural gene analyses (Table S5). The exception was that, in addition to cell tropism, Type III SS analysis also identified transmission route as a significant predictor variable (P = 0.007), though it was still less significant than cell tropism (P<0.0001). More specifically, in addition to different cell tropisms, transmission through arthropod vectors was also found to be a significant rate predictor in one of the three analyses (P = 0.002, Table S5).

To ensure that any substitution rate variability attributed to a given predictor variable was not significantly dependent on other predictor variables, we examined collinearity in all datasets. With the exception of the persistent infection variable, which was nested with the endothelial target cell variable and thus excluded, the ANCOVA analyses for the structural gene rate datasets and the combined rate dataset showed no significant collinearity (no variance inflation factors (VIF) were greater than 10). For the non-structural gene rate datasets, many different predictor variables had VIF>10. However, subsequent analyses where each individual variable was removed did not significantly reduce collinearity in these datasets (data not shown). Due to the consistent results between the structural and non-structural gene datasets, as well as those from the combined rate dataset, we concluded that correlations among independent variables did not significantly impact our results.

Since target cells were found to be the only consistently significant predictors of substitution rates, a series of one-tailed t-tests was used to confirm which cell tropisms are associated with higher viral substitution rates than others. Viruses that target epithelial cells were found to have significantly higher structural gene substitution rates than viruses that target neurons, endothelial cells, or leukocytes (Table 3, P<0.0009). Similarly, viruses that target epithelial cells were found to have significantly higher non-structural gene substitution rates than viruses that target neurons, hepatocytes, or leukocytes (Table 4, P<0.0007). These results were recapitulated in the control datasets that only used one rate per viral species (Tables S6 and S7). It should be noted, however, that most of the viruses in this study that are classified as targeting leukocytes ultimately cause systemic infections and infect a wide variety of cell types. Consequently, viruses in the leukocyte target cell category had the most rate variation of all the target cell categories (Figure 1).

Because transmission through arthropod vectors was also found to be a significant rate predictor in the ANCOVA analyses based on the combined datasets and because of the correlation between epithelial cell tropism and fecal-oral/respiratory transmission, we evaluated any significant variation among substitution rates of viruses with different transmission routes. Using a series of one-tailed t-tests, we found that viruses that are transmitted through the fecal-oral/respiratory route have significantly higher substitution rates than those transmitted by arthropod vectors (P<0.0001). However, we also compared different cell tropisms within each of these transmission routes. We found that fecal-oral/respiratory transmitted viruses that target epithelial cells have significantly higher substitution rates than those that target other cell types (P<0.0001, Figure 5). Similarly, we found that neurotropic arboviruses have significantly lower substitution rates than arboviruses that target other cell types (P<0.001, Figure 5).

We also tested for linear relationships between viral substitution rates and other evolutionary parameters for which only smaller subsets of our datasets could be analyzed. Reliable experimentally measured mutation rates estimated as mutations per base per infectious cycle were only available for four different viruses included in this study (poliovirus 1,,, hepatitis C virus, influenza A virus,,, influenza B virus). Mutation rates measured as mutations per base per strand replication were only available for three viruses included in this study (poliovirus 1, measles virus,, and influenza A virus). These mutation rates were not significantly correlated with their corresponding substitution rate estimates (r = 0.69, P = 0.31 and r = −0.93, P = 0.25, for mutation rates measured as mutations per base per infection and mutation rates measured as mutations per base per replication, respectively). Similarly, there were no significant correlations between the estimated substitution rates and dN/dS estimates (ρ = −0.02, P = 0.88 and ρ = −0.07, P = 0.68, for the limited structural gene and non-structural gene datasets, respectively).

ANCOVA and t-tests consistently revealed epithelial cell tropism and neurotropism as the most significant viral substitution rate predictors. Since these two cell types have some of the highest and lowest turnover rates, respectively, of all mammalian cells,,,, we sought to determine if there were any associations between host cell turnover rate and viral generation time. Using the model proposed by Sanjuán (2012) that relates the long-term substitution rate, K, to the mutation rate, μ, correcting for transient deleterious mutations, we were able to estimate generation times for the few viruses with reliable mutation rate estimates. This model, , with , (G = genome length, g = generation time, sH = harmonic mean of the selection coefficient), confirmed that influenza A virus, influenza B virus, and poliovirus, which target epithelial cells, have substantially shorter generation times (<40 hours) than hepatitis C virus, which targets hepatocytes (>200 hours). These results, while based on a very limited dataset, provide quantitative evidence for a link between cell tropism and generation time. Shorter average generation times lead to more rounds of replication per year, which could neatly explain higher per-year substitution rates.

Selection pressures do not predict substitution rates

Variation in strength and/or direction of selection has frequently been invoked as a determinant of viral substitution rates,,. While positive selection can certainly result in variation among very short-term substitution rates, purifying selection tends to dominate over longer timescales,,,. However, variation is observed in the strength of purifying selection due to differences in host ranges. For instance, as previously mentioned, viruses vectored by arthropods have unique evolutionary constraints placed on them by their host diversity,,,. While previous studies found that arboviruses are under stronger purifying selection than non-arboviruses,,, we found that the dN/dS estimates based on structural genes of arboviruses were not significantly lower than those for non-arboviruses (P = 0.19). The dN/dS estimates based on non-structural genes of arboviruses were only moderately lower than those for non-arboviruses (P = 0.04). Further, we found no significant correlation between the estimated dN/dS and substitution rates, suggesting that detectable differences in selection pressures do not explain the variation in substitution rates of mammalian RNA viruses. To date, there are no data supporting a link between cell tropism and sustained differences in selection pressures.

Mutation and substitution rates are uncorrelated

Compared to the slower evolution of DNA viruses, the evolution of RNA viruses is dominated by their high mutation rates,,. Weak negative correlations between genome lengths and viral substitution rates have been attributed to a relationship between mutation rate and substitution rate, as smaller genomes could in theory withstand higher mutation rates than larger genomes,,. However, while differences in spontaneous mutation rates appear to be significantly correlated to the long-term substitution rates of DNA viruses, this linear relationship disappears past a certain mutation rate threshold: around 10−6 mutations per site per infectious cycle, the lower end of the mutation rate range of RNA viruses,. It is, therefore, not surprising that we found no significant correlation between substitution rates and the available, reliable mutation rate estimates. Additionally, a recent study of the retrovirus HIV-1 found that infection of different cell types did not lead to differences in mutation rate, providing some evidence that mutation rate is not correlated with cell tropism. Together, these data suggest that mutation rate variation among different cell types is not driving higher substitution rates in epithelial-infecting mammalian RNA viruses.

Generation time could explain substitution rate variation

Ruling out selection, mutation rates, and recombination frequencies as drivers of RNA virus substitution rates implies that the rate variation is largely the result of variation in replication dynamics,. Enhanced replication frequencies (shorter generation times) have been used to explain a variety of the previously suggested links between virus ecology and substitution rate. For example, viruses in the acute phase of an infection generally replicate more frequently than those in a persistent infection, and viruses in a latent phase do not replicate at all. Further, as an alternative to differential selection pressures, the argument that transmission mode drives viral substitution rates assumes that viruses that can be transmitted more rapidly will have shorter generation times (e.g., horizontal transmission vs. vertical transmission,,).

DNA viruses have shorter generation times in faster dividing cells,, but the associations between cell tropism and RNA virus generation time are less obvious, as RNA viruses do not depend on cellular replication machinery. However, there is evidence that for at least some RNA viruses, viral genome replication is highly dependent on host cell proliferation, with RNA synthesis occurring at much lower rates in poorly proliferating cells than in rapidly dividing cells,,,,. For example, it has been repeatedly demonstrated that hepatitis C virus genome replication is enhanced in proliferating cells, perhaps due to higher levels of available nucleotides, or because of higher levels of viral protein synthesis facilitated by nuclear translation initiation factors that only become available in the cytoplasm during cell division. Similar dependence on cell proliferation for viral replication efficiency has been demonstrated in a number of picornaviruses,,,. Further, using the model proposed by Sanjuán (2012), we found that viruses that infect epithelial cells have generation times that may be as much as 40-fold shorter than a virus that infects non-epithelial cells. This offers a possible mechanistic basis for our finding that viruses that target the fastest-dividing cells in the body (intestinal and respiratory epithelial cells,,,) have higher substitution rates than viruses that infect cells that turnover at very low rates, if at all (neurons,,).

We are the first to provide statistical evidence that cell tropism predicts rates of mammalian RNA virus evolution, likely through its influence on virus generation time. These results offer a new perspective on why it has been difficult to create effective vaccines for viruses that infect epithelial tissue, such as rotavirus and enterovirus 71,. Further, as it has been shown that higher rates of viral evolution can result in increased genetic diversity and higher epidemiological fitness,,, the higher substitution rates of epithelial-infecting viruses predict increased evolvability and greater potential for emergence in novel host species.

Published rates

Long-term nucleotide substitution rates of mammalian RNA viruses were collected from the literature, with a focus on finding rates for the outer structural gene containing the major antigenic site(s) and non-structural (preferably the RdRp) genes. While the RdRp genes of the (-)ssRNA and dsRNA viruses are classified as structural, or virion-associated, genes, they are generally thought to be more conserved and under very different selection pressures than the structural genes that interact with the host immune system,. We excluded retroviruses from analysis because they are known to have highly variable substitution rates due to time spent integrated into DNA genomes, where they evolve at the rate of their hosts' genome,. Viruses that predominately infect non-mammals, with mammals serving as incidental, dead-end hosts, were also excluded. Only rates estimated for individual viral species or strains were used, not those that aggregated multiple species into one analysis. Similarly, only rates from single gene analyses were included, not those based on full genomes or multiple gene alignments. In order to minimize any rate discrepancies that could result from variations among datasets (e.g., number of taxa, temporal range, portion of gene analyzed) and/or subtle methodological variations,,,,,, only rates produced by Bayesian coalescent analyses of datasets composed of at least 30 taxa, isolated over a minimum range of 15 years and spanning at least 40% of the analyzed gene were included. Bayesian coalescent analyses provide estimates of viral evolution that are calculated over a longer range than simply the date range over which the taxa were isolated. This is because they determine the likely phylogenetic relationship among the isolates and infer substitution rates over the entire evolutionary history of the sampled taxa: over decades, hundreds, even thousands of years. These rates can therefore be considered “long-term” nucleotide substitution rates.

Data regarding genomic architecture and ecology were obtained for all viruses with published substitution rates that met these criteria. We included multiple rates for a given virus when available, except when a single study examined multiple lineages and summarized the results in a single rate,,,. Corresponding dN/dS estimates were collected when available.

Sequence data

These published substitution rates were supplemented with novel BEAST rate analyses based on the sequence data available in GenBank (accessed through Taxonomy Browser, http://www.ncbi.nlm.nih.gov/Taxonomy). Sequences for structural and non-structural genes with years of isolation available in GenBank or the literature were manually aligned using Se-Al v2.0a11. Sequences with GenBank or published information that indicated they were genetically manipulated or extensively passaged in the lab prior to sequencing were eliminated from further analysis. The final datasets also adhered to the conservative criteria described above for published datasets.

Substitution rate and selection analyses

As recombination events can lead to over-estimation of nucleotide substitution rates, each dataset was scanned for recombination using seven different algorithms (RDP, GENECONV, Bootscan, MaxChi, Chimaera, SiScan, and 3seq) implemented in RDP v3.44. Sequences implicated as recombinant by two or more algorithms were excluded from further analysis. These finalized alignments were deposited into Dryad (doi:10.5061/dryad.58ss8). Modeltest v3.7 was used to determine the best-fit model of nucleotide substitution for each dataset (by AIC).

Long-term nucleotide substitution rates were estimated using BEAST v1.5.4. Each dataset was run for at least 50 million generations and until all parameters had stabilized (effective sampling size >200). Each dataset was run with two different clock models (strict and uncorrelated lognormal) and three different demographic models (constant, exponential, and Bayesian skyline). The best-fitting clock/demographic model combination for each dataset was determined using Bayes factors as implemented in Tracer v1.5. For each best set of priors, two independent runs were performed to ensure that the results were replicable, and a control analysis was run without the dataset to ensure that the priors were not controlling the outcome of the analysis.

The Single Likelihood Ancestor Counting (SLAC), codon-based maximum likelihood method available in the HYPHY package on the Datamonkey web server was used to evaluate the strength of selection pressure on these datasets.

Statistical analyses

In order to determine which factors most significantly predict substitution rates of mammalian RNA viruses, ANCOVA analyses were run using SPSS Statistics v21 (IBM) with log-transformed mean substitution rates as the dependent variable and seven overarching predictor variables (target cell, transmission route, whether the infection is acute or persistent, host range, genome length, genome sense, and whether or not the genome is segmented). For each variable, different base levels were tested to ensure that the chosen base level did not significantly influence the results. Collinearity among the variables was also assessed, with variance inflation factors (VIF) greater than 10 indicating redundancy among variables. Separate ANCOVA analyses were run on the structural and non-structural gene datasets. As there were multiple published rates for some viral species and strains, additional analyses were run for both the S and NS datasets with only one substitution rate per virus species. When there were multiple rates for a given virus species, we calculated and used an average rate.

One-tailed t-tests were subsequently run in R v2.14.1 to provide an additional measure of significant directional variation among the log-transformed mean rates of different levels for any categorical variable that was found to be a significant rate predictor (α = 0.01, adjusted by Bonferroni correction for multiple comparisons) in the ANCOVA analyses. Additional t-tests were also conducted using the control datasets with one rate per virus species.

Additionally, though there were no dN/dS or mutation rate estimates available for all viruses used in this study, the available data for each variable were compared to corresponding log-transformed mean substitution rate estimates using Spearman rank correlation (for dN/dS) or Pearson correlation coefficient (for mutation rates). Structural and non-structural gene rate estimates were also compared using Pearson correlation coefficient. All correlation analyses were performed in SPSS Statistics v21.