Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹SARS-CoV-2 medication

The Footprint of Genome Architecture in the Largest Genome Expansion in RNA Viruses


Genome size is the net result of evolution driven by the environment, mutation, and the genetics of a given organism,. Particularly mutation rate is a powerful evolutionary factor. The relation between mutation rate and genome size is inversely proportional for a range of life forms from viroids to viruses to bacteria, and it is positive for eukaryotes, suggestive of a causative link–[6]. The genome size of RNA viruses is restricted to a range of ∼2 to 32 kb that corresponds to a very narrow band on the genome size scale (ranging from 1 kb to 10 Mb) across which genome size increase is correlated with mutation rate decrease. This restricted genome size range of RNA viruses was believed to be a consequence of the universal lack of proof-reading factors, resulting in a low fidelity of RNA replication,.

In the above relation, mutation rate and proof-reading serve as a proxy for replication fidelity and genetic complexity, respectively. Replication fidelity, genome size, and genetic complexity were postulated to lock each other, through a triangular relation, in a low state in primitive self-replicating molecules. This trapping, known as the “Eigen paradox”, was extended to include RNA viruses, providing a conceptual rationale for the small range of genome sizes in these viruses. Recent studies of the order Nidovirales, a large group of RNA viruses that includes those with the largest genomes known to date, provided strong support for the postulated triangular relation,. Unexpectedly, they also revealed how nidoviruses may have solved the Eigen paradox by acquiring a proof-reading enzyme. These advancements implied that the control of genome size may be more complex than previously thought, in RNA viruses in general, and particularly in nidoviruses.

The order Nidovirales is comprised of viruses with enveloped virions and non-segmented single-stranded linear RNA genomes of positive polarity (ssRNA+), whose replication is mediated by a cognate RNA-dependent RNA polymerase (RdRp),. The order includes four families - the Arteriviridae and Coronaviridae (including vertebrate, mostly mammalian viruses), and the Roniviridae and Mesoniviridae (invertebrate viruses). The unusually broad 12.7 to 31.7 kb genome size range of this monophyletic group of viruses includes the largest known RNA genomes, which are employed by viruses from the families Roniviridae (∼26 kb) and Coronaviridae (from 26.3 to 31.7 kb), that have collectively been coined “large-sized nidoviruses”. Viruses from the Arteriviridae (with 12.7 to 15.7 kb genomes) and the recently established Mesoniviridae (20.2 kb), are considered small and intermediate-sized nidoviruses, respectively. Nidoviruses share a conserved polycistronic genomic architecture (known also as “organization”) in which the open reading frames (ORFs) are flanked by two untranslated regions (UTRs),–[26]. The two 5′-proximal ORFs 1a and 1b overlap by up to a few dozen nucleotides and are translated directly from the genomic RNA to produce polyproteins 1a (pp1a) and pp1ab, with the synthesis of the latter involving a −1 ribosomal frameshift (RFS) event–[29]. The pp1a and pp1ab are autoproteolytically processed into nonstructural proteins (nsp), named nsp1 to nsp12 in arteriviruses and nsp1 to nsp16 in coronaviruses (reviewed in). Most of them are components of the membrane-bound replication-transcription complex (RTC)–[33] that mediates genome replication and the synthesis of subgenomic RNAs (a process known also as “transcription”),. ORF1a encodes proteases for the processing of pp1a and pp1ab (reviewed in), trans-membrane domains/proteins (TM1, TM2, and TM3) anchoring the RTC–[38] and several poorly characterized proteins. ORF1b encodes the core enzymes of the RTC (reviewed in, see also below). Other ORFs, whose number varies considerably among nidoviruses are located downstream of ORF1b (hereafter collectively referred to as 3′ORFs). They are expressed from 3′-coterminal subgenomic mRNAs, and encode virion and, optionally, so-called “accessory proteins” (reviewed in–[43]). The latter, as well as several domains encoded in ORF1a and ORF1b, were implicated in the control of virus-host interactions–[48].

In addition to comparable genome architectures, nidoviruses share an array (synteny) of 6 replicative protein domains. Three of these are most conserved enzymes of nidoviruses: an ORF1a-encoded protease with chymotrypsin-like fold (3C-like protease, 3CLpro)–[51], an ORF1b-encoded RdRp,, and a superfamily 1 helicase (HEL1)–[57] (reviewed in). For other proteins, relationships have been established only between some nidovirus lineages, mostly due to poor sequence similarity. Two tightly correlated properties separate large- and intermediate-sized nidoviruses from all other ssRNA+ viruses, classified in several dozens of families and hundreds of species: a genome size exceeding 20 kb and the presence of a gene encoding a RNA 3′-to-5′ exoribonuclease (ExoN), which resides in nsp14 in the case of coronaviruses. The latter enzyme is distantly related to a DNA proofreading enzyme, and it is genetically segregated and expressed together with RdRp and HEL1,. Based on these properties ExoN was implicated in improving the fidelity of replication in large- and intermediate-sized nidoviruses. This hypothesis is strongly supported by the excessive accumulation of mutations in ExoN-defective mutants of two coronaviruses, mouse hepatitis virus and severe acute respiratory syndrome coronavirus (SARS-CoV), the identification of an RNA 3′-end mismatch excision activity in the SARS-CoV nsp10/nsp14 complex, and the high efficacy of a live coronavirus vaccine displaying impaired replication fidelity due to nsp14-knockout (for review see,). Although the molecular mechanisms underlying ExoN's function in fidelity control remain to be elucidated, its acquisition by nidoviruses likely enabled genome expansions beyond the limit observed for other non-segmented ssRNA+ viruses,. Since ExoN-encoding nidoviruses have evolved genomes that may differ by up to ∼12 kb (from 20.2 kb of Nam Dinh virus, NDiV, to 31.7 kb of Beluga whale coronavirus SW1, BWCoV-SW1), there must be other factors in addition to the proof-reading enzyme that control genome size.

In this study we sought to characterize the dynamics of nidovirus genome expansion (NGE). The NGE is defined by the entire range of the genome sizes of extant nidoviruses, from 12.7 to 31.7 kb, and thus concerns both pre- and post-ExoN acquisition events. Our analysis revealed that ExoN acquisition was part of a larger process with non-linear dynamics, during which distinct coding regions of the nidovirus genome were expanded to accommodate both an extremely large number of mutations and virus adaptation to different host species. Our results indicate that genome architecture and the associated region-specific division of labor leave a footprint on the expansion dynamics of RNA virus genomes through controlling replication fidelity and/or other mechanisms. Eventually, these constraints may determine the observed limit on RNA virus genome size.

The scales of per-residue evolutionary change in nidoviruses and the Tree of Life are comparable

Nidoviruses have evolved genomes in a size range that accounts for the upper ∼60% of the entire RNA virus genome size scale and include the largest RNA genomes. What did it take to produce this unprecedented innovation in the RNA virus world? This question could be addressed in two evolutionary dimensions: time and amount of substitutions. Due to both the lack of fossil records and high viral mutation rates, the time scale of distant relations of RNA viruses remains technically difficult to study. Hence, we sought to estimate the amount of accumulated replacements in conserved nidovirus proteins and to place it into a biological perspective by comparing it with that accumulated by proteins of cellular species in the Tree of Life (ToL).

To this end, we used a rooted phylogeny for a set of 28 nidovirus representatives (Table S1), which was based on a multiple alignment of nidovirus-wide conserved protein regions in the 3CLpro, the RdRp and the HEL1, as described previously. The 28 representatives covered the acknowledged species diversity of nidoviruses with completely sequenced genomes,,, and included two additional viruses. For the arterivirus species Porcine reproductive and respiratory syndrome virus we selected two viruses, representing the European and North American genotypes, respectively, because we observed an unusually high divergence of these lineages; for the ronivirus species Gill-associated virus we selected two viruses representing the genotypes gill-associated virus and yellow head virus, respectively, because these viruses showed a genetic distance comparable to that of some coronavirus species (CL & AEG, in preparation). The nidovirus-wide phylogenetic analysis consistently identified the five major lineages: subfamilies Coronavirinae and Torovirinae, and families Arteriviridae, Roniviridae and Mesoniviridae. The root was placed at the branch leading to arteriviruses (Fig. 1A) according to outgroup analyses. Accordingly, arteriviruses with genome sizes of 12.7 to 15.7 kb are separated in the tree from other nidoviruses with larger genomes (20.2–31.7 kb).

We compared the evolutionary space explored by nidoviruses, measured in number of substitutions per site in conserved proteins, with that of a single-copy protein dataset representing the ToL (Fig. 1B). Using a common normalized scale of, comparison of the viral and cellular trees and associated pairwise distance distributions revealed that the distances between cellular proteins (0.05–0.45 range) cover less than half the scale of those separating nidovirus proteins. (Fig. S1). Unlike cellular species, nidoviruses are grouped in a few compact clusters, which are very distantly related. The distances between nidovirus proteins are unevenly distributed, reflecting the current status of virus sampling: intragroup distances between nidoviruses forming major lineages are in the 0.0–0.25 range, while intergroup distances between nidoviruses that belong to different lineages are in the 0.55–1.0 range. The distances separating the intermediate-sized mesonivirus from other nidoviruses tend to be most equidistant, accounting for ∼15% of all distances in the 0.55–0.85 range. Consequently, nidovirus evolution involved the accumulation of mutations in the most conserved proteins at a scale comparable to that of the ToL. This observation is instructive in two ways. First, it can be contrasted with the conservation of nidovirus genome architecture, which emerges in this context as truly exceptional by conventional evolutionary considerations. Second, it makes it plausible that other, less conserved proteins might have diverged beyond the level that can be recognized by sequence alignment, thus establishing limits of the applicability of the alignment-based analysis of nidoviruses. We used both these insights to advance our study further (see below).

The scale of nidovirus genome size change is proportional to the amount of substitutions in the most conserved proteins

To quantify the relation of genome size change and the accumulation of substitutions, we plotted pairwise evolutionary distances (PED) separating the most conserved replicative proteins (Y axis) versus genome size differences (X axis) for all pairs of nidoviruses in our dataset (Fig. 2). It should be noted that the observed genome size differences may serve only as a low estimate for the actual genome size change, since it does not account for (expansion or shrinkage) events that happened in parallel between two viruses since their divergence. The obtained 378 values are distributed highly unevenly, occupying the upper left triangle of the plot. Using phylogenetic considerations (Figs. 1A and S1), four clusters could be recognized in the plot. Genetic variation within the four major virus groups with more than one species (arteri-, corona-, roni-, and toroviruses) is confined to a compact cluster I in the left bottom corner (X range: 0.033–4.521 kb, Y range: 0.051–1.401). Values quantifying genetic divergence between major lineages are partitioned in three clusters taking into account genome sizes: large-sized vs. large-sized nidoviruses (cluster II, X: 0.002–5.433 kb, Y: 3.197–4.292), intermediate-sized vs. other lineages (cluster III, X: 4.475–11.494 kb, Y: 2.896–4.553), and small-sized vs. large-sized nidoviruses (cluster IV, X: 10.536–18.978 kb, Y: 4.159–5.088). Points in clusters I, III and IV are indicative of a positive proportional relation between genome size change and the accumulation of replacements. The off-diagonal location of cluster II can be reconciled with this interpretation under the (reasonable) assumption that the three lineages of large-sized nidoviruses expanded their genomes independently and considerably since diverging from their most recent common ancestor (MRCA). This positive relation is also most strongly supported by the lack of points in the bottom-right corner of the plot (large difference in genome size; small genetic divergence). Overall, this analysis indicates that a considerable change in genome size in nidoviruses could have been accomplished only when accompanied by a large number of substitutions in the most conserved proteins.

Only a fraction of genome size changes may be attributed to known domain gain or loss

Next, we asked whether genome size change could be linked to domain gain and loss. We analyzed the phylogenetic distribution of protein domains that were found to be conserved in one or more of the five major nidovirus lineages. Ancestral state parsimonious reconstruction was performed for the following proteins: ORF1b-encoded ExoN, N7-methyltransferase (NMT), nidovirus-specific endoribonuclease (NendoU),, 2′-O-methyltransferase (OMT),, ronivirus-specific domain (RsD) (this study; see legend to Fig. S2), and ORF1a-encoded ADP-ribose-1″-phosphatase (ADRP)–[74]. This analysis revealed that domain gain and loss have accompanied NGE (Fig. S2 and Table S2). Particularly, the genetically segregated ExoN, OMT and NMT domains (Fig. 3) were acquired in a yet-to-be determined order during the critical transition from small-sized to intermediate-sized nidovirus genomes. However, the combined size of these domains accounts for only a fraction (49.7%) of the size difference (4,475 nt) between the genomes of NDiV (20,192 nt) and Simian hemorrhagic fever virus (SHFV), which has the largest known arterivirus genome (15,717 nt). The fraction that could be attributed to these and the three other protein domains is even smaller in other pairs of viruses representing different major nidovirus lineages (CL & AEG). This analysis is also complicated by the uncertainty about the genome sizes of nidovirus ancestors that acquired or lost domains.

The nidovirus genome can be partitioned according to functional conservations in genome architecture

In order to gain further insight in NGE dynamics, we analyzed large genome areas in which homology signals were not recoverable in the currently available dataset because of both the extreme divergence of distant nidoviruses and the relatively poor virus sampling (Fig. 1). To address this challenge, we developed an approach that establishes and exploits relationships between nidovirus genomes in an alignment-free manner on grounds other than sequence homology. To this end, we partitioned the nidovirus genome according to functional conservations in the genome architecture, using results for few characterized nidoviruses and bioinformatics-based analysis for most other viruses (reviewed in). With this approach, the genomes of all nidoviruses can be consistently partitioned into five regions in the 5′ to 3′ order: 5′-UTR, ORF1a, ORF1b, 3′ORFs, and 3′-UTR (Fig. 3, Table S3). The 5′-UTR and 3′-UTR flank the coding regions and account for <5% of the nidovirus genome size. The borders of the three ORF regions that overlap by few nucleotides in some or all nidoviruses were defined as follows: ORF1a: from the ORF1a initiation codon to the RFS shifty codons; ORF1b: from the RFS signal to the ORF1b termination codon; and 3′ORFs: from the ORF1b termination codon to the termination codon of the ORF immediately upstream of the 3′UTR. As we detail in the Supplementary text (Text S1), the three ORF regions are of similar size but differ in expression mechanism (Fig. 3 top) and principal function. Thus, ORF1a dominates the expression regulation of the entire genome, and ORF1b encodes the principal enzymes for RNA synthesis, while the 3′ORFs control genome dissemination. This region-specific association may be described as a division of labor.

The nidovirus genome expanded unevenly across the three major coding regions

We then asked how the different regions contributed to the genome expansion. We initially noted that the intermediate position of the mesonivirus between the two other nidovirus groups is observed only in genome-wide but not in region-specific size comparisons (Fig. 4). In the latter, the mesonivirus clusters with either small-sized (ORF1a and 3′ORFs) or large-sized (ORF1b) nidoviruses. This non-uniform position of the mesonivirus relative to other nidoviruses is indicative of a non-linear relationship between the size change of the complete genome and its individual regions during NGE. Accordingly, when fitting weighted linear regressions for the three regions separately to the six datasets formed by nidoviruses with small and large genomes, support for a linear relationship was found only for the 3′ORF dataset of large nidoviruses; for all other regions a linear relationship was not statistically significant (Fig. S3). These results prompted us to evaluate linear as well as non-linear regression models applied to a dataset including all known nidovirus species (n = 28) (Fig. 5). Two non-linear models were employed: third order monotone splines and a double-logistic regression. In the monotone splines, two parameters – the number and position of knots – determine the regression fit. We identified values for both parameters that result in the best fit (Fig. S4).

Using weighted r2 values, we observed that the splines model captures 92.9–96.1% of the data variation for the three ORF regions. This was a 5–22% gain in the fit compared to the linear model (75.9–90.8%) (Fig. 5). This gain was considered statistically significant (α = 0.05) in two F-tests, a specially designed one and a standard one, as well as in the LV-test for every ORF region (p = 0.019 or better) and, particularly, their combination (p = 9.1e-6 or better) (Table 1). The splines model also significantly outperforms the double-logistic model (p = 0.0014) (Table 1). These results established that the nidovirus genome expanded in a non-linear and region-specific fashion.

The three major coding regions expanded consecutively in a lineage-dependent manner

Like each region, also the entire genome must have expanded non-linearly during NGE. Revealing its dynamic was our next goal. To this end, we analyzed the contribution of each of the five genomic regions to the overall genome size increase under the three models (Fig. 6 and Fig. S5). The top-ranking splines model (Table 1) predicts a cyclic pattern of overlapping wavelike size increases for the three coding regions (the 5′ and 3′UTR account only for a negligibly minor increase that is limited to small nidoviruses). Each of the three coding regions was found to have increased at different stages during NGE (Fig. 6). A cycle involves expanding predominantly and consecutively the ORF1b, ORF1a, and 3′ORFs region. One complete cycle flanked by two partial cycles are predicted to have occurred during the NGE from small-sized to large-sized nidoviruses. The complete cycle encompasses almost the entire genome size range of nidoviruses, starting from 12.7 kb and ending at 31.7 kb. The dominance of an ORF region in the increase of genome size was characterized by two parameters: a genome size range (X axis in Fig. 6) in which the contribution of a region accounts for a >50% share of the total increase, and by the maximal share it attains in the NGE (Y axis in Fig. 6). For three major regions these numbers are: ORF1b, dominance in the 15.8–19.3 kb range with 72.9% maximal contribution at genome size 17.5 kb; ORF1a, 19.7–26.1 kb and 81.3% at 22.7 kb; 3′ORFs, 26.1–31.7 kb and 89.6% at 29.5 kb (Fig. 6).

Furthermore, the shapes of the three waves differ. The first one (ORF1b) is most symmetrical and it starts and ends at almost zero contribution to the genome size change. This indicates that the ORF1b expansion is exceptionally constrained, which is in line with the extremely narrow size ranges of ORF1b in arteri- and coronaviruses (with mean±s.d. of 4362±86 and 8073±50 nt, respectively; Fig. 4 and Fig. 6). The second wave (ORF1a) is tailed at the upper end and is connected to the ORF1a wave from the prior cycle. This ORF seems to have a relatively high baseline contribution (∼20%) to the genome size change up to the range of coronaviruses. The third wave (3′ORFs) is most asymmetrical (incomplete), as it only slightly decreases from its peak toward the largest nidovirus genome size to which this region remains the dominant contributor (∼77%).

One partial cycle, preceding the complete one, is observed inside the genome size range of arteriviruses and involves the consecutive expansions of ORF1a and 3′ORFs, respectively. Also the main, but still very limited contributions of 5′- and 3′-UTRs (<6%) are observed here. The start of another incomplete cycle, involving the expansion of ORF1b and overlapping with the complete cycle, is observed within the upper end of coronavirus genome sizes.

It must be stressed that nidoviruses occupy different positions on the trajectory that depicts the entire NGE dynamics. For the viruses with large genomes those with smaller genomes represent stages that they have passed during NGE; in this respect the latter may resemble ancestral viruses which have gone extinct. For the smaller genomes those with the larger ones represent stages that they have not reached during NGE. Mesonivirus and roniviruses seem to have been “frozen” after the first (ORF1b) and second (ORF1a) wave, respectively. The third wave (3′ORFs) was due to the genome expansion of coronaviruses and, to a lesser extent, toroviruses (compare the genome sizes of these viruses and the third wave position in Fig. 6). These observations reveal that the constraints on genome size due to genome architecture may be modulated in a lineage-dependent manner.

Concluding remarks and implications

It is broadly acknowledged that high mutation rates and large population sizes allow RNA viruses to explore an enormous evolutionary space and to adapt to their host,. Yet the low fidelity of replication also confines their evolution within a narrow genome size range that must affect their adaptation potential. Above, we present evidence for a new constraint on genome size in RNA viruses. In our analysis of nidoviruses, the conserved genome architecture and associated division of labor emerged as potentially powerful forces that are involved in selecting both new genes and positions of gene insertion during genome expansion. In this respect, the established wavelike dynamics of regional size increase could be seen as the footprint of genome architecture on genome size evolution. Ultimately, these constraints may determine the upper limit of the RNA virus genome size. The reported data point to an important evolutionary asymmetry during genome expansion, which concerns the relation between proteins controlling genome replication, expression, and dissemination, and may certainly be relevant beyond the viruses analyzed here.

Importantly, the major diversification of nidoviruses by genome expansion must have started at some early point after the acquisition of ExoN. From that point on, nidoviruses expanded their genomes in parallel in an increasing number of lineages, each of which may have acquired different domains in the same region. Extant representatives of the major lineages have very different genome sizes and essentially offer snapshots of different NGE stages. It seems that the host range may affect the outcome of this process, since the two families that infect invertebrates are on the lower end of the genome size range in the ExoN-encoding nidoviruses. For yet-to-be described nidoviruses, the genome expansion model can predict the sizes of the three coding regions by knowing the genome size only. The mechanistic basis of this fundamental relation can be probed by comparative structure-function analyses, which may also advance the development of nidovirus-based vectors and rational measures for virus control. Thus, the wavelike dynamics model links virus discovery to basic research and its various applications.


A dataset of nidoviruses representing species diversity from the three established and a newly proposed virus family was used (Table S1). A multiple alignment of nidovirus-wide conserved protein domains (28 species, 3 protein families, 604 aa alignment positions, 2.95% gap content) as described previously formed the basis of all phylogenetic analyses. To put the scale of the nidovirus evolution into an independent perspective, we compared it with a cellular dataset previously used to reconstruct the ToL, for which a concatenated alignment of single-copy proteins was used (30 species, 56 protein families, 3336 aa alignment positions, 2.8% gap content). The proteins used in the nidoviral and cellular datasets are the most conserved in their group and, as such, could be considered roughly equivalent and suitable for the purpose of this comparative analysis.

Phylogenetic analyses

Rooted phylogenetic reconstructions by Bayesian posterior probability trees utilizing BEAST under the WAG amino acid substitution matrix and relaxed molecular clock (lognormal distribution) were performed as described previously. Evolutionary pairwise distances were calculated from the tree branches. A maximum parsimony reconstruction of the ancestral nidovirus protein domain states at internal nodes of the nidovirus tree was conducted using PAML4.The quality of ancestral reconstructions was assessed by accuracy values provided by PAML4. The nidovirus genomic sequences are non-independent due to their phylogenetic relatedness. When calculating the contribution of individual sequences to the total observed genetic diversity the uneven sampling of different phyletic lineages must be accounted for. To correct for the uneven sampling we assigned relative weights to the 28 nidovirus species by using position-based sequence weights that were calculated on the alignment submitted for phylogeny reconstruction. The weights were normalized to sum up to one and were used in regression analyses (see below). The sequence weights varied ∼7 fold from 0.017 to 0.116. NDiV, which represents mesoniviruses, showed the largest weight of 0.116 that was distantly followed by those of the bafinivirus White bream virus (WBV; 0.075) and roniviruses (0.06 each); coronaviruses, making up the best-sampled clade, were assigned the lowest weights (0.017 to 0.028 each).

Statistical analysis of genome size change in nidoviruses

The genome of each nidovirus was consistently partitioned into five genomic regions according to external knowledge (see Results). To model the contribution of each genomic region to the total genome size change, we conducted weighted regression analyses (size of a genomic region on size of the genome) using three models – a linear and two non-linear ones. Position-based sequence weights were used and a confidence level of α = 0.05 was applied in all analyses. The regressions of the different genomic regions were not fitted separately but were joined to produce a genome-wide analysis. The combined contribution of all genomic regions to the genome size change must obviously sum up to 100%. To satisfy this common constraint, in each analysis, regression functions were fitted simultaneously to sizes of the genomic regions by minimizing the residual sum of squares, thereby constraining the sum of all slopes to be not larger than one. The linear model assumes a constant contribution of each genomic region during evolution which was modeled via linear regions.

In the first non-linear model we applied third order monotone splines with equidistant knots. We chose splines because of their flexibility and generality (we do not rely on a specific regression function). The monotonicity constraint was enforced to avoid overfitting which was observed otherwise, and third order functions were chosen to obtain smooth, second-order derivatives. We explored the dependence of the performance of the splines model on variations in two critical parameters, the number of knots and the start position of the first knot. These two parameters define a knot configuration and determine a partitioning of the data into bins. In the first test we evaluated five different configurations generating from three to seven knots. Configurations using eight or more knots resulted in some bins being empty and were therefore not considered. For each number of knots the position of the first knot and the knot distance were determined as resulting in that configuration for which the data points are distributed most uniformly among the resulting bins. The exception was the 3-knot configuration, in which the position of the second knot was selected as the intermediate position in the observed genome size range (22.2 kb). Only configurations with equidistant knots were considered. All probed splines models were evaluated by goodness-of-fit values (weighted version of the coefficient of determination r2). In the second test we evaluated the model dependence on the position of the first knot by considering all positions that do not result in empty bins for the optimal number of knots determined using the approach described above.

As another non-linear model we used a 7-parameter double-logistic regression function that mimics the splines model and more readily allows for biological interpretations. Logistic functions discriminate between two principal states – stationary and growth phases; a double-logistic curve comprises not more than three steady and two growth phases. The “length” of the different phases (in the dimension of the independent variable; e.g. genome size), the steady state values (in the dimension of the dependent variable, e.g. ORF size), and the “strength” of the growth (e.g. the maximum slope of the curve between two steady states) are controlled by the parameters of the regression function. Once estimated, the parameter values can be used to infer genome size intervals for which a particular ORF region is in a steady state as well as critical genome and ORF sizes at the transition between two steady states. Since double-logistic regressions did not converge for the 5′- and 3′-UTRs, linear functions were used for these two genome regions instead.

Linear (null hypothesis) and splines (alternative hypothesis) regression models were compared using standard weighted F-statistics and a specially designed permutation test (see below). To exclude overfitting as the cause of support of the more complex models, we utilized a more sophisticated framework (LV-Test) for the comparison of non-nested regression models (linear vs. double-logistic and splines vs. double-logistic) as detailed in. The test was further modified to include weighted residuals according to virus sequence weights that account for sequence dependence.

Since our null hypothesis (linear model) is at the boundaries of the parameter space, we developed a permutation test to further compare the linear and splines models. To this end, genome region sizes were transformed to proportions (region size divided by genome size), randomly permuted relative to genome sizes, and transformed back to absolute values. These transformations are compatible with the constraints of the null hypothesis and the requirement that region sizes have to sum to genome sizes. Weights were not permuted. The linear and splines models were fit to the permuted datasets and F-statistics were calculated as for the original dataset. The p-value of the test is the fraction of F-statistics of permuted datasets that are larger than the F of the original dataset. It was calculated using 1,000,000 permutations that were randomly sampled out of ∼1029 possible permutations.

Finally, we analyzed the contribution of each genome region to the total change in genome size under the three regression models. The contribution of each region according to a model was calculated as the ratio of change in region size to change in genome size (first derivative of the regression function) along the nidovirus genome size scale. These region-specific contributions were combined in a single plot for visualization purposes.

To conduct all statistical analyses and to visualize the results we used the R package.

Accession numbers

Accession numbers of virus genomes utilized in the study are shown in Table S1.