Dataset: 11.1K articles from the COVID-19 Open Research Dataset (PMC Open Access subset)
All articles are made available under a Creative Commons or similar license. Specific licensing information for individual articles can be found in the PMC source and CORD-19 metadata
.
More datasets: Wikipedia | CORD-19

Logo Beuth University of Applied Sciences Berlin

Made by DATEXIS (Data Science and Text-based Information Systems) at Beuth University of Applied Sciences Berlin

Deep Learning Technology: Sebastian Arnold, Betty van Aken, Paul Grundmann, Felix A. Gers and Alexander Löser. Learning Contextualized Document Representations for Healthcare Answer Retrieval. The Web Conference 2020 (WWW'20)

Funded by The Federal Ministry for Economic Affairs and Energy; Grant: 01MD19013D, Smart-MD Project, Digital Technologies

Imprint / Contact

Highlight for Query ‹Bordetella parapertussis whooping cough symptoms

Accounting for Programmed Ribosomal Frameshifting in the Computation of Codon Usage Bias Indices

Slippery site data

We obtained a dataset of the Saccharomyces cerevisiae S288C strain genes with predicted -1 PRF events from the Programmed Ribosomal Frameshifting database (PRFdb) of the University of Maryland (Belew et al. 2008). These putatitve -1 PRF signals were identified using first a filter for slippery site identification, and subsequent detection of mRNA pseudoknots by means of the Nupack algorithm (Dirks and Pierce 2004). For a slippery site to be found, a confluence of signals in mRNA is required. First, a site must match the pattern (X XXY YYZ), where X is some base, Y is either A or U, and Z is A, C or U. Second, a spacer sequence of a minimum of 8 nucleotides in length needs to exist between the slippery site and the following pseudoknot. Third, a pseudoknot predicted by minimum free energy values of the mRNA secondary structure is expected downstream the slippery site. The dataset also included a full list of gene annotations, accession numbers, the relative position of the slippery sites within the gene in which they were found, as well as the gene mRNA. For each slippery site, we computed were premature stop codons first appear downstream. With this, we computed the frameshifted sequences.

Protein expression and mRNA level data

We obtained protein expression as well as mRNA level data from the Supporting Information of (von der Haar 2008). Von der Haar has produced an extensive curated data set that merges data from various sources. For protein expression, the data include the seminal studies of Ghaemmaghami et al. (Ghaemmaghami et al. 2003), Newman et al. (Newman et al. 2006), and Lu et al. (Lu et al. 2007). Additional data stems from 46 further studies, specified in (von der Haar 2008). To ensure comparability, only studies were included in which yeast was grown in rich medium. Transcriptome data for mRNA levels were obtained from (Holstege et al. 1998; Arava et al. 2003; Holland 2002; Jelinsky and Samson 1999; Roth et al. 1998). These data include about 6000 genes. The construction of the curated data set is described in detail in (von der Haar 2008).

Indices for codon usage bias

Many indices have been proposed to assess the degree of codon usage bias present in a gien mRNA sequence. Here, we focus on the codon adaptation index (CAI) (Sharp and Li 1987; Carbone et al. 2003). In the following, we give the implemented definition of the CAI. Let L be the length of the mRNA sequence in codons, c is the index of the synonymous codons decoding the same amino acid a, and oac is the observed count of the synonymous codon c of amino acid a in the sequence. Ca is the index set of all codons within an amino acid a. A is the index set of all amino acids a. We defineωac ≐ oac∑c∈Caoac,(1)as the relative adaptiveness of codon ac within amino acid a of a given mRNA sequence.

Then, the codon adaptation index is defined asCAImRNA ≐ 1L∑a∈A∑c∈Caoacln(ωacref),(2)where oac is measured on the observed on the mRNA of interest, and ωacref is taken from a reference sequence of highly expressed genes. For the reference sequence, we concatenated all of the mRNA of ribosomal genes (Friberg et al. 2004). The ribosomal genes were taken from the ribosomal gene database http://ribosome.med.miyazaki-u.ac.jp/ (Nakao et al. 2004).

Data Availability

The code used for analysis of the data are available on Figshare (doi: 10.6084/m9.figshare.5955886; url: https://figshare.com/s/ff89f9e614b129469fd4). Supplemental material available at Figshare: https://doi.org/10.6084/m9.figshare.6970277.

The cost of -1 PRF maintenance

Effective -1 PRF events classically comprise three constitutive elements. The first element is a heptameric sequence –the slippery site– (Jacks and Varmus 1985; Plant et al. 2004). This site has the structure (X XXY YYZ), where X can be any base (A,C,G,U), Y is either A or U, and lastly, Z is either A, C or U (Plant et al. 2004; Plant and Dinman 2006; Ketteler 2012). The second element is a spacer sequence, separating the slippery site from the third element (Dinman, 2012b). The spacer sequence typically comprises around 1-12 nucleotides (Ketteler 2012; Dinman, 2012a). The third element is a pseudoknot, an mRNA secondary structure that is thought to generate a mechanical tension on the spacer sequence when the slippery site is being translated (Ketteler 2012; Dinman, 2012b, a). To release the tension, the ribosome, while stalling at the slippery site, is pushed one nucleotide back (Dinman, 2012b; Caliskan et al. 2015; Ketteler 2012). The frequency with which the ribosome is redirected to the -1 frame is called -1 PRF efficiency (Advani et al. 2013; Advani and Dinman 2016).

These criteria were used to generate an extensive database of -1 PRF signals, PRFdb, of the yeast genome by algorithmic search (see Materials and Methods). If PRFdb comprises a sufficiently large set of functional slippery sites, the effects of putative -1 PRF induced costs to translation efficiency should be detectable.

To this end, we devised three hypotheses. We derive these hypotheses from the assumption that the maintenance of -1 PRF signals carries an intrinsic cost to the organism. Furthermore, we assume that mutations that alter CUB are accumulated more slowly than mutations that alter -1 PRF efficiency, p. This is because the -1 PRF regulation mechanism is very likely to be affected by only a few mutations, whereas several mutations would be needed to significantly modify CUB. Hence, most short-term changes in protein expression demand in genes with slippery sites are more likely absorbed by -1 PRF.

We first introduce a framework to formulate these hypotheses (see Figure 1). Let L be the length (in nucleotides) of an mRNA sequence x. A single slippery site is located at site l of x, downstream from the start. Ribosomes are redirected to the -1 frame with probability p, the -1 PRF efficiency. They continue with the mRNA’s translation until encountering a premature termination codon (PTC) at a distance λ post slippery site.

In a first hypothesis, we formed expectations about how the number of slippery sites per gene should vary with the ratio of protein expression to mRNA levels per cell. This ratio has been used as a measure for translation efficiency in past studies (Tuller et al. 2010). To formulate the first hypothesis, we conduct a thought experiment. Let us assume that an organism’s environmental conditions favor lower expression of a protein from a -1 PRF-gene. Then, increasing the average p in that gene will adapt the organism to the new conditions. Conversely, if such environmental changes demanded higher protein expression, genes containing -1 PRF signals may adapt by reducing p. If protein demands exceed the production capacity of a gene with very small p, the -1 PRF mechanism may offer only costs, but no benefits from protein regulation. Then, mutations that remove the heptameric slippery site signature will be selected for. Taken together, these effects imply an asymmetry in the probability of slippery site loss depending on the direction of protein demand changes. Losses of slippery sites are likelier when protein demands increase than when they decrease. -1 PRF signals should thus become rarer with increasing protein expression levels.

The second hypothesis states that the costs of -1 PRF maintenance are reduced as slippery sites approach the 5′ end or start of the mRNA. Every time ribosomes slip through -1 PRF, the energy of translating a stretch of length l+λ of mRNA is wasted for protein production. Since λ is roughly constant across within-gene sites (see Fig. S1), the cost to translational efficiency incurred from -1 PRF should depend on l. Therefore, diminishing l will minimize cost. Thus, the distribution of slippery site positions relative to the gene length should be skewed toward mRNA starts, that is, small l values.

This effect is expected to be weaker than the effect in the first hypothesis. In the first hypothesis, if protein demands exceed yields generated by a gene and p is small, a fitness cost is incurred from i) the reduction of protein expression, ii) the cost of NMD-mediated mRNA degradation and iii) the cost of unnecessary translation of frameshifted mRNA sequences. Displacing the slippery site would only reduce the latter cost. Thus, we expect the signal for this hypothesis to be weaker than for the first hypothesis.

The third hypothesis states that a skew in the distribution of the location of the slippery site within a gene (relative to that gene’s length) should also become more pronounced with higher expression levels. This is because for fixed p, increasing expression levels should mirror an increase in -1 PRF induced translation costs. Thus, the benefits of slippery sites closer to translation-initiation positions should increase with expression levels. As in the second hypothesis, selection for such an effect is likely to be very weak compared to effects in the first hypothesis.

Figure 2 shows the results of testing all three hypotheses. Figure 2A shows that -1 PRF signals in genes become rarer with gene expression, consistent with our first hypothesis. Figure 2B shows that slippery sites positions are more prevalent in the first half of a gene than in the second, consistent with our second hypothesis. The frequency of slippery sites diminishes toward both extremes of mRNA sequences, and more markedly so toward the stop codon of the mRNA. This skew persists when including all genes from the PRFdb (see Figure S2). Figure 2C shows that the average slippery site position is displaced toward the 5′ end of the mRNA as protein expression levels increase from 102 molecules per cell to 105 molecules per cell. Above 105 molecules per cell, the uncertainty around the averages becomes large due to low sample sizes, and unambiguous deductions become impossible. The decreasing trend in the average l/L with protein expression is confirmed by a linear regression. Lastly, we retested all of the hypotheses with an alternative integrated data set from PaxDB, and obtained the same results (see Figure S3).

Thus, with increasing demand for protein production, the data suggest that production gains will primarily be attained by disposing of the -1 PRF mechanism present in a gene, rather than minimizing the cost of faulty translation. We speculate that, most likely, this occurs by altering the slippery site sequence. The test of the other two hypotheses involving slippery site displacement (or loss of large-l slippery sites) provide further evidence for an intrinsic -1 PRF cost to translation. The selective pressure on this latter adaptive process appears to be weak.

Codon Usage Bias Index Correction from Translation Efficiency:

In a first approach, we derive an estimator for a CUB index correction from basic relationships between CUB indices and translation efficiencies of an mRNA. In the absence of a -1 PRF signal, the CUB on x will be measured by some index function I(x). As noted, if a functional -1 PRF signal is present, I(x) does not account for the translation efficiency loss due to the unproductive translation of the hybrid sequence x′ (see Figure 1). We aim to derive a “corrected” index Ic(x) that more appropriately reflects the diminished translation efficiency in the presence of -1 PRF signals.

To find an expression for Ic(x), we begin with the assumption that there exists a monotonous mapping F that maps the translation efficiency η(x) of a sequence x (in the absence of -1 PRF) to a codon usage bias index I: I(x)=F(η(x)).

F is a monotonously increasing function, where increases in translation efficiency are reflected by increases in the codon usage bias index. Research by Tuller et al. further suggests that F is concave (Tuller et al. 2010), such that codon usage bias index values saturate with increasing translation efficiency values. We define translation efficiency η(x) classically as an input-output energy ratio. More specifically, η(x) is the ratio of the total per-protein energy Ep(x) contained in the n(x) synthesized proteins from x to the energy, Ei(x), used for producing those proteins in a sufficiently long time frame Δt:η(x)=n(x)Ep(x)Ei(x).(3)Here, the energy going into the synthesization machinery, Ei(x), does not contain the expenditure for -1 PRF’s regulatory use. Thus, for a sequence x that additionally carries a functional slippery site, we haveη(x)→ηc(x) ≡ nc(x)Ep,c(x)Ei,c(x)=n(x)Ep(x)Ei(x)+Ei(x′).(4)Here, the same number of proteins were produced as when -1 PRF was not considered, that is, nc(x)=n(x). Since the final proteins are structurally equivalent to those in the absence of -1 PRF, we also have Ep,c(x) ≡ Ep(x). However, more energy was expended to produce these proteins, and therefore Ei,c(x)=Ei(x)+Ei(x′)>Ei(x). Their synthesization requires at least Ei(x). A part of the expended energy on translation is not implemented in the proteins, but in the translation of x′ and the NMD pathway activation, Ei(x′). Thus, it follows that ηc(x)<η(x).

The corrected usage bias statistic Ic(x) is defined asIc(x) ≐ F(ηc(x))=F(n(x)Ep(x)Ei(x)+Ei(x′)).(5)Since the form of F as well as the values of Ep(x), Ei(x) and Ei(x′) are either difficult to measure or unknown, we aim to to compute Ic(x) indirectly from I(x). To this end, we separate the translation efficiency ηc(x) into two components:ηc(x)=η(x)+Q(x,x′).(6)Solving for Q(x,x′) givesQ(x,x′)=−n(x)Ep(x)(Ei(x)+Ei(x′))Ei(x′)Ei(x)=−ηc(x)R(x,x′),(7)where R(x,x′) ≡ Ei(x′)/Ei(x). With this, we haveηc(x)=η(x)1+R(x,x′).(8)Further analysis of the ratio R(x,x′) is complicated by the inability to directly measure Ei(x) and Ei(x′). To address this issue, we pursue an approach where the ratio is approximated by information about the relative one-elongation energies spent translating x and x′. More specifically, we assume that each time x is translated, an energy input of T(x) ≐ Ei(x)/n(x) is expended per time unit Δt. The total number of times translation is initiated on the mRNA sequence, N(x), is N(x)=n(x)/(1−p). This is because n(x) corresponds to the fraction 1−p of times that an mRNA-elongating ribosome remains in frame. Analogously, the number of times translation is interrupted by a -1 PRF event is n(x′) ≡ N(x)p=n(x)p/1−p. Hence, each time the sequence x′ is translated, an energy expenditure of T(x′) ≐ Ei(x′)/n(x′) is ensued. Note that n(x′) does not correspond to a protein number. The ratio between these two is approximated by:T(x′)T(x)=Ei(x′)Ei(x)(1−pp)≈l+λ+πL,(9)which is independent of the number of synthesized proteins n(x). Here, we have assumed that ratio of the per-run translation energies expended for the non-frameshifting to the frameshifting scenarios corresponds roughly to the ratio of the lengths of the translated sequences x′ to x, respectively. However, there is an extra cost to each frameshifted mRNA, x′ stemming from NMD-mediated degradation. This cost is accounted for by π, which is an unknown fixed cost associated with each -1 PRF event. π is measured in units of equivalents of translation cost per nucleotide.

With (9), we can approximate the R(x,x′):R(x,x′) ≐ Ei(x′)Ei(x)≈l+λ+πL(p1−p).(10)With this, we are ready to address the last approximation required to find an analytical expression for Ic(x). Since F is concave, it follows that F(aη)≥aF(η) for a∈ℝ. Thus, setting a=(1+R(x,x′))−1, and using definition (4), we find a lower bound for Ic(x):Ic(x) ≐ F(ηc(x))=F((1+R(x,x′))−1η(x))≥(1+R(x,x′))−1F(η(x))=Ic^(x) ≐ (1+l+λ+πL(p1−p))−1I(x).(11)The approximation for R(x,x′) results in reasonable and useful properties of ηc(x). More specifically, as p→1 we have that ηc(x)→0. This entails that no mRNA is translated into proteins, as expected. The same follows if π→∞. The extraction of a from within the brackets leads to the undesirable behavior that as p→1 or π→∞, we have that Ic^(x)→0. Instead, I^c(x) should approximate some minimal value Ic,min(x)=F(0).

A major drawback of Ic^(x) is its reliance on p and π, which are both unknown. Typically, p is assumed to lie between 1−10%, but may reach up to 70% (for example the EST2 gene in yeast, (Advani et al. 2013)). In the following, and if not stated otherwise, we assume that p is constant across all genes. It is unclear how large π should be. We have found no evidence that the value of π is dependent on any characteristic of mRNA, such as its length, or codon composition (Chang et al. 2007; Dinman, 2012b). NMD is an evolutionarily conserved surveillance pathway (Chang et al. 2007). Its activation may thus differ in energy expenditure between organisms (indeed, in yeast, NMD does not require an exon-junction complex, unlike most other eukaryotes). However, we have found no indication that the energy required to fully complete the NMD process will substantially vary from mRNA to mRNA within an organism. Thus, we assume that π is uniform across genes, although we do not know how large it is relative to other energy inputs. Again, if not stated otherwise, we assume that π=0 in order to give a conservative estimate of the effects of -1 PRF on the TEH.

Codon Usage Bias Index Correction from Averaging:

In a second approach, we define a correction for a -1 PRF-aware CUB index using a balancing principle. To this end, we add the CUB index I(x) in successful, non-NMD mediated translations of x and the index I(x′) of untranslated x′, while weighing both with their respective probability of occurrence:Ia(x) ≐ (1−p)I(x)+pI(x′).(12)The corrected index Ia(x) is thus the expected value of I when considering that x is translated a fraction (1−p) of the time, and x′ is translated a fraction p of the time. I(x′) therefore acts as a penalization function. Ia may underestimate the extra energy costs to -1 PRF incurred from the activation of the NMD-mediated degradation processes, since these are not comprised in I(x′). Similarly to I^c(x), the major drawback of the estimator Ia(x) is its dependence on p. As before, we thus evaluate Ia(x) for different, plausible values of p to assess -1 PRF’s effect on the TEH. However, Ia(x) has the advantage that it is also well defined for p=1.

Reexamining the TEH with corrected CUB indices:

Both corrections Ic^ and Ia represent a change in the value of a codon usage bias index given a -1 PRF efficacy. The models for correcting I presented here do not relate these measures to protein expression levels, P. They do therefore not predict protein expression. Instead, they allow us to reexamine the association between protein expression and a corrected codon usage bias index that in part underpins the TEH.

To examine how these corrections affect associations between CUB and protein expression, we used the widely employed codon adaptation index (CAI) (Carbone et al. 2003) as an example for a CUB index I. We then computed both corrected Ic^ and Ia for different values of p and for all genes in the von der Haar data set (von der Haar 2008) that contain -1 PRF signals. For Ic^ we also assumed different values of π.

A precise computation of Ic^ and Ia would require information about the value of p across genes. Since no such information is available as of now, we follow two approaches. First, we treat all -1 PRF sites equally, assuming an equal p value for all, independent of the protein expression levels P of the genes they are located in. Second, we assume that p is associated with P, and allow p to slowly decrease with the P of the gene.

Figure 4 shows that correcting for the presence of -1 PRF signals with a -1 PRF efficiency of up to p=0.3 does not substantially affect the CUB index to protein expression relationship for both CUB index corrections, if the extra cost of mRNA degradation by NMD is neglected (π=0). Figure 4A) shows that all CAI values for all genes are diminished when using the correction Ic^. A substantial correlation between protein expression and corrected CAI index remains. Accounting for the frameshifted sequences x′ also diminishes codon adaptation measures Ia considerably compared to uncorrected CUB index values, as shown in Figure 4B). However, unlike with Ic^ this reduction does not correspond to a uniform negative offset. Instead, the effect of accounting for -1 PRF is to both shift and broaden the distribution of Ia(x) values relative to I(x) (see Figure S4).

The two corrections Ic^ and Ia show different sensitivities to the -1 PRF efficiency. In particular, the sensitivity of Ic^ to -1 PRF efficiency is mediated by the value of π. Figure 5A shows that the correlations between both Ic^ and Ia to protein expression levels decline for increasing values of p and different values of π. As p increases, and for low costs of NMD-mediated mRNA decay, the correlation of Ic^ to protein expression levels declines to very low values (∼0.2 at p=0.9 for π=0,10,100). If the energetic cost of -1 PRF induced mRNA degradation, π, becomes larger (π=1000,10000), the correlation declines very rapidly with p, and almost vanishes when p reaches ∼0.5. For Ia estimates, correlations’ dependency on p is moderate, never falling below 0.4 at biologically implausible values of p of unity.

These results corroborate the support for the translation efficiency hypothesis. For uniform -1 PRF efficiencies and for Ic^-based estimates, correlations of corrected CAI to protein expression levels decline with p. They only become sufficiently diminished to challenge the TEH when both the values of π and p are very high. Indeed, according to our mathematical framework, π=1000,10000 corresponds to the cost of either translating a large gene, or tenfold that cost. Except for π=10000, all correlation values between corrected CUB indices and protein expression levels remain at around 0.4−0.5 for biologically relevant -1 PRF efficiencies of p=1−10%. For Ia estimates, correlation seems generally robust to changes in p.

Varying -1 PRF efficiency with protein expression levels:

Since the data in Figure 2 strongly suggest a cost to -1 PRF maintenance, we also explored how a -1 PRF efficiency decline with protein expression levels consistent with such cost would affect the correlations in Figure 5A. Unlike a uniform p across protein levels, corrections to Ic^ are expected to be larger at low expression levels than at high levels. The -1 PRF decline is assumed as follows: p(P)=pb/log10(P), where we call pb the baseline -1 PRF efficiency.

For Ic^, this additional assumption leads to a measure:Ic^(x)=(1+l+λ+πL(pblog10(P)−pb))−1I(x).(13)Analogously, Ia becomesIa(x)=I(x)−(pb−log10(P))(I(x)−I(x′)).(14)Figure 5B shows that under this assumption, correlations between protein expression and Ic^ rise with larger baseline -1 PRF efficiencies for all except the largest values of π, while again Ia remains unaffected. We observe analogous results using the Spearman rank correlation (see Figure S5).

When -1 PRF efficiencies decline with protein expression –in accordance with a cost to -1 PRF– most correlations between Ic^-based estimates and protein expression rise with -1 PRF baselines. Except for π=10000, all correlation values increase with -1 PRF baselines. This also adds support for the TEH, suggesting that -1 PRF conceals stronger associations than measured with uncorrected codon usage bias indices.

Discussion

In this study, we have investigated the hypothesis that -1 PRF presupposes translational costs to an organism, while at the same time generating benefits associated with protein expression regulation. We explored whether such a cost might be identified more directly in data and used both PRFdb (Belew et al. 2008) and the von der Haar data set (von der Haar 2008) to address this question. We devised three hypotheses for likely signals of such cost in -1 PRF carrying genes. We could not find contradictory evidence for any of those hypotheses in the data. In a second step, we explored whether these costs, if not accounted for, bias important measures of CUB. We devised two new general approaches to correct CUB indices for the presence of -1 PRF signals. We then tested whether the concealment of such costs may unduly influence, falsify or strengthen, one classical test of the translational efficiency hypothesis: the association of CUB indices with protein expression levels. Under the assumption of uniform -1 PRF efficiencies, the energetic costs related with NMD activation would need to be implausibly high to warrant this conclusion. We find that on the contrary, assuming that -1 PRF efficiency decreases with protein expression levels -as suggested by the existence of a cost to -1 PRF-, the TEH is strengthened. Thus, taken together, our results suggests that high organismal demands for specific proteins are reflected in CUB-mediated translation efficiency gains.

Our study comes with a series of caveats. A first caveat of this study lies in that the slippery sites are inferred (utilizing the Nupack algorithm (Dirks and Pierce 2004) for mRNA pseudoknot detection), but not always experimentally confirmed (Advani et al. 2013). Previous studies suggest that inferred slippery sites are likely to be functional. In (Advani et al. 2013) it is reported that in previous work, nine out of nine high confidence -1 PRF sites detected by methods used in the PRFdb in S. cerevisae were confirmed to be functional in vivo. Crucially, whether a -1 PRF site is regarded to be functional depends on whether it exceeds a predetermined p-threshold. For example, in (Advani et al. 2013), a search of -1 PRF slippery sites identified 10 candidate genes in EST2, three in EST1, 2 in STN, and 1 in CDC in mRNA involved in telomerase. Out of these, and employing a cutoff of 1% for -1 PRF efficiency, seven carry functional slippery sites (EST2: 3, EST1: 2, STN: 1, CDC: 1) (see (Advani et al. 2013), Table 1). Had a cutoff of 0% -1 PRF efficiency been employed, all slippery sites except one would be functional. This limited sample provides confidence that the methods described by (Belew et al. 2008) appropriately capture biological mechanisms.

Even with such uncertainty, it is unlikely that the presence of non-functional -1 PRF sites in the analyzed data would affect the claim of an intrinsic -1 PRF maintenance cost (Figure 2). This claim would only be biased if the probability of a candidate site to be functional were affected by either protein expression levels of the gene within which it resides or alternatively, the relative position in the gene of the putative slippery site. Note that these effects should arise from mechanisms that are independent of the ones studied here. Again, we are unaware of any such mechanisms acting in yeast, except for the requirement of a spacer sequence. A spacer sequence of a minimum length of say, 8 nt, will prohibit slippery sites to be located within 8 nt of the mRNA end. This restriction will slightly bias the a priori position of candidate slippery sites. However, as average mRNA by far exceed this length, such a restriction cannot explain the effects documented here.

Another caveat of our study was the lack of estimates for p for the whole data set. Due to this restriction, we could not explore how -1 PRF efficiency p relates to CAI or protein expression. As assumed in Figure 5B, we expect that -1 PRF maintenance costs should on average translate into an inverse relationship between p and protein expression levels, because higher protein expression demands by the organism should be countered with reductions of p, minimizing the loss of mRNA to NMD-induced degradation. However, it is also possible that there exist cases in which -1 PRF has important functions in highly expressed genes. Hence, although there are fewer slippery sites for highly expressed genes and they are costly in terms of translation efficiencies, these sites could have large -1 PRF efficiencies. We have not found evidence in the literature to support this notion, and it would concomitantly contradict the evidence here presented. Additional testing of our hypotheses on the costs -1 PRF maintenance would be greatly helped if information on how -1 PRF efficiencies vary with protein expression was available.

While our analysis of the behavior of Ic^ gives us an indication how -1 PRF costs to translation efficiency could bias CUB indices, these insights rely on the assumptions made during Ic^’s derivation. Importantly, Ic^ is a lower bound to Ic, which means that downward corrections of Ic are potentially exaggerated. Thus, the claim that corrected CUB indices will leave the basis for the TEH unaffected is resistant to such a bias. Another key assumption is that CUB indices of an mRNA sequence should increase monotonically with translation efficiency. In practice, this association is surely not perfect, due to multiple additional influences on CUB from unrelated biological processes. For these processes to systematically bias our results they should i) dominate over translational efficiency effects or ii) result in directional effects when combined. Despite these possible shortcomings, the assumption reflects key properties with which many CUB indices are designed, namely to mirror translational efficiency. For example, the CAI uses ribosomal mRNA as a reference to compute preferred codons because it is highly plausible that they are translationally efficient.

Moreover, our results suggest that utilizing expected values for corrected CUB index definition, such as in the case of Ia, is suboptimal. Analysis of the distribution of Ia levels with p=1 of genes in the von der Haar data set show that corrected codon adaptation can increase. This behavior contradicts the rationale behind introducing such corrections in the first place, as, surprisingly, I(x′)>I(x) can occur.

To test the hypotheses about the TEH derived here, we have only utilized genes with identified slippery sites, and not all genes. Testing on all genes would be indicated if the TEH was challenged. However, if the correlation coefficient measured within a subset in which a correction has been applied is not substantially diminished relative to uncorrected values, the same correction will not affect the whole set either. Therefore testing it on the whole set is not necessary.

The results indicating intrinsic -1 PRF maintenance cost and TEH support are in mutual agreement. -1 PRF maintenance induces a reduction of the size of l with increasing protein expression levels (Figure 2C). At fixed p, as assumed in our analysis, significant reductions in l would diminish the penalization to the original, un-corrected CUB. This could only be compensated by increases of p or π with protein expression levels, contrary to intuition and available evidence (Advani et al. 2013). In fact, our analysis shows that a more plausible p dependency to protein expression leads to a strengthening of the basic CUB index to expression level association.

These results have to be interpreted in the context of current codon usage bias research. The mechanism of translational selection (TEH) remains the main explanation put forward for selection based origins of codon bias. This explanation presupposes that silent mutations affect fitness via translation processes. These fitness increases originate from translation efficiency and accuracy gains, that in turn are hypothesized to stem from translation initiation and elongation processes. In fact, mRNA elongation rates do indeed appear to correlate positively with preferred codon frequencies (Gustafsson et al. 2004), although better evidence would be desirable. However, for initiation processes –which are assumed to contribute the bulk of these fitness increases–, the effects of CUB on initiation rate increase remain subject of debate (Quax et al. 2015). In fact, Tuller and Zur have analyzed the effect of the structure of the 5′ end of an mRNA on translation initiation and elongation rates and found various regulatory signals that affect these rates in different ways (Tuller and Zur 2015). Indeed, the induced folding at the 5′ end of the ORF appears to affect translation efficiency.

Overall, the evidence from codon usage bias statistics and its associations to tRNA abundance and protein expression offers a compelling narrative for the TEH. However, current research efforts aiming to identify the exact mechanisms that give rise to these associations must account for conflating selective forces. Indeed, the mechanisms laid out in the TEH may not be the only way in which selection shapes codon usage frequencies. For example, besides the abundance of tRNAs, other factors have been discovered to crucially affect elongation rates and hence, to be possible targets selection (Chamary et al. 2006; Resch et al. 2007, 2009; Hunt et al. 2009). More precisely, specific synonymous changes can influence mRNA splicing, mRNA secondary structure, protein stability as well as protein folding (Komar 2009; Chamary et al. 2006; Carlini and Genut 2006; Parmley et al. 2006; Tsai et al. 2008; Komar 2007). Synonymous changes may also alter the secondary structure of mRNA and thus affect the rate of translation –as established in vitro (Ivanov et al. 1997; Parmley and Hurst 2007; Tuller and Zur 2015) and subsequently in vivo (Kimchi-Sarfaty et al. 2007). A comprehensive review on how protein expression is fine tuned by codon usage bias is given by Quax (Quax et al. 2015). Further, in a very recent study, a synonymous difference between mammalian cytoskeletal β- and γ-actin proteins was found to affect co-translational processing, ubiquitination, and co-translational degradation, leading to differential stability properties of the corresponding protein products (Zhang et al. 2010).

Within genomes, the biologial role of a gene influences that gene’s CUB in various additional ways. Some codons are more abundant in genes depending on their function, displaying distinct codon bias patterns. Supek has reviewed how gene function modifies codon preferences (Supek 2016). Selective pressure to maintain (or alter) gene function is superimposed to what is expected from translational efficiency and accuracy optimization. Genes where altered CUB patterns have been found are involved in diverse functions: amino acid starvation responses, cyclical protein expression, tissue specific expression, cellular differentiation, stress responses, and carcinogenesis. While genes can differ in function, there are also differences in function within a gene’s sequence that also affect local CUB. For example, Tuller and Zur have surveyed the multiple roles of the 5′ end of coding sequences in gene expression regulation (Tuller and Zur 2015). They hypothesize that due to multitude of regulatory signals found in that region, selection pressures regarding codon utilization are likely different than in other regions. Unlike cross-gene function, such effects are local, which should not affect our analysis. How effects from gene function may affect our results, crucially depends on their frequency and direction. The literature reviewed here does not appear to warrant the assumption that all the function-dependent selective pressures will align to influence CUB in the same way, leading to systematic bias. The phenomenon of -1 PRF is thus only one of many ways in which CUB may be influenced.

The particular appeal of the -1 PRF phenomenon lies in its potential to elucidate many of these processes. Because slippery sites are precisely localized and -1 PRF efficiencies measurable, -1 PRF signals constitute natural experiments to translation efficiency and accuracy theories. This is because the separation of the mRNA by a slippery site should create differential translational costs across that mRNA. This translational cost gradient should be reflected in codon usage bias differences. -1 PRF based approaches to analyzing codon usage bias behavior, like the one presented in this study, may thus offer novel tools to better understand the means by which translation efficiency gains are realized in nature.