INTRODUCTION

With the recent focus on nonprotein coding RNA (ncRNA) genes, interest in detecting novel ncRNAs has rapidly emerged. Since the structure of RNA is evolutionarily more conserved than its sequence, predicting the RNA's; secondary structure is one of the most important steps towards its functional analysis. There are two fundamentally different approaches to predicting RNA secondary structures, namely free-energy minimization and probabilistic approaches, which often use phylogenetic information given by a multiple sequence alignment. This duality of approaches represents two different types of RNA structure information, the former relies on the physical properties of single sequences, while the latter uses evolutionary information in the form of compensatory base pair substitutions.

The advantage of energy minimization is that it relies on experimentally determined parameters. On the other hand, it is known that methods such as RNAfold (1) and Mfold (2) achieve an overall sensitivity of about 70% (3). For example, H/ACA snoRNAs usually do not produce the characteristic two-stem when the energy minimization alone is used to determine the structure. Instead, information on the functional sites have to be used as additional constraints to the thermodynamic folding (4–6). In general, there are several reasons for the inaccuracy of energy minimization. First, the energy model is incomplete, especially for multi-loop structures. Second, there can be alternative structures with similar free energies. And third, true structures are often stabilized by bound molecules. The only feasible way to determine the effects of these inaccuracies is to include phylogenetic information by looking for conserved substructures that cannot be accounted for by thermodynamic folding alone.

Hence, it is desirable to combine this duality of RNA structure information into a single optimization problem. This was probably addressed for the first time in 1985 by David Sankoff (7), who introduced an algorithm that solved the problem of simultaneously aligning and folding a set of unaligned RNA sequences. While, this is a kind of gold standard for comparative RNA secondary structure prediction, the ‘algorithm requires extreme amounts of memory and time’ (8). Thus, practical implementations of the Sankoff algorithm like FOLDALIGN (9–12), Dynalign (13,14), PMcomp (15), LocARNA (16,17) and PARTS (18), published more than 20 years later, introduce different constraints and somewhat arbitrary choices in their scoring schemes to make the approach tractable for realistic input sizes.

Probabilistic approaches to the problem of simultaneously aligning and folding a set of RNA sequences, e.g. Stemloc (19) and Consan (20), are usually based on stochastic context-free grammars (SCFG). Unlike the aforementioned Sankoff-like methods where the energy is explicitly reflected in the scoring scheme, these approaches rely purely on statistical learning methods to determine their parameters. A mixed approach is employed by CMfinder (21) that implicitly combines energy contributions with an SCFG. As a seed CMfinder uses energetically folded structures from which a covariance model (SCFG) is constructed in successive rounds of optimization. Another approach is SimulFold (22), which simultaneously infers structures (including pseudoknots), alignments and trees. There are no sequence-dependent energy contributions, but an energy term that depends on the topology of the consensus structure.

The main disadvantage of the Sankoff-like approaches is their high-computational cost. For this reason, a class of methods was developed that saves computational resources by predicting the optimal structure from given RNA alignments, which are usually produced by multiple sequence alignment methods. This approach has proven useful in genomic screens for ncRNAs (23,24), despite their limitations in finding RNA structures in more divergent sequences (25). It has been shown that the quality of the predictions breaks down in cases where sequence identity is <60% (26). On the other hand, these methods can also be applied to improve the consensus structure prediction in Sankoff-like approaches. The reason is simply that the Sankoff-like approaches usually apply a progressive strategy by combining pairwise alignments to build the final multiple alignment. Thus, consensus structure prediction can be improved when considering the complete phylogenetic information.

In this article, we extend Pfold (27) to simultaneously use evolutionary and energetic information while searching for the common structure in a set of prealigned sequences. The probabilistic approach underlying Pfold combines an explicit evolutionary model of the RNA sequences with a probabilistic model of the secondary structures. When trying to extend this approach to incorporate thermodynamic folding, the first idea was to develop a combined probabilistic model for evolution and folding, using the partition function approach of McCaskill (28) as a probabilistic model for thermodynamic folding.

There are two main problems for a combined probabilistic model. First, there is no simple way to weight the different information sources in such a combined model. And second, the structure prediction would be based on a maximum likelihood or maximum a posteriori (MAP) approach. Recent work by Carvalho and Lawrence (29) has shown that this approach often does not yield to the desired result. Basically, the implicit assumption of a maximum likelihood or MAP approach is that the structure with the highest probability is also the structure where the ensemble of close neighbors has the highest probability mass, which is often not the case. Hence, Carvalho and Lawrence proposed different new classes of estimators, which included estimators already used in sequence alignment and RNA secondary structure prediction. One example is the maximum expected accuracy (MEA) method originally introduced by ref. (30) and successfully applied to sequence alignment in ProbCons (31) and to RNA structure prediction in CONTRAfold (3). A partition function version of the Sankoff algorithm (which is related to MEA scoring) was introduced in ref. (32). Another example is Pfold itself, since the current implementation does not use the maximum likelihood approach originally introduced in ref. (33), but is based on reliability scores (27), which can be interpreted as a variant of MEA scoring.

Hence, our approach is based on a combined MEA score. The combined score is based on both base pair and single-strand probabilities as calculated by RNAfold and the reliabilities of base pairs and single-stranded positions as extracted from Pfold. The combined MEA scoring has the advantage that it allows for an explicit weighting of the contribution of phylogenetic and thermodynamic information and smoothens the effects of incorrectly predicted base pair probabilities. The latter is due to the fact that the algorithm searches for a structure that shares as many base pairs as possible with all alternative structures. We implemented this Probabilistic Evolutionary and Thermodynamic folding algorithm (PETfold) for multiple RNA sequence alignments.

Recently, several methods for finding the common structure in a set of sequences that are either unaligned or aligned have been published (34–37). Essentially, they all carry out a global alignment or predict a common structure on a set of globally aligned sequences. RNAalifold (38) basically applies energy minimization to a complete alignment. In addition, it introduces an evolutionarily motivated score to measure sequence covariation for base pairs. Since it is widely used and applies a thermodynamic-based scoring scheme, it was chosen as a benchmark method in addition to Pfold.

The Pfold model

We now recall the Pfold (33) model. In the following, a secondary structure σ is a set of base pairs that do not cross. Let A be a multiple alignment of the sequences s1 … sn. Furthermore, let T be a given evolutionary tree and M be a prior model for the secondary structures. Thus the model provides a probability distribution on the structures, given the data (i.e. the alignment A) and the background information (i.e. the secondary structure background model M and the tree T):

1

Since Pr [A|T,M] is independent from the structure σ, we need to optimize only Pr [A|T,σ,M] P[σ |T,M].

Here, P[σ|T,M] is the prior distribution over all secondary structures. M is given by the following simple SCFG, which is taken from Pfold:

2

It has been shown by Dowell and Eddy (39) that this grammar performs best on a given benchmark data set. Furthermore, this grammar is unambiguous, i.e. every structure is generated in a unique way. For each structure σ, let τM(σ) be the associated parse tree that produces the structure σ using the grammar M. With r(σ) we denote the root node of τM(σ).

The term Pr [A|T,σ,M] provides a distribution over alignments. Here, only the sequences evolve and the common structure σ (adopted by all sequences) determines the mode of evolution. The alignment probability is calculated as the product of alignment column probabilities, which are calculated using Felsenstein's; dynamic programming for phylogenetic trees (see Supplementary Material for details).

The full term Pr [A|T,σ,M] P[σ |T,M] now can be calculated with a combined SCFG by multiplying the probabilities for the secondary structure rules in Equation (2) with the probabilities for generating the associated alignment columns (see Supplementary Material again). In the following, we will denote it with Pr(r(σ), A).

Originally, Pfold (33) used a MAP approach for calculating the consensus structure σ. Thus, σ was defined to be the structure that maximizes Pr(r(σ), A). This maximization problem can be solved using the CYK algorithm. Later, this was replaced in Pfold by a more successful MEA approach discussed in the next section.

Maximum expected accuracy

The basic idea of MEA scoring is to consider Pr(r(σ), A) as a probability distribution over consensus structures, where higher probabilities denote a better fit to the alignment and its associated evolutionary history. We write position i ∉ σ as short for ‘position i is not involved in a base pair in σ’ [i.e. ∀ j: ((i,j) ∉ σ ∧ (j,i) ∉ σ)]. Given the structure σ and an alignment A with m columns, the set of all unpaired positions in the consensus structure is denoted as σ ={i ∉ σ | 1 ≤ i ≤ m}. Then, we can compute the expected overlap ex-overevo(σ) of a specific consensus structure σ with all possible consensus structures, weighted according to their probabilities. This can be done as follows:

where the term δ(ψ) for a proposition ψ ≡ ((i,j)∈σ′) or ψ ≡ (i ∈ sg(σ′)) is 1 if the ψ is true and 0 otherwise. α is a free parameter that weights single stranded against base pair positions. Since a base pair occupies two positions, α should be smaller than 1\2 (where α=1\2 means that prediction errors on base pairs and single stranded positions are weighted equally).

We now define the reliability scores for a base pair (i, j) and for a single-stranded base i as in Pfold by

3

Both can be calculated by an inside/outside algorithm. Using the terms in Equation (3), we can redefine the expected overlap of a consensus structure as

which allows us to calculate the consensus structure with the maximal expected overlap using a Nussinov style algorithm.

Extension by folding energies

Before we can extend the model, we have to more precisely define what it means that two sequences can adopt the same consensus structure under a given alignment matrix A = (au,l), where u denotes the number of sequences and l the number of alignment columns. This is necessary because a consensus structure is defined as a set of paired alignment columns. Hence, let n be the number of sequences and let fuA(i)=l be the alignment column corresponding to position i in sequence su. The mapping fuA can be extended to structures:

In the previous section, we searched for a consensus structure that had the maximal expected overlap with other possible consensus structures defined by the probabilistic evolutionary model, thus minimizing the expected number of evolutionary prediction errors. Now, we also want to evaluate the expected overlap for each sequence s with its ensemble of structures as given by the energy model. This implies that for each sequence s, we consider the distribution of structures as introduced by McCaskill (28). For this purpose, let be the base pair probabilities for a sequence s as calculated by, e.g. RNAfold −p (1) and the probability for position k being single stranded in sequence s. The combined expected overlap now consists of two parts, generally weighted with 1 for the conservation part and β for the thermodynamic overlap:

4

where ex-overstr(σ) is the expected overlap of σ with all structures from all sequences. Formally, this is defined by

Here, denotes the base pair probability for sequence su, written by alignment columns:

and analogously for .

Overall, we obtain

The consensus structure maximizing this expectation can again be calculated by a Nussinov-style algorithm.

Reliably conserved substructure

Highly conserved substructures are under strong evolutionary pressure, possibly caused by interactions with bound molecules or other functional constraints. As mentioned in the Introduction section, these substructures cannot be accounted for in thermodynamic folding algorithms. We refer to them as a reliably conserved substructure σrel, which is defined as follows. A substructure is a structure that fixes only some of the positions in a sequence, hence being a partial structure. A partial structure σp is a tuple consisting of a set of base pairs and a set of single-stranded positions such that no single-stranded position in is part of a base pair in . Then, a partial structure σp is a substructure of a structure σ if and .

The reliably conserved substructure σrel is a substructure of σ that is determined by the evolutionarily highly reliable positions. They are selected by using thresholds on the reliability scores of single-stranded (pthresholdss) and base-paired positions (pthresholdbp). We get σrel by using a Nussinov-style approach with the additional condition that

There are two possible approaches to determine these thresholds. First, they can be estimated through parameter tuning using a data set of known RNA families to predict RNA structures most similar to their structure annotations (see Result section). Second, a statistical approach can be applied by determining positions whose reliability is significantly better than the average (see Supplementary Material). Our observation was that in practice the first approach worked better than the second.

For specific cases where there are many additional constraints imposed on the structure, it is a good strategy to identify the reliably conserved substructure σrel

first, and then search for the consensus structure σ that maximizes ex - over (σ) and contains σrel as a substructure. Reliably conserved substructures are easily integrated into the earlier described Nussinov-style algorithm by setting the weighting factor of thermodynamic overlap β to zero in Equation (2) (see Supplementary Material for details).

Gaps

Gaps are treated in a two-step procedure. First, all columns are removed from the alignment where ≤75% of the sequences have nucleotides. These columns are integrated as gaps in the consensus secondary structure at the end, a strategy that is adapted from Pfold. Second, sequence-dependent probabilities and are calculated without gaps and probabilities of gaps are estimated as the average in the appropriate column of the alignment.

Data

Rfam (40) is the most up-to-date comprehensive and freely accessible RNA structure database, so we used 46 Rfam (version 8.0) seed alignments for parameter tuning and for PETfold benchmarking. Our data set consists of 17 RNA families recommended by refs. (21) and (11) and 29 additional families possessing a high-quality alignment documented by the SARSE project (41). The alignments of the second set of data are characterized by a score for inconsistent base pairs ≤4 and a score for novel base pairs ≤4. These criteria are satisfied by 43 families, of which two families overlap the first set and 12 were rejected because of an annotated structure with <40% of bases being involved in base pairs. Large alignments were reduced by sequence similarity as in (41). An overview of all RNA families in the data set is given in Supplementary Table 1.

Performance evaluation

The RNA structure predictions of PETfold, Pfold and RNAalifold are evaluated by looking for their correlation to the related Rfam structure annotation, ignoring noncanonical base pairs. Therefore, we used the Matthews correlation coefficient (42) defined as

5

where Pt is the number of mutual base pairs in the two assignments (true positives), Nt the number of mutual pairs of bases not base pairing (true negatives), Pf the number of predicted base pairs not in the Rfam assignment (false positives) and Nf the number of base pairs in the Rfam assignment not predicted to pair (false negatives). We also calculated sensitivity, SEN = Pt / (Pt + Nf) and positive predictive value, PPV = Pt / (Pt + Pf). Their geometric mean is a good approximation to the MCC for base pair prediction (43). The parameter tuning was done by optimizing the MCC to get a best possible tradeoff between sensitivity and PPV. In addition, we used the more stringent structure evaluation scheme R5 introduced by Gorodkin et al. (44) (see Supplementary Material for full details). This was necessary because MCC does not evaluate whether or not two positions coincide in their base pair partners when the two positions do base pair.

Pfold similarity

PETfold without thermodynamic probabilities (set parameter β = 0) calculates exactly the same most reliable structure as Pfold. We interpret highly conserved parts of this RNA structure as evolutionary substructure σrel.

Parameter tuning

The default parameters of PETfold are set by optimizing the MCC of predicted RNA secondary structure to the related Rfam structure annotation over all RNA families in our data set. Hence, we ran PETfold with different values for the single-stranded probability weighting factor (α ≤ 0.5) and different reliability thresholds for evolutionary highly conserved single-stranded (pthresholdss) and base-paired positions (pthresholdbp). The latter two values determine the reliably conserved substructure σrel. The conservation part and thermodynamic overlap are equally weighted (β = 1) as PETfold should not be overfitted to RNA families with strong conservation and lower thermodynamic stability or vice versa.

As shown in Figure 1, PETfold with α = 0.2 and pthresholdss = 1 predicts the best RNA structure as compared to the Rfam annotated structure. These settings optimize the structure of 30 RNA families. The total rejection of the reliably conserved substructure σrel (pthresholdss = pthresholdbp = 1) or pthresholdbp < 0.9 slightly decreases the performance of PETfold. Therefore, we are pragmatic and use pthresholdbp = 0.9, which predicts the best RNA structure for 38 families. In summary, we choose α = 0.2, β = 1, pthresholdss = 1 and pthresholdbp = 0.9 as default parameters for PETfold having optimized the structures of 25 RNA families that comprise 54% of the data set.

The parameter tuning of PETfold showed that in many cases the quality improves by adding thermodynamic probabilities even when there are highly conserved substructures. PETfold produces the highest MCCs when single-stranded positions are not considered as part of the reliably conserved substructure σrel. Furthermore, defining σrel which consists of evolutionarily conserved base pairs with high reliability resulted only in a minor improvement. This emphasizes that phylogenetic structure information is mostly supported by folding energy. Another unexpected result was the poor weighting of single stranded against base pair positions (α = 0.2), i.e. base pair probabilities have a larger impact on RNA structure prediction.

Performance

Next, we compared the RNA structure predictions of PETfold with Pfold and RNAalifold, using the default parameters for each program. On average over the entire data set, PETfold performed better than the other methods for a wide range of parameter settings. PETfold with default parameters predicts base pairs with 0.85 PPV, 0.88 SEN and 0.86 accuracy. Its mean MCC to the Rfam annotations of 0.85 is significantly higher than the 0.71 MCC obtained for Pfold and the 0.79 obtained for RNAalifold (Table 1).

The MCCs for PETfold, Pfold and RNAalifold are listed for all families in the data set in Supplementary Table 1. PETfold predicts a structure that is better than the ones predicted by Pfold and RNAalifold for 18 RNA families. In contrast, Pfold achieves this for only 7 families and RNAalifold for 16 families. Considering a confidence interval of 0.01, we observed 27 of the most accurate predictions by PETfold, 15 by Pfold and 18 by RNAalifold. These include cases where two or three methods predicted the same structure. More specifically, seven structures were predicted by PETfold and Pfold, three by PETfold and RNAalifold and two structures [Histone3 (45) and Rhino_CRE (46)] were predicted by all three methods.

Concerning the structure predictions that completely agreed with the Rfam annotations, the three methods correctly predict the Histone3 structure. In addition, Pfold predicts the exact structure of mir-194 (47) microRNA precursor and RNAalifold of the viral 3′-UTR stem-loop s2m (48). In contrast, PETfold provides accurate structure predictions for four viral RNA families [HepC_CRE (49), IBV_D-RNA (50), TCV_H5 (51) and TCV_Pr (51)]. Their alignments contain between 3 and 64 sequences and have mean pairwise identities (MPI) ranging from 77% to 95%, showing that PETfold is not overfitted to certain alignment features. Furthermore, a scan of all 574 seed alignments in Rfam version 8.0 results in the best performance by PETfold (MCC = 0.63), followed by Pfold (MCC = 0.62) and RNAalifold (MCC = 0.58). However, it should be noted that a large number of annotated RNA structures in Rfam are themselves predictions, many by Pfold.

The performance of PETfold increases greatly in the case of several H/ACA snoRNAs, the already mentioned example of ncRNAs that often do not fold into the structure of lowest free energy, e.g. HACA_sno_snake (52) (MCC with PETfold: 0.79, Pfold: 0.25, RNAalifold: 0.41) as a member of our data set and SNORA51 (53) (MCC with PETfold: 0.91, Pfold: 0.13, RNAalifold: 0.57). The functional RNA element of HIV-1 mRNA, Rev response element (RRE), has a verified RNA secondary structure of 337 nt in length (54). The accuracy of the predicted structure for this element increases by 30% when using PETfold (MCC=0.90).

PETfold takes O(L2) space and O(L3) time when the number of sequences N is much smaller than the sequence length L. The most time-consuming parts of the algorithm are the calculation of evolutionarily reliabilities using Pfold (O(L3) + O(L2)), the calculation of energy-based probabilities of N sequences using RNAfold (N × O(L3)) and the Nussinov-style algorithm (O(L3)). In practice, the running time of PETfold is approximately twice that of Pfold and much longer than that of RNAalifold (see Table 1 for details). Major reasons for the longer runtimes are the implementation of PETfold in Perl and the external calls to Pfold and RNAfold.

Family specific parameter settings

The suggested PETfold parameters are based on the average performance over all RNA families in the data set. We observed a common distribution of PETfold performance for different parameter settings in which low-reliability thresholds for evolutionary substructures σrel decrease the prediction accuracy (Figure 1). Thus, the general application of evolutionary constraints has negative influences on the structure prediction.

Nevertheless, several RNA families show a different performance distribution. For instance, PETfold with default parameters performs worse than RNAalifold for the Rotavirus cis-acting replication element (Rota_CRE) (55). However, if we lower the threshold for conserved base pairs, then PETfold predicts a structure with MCC=0.84 compared to the MCC=0.76 achieved by RNAalifold.

We could not find a general correlation of the best performing reliability thresholds with previously used alignment features such as the structural conservation index (SCI) (23) or mean pairwise sequence identity (MPI). However, structurally diverse alignments tend to benefit from lower reliability thresholds due to a higher evolutionary information content. The alignment diversity can be measured by Pcluster (41), which divides the sequences of an RNA alignment into subgroups with different consensus structures. On average, PETfold performs best with high-reliability thresholds for RNA families with low number of structural clusters. This correlation is shown in Figure 2. For instance, the best performing reliability thresholds for Rota_CRE, whose alignment is divided in three clusters by Pcluster, are well described by Figure 2.

In summary, PETfold's default parameters should be regarded as a reasonable approximation for the RNA family alignments with different levels of diversity.

DISCUSSION

We presented a new method, named PETfold, which unifies probabilistic evolutionary and thermodynamic models to predict global RNA structures from multiple alignments. Highly conserved RNA structures are probably naturally selected due to their functional relevance, while less conserved sequences tend to fold in a state of minimum free energy. Combining both types of information increases the predictive power: we rely on phylogenetic information if the structure elements are highly conserved and extend the model by including folding energies to build the surrounding structure.

The 46 Rfam seed alignments used for benchmarking are selected for their consistency. They might include manually optimized structure information, that are not achieved by sequence alignment algorithms. However, we are interested in the best possible structure prediction, which is often only found using conserved structure information in the alignment. The analyses of PETfold revealed that the selection of a reliably conserved substructure σrel is less important than expected. This has been confirmed by estimating the statistical significance of the reliability scores, which shows that only high-reliability scores are significant. On the other hand, the combination of an evolutionary model with a thermodynamic model outperforms the widely used RNA structure predictors Pfold and RNAalifold.

Nevertheless, there are cases where the evolutionary conservation is more important. Especially for RNA families with an active site like H/ACA snoRNAs, the usage of reliably conserved substructures improves the predictions. A first guess, namely that the impact of the evolutionary substructure might be negatively correlated with the SCI, could not be confirmed. However, we observed a correlation to the number of structural clusters in an alignment as measured by Pcluster. This should be investigated in greater detail since it provides a new possibility to classify multiple RNA alignments. Currently, the measurements as used, e.g. in BRAliBase (26) are MPI and SCI.

There are two possible improvements. One could be the application of stacking base pair probabilities (56). These probabilities can be easily integrated in the MEA approach if the rule F→dFd of the SCFG is applied. Nevertheless, a first trial of this extension using the stacking base pair probabilities calculated by RNAfold (1) did not improve the performance of PETfold and may be due to the fact that these are not independent of the simple base pair probabilities.

A second possible improvement considers the fact that structural stability is at least partially accounted twice since Pfold already favors stable base pairs such as G—C. This does not pose a major problem since the thermodynamic part adds a lot of new information like sequence-dependent stacking or properties of the complete structure ensemble (encoded in base pair probabilities). Nevertheless, it would be nice to explicitly separate these different information sources in a future version.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online. The source code of PETfold is available under the GNU Public License at http://genome.ku.dk/resources/petfold.

FUNDING

Funding for open access charge: Albert-Ludwigs-Universität Freiburg.

Conflict of interest statement. None declared.