1 INTRODUCTION

Macromolecules such as DNA, RNA and proteins have the ability to form diverse tertiary structures, which enable functionality and thus, life. For many decades, proteins were deemed the global players in the cell until RNA entered the spotlight. For example, RNA structures have been found to be catalytically active, which was assumed to be the privilege of proteins. Furthermore, small RNAs are known to regulate gene expression and RNA viruses employ a plethora of structure elements to invade the host cell.

To gain insight into macromolecule function, one must investigate the structure. The first step in RNA folding is stable base pairing that leads to a secondary structure. As RNA structure formation is of hierarchical nature, secondary structure is the basis for the tertiary fold that produces the functional structure. Especially for RNAs, structure determination by experimental means is an intricate and expensive task. Computational RNA structure prediction is therefore an invaluable tool for biologists. Comparative structure prediction is considered the most reliable approach for computational RNA structure prediction. Single sequence structure prediction is always limited by the accuracy of the underlying folding model. Three main streams have been identified for comparative RNA secondary structure prediction: (i) predict a structure from a pre-computed sequence alignment; (ii) simultaneously compute an alignment and a structure and (iii) alignment-free methods (Gardner and Giegerich, 2004).

Tools for multiple sequence alignments such as ClustalW (Thompson et al., 1994) are readily available and thus, structure prediction from an alignment is a tempting approach [e.g. RNAalifold (Hofacker et al., 2002)]. Such methods heavily depend on the sequence conservation and quality of the underlying alignment. However, ncRNAs are conserved rather on the structure level than on the sequence level. The gold standard of RNA comparative structure prediction is the Sankoff approach as it does not rely on a high-quality sequence alignment and captures the structural conservation of ncRNAs. Sankoff (1985) introduced a theoretical dynamic programming algorithm for simultaneous folding and aligning for a set of N sequences that takes O(n3N) time and O(n2N) space. Practical variants have been derived which more or less retain the Sankoff principle by sacrificing optimality. Alignment-free methods aim to avoid the pragmatic restrictions made in a practical Sankoff approach as well as the reliance on a high-quality alignment [e.g. CARNAC (Perriquet et al., 2003)]. Note that all of these comparative structure prediction methods exclude the prediction of RNA pseudoknots.

RNA pseudoknots are crossing structure elements with diverse functions. The principle of pseudoknot formation is that bases within a loop region pair with complementary unpaired bases outside the loop. From an algorithmic point of view, even the simplest type of pseudoknot adds considerable computational demands due to crossing base pairs. In fact, the majority of comparative RNA structure prediction methods exclude pseudoknots. Biologists have delivered a wealth of studies, which show that pseudoknots have an astonishing number of diverse functions and occur in most classes of RNA (Staple and Butcher, 2005). RNA viruses use pseudoknots for hijacking the replication apparatus of the host (Brierley et al., 2007).

A limited number of RNA comparative structure prediction methods can handle pseudoknots due to the computational complexity. Several of these methods take a sequence alignment as an input. ILM is an algorithm that takes as an input either individual sequences or a sequence alignment (Ruan et al., 2004). A base pair score matrix is prepared initially and helices are added to the structure in an iterative fashion. In the approach hxmatch, a maximum weighted matching algorithm with combined thermodynamic and covariance scores is used (Witwer et al., 2004). This program gives the option to be combined with RNAalifold. KNetFold is a machine learning method, which takes a sequence alignment as an input and outputs a consensus structure allowing pseudoknots (Bindewald and Shapiro, 2006). Simulfold takes an alignment as an input and simultaneously calculates a structure including pseudoknots, a multiple-sequence alignment and an evolutionary tree by sampling from the joint posterior distributions (Meyer and Miklos, 2007). Tfold combines stem stability, covariation and conservation to search for compatible stems and subsequently for pseudoknots for a set of aligned homologous sequences (Engelen and Tahi, 2010). Several comparative structure prediction methods including pseudoknots do not rely on an initial sequence alignment. The graph-theoretical approach comRNA computes stem similarity scores and uses a maximum clique finding algorithm to find pseudoknotted structures (Ji et al., 2004). SCARNA performs pairwise structural alignment of stem fragments with fixed lengths derived from the probability dot plot (Tabei et al., 2008).

In the following, a novel comparative approach for predicting structures including H-type pseudoknots called DotKnot-PW will be introduced. The input consists of two unaligned, evolutionarily related RNA sequences. Similarity scores between structure elements will be calculated. Statistically significant pairs will be used to find the set of conserved structure elements common to two sequences, which maximize a combined thermodynamic and similarity score. Using a hand-curated test set of pseudoknotted structures with experimental support, the prediction accuracy of DotKnot-PW will be compared with methods from the literature.

2 APPROACH

Pseudoknots are functional elements in RNA structures and therefore, the most promising approach for comparative prediction is a structure comparison with less focus on exact sequence matching. In fact, perfect conservation on the sequence level can be more of a curse than a blessing. Especially ncRNAs are known to evolve quickly and so-called consistent and compensatory base pairs in both sequences will give much more confidence for structure conservation than a sequence alignment. One strong point of the DotKnot method for single sequence pseudoknot prediction (Sperschneider and Datta, 2010; Sperschneider et al., 2011) is that the set of possible H-type pseudoknot candidates (and secondary structure elements) is explicitly computed and thus readily available for further investigation. The main steps in the pairwise pseudoknot prediction approach DotKnot-PW are as follows (Fig. 1):

Run DotKnot for two unaligned sequences Seqx and Seqy. This returns secondary structure element and H-type pseudoknot candidate dictionaries.Calculate pairwise base pair similarity scores for secondary structure elements and H-type pseudoknot candidates. Keep significant pairs that have a low estimated P-value.Use significant pairs to calculate the set of conserved structure elements and pseudoknots for the two sequences that maximizes a combined free energy and similarity score.

The key point of the DotKnot-PW approach is how to score the similarity of stems, secondary structure elements and H-type pseudoknot candidates derived from sequences Seqx and Seqy. Related work has been done for stem finding in unaligned sequences, where stem candidates are assigned a matching score across unaligned sequences, e.g. in SCARNA. Another point is how to assess the significance of a similarity score using P-values. These points will be explained in detail in the following section.

3.1 Base pair similarity scores of stems

For two given stems si(x) and sj(y) with fixed lengths in sequences Seqx and Seqy, respectively, the base pair similarity score sim[si(x), sj(y)] is calculated using an ungapped local structure alignment of the base pairs with the RIBOSUM85-60 matrix. As an example, consider the following optimal ungapped local structure alignment of the two stems with base pair similarity score of sim[s1(x), s2(y)] = 22.04 using the RIBOSUM85-60 matrix.

s1(x)

UCUCUAUC … … .GAUAGAGA

((((((((… … .))))))))

s1(y)

–UUGUAC… … .GUACAA–

–((((((… … .))))))–

To evaluate the significance of base pair similarity scores instead of the raw score, one has to find out what the underlying probability distribution is. Similar to the case of ungapped local sequence alignments (Karlin and Altschul, 1990), it is assumed here that the base pair similarity scores follow an extreme value distribution. However, the main difference is that a comparison between fixed-length stem fragments is made. It is important to remember that parameters λ and K describe the extreme value distribution of optimal local alignment scores in the asymptotic limit of long sequences (Altschul et al., 2001). Here, the parameters for the generalized extreme value distribution are pre-calculated using maximum likelihood fitting of a distribution to the histogram of a large sample of random base pair similarity scores. The maximum likelihood fitting was performed using the ismev package of the R statistical language for a range of stem lengths (see Supplementary Material). The P-value is defined as the probability to obtain a score greater than or equal to the observed score strictly by chance. A stem si(x) in sequence Seqx and a stem sj(x) in sequence Seqy are a significant pair if the score sim[si(x), sj(y)] has an estimated P-value less than α. Stem pairs with a P-value larger than α are not considered in the following.

3.2 Base pair similarity scores of interrupted stems

For two interrupted stems, the base pair similarity score is calculated by deleting bulges and internal loops and scoring stems as consecutive base pairs. Base pair similarity scores for regular and interrupted stems are also calculated if the difference in number of base pairs is less than 5. For example, a stem with one bulge might be a conserved match with a regular stem. A stem si(x) in sequence Seqx and a stem sj(y) in sequence Seqy are a significant pair if the score sim[si(x), sj(y)] has an estimated P-value less than α.

3.3 Base pair similarity scores of multiloops

Calculating the base pair similarity score for two multiloop structures is complex due to the variety of inner loop elements, which may be regular or interrupted stems. A multiloop can be decomposed into an outer stem and a set of inner structure elements Si(x) = [s1(x),…,sk(x)], where k ≥ 2. The base pair similarity score for the outer stems of two multi-loops can be easily obtained from the previously calculated base pair similarity scores. If the outer stem is a conserved match, a local alignment on the set of inner structure elements is used to find the base pair similarity score. Here, gaps are allowed in the local alignment of inner structure elements; however, no gap penalty is used. Let two sets of inner structure elements Si(x) = [s1(x),…, sn(x)] and Sj(y) = [s1(y),…, sm(y)] be given. Let H(i, j) be the maximum similarity score between a suffix of Si(x) and a suffix of Sj(y). The optimal local alignment is calculated as follows:

A multiloop in sequence Seqx and a multiloop in sequence Seqy are a significant pair if the similarity score has an estimated P-value less than α.

3.4 Base pair similarity scores of H-type pseudoknots

A H-type pseudoknot has two pseudoknot stems S1 and S2. The prerequisite for a conserved pseudoknot pair is that both core H-type pseudoknot stem pairs [S1(x), S1(y)] and [S2(x), S2(y)] are significant. The base pair similarity score for two H-type pseudoknots pi(x) and pj(y) in sequences Seqx and Seqy, respectively, is the sum of base pair similarity scores for the core pseudoknot stems as well as the base pair similarity score from a gapped local alignment of the recursive secondary structure elements in the loops (as described for multiloops). A pseudoknot pi(x) in sequence Seqx and a pseudoknot pj(y) in sequence Seqy are a significant pseudoknot pair if the similarity score sim[pi(x), pj(y)] has an estimated P-value less than α.

3.5 Dissimilarity and weight of significant structure elements pairs

The base pair similarity score calculated in the previous sections might not be powerful enough to distinguish true positive conserved structure element pairs from false-positive structure element pairs due to the finite lengths of stems and exclusion of loop sequences in the alignment. Therefore, a dissimilarity score is also used to confirm whether a pair is significant. The dissimilarity for two given structure elements si(x) and sj(y) in sequences Seqx and Seqy is defined as:

where dissim1 is the difference in the stem lengths and dissim2 is the difference in the number of loop lengths. As an example, consider the pseudoknot pair p1(x) and p1(y) in sequences Seqx and Seqy, respectively, with stems S1, S2 and loops L1, L2, L3. The pseudoknot pair has dissimilarity of 6.

p1(x)

UCUCUAUCAGAAUGGAUGUCUUGCUGCUAUAAUAGAUAGAGAAGGUUAUAGCAG

((((((((… … … … …[[[[[[[[[[.)))))))).]]]]]]]]]]

p1(y)

UUGUACAGAAUGGUAAGCCAAGUGUCAAUAGGAGGUACAAGCAACCUAUUGCAU

((((((… … … … …[[[.[[[[[[[.))))))… .]]]]]]]]]]

A weight is assigned to a significant pair, which is a combination of the free energy, covariation and dissimilarity. The overall weight s of a significant structure element pair [si(x), sj(y)] in sequences Seqx and Seqy is a combination of the free energy weights w[si(x)] and w[sj(y)], base pair similarity score sim[si(x), sj(y)] and dissimilarity dissim[si(x), sj(y)]:

Only structure element pairs with positive score s are allowed in the following dynamic programming algorithm. Here, α and γ are set to 0.5 and β is set to 1.

3.6 Finding best set of significant structure elements

Let be the number of structure elements in the first sequence Seqx and be the number of structure elements in the second sequence Seqy. Each structure element has a left and right endpoint in the sequence and is a stem, interrupted stem, multiloop or H-type pseudoknot. Structure elements can also be represented as nodes in a graph. In each sequence, the structure elements are ordered by their right endpoints. An edge is drawn between two structure elements in the first and the second sequence if their base pair similarity score has a P-value less than α. Given the set of edges between nodes and , the goal is to find the set of edges with maximum weight that are non-crossing. This relates to finding the set of non-overlapping structure elements in the two sequences that maximize the score under the requirement that the interval ordering is preserved. A set of structure elements in the first and second sequence, which preserves the interval ordering is called a feasible structure element alignment and must satisfy the following two requirements. Each structure element can be aligned with at most one other structure element in the other sequence. The order of structure elements must be preserved with respect to the alignment. That is, if structure elements and in the first sequence are aligned with and in the second sequence, respectively, the pairs may never overlap: (Fig. 2).

Given nodes in the first sequence Seqx and in the second sequence Seqy, let f(i, a) be the maximum sum of edge weights for nodes between 1 and i in the first sequence and 1 and a in the second sequence such that the edges are non-crossing (i ≤ n and a ≤ m). The nodes that maximize the sum of edge weights are called an optimal structure element alignment for the two sequences. The optimal structure element alignment is calculated using dynamic programming. For a given structure element pi with start point ai and end point bi, let pre(i) be the non-overlapping predecessor. For each structure element, its predecessor is pre-computed using the sorted list of structure elements. The recursion for calculating the optimal structure element alignment is as follows:

Furthermore, nested structures are taken into account for significant outer stem pairs, which have estimated P-value less than α. For each significant outer hairpin loop pair, the optimal structure element alignment of inner elements is computed.

3.7 Time requirements

For two unaligned RNA sequences Seqx and Seqy, the single sequence prediction method DotKnot returns structure element dictionaries derived from the probability dot plot. Let n and m be the number of structure elements in sequences Seqx and Seqy, respectively. Calculating the similarity scores and the optimal structure element alignment takes O(nm) time. Furthermore, nested structures are taken into account for significant outer stem pairs, which have estimated P-value less than α. Let a be the number of significant stem pairs, where both stems are hairpin loops. For each significant outer hairpin loop pair, the optimal structure alignment of inner elements is computed. In the worst case, this increases time requirements to O(a × nm). The number of structure elements depends on the base composition of the sequence. Empirically, n and m can be observed to grow linearly with the length of the sequence for uniform base distribution (see Supplementary Material). In practice, DotKnot-PW can be expected to run in the order of minutes for sequences shorter than 500 nt.

4 RESULTS

Many pseudoknot prediction programs have been evaluated using all the entries in the PseudoBase database (van Batenburg et al., 2000). There are several caveats in this approach. First, the sequences given in PseudoBase are those which exactly harbor the pseudoknot. However, in practice structure prediction algorithms will be applied to longer sequences without prior knowledge of the pseudoknot location. Second, long-range pseudoknot entries appear in a truncated version in the database. Third, some classes of pseudoknots have a large number of entries (such as short H-type pseudoknots in the 3′-untranslated regions of plant viruses), whereas more complex types of pseudoknots only have one representative (such as long-range rRNA pseudoknots). Therefore, a hand-curated dataset of pseudoknot structures will be used here.

When it comes to pseudoknots, many structures have been published based on a secondary structure predicted by free energy minimization. These predicted secondary structures are used as a working model and refined using experimental techniques such as chemical and enzymatic probing. However, the native structure remains unsolved unless tertiary structure determination methods such as X-ray crystallography are used. Testing structures that are based on computer predictions with no experimental support creates a bias in the benchmark and will be avoided in this evaluation.

A total of 16 pseudoknotted reference structures from different RNA types were collected, which have strong experimental support. For each reference structure, a supporting set of 10 evolutionarily related sequences was obtained from the RFAM database (Gardner et al., 2010). Note that for the vast majority of supporting sequences, no experimentally determined structures are available. The average pairwise sequence identities vary from 55% to 99%. Given a reference structure, the performance of prediction algorithms is evaluated in terms of sensitivity (S), i.e. the percentage of base pairs in the reference structure, which are predicted correctly, as well as positive predictive value (PPV), i.e. the percentage of predicted pairs, which are in the reference structure. The Matthews correlation coefficient (MCC) is also reported and is in the range from −1 to 1, where 1 corresponds to a perfect prediction and −1 to a prediction that is in total disagreement with the reference structure. The performance of each method for predicting the reference structure was evaluated as described in Gardner and Giegerich (2004).

DotKnot-PW was compared with methods that are freely available and use standard input and output formats. The comparative methods are CARNAC, Tfold and hxmatch (with the -A option using RNAalifold). All of these methods return structure predictions for only the reference structure with regards to the support set of evolutionarily related sequences. Tfold and hxmatch take a sequence alignment as the input. ClustalW with the default parameters was used to produce the initial sequence alignment. DotKnot-PW and CARNAC take a set of unaligned sequences as the input. Furthermore, prediction results for the reference sequence (not the supporting sequences) were obtained from the single sequence methods DotKnot (Sperschneider and Datta, 2010; Sperschneider et al., 2011), ProbKnot (Bellaousov and Mathews, 2010), IPknot (Sato et al., 2011) and RNAfold (Hofacker et al., 1994). Note that all methods except CARNAC and RNAfold allow pseudoknot prediction.

The results are shown in Table 1. DotKnot-PW has the highest average MCC of 0.75 for the test sequences. For each reference structure with the 10 support sequences from the corresponding RFAM family, 10 predictions are returned ordered by the combined free energy and similarity score. If only the pairwise prediction with highest combined free energy and similarity score is taken, DotKnot-PW has an improved average MCC of 0.81. Tfold and hxmatch have average MCC of 0.6 and 0.59, respectively. CARNAC has average MCC of 0.45 with much higher average specificity than sensitivity.

The prediction results for single sequence structure prediction for each of the reference sequences with experimentally determined structures are also shown in Table 1. Note that this does not include the prediction for the support sequences from RFAM, as no experimentally determined structures are available. All single sequence pseudoknot prediction methods show improved results over using RNAfold. DotKnot has the highest average MCC of 0.76, followed by IPKnot and ProbKnot.

As an example, consider the S15 mRNA pseudoknot that binds to specific proteins in the autoregulation mechanism of ribosomal protein S15 synthesis (Philippe et al., 1995). For the reference sequence S15 and 10 support sequences from the corresponding RFAM family, DotKnot-PW returns pairwise predictions ordered by the combined free energy and similarity score. The top two pairwise predictions with the highest scores are shown in Figure 3.

5.1 The underlying folding model

Single sequence prediction methods are always limited by the underlying RNA folding model. This may be the set of free energy parameters used by free energy minimization methods or the underlying methodological framework such as maximum expected accuracy methods. DotKnot-PW shows excellent results on H-type pseudoknots with short interhelix loops. For this type of pseudoknots, DotKnot-PW uses free energy pseudoknot parameters by Cao and Chen (2006, 2009) based on polymer statistical mechanics. Improvements of the accuracy of free energy parameters, both for secondary structures and pseudoknots, will lead to more accurate prediction methods. However, one has to keep in mind that the algorithms themselves must be designed in such a fashion that novel parameters can be efficiently incorporated. The heuristic framework of DotKnot-PW has been designed such that it can incorporate sophisticated free energy parameters for pseudoknots, secondary structures and coaxial stacking. In the future, DotKnot-PW could also use contributions from basic tertiary structure elements such as base triples around the pseudoknot junction or stem-loop interactions.

5.2 The type of pseudoknot

Pseudoknot prediction algorithms come in two flavors: either they can predict a certain, restricted class of pseudoknots or they do not have a restriction on the type of pseudoknot that can be predicted. For methods using free energy parameters, the inclusion of general types of pseudoknots might be more of a curse than a blessing, as no reliable free energy parameters for complex pseudoknots are available. DotKnot-PW has restrictions on the type of pseudoknot that can be predicted. However, this does not always lead to poor prediction results in practice. For example, DotKnot-PW shows the best result for the HDV ribozyme, which is a complex double nested pseudoknot.

5.3 The trouble with benchmarking

The results from the benchmark for structure prediction in Table 1 must be interpreted with care. First, the tested methods can be run with different parameters, possibly producing better results. However, as a typical user has no prior knowledge about the structure, the default parameters for each method are used. Of course, a comprehensive benchmark should include a larger number of structures to obtain a more reliable evaluation. However, in this study, the focus has been on a test set where the structures are supported experimentally. Many structures have been published, which were determined using computational tools and this will inevitably create a bias in a benchmark, and thus they were excluded here.

5.4 Gaining confidence with multiple sequences

Here, an extension of the single sequence prediction method DotKnot was presented based on the pairwise comparison of structure elements. This approach called DotKnot-PW is designed as an algorithm for finding the structure including H-type pseudoknots common to two sequences. As shown in Table 1, DotKnot-PW can greatly improve structure predictions for RNA families when compared with the single sequence prediction using DotKnot. In some cases, a comparative approach might have lower sensitivity than a single sequence prediction; however, this should not generally be judged as ‘inferior’. For example, ncRNAs might preserve some integral base pairs throughout evolution and only these will be detected by a comparative approach, which returns the set of base pairs common to a set of evolutionarily related sequences. DotKnot-PW uses a set of unaligned sequences as the input; therefore, no expert user intervention is required. In the future, DotKnot-PW will be extended to include intramolecular kissing hairpins. Furthermore, constrained folding will be implemented to predict a structure subject to constraints, e.g. enforce certain base pairs or regions, which must remain unpaired.

6 CONCLUSION

DotKnot-PW has been designed as a dedicated pseudoknot prediction tool and should be applied to RNA sequences where pseudoknotted interactions are suspected in the structures. Prediction accuracy will inevitably decrease for sequences, which are longer than say 400 nt for any single sequence structure prediction method (Reeder et al., 2006). To achieve reliable results, short sequences should be folded using DotKnot and predictions should be compared with results from other methods from the literature. To gain confidence in predictions, subsequent comparative prediction using DotKnot-PW and other comparative methods is highly recommended. Ideally, experimental verification of computationally predicted pseudoknots should be sought.