Position specific scoring matrices (PSSMs) for the apical loop and the upper stem of bacterial SECIS element are derived from the segment-based sequence alignment.Log-likelihood ratio was used for scoring the relative likelihood of each alignment. Each element M i, j in the PSSM was generated
according to the following formula:

where N is the total number of bSECIS elements in our training dataset; ni, j is the number of occurrences of letter i (the nucleotides A, C, G and U and the gap -) at position j of an alignment; and pi is the a priori probability of letter i. Given the assumption that the distribution of letters is independent, pi is an overall frequency of the letters in all sequences or the frequency within a subset of sequences (our training data, Akmaev et al., 2000). We tried both pi models and found that there was no significant difference between them in regard to computing the probability of the original alignment (Pali) using the following multinomial distribution:

where A is the total number of letters in the sequence alphabet (4 for DNA/RNA and 20 for protein) and C is the total number of columns in the alignment. The final PSSMs (Gribskov et al., 1987) were composed of 5 rows (A, C, G, U and gap -) and L columns (L = maximum length of an alignment).
To find optimal bSECIS elements in the candidate set, we developed a quasi-greedy alignment algorithm based on the Gotoh dynamic programming algorithm (Gotoh, 1982; Gotoh, 1999). First, the apical loop and the upper-stem of each candidate bSECIS element were compared with all possible sequence combinations derived from PSSMs. For a set of N different sequences and a 5xL profile matrix, the standard algorithm for comparison is a greedy algorithm with a time complexity of O(5LN3). However, many of these alignments are redundant or have very low scores. Therefore, to optimize the alignment algorithm, we added additional constraints, including eliminating sequence combinations with negative weight scores or excessive number of gaps. A similar but more sophisticated strategy was successfully applied to the problem of identification of RNA sequence motifs (Gorodkin et al., 2001). For each alignment, a modified Gotoh algorithm was used to compare the basic structural elements of each candidate bSECIS element to the profiles based on substitution scores and gap penalties (Gotoh, 1982). Our major modification to the original algorithm was in the scoring scheme. In the unmodified algorithm, score at any position in each alignment is based on the comparison of nucleotides at the corresponding positions in the aligned sequences. In our analysis, the score was obtained not only from the substitution score, but also from the weight matrices and our bSECIS structural model. The total score of each comparison was the sum of individual position scores.
The next step calculated a total weight score for every candidate bSECIS element by summing individual scores of each structural element, and compared alignments to each other. Finally, optimal bSECIS elements and their predicted ORFs were presented for selecting optimal M <= N sequences with weight scores greater than the predefined cutoff.
We approximated the occurrence of positive-scoring candidate bSECIS elements in a random database as a compound Poisson process. Therefore, alignment scores (S) should follow an Extreme Value Distribution (EVD, Frith et al., 2002) with the E-value (E) defined as:

where N is the length of a putative bSECIS element, F is the finite length correction, and
and K are normalizing parameters. This assumption is related to the BLAST statistics of Karlin and Altschul (Karlin and Altschul, 1990). In practice, the values of F and K are much less important than that of
, which appears in the exponent of the extreme value distribution.
Considering that the exact values for these parameters is hard to compute, we simplified the above formulas to give a closed solution based on three assumptions: (i) high-scoring bSECIS elements are rare (< 5%) in a random database; and (ii) optimal bSECIS elements do not have a significant tendency to overlap with themselves or (iii) with other optimal bSECIS candidates. Our E-values were calculated from three factors: bit score, length of each putative bSECIS element and size of a random database (see below). First, the training set SECIS elements were randomly inserted into a 1 Mb randomized nucleotide database. Normalized bit score (Sb) was derived from the raw weight score (Sw) using the following normalizing equation:

where the initial values of
and K were determined empirically according to the BLAST statistics:
= 1.5 and K = 0.5. As a result, bit scores and E-values could be independent of the original scoring system, allowing those calculated with particular rewards and penalties to be compared directly to those calculated with different rewards and penalties. We then enumerated all possible positive scoring sequences of a maximum length equal to the longest bSECIS element, and measured the distribution of their scores. To maintain good confidence levels, we defined a bit score threshold (T), which gives a low enough (< 5%) false positive rate (FPR). The following constraint should be satisfied:

where NSb>Tis the number of known bSECIS elements in the training set with bit score >= threshold, and
is the number of hits detected in the randomized nucleotide database with score >= threshold. We then adjusted
and K manually to determine a trade off until the Sb-FPR distribution fitted nicely a generalized EVD (data not shown). Finally the following approximate equation was obtained to calculate our E-value:

where L is the effective length of each bSECIS element.
Akmaev, V.R., Kelley, S.T. and Stormo, G.D. (2000) Phylogenetically enhanced statistical tools for RNA structure prediction. Bioinformatics, 16, 501-512.
Frith, M.C., Spouge, J.L., Hansen,U. and Weng, Z. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res., 30, 3214-3224.
Gorodkin, J., Stricklin, S.L. and Stormo, G.D. (2001) Discovering common stem-loop motifs in unaligned RNA sequences. Nucleic Acids Res., 29, 2135-2144.
Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol., 162, 705-708.
Gotoh, O. (1999) Multiple sequence alignment: algorithms and applications. Adv. Biophys., 36, 159-206.
Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci., 84, 4355-4358.
Karlin, S. and Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci., 87, 2264-2268. A. Two forms of eukaryotic SECIS elements The allowed lengths of helices and loops are indicated. Conserved nucleotides are shown in red, including the four non-Watson-Crick base pairs (quartet or SECIS core) and two unpaired adenosines in the apical loop of form I or internal loop of form II structures.
B. SECIS elements in Methanococcus jannaschii selenoprotein mRNAs Methanococcus jannaschii SECIS elements are shown. The conserved GAA___A internal loop and three consecutive C-G or G-C base pairs are highlighted in red. fdhA, formate dehydrogenase; vhuU, F420-non-reducing hydrogenase; fwdB, formylmethanofuran dehydrogenase; fruA, F420-reducing hydrogenase; selD, selenophosphate synthetase.
C. SECIS elements in the Escherichia coli formate dehydrogenase fdhF and fdnG mRNAs The bulged and apical loop nucleotides that interact with the translation elongation factor SelB are highlighted in red. The Sec UGA codon is shown in bold.
Selenoproteins are shown in red and Cys-containing homologs in blue. fdhF, formate dehydrogenase H; fdnG, formate dehydrogenase N; fdoG, formate dehydrogenase O; fdhA, formate dehydrogenase alpha subunit (subfamily designation is not clear).
Only sequences downstream of UGA/UGU codons are shown. In-frame UGA/UGU codons are shown in bold. Conserved guanosines (G) in the apical loops are highlighted in red.
Supplementary figures
Figure 1. SECIS elements in eukaryotes, archaea and bacteria.
Figure 2. Phylogenetic analysis of formate dehydrogenases.
Figure 3. Predicted bacterial SECIS elements in Sec-containing and Cys-containing fdnG genes in Mannheimia succiniciproducens.