Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms, pages 57–64,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Stochastic Multiple Context-Free Grammar
for RNA Pseudoknot Modeling
Yuki Kato
Graduate School of
Information Science,
Nara Institute of
Science and Technology
8916-5 Takayama, Ikoma,
Nara 630-0192, Japan
yuuki-ka@is.naist.jp
Hiroyuki Seki
Graduate School of
Information Science,
Nara Institute of
Science and Technology
8916-5 Takayama, Ikoma,
Nara 630-0192, Japan
seki@is.naist.jp
Tadao Kasami
Graduate School of
Information Science,
Nara Institute of
Science and Technology
8916-5 Takayama, Ikoma,
Nara 630-0192, Japan
kasami@naist.jp
Abstract
Several grammars have been proposed
for modeling RNA pseudoknotted struc-
ture. In this paper, we focus on multiple
context-free grammars (MCFGs), which
are natural extension of context-free gram-
mars and can represent pseudoknots, and
extend a specific subclass of MCFGs to
a probabilistic model called SMCFG. We
present a polynomial time parsing algo-
rithm for finding the most probable deriva-
tion tree and a probability parameter esti-
mation algorithm. Furthermore, we show
some experimental results of pseudoknot
prediction using SMCFG algorithm.
1 Introduction
Non-coding RNAs fold into characteristic struc-
tures determined by interactions between mostly
Watson-Crick complementary base pairs. Such a
base paired structure is called the secondary struc-
ture. Pseudoknot (Figure 1 (a)) is one of the typi-
cal substructures found in the secondary structures
of several RNAs, including rRNAs, tmRNAs and
viral RNAs. An alternative graphic representation
of a pseudoknot is arc depiction where arcs con-
nect base pairs (Figure 1 (b)). It has been rec-
ognized that pseudoknots play an important role
in RNA functions such as ribosomal frameshifting
and regulation of translation.
Many attempts have so far been made at mod-
eling RNA secondary structure by formal gram-
mars. In a grammatical approach, secondary struc-
ture prediction can be viewed as parsing problem.
However, there may be many different derivation
trees for an input sequence. Thus, it is necessary to
have a method of extracting biologically realistic
5’−C  A  G  G           •   •   • 
           U  C  C  A  G  U                         •   •   •
                         U  C  A  G−3’
C
G C
(a) Pseudoknot
c  a  g  g  c  u  g  a  c  c  u  g  c  u  c  a  g
(b) Arc depiction of (a)
Figure 1: Example of RNA secondary structure
derivation trees among them. One solution to this
problem is to extend a grammar to a probabilistic
model and find the most likely derivation tree, and
another is to take free energy minimization into ac-
count. Eddy and Durbin (1994), and Sakakibara et
al. (1994) modeled RNA secondary structure with-
out pseudoknots by using stochastic context-free
grammars (stochastic CFGs or SCFGs). For pseu-
doknotted structure (Figure 1 (a)), however, an-
other approach has to be taken since a single CFG
cannot represent crossing dependencies of base
pairs in pseudoknots (Figure 1 (b)) for the lack of
generative power. Brown and Wilson (1996) pro-
posed a model based on intersections of SCFGs
to describe RNA pseudoknots. Cai et al. (2003)
introduced a model based on parallel communi-
cation grammar systems using a single CFG syn-
chronized with a number of regular grammars.
Akutsu (2000) provided dynamic programming al-
gorithms for RNA pseudoknot prediction without
using grammars. On the other hand, several gram-
mars have been proposed where the grammar itself
can fully describe pseudoknots. Rivas and Eddy
(1999, 2000) provided a dynamic programming
57
algorithm for predicting RNA secondary structure
including pseudoknots, and introduced a new class
of grammars called RNA pseudoknot grammars
(RPGs) for deriving sequences with gap. Ue-
mura et al. (1999) defined specific subclasses of
tree adjoining grammars (TAGs) named SL-TAGs
and extended SL-TAGs (ESL-TAGs) respectively,
and predicted RNA pseudoknots by using pars-
ing algorithm of ESL-TAG. Matsui et al. (2005)
proposed pair stochastic tree adjoining grammars
(PSTAGs) based on ESL-TAGs and tree automata
for aligning and predicting pseudoknots, which
showed good prediction accuracy. These gram-
mars have generative power stronger than CFGs
and polynomial time algorithms for parsing prob-
lem.
In our previous work (Kato et al., 2005),
we identified RPGs, SL-TAGs and ESL-TAGs
as subclasses of multiple context-free grammars
(MCFGs) (Kasami et al., 1988; Seki et al., 1991),
which can model RNA pseudoknots, and showed a
candidate subclass of the minimum grammars for
representing pseudoknots. The generative power
of MCFGs is stronger than that of CFGs and
MCFGs have a polynomial time parsing algo-
rithm like the CYK (Cocke-Younger-Kasami) al-
gorithm for CFGs. In this paper, we extend the
above candidate subclass of MCFGs to a prob-
abilistic model called a stochastic MCFG (SM-
CFG). We present a polynomial time parsing algo-
rithm for finding the most probable derivation tree,
which is applicable to RNA pseudoknot predic-
tion. In addition, we mention a probability param-
eter estimation method based on the EM (expec-
tation maximization) algorithm. Finally, we show
some experimental results on pseudoknot predic-
tion for three RNA families using SMCFG algo-
rithm, which show good prediction accuracy.
2 Stochastic Multiple Context-Free
Grammar
A stochastic multiple context-free grammar
(stochastic MCFG, or SMCFG) is a probabilistic
extension of MCFG (Kasami et al., 1988; Seki et
al., 1991) or linear context-free rewriting system
(Vijay-Shanker et al., 1987). An SMCFG is a 5-
tuple G = (N,T,F,P,S) where N is a finite set
of nonterminals, T is a finite set of terminals, F is
a finite set of functions, P is a finite set of (pro-
duction) rules and S ∈ N is the start symbol. For
each A ∈ N, a positive integer denoted by dim(A)
is given and A derives dim(A)-tuples of terminal
sequences. For the start symbol S, dim(S) = 1.
For each f ∈ F, positive integers di (0 ≤ i ≤ k)
are given and f is a total function from (T∗)d1 ×
···×(T∗)dk to (T∗)d0 where each component of
f is defined as the concatenation of some compo-
nents of arguments and constant sequences. Note
that each component of an argument should oc-
cur in the function value at most once (linear-
ity). For example, f[(x11,x12),(x21,x22)] =
(x11x21,x12x22). Each rule in P has the form
of A0 p→ f[A1,...,Ak] where Ai ∈ N (0 ≤
i ≤ k), f : (T∗)dim(A1) ×···× (T∗)dim(Ak) →
(T∗)dim(A0) ∈ F and p is a real number with 0 ≤
p ≤ 1 called the probability of this rule. The sum-
mation of the probabilities of the rules with the
same left-hand side should be one. If we are not
interested in p, we just write A0 → f[A1,...,Ak].
If k ≥ 1, then the rule is called a nonterminat-
ing rule, and if k = 0, then it is called a termi-
nating rule. A terminating rule A0 → f[ ] with
f[h][ ] = βh (1 ≤ h ≤ dim(A0)) is simply written
as A0 → (β1,...,βdim(A0)).
We recursively define the relation ∗⇒ by the fol-
lowing (L1) and (L2): (L1) if A p→ α ∈ P (α ∈
(T∗)dim(A)), then we write A ∗⇒ α with proba-
bility p, and (L2) if A p→ f[A1,...,Ak] ∈ P
and Ai ∗⇒ αi ∈ (T∗)dim(Ai) (1 ≤ i ≤ k)
with probabilities p1,...,pk, respectively, then
we write A ∗⇒ f[α1,...,αk] with probability
p ·∏ki=1 pi. In parallel with the relation ∗⇒, we
define derivation trees as follows: (D1) if A p→
α ∈ P (α ∈ (T∗)dim(A)), then the ordered tree
with the root labeled A which has α as the only
one child is a derivation tree for α with proba-
bility p, and (D2) if A p→ f[A1,...,Ak] ∈ P,
Ai ∗⇒ αi ∈ (T∗)dim(Ai) (1 ≤ i ≤ k) and
t1,...,tk are derivation trees for α1,...,αk with
probabilities p1,...,pk, respectively, then the or-
dered tree with the root labeled A (or A : f
if necessary) which has t1,...,tk as (immediate)
subtrees from left to right is a derivation tree for
f[α1,...,αk] with probability p ·∏ki=1 pi. Ex-
ample rules are A 0.3→ f[A] where f[(x1,x2)] =
(ax1b,cx2d) and A 0.7→ (ab,cd). Then, A ∗⇒
(ab,cd) by the second rule, which is followed
by A ∗⇒ f[(ab,cd)] = (aabb,ccdd) by the first
rule. The probability of the latter derivation is
0.3 · 0.7 = 0.21. The language generated by an
SMCFG G is defined as L(G) = {w ∈ T∗ | S ∗⇒
58
Table 1: SMCFG Gs
Type Rule set Function Transition probability Emission probability
E Wv →(ε,ε) 1 1
S Wv →J[Wy] J[(x1,x2)] = x1x2 tv(y) 1
D Wv →SK[Wy] SK[(x1,x2)] = (x1,x2) tv(y) 1
B1 Wv →C1[Wy,Wz] C1[x1,(x21,x22)] = (x1x21,x22) 1 1
B2 Wv →C2[Wy,Wz] C2[x1,(x21,x22)] = (x21x1,x22) 1 1
B3 Wv →C3[Wy,Wz] C3[x1,(x21,x22)] = (x21,x1x22) 1 1
B4 Wv →C4[Wy,Wz] C4[x1,(x21,x22)] = (x21,x22x1) 1 1
U1L Wv →UPai1L[Wy] UPai1L[(x1,x2)] = (aix1,x2) tv(y) ev(ai)
U1R Wv →UPaj1R[Wy] UPaj1R[(x1,x2)] = (x1aj,x2) tv(y) ev(aj)
U2L Wv →UPak2L[Wy] UPak2L[(x1,x2)] = (x1,akx2) tv(y) ev(ak)
U2R Wv →UPal2R[Wy] UPal2R[(x1,x2)] = (x1,x2al) tv(y) ev(al)
P Wv →BPaial[Wy] BPaial[(x1,x2)] = (aix1,x2al) tv(y) ev(ai,al)
w with probability greater than 0}.
In this paper, we focus on an SMCFG Gs =
(N,T,F,P,S) that satisfies the following condi-
tions: Gs has m different nonterminals denoted
by W1,...,Wm, each of which uses the only one
type of a rule denoted by E, S, D, B1, B2, B3, B4,
U1L, U1R, U2L, U2R or P 1 (see Table 1). The
type of Wv is denoted by type(v) and we prede-
fine type(1) = S, that is, W1 is the start symbol.
Consider a sample rule set Wv → UPα1L[Wy] |
UPα1L[Wz] where UPα1L[(x1,x2)] = (αx1,x2)
and α ∈ T. For each rule r, two real values
called transition probability p1 and emission prob-
ability p2 are specified in Table 1. The probability
of r is simply defined as p1 · p2. In application,
p1 = tv(y) and p2 = ev(ai),... in Table 1 are the
parameters of the grammar, which are set by hand
or by a training algorithm (Section 3.3) depending
on the set of possible sequences to be analyzed.
3 Algorithms for SMCFG
In RNA structure analysis using stochastic gram-
mars, we have to deal with the following three
problems: (1) calculate the optimal alignment of
a sequence to a stochastic grammar (alignment
problem), (2) calculate the probability of a se-
quence given a stochastic grammar (scoring prob-
lem), and (3) estimate optimal probability param-
eters for a stochastic grammar given a set of exam-
ple sequences (training problem). In this section,
we give solutions to each problem for the specific
SMCFG Gs = (N,T,F,P,S).
3.1 Alignment Problem
The alignment problem for Gs is to find the
most probable derivation tree for a given input se-
1These types stand for END, START, DELETE, BIFURCA-
TION, UNPAIR and PAIR respectively.
quence. This problem can be solved by a dynamic
programming algorithm similar to the CYK algo-
rithm for SCFGs (Durbin et al., 1998), and in this
paper, we also call the parsing algorithm for Gs
the CYK algorithm. We fix an input sequence w =
a1···an (|w| = n). Let γv(i,j) and γy(i,j,k,l)
be the logarithm of maximum probabilities of a
derivation subtree rooted at a nonterminal Wv for
a terminal subsequence ai···aj and of a deriva-
tion subtree rooted at a nonterminal Wy for a tuple
of terminal subsequences (ai···aj,ak ···al) re-
spectively. The variables γv(i,i−1) and γy(i,i−
1,j,j − 1) are the logarithm of maximum prob-
abilities for an empty sequence ε and a pair of ε.
Let τv(i,j) and τy(i,j,k,l) be traceback variables
for constructing a derivation tree, which are calcu-
lated together with γv(i,j) and γy(i,j,k,l). We
define Cv = {y | Wv → f[Wy] ∈ P, f ∈ F}.
To avoid non-emitting cycles, we assume that the
nonterminals are numbered such that v < y for
all y ∈ Cv. The CYK algorithm uses five dimen-
sional dynamic programming matrix to calculate
γ, which leads to logP(w, ˆpi | θ) where ˆpi is the
most probable derivation tree and θ is an entire set
of probability parameters. The detailed descrip-
tion of the CYK algorithm is as follows:
Algorithm 1 (CYK).
Initialization:
for i ← 1 to n + 1, j ← i to n + 1, v ← 1 to m
do if type(v) = E
then γv(i,i−1,j,j −1) ← 0
else γv(i,i−1,j,j −1) ←−∞
Iteration:
for i ← n downto 1, j ← i−1 to n, k ← n + 1
downto j + 1, l ← k−1 to n, v ← 1 to m
do if type(v) = E
then if j = i−1 and l = k−1
then skip
59
else γv(i,j,k,l) ←−∞
if type(v) = S
then γv(i,j)
← max
y∈Cv
max
h=i−1,...,j
[logtv(y)
+γy(i,h,h + 1,j)]
τv(i,j)
← argmax
(y,h)
[logtv(y)+γy(i,h,h+1,j)]
if type(v) = B1 and Wv → C1[Wy,Wz]
then γv(i,j,k,l)
← max
h=i−1,...,j
[γy(i,h)+γz(h+1,j,k,l)]
τv(i,j,k,l)
← arg max
(y,z,h)
[γy(i,h)+γz(h+1,j,k,l)]
if type(v) = B2 and Wv → C2[Wy,Wz]
then γv(i,j,k,l)
← max
h=i−1,...,j
[γy(h+1,j)+γz(i,h,k,l)]
τv(i,j,k,l)
← arg max
(y,z,h)
[γy(h+1,j)+γz(i,h,k,l)]
if type(v) = B3 and Wv → C3[Wy,Wz]
then γv(i,j,k,l)
← max
h=k−1,...,l
[γz(i,j,h+1,l)+γy(k,h)]
τv(i,j,k,l)
← arg max
(y,z,h)
[γz(i,j,h+1,l)+γy(k,h)]
if type(v) = B4 and Wv → C4[Wy,Wz]
then γv(i,j,k,l)
← max
h=k−1,...,l
[γz(i,j,k,h)+γy(h+1,l)]
τv(i,j,k,l)
← arg max
(y,z,h)
[γz(i,j,k,h)+γy(h+1,l)]
if type(v) = P
then if j = i−1 or l = k−1
then γv(i,j,k,l) ←−∞
else γv(i,j,k,l)
← max
y∈Cv
[logev(ai,al) + logtv(y)
+γy(i + 1,j,k,l−1)]
τv(i,j,k,l)
← argmaxy [logev(ai,al) + logtv(y)
+γy(i + 1,j,k,l−1)]
else γv(i,j,k,l)
← max
y∈Cv
[logev(ai,aj,ak,al) + logtv(y)
+γy(i + ∆1Lv ,j −∆1Rv ,k + ∆2Lv ,
l−∆2Rv )]
τv(i,j,k,l)
← argmaxy [logev(ai,aj,ak,al)
+logtv(y) + γy(i + ∆1Lv ,j −∆1Rv ,
k + ∆2Lv ,l−∆2Rv )]
Note: ev(ai,aj,ak,al) = ev(ai) for type(v) =
U1L, ev(ai,aj,ak,al) = ev(aj) for type(v) =
U1R, ev(ai,aj,ak,al) = ev(ak) for type(v) =
U2L, ev(ai,aj,ak,al) = ev(al) for type(v) =
U2R, ev(ai,aj,ak,al) = 1 for the other types
except P. Also, ∆1Lv = 1 for type(v) = U1L,
∆1Rv = 1 for type(v) = U1R, ∆2Lv = 1 for
type(v) = U2L, ∆2Rv = 1 for type(v) = U2R,
and ∆1Lv ,...∆2Rv are set to 0 for the other types
except P.
When the calculation terminates, we obtain
logP(w, ˆpi | θ) = γ1(1,n). If there are b BI-
FURCATION nonterminals and a other nontermi-
nals, the time and space complexities of the CYK
algorithm are O(amn4 + bn5) and O(mn4), re-
spectively. To recover the optimal derivation tree,
we use the traceback variables τ. Due to limitation
of the space, the full description of the traceback
algorithm is omitted (see (Kato and Seki, 2006)).
3.2 Scoring Problem
As in SCFGs (Durbin et al., 1998), the scor-
ing problem for Gs can be solved by the inside
algorithm. The inside algorithm calculates the
summed probabilities αv(i,j) and αy(i,j,k,l) of
all derivation subtrees rooted at a nonterminal Wv
for a subsequence ai···aj and of all derivation
subtrees rooted at a nonterminal Wy for a tuple
of subsequences (ai···aj,ak ···al) respectively.
The variables αv(i,i−1) and αy(i,i−1,j,j−1)
are defined for empty sequences in a similar way
to the CYK algorithm. Therefore, we can easily
obtain the inside algorithm by replacing max op-
erations with summations in the CYK algorithm.
When the calculation terminates, we obtain the
probability P(w | θ) = α1(1,n). The time and
space complexities of the algorithm are identical
with those of the CYK algorithm.
In order to re-estimate the probability parame-
ters of Gs, we need the outside algorithm. The
outside algorithm calculates the summed prob-
ability βv(i,j) of all derivation trees excluding
subtrees rooted at a nonterminal Wv generat-
ing a subsequence ai···aj. Also, it calculates
βy(i,j,k,l), the summed probability of all deriva-
tion trees excluding subtrees rooted at a non-
terminal Wy generating a tuple of subsequences
(ai···aj,ak ···al). In the algorithm, we will use
Pv = {y | Wy → f[Wv] ∈ P, f ∈ F}. Note
that calculating the outside variables β requires
the inside variables α. Unlike CYK and inside al-
gorithms, the outside algorithm recursively works
60
its way inward. The time and space complexities
of the outside algorithm are the same as those of
CYK and inside algorithms. Formally, the outside
algorithm is as follows:
Algorithm 2 (Outside).
Initialization:
β1(1,n) ← 1
Iteration:
for i ← 1 to n+1, j ← n downto i−1, k ← j+1
to n + 1, l ← n downto k−1, v ← 1 to m
do if type(v) = S and Wy → C1[Wv,Wz]
then βv(i,j)
←
n∑
h=j
n+1∑
k′=h+1
n∑
l′=k′−1
βy(i,h,k′,l′)
αz(j + 1,h,k′,l′)
if type(v) = S and Wy → C2[Wv,Wz]
then βv(i,j)
←
i∑
h=1
n+1∑
k′=j+1
n∑
l′=k′−1
βy(h,j,k′,l′)
αz(h,i−1,k′,l′)
if type(v) = S and Wy → C3[Wv,Wz]
then βv(i,j)
←
i∑
h=1
i−1∑
k′=h−1
n∑
l′=j
βy(h,k′,i,l′)
αz(h,k′,j + 1,l′)
if type(v) = S and Wy → C4[Wv,Wz]
then βv(i,j)
←
i∑
h=1
i−1∑
k′=h−1
i∑
l′=k′+1
βy(h,k′,l′,j)
αz(h,k′,l′,i−1)
if type(v) ̸= S and Wy → C1[Wz,Wv]
then βv(i,j,k,l)
←
i∑
h=1
βy(h,j,k,l)αz(h,i−1)
if type(v) ̸= S and Wy → C2[Wz,Wv]
then βv(i,j,k,l)
←
k−1∑
h=j
βy(i,h,k,l)αz(j + 1,h)
if type(v) ̸= S and Wy → C3[Wz,Wv]
then βv(i,j,k,l)
←
k∑
h=j+1
βy(i,j,h,l)αz(h,k−1)
if type(v) ̸= S and Wy → C4[Wz,Wv]
then βv(i,j,k,l)
←
n∑
h=l
βy(i,j,k,h)αz(l + 1,h)
else βv(i,j,k,l)
←
∑
y∈Pv
βy(i−∆1Ly ,j + ∆1Ry ,k−∆2Ly ,
l+∆2Ry )ey(ai−∆1Ly ,aj+∆1Ry ,ak−∆2Ly ,
al+∆2Ry )ty(v)
3.3 Training Problem
The training problem for Gs can be solved by the
EM algorithm called the inside-outside algorithm
where the inside variables α and outside variables
β are used to re-estimate probability parameters.
First, we consider the probability that a nonter-
minal Wv is used at positions i, j, k and l in a
derivation of a single sequence w. If type(v) =
S, the probability is 1P(w|θ)αv(i,j)βv(i,j), other-
wise 1P(w|θ)αv(i,j,k,l)βv(i,j,k,l). By summing
these over all positions in the sequence, we can ob-
tain the expected number of times that Wv is used
for w as follows: for type(v) = S, the expected
count is
1
P(w | θ)
n+1∑
i=1
n∑
j=i−1
αv(i,j)βv(i,j),
otherwise
1
P(w | θ)
n+1∑
i=1
n∑
j=i−1
n+1∑
k=j+1
n∑
l=k−1
αv(i,j,k,l)
βv(i,j,k,l).
Next, we extend these expected values from a sin-
gle sequence w to multiple independent sequences
w(r) (1 ≤ r ≤ N). Let α(r) and β(r) be the in-
side and outside variables calculated for each in-
put sequence w(r). Then we can obtain the ex-
pected number of times that a nonterminal Wv is
used for training sequences w(r) (1 ≤ r ≤ N) by
summing the above terms over all sequences: for
type(v) = S,
E(v) =
N∑
r=1
n+1∑
i=1
n∑
j=i−1
1
P(w(r) | θ)α
(r)
v (i,j)
β(r)v (i,j),
otherwise
E(v) =
N∑
r=1
n+1∑
i=1
n∑
j=i−1
n+1∑
k=j+1
n∑
l=k−1
1
P(w(r) | θ)
α(r)v (i,j,k,l)β(r)v (i,j,k,l).
Similarly, for a given Wy, the expected number of
times that a rule Wv → f[Wy] is applied can be
61
obtained as follows: for type(v) = S,
E(v → y) =
N∑
r=1
n+1∑
i=1
n∑
j=i−1
j∑
h=i−1
1
P(w(r) | θ)
β(r)v (i,j)tv(y)α(r)y (i,h,h + 1,j),
otherwise
E(v → y) =
N∑
r=1
n+1∑
i=1
n∑
j=i−1
n+1∑
k=j+1
n∑
l=k−1
1
P(w(r) | θ)
β(r)v (i,j,k,l)ev(ai,aj,ak,al)tv(y)
α(r)y (i + ∆1Lv ,j −∆1Rv ,k + ∆2Lv ,
l−∆2Rv ).
For a given terminal a or a pair of terminals (a,b),
the expected number of times that a rule contain-
ing a (or a and b) is applied is as shown below: for
type(v) = U1L,
E(v → a) =
N∑
r=1
n∑
i=1
n∑
j=i
n+1∑
k=j+1
n∑
l=k−1
1
P(w(r) | θ)
δ(a(r)i = a)β(r)v (i,j,k,l)
α(r)v (i,j,k,l),
for type(v) = U1R,
E(v → a) =
N∑
r=1
n∑
i=1
n∑
j=i
n+1∑
k=j+1
n∑
l=k−1
1
P(w(r) | θ)
δ(a(r)j = a)β(r)v (i,j,k,l)
α(r)v (i,j,k,l),
for type(v) = U2L,
E(v → a) =
N∑
r=1
n−1∑
i=1
n−1∑
j=i−1
n∑
k=j+1
n∑
l=k
1
P(w(r) | θ)
δ(a(r)k = a)β(r)v (i,j,k,l)
α(r)v (i,j,k,l),
for type(v) = U2R,
E(v → a) =
N∑
r=1
n−1∑
i=1
n−1∑
j=i−1
n∑
k=j+1
n∑
l=k
1
P(w(r) | θ)
δ(a(r)l = a)β(r)v (i,j,k,l)
α(r)v (i,j,k,l),
and for type(v) = P,
E(v → ab) =
N∑
r=1
n−1∑
i=1
n−1∑
j=i
n∑
k=j+1
n∑
l=k
1
P(w(r) | θ)
δ(a(r)i = a,a(r)l = b)β(r)v (i,j,k,l)
α(r)v (i,j,k,l),
where δ(C) is 1 if the condition C in the parenthe-
sis is ture, and 0 if C is false.
Now, we re-estimate probability parameters by
using the above expected counts. Let ˆtv(y) be re-
estimated transition probabilities from Wv to Wy.
Also, let ˆev(a) and ˆev(a,b) be re-estimated emis-
sion probabilities that Wv emits a symbol a and
two symbols a and b respectively. We can ob-
tain each re-estimated probability by the following
equations:
ˆtv(y) = E(v → y)E(v) , ˆev(a) = E(v → a)E(v) ,
ˆev(a,b) = E(v → ab)E(v) .
(3.1)
Note that the expected count correctly correspond-
ing to its nonterminal type must be substituted
for the above equations. In summary, the inside-
outside algorithm is as follows:
Algorithm 3 (Inside-Outside).
Initialization: Pick arbitrary probability parame-
ters of the model.
Iteration: Calculate the new probability parame-
ters using (3.1). Calculate the new log likelihood∑
N
r=1 logP(w
(r) | θ) of the model.
Termination: Stop if the change in log likelihood
is less than predefined threshold.
4 Experimental Results
4.1 Data for Experiments
The dataset for experiments was taken from an
RNA family database called “Rfam” (version 7.0)
(Griffiths-Jones et al., 2003) which is a database
of multiple sequence alignment and covariance
models (Eddy and Durbin, 1994) representing
non-coding RNA families. We selected three vi-
ral RNA families with pseudoknot annotations
named Corona pk 3 (Corona), HDV ribozyme
(HDV) and Tombus 3 IV (Tombus) (see Table 2).
Corona pk 3 has a simple pseudoknotted struc-
ture, whereas HDV ribozyme and Tombus 3 IV
have more complicated structures with pseudo-
knot.
62
Table 2: Three RNA families from Rfam ver. 7.0
Family Range of length # of annotated sequences # of test sequences
Corona pk 3 62–64 14 10
HDV ribozyme 87–91 15 10
Tombus 3 IV 89–92 18 12
Table 3: Prediction results
Family Precision [%] Recall [%] CPU time [sec]
Average Min Max Average Min Max Average Min Max
Corona pk 3 99.4 94.4 100.0 99.4 94.4 100.0 27.8 26.0 30.4
HDV ribozyme 100.0 100.0 100.0 100.0 100.0 100.0 252.1 219.0 278.4
Tombus 3 IV 100.0 100.0 100.0 100.0 100.0 100.0 244.8 215.2 257.5
4.2 Implementation
We specified a particular SMCFG Gs by utiliz-
ing secondary structure annotation of each fam-
ily. Rules were determined by considering con-
sensus secondary structure. Probability parame-
ters were estimated in a few selected sequences by
the simplest pseudocounting method known as the
Laplace’s rule (Durbin et al., 1998): to add one ex-
tra count to the true counts for each base configu-
ration observed in a few selected sequences. Note
that the inside-outside algorithm was not used in
the experiments. The other sequences in the align-
ment were used as the test sequences for predic-
tion (see Table 2). We implemented the CYK al-
gorithm with traceback in ANSI C on a machine
with Intel Pentium D CPU 2.80 GHz and 2.00 GB
RAM. Straightforward implementation gives rise
to a serious problem of lack of memory space due
to the higher order dynamic programming matrix
(remember that the space complexity of the CYK
algorithm is O(mn4)). The dynamic program-
ming matrix in our specified model is sparse, and
therefore, we successfully implemented the matrix
as a hash table storing only nonzero probability
values (equivalently, finite values of the logarithm
of probabilities).
4.3 Tests
We tested prediction accuracy by calculating pre-
cision and recall (sensitivity), which are the ratio
of the number of correct base pairs predicted by
the algorithm to the total number of predicted base
pairs, and the ratio of the number of correct base
pairs predicted by the algorithm to the total num-
ber of base pairs specified by the trusted annota-
tion, respectively. The results are shown in Table
3. A nearly correct prediction (94.4% precision
and recall) for Corona pk 3 is shown in Figure
2 where underlined base pairs agree with trusted
ones. The secondary structures predicted by our
algorithm agree very well with the trusted struc-
tures.
4.4 Comparison with PSTAG
We compared the prediction accuracy of our SM-
CFG algorithm with that of PSTAG algorithm
(Matsui et al., 2005) (see Table 4). PSTAGs,
as we have mentioned before, are proposed for
modeling pairwise alignment of RNA sequences
with pseudoknots and assign a probability to each
alignment of TAG derivation trees. PSTAG al-
gorithm, based on dynamic programming, calcu-
lates the most likely alignment for the pair of
TAG derivation trees where one of them is in the
form of an unfolded sequence and the other is a
TAG derivation tree for known structure. SMCFG
method shows better performance in accuracy than
PSTAG method in the same test sets.
5 Conclusion
In this paper, we have proposed a probabilistic
model named SMCFG, and designed a polyno-
mial time parsing and a parameter estimation al-
gorithm for SMCFG. Moreover, we have demon-
strated computational experiments of RNA sec-
ondary structure prediction with pseudoknots us-
ing SMCFG parsing algorithm, which show good
performance in accuracy.
Acknowledgments
This work is supported in part by Grant-in-Aid
for Scientific Research from Japan Society for the
Promotion of Science (JSPS). We also wish to
thank JSPS Research Fellowships for Young Sci-
entists for their generous financial assistance. The
authors thank Dr. Yoshiaki Takata for his useful
63
CUAGUCUUAUACACAAUGGUAAGCCAGUGGUAGUAAAGGUAUAAGAAAUUUGCUACUAUGUUA    [[[[[[[[              ((( ((((((( ]]]]]]]]  ))))))) )))    
CUAGUCUUAUACACAAUGGUAAGCCAGUGGUAGUAAAGGUAUAAGAAAUUUGCUACUAUGUUA    [[[[[[[[              ((((((((((  ]]]]]]]]   ))))))))))      
[Trusted structure in Rfam]
[Prediction by SMCFG]
Corona_pk3 (EMBL accession #: X51325.1)
Figure 2: Comparison of a prediction result with a trusted structure in Rfam
Table 4: Comparison between SMCFG and PSTAG
Model Average precision [%] Average recall [%]
Corona HDV Tombus Corona HDV Tombus
SMCFG 99.4 100.0 100.0 99.4 100.0 100.0
PSTAG 95.5 95.6 97.4 94.6 94.1 97.4
comments on implementation of high dimensional
dynamic programming.

References
Tatsuya Akutsu. 2000. Dynamic programming al-
gorithms for RNA secondary structure prediction
with pseudoknots. Discrete Applied Mathematics,
104:45–62.
Michael Brown and Charles Wilson. 1996. RNA pseu-
doknot modeling using intersections of stochastic
context free grammars with applications to database
search. Proc. Pacific Symposium on Biocomputing,
109–125.
Liming Cai, Russell L. Malmberg, and Yunzhou Wu.
2003. Stochastic modeling of RNA pseudoknotted
structures: a grammatical approach. Bioinformatics,
19(1):i66–i73.
Richard Durbin, Sean R. Eddy, Anders Krogh, and
Graeme Mitchison. 1998. Biological Sequence
Analysis, Cambridge University Press.
Sean R. Eddy and Richard Durbin. 1994. RNA
sequence analysis using covariance models. Nuc.
Acids Res., 22(11):2079–2088.
Sam Griffiths-Jones, Alex Bateman, Mhairi Marshall,
Ajay Khanna, and Sean R. Eddy. 2003. Rfam: an
RNA family database. Nuc. Acids Res., 31(1):439–
441.
Tadao Kasami, Hiroyuki Seki, and Mamoru Fujii.
1988. Generalized context-free grammar and multi-
ple context-free grammar. IEICE Trans. Inf. & Syst.,
J71-D(5):758–765 (in Japanese).
Yuki Kato, Hiroyuki Seki, and Tadao Kasami. 2005.
On the generative power of grammars for RNA sec-
ondary structure. IEICE Trans. Inf. & Syst., E88-
D(1):53–64.
Yuki Kato and Hiroyuki Seki. 2006. Stochastic
multiple context-free grammar for RNA pseudoknot
modeling. NAIST Info. Sci. Tech. Rep. (NAIST-IS-
TR2006002).
Hiroshi Matsui, Kengo Sato, and Yasubumi Sakak-
ibara. 2005. Pair stochastic tree adjoining gram-
mars for aligning and predicting pseudoknot RNA
structures. Bioinformatics, 21(11):2611–2617.
Elena Rivas and Sean R. Eddy. 1999. A dynamic pro-
gramming algorithm for RNA structure prediction
including pseudoknots. J. Mol. Biol., 285:2053–
2068.
Elena Rivas and Sean R. Eddy. 2000. The language of
RNA: A formal grammar that includes pseudoknots.
Bioinformatics, 16(4):334–340.
Yasubumi Sakakibara, Michael Brown, Richard
Hughey, I. Saira Mian, Kimmen Sj¨olander, Rebecca
C. Underwood, and David Haussler. 1994. Stochas-
tic context-free grammars for tRNA modeling. Nuc.
Acids Res., 22:5112–5120.
Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and
Tadao Kasami. 1991. On multiple context-free
grammars. Theor. Comput. Sci., 88:191–229.
Yasuo Uemura, Aki Hasegawa, Satoshi Kobayashi, and
Takashi Yokomori. 1999. Tree adjoining grammars
for RNA structure prediction. Theor. Comput. Sci.,
210:277–303.
K. Vijay-Shanker, David J. Weir, and Aravind K. Joshi.
1987. Characterizing structural descriptions pro-
duced by various grammatical formalisms. Proc.
25th Annual Meeting of Association for Computa-
tional Linguistics (ACL87), 104–111.
