Kullback-Leibler Distance
between Probabilistic Context-Free Grammars
and Probabilistic Finite Automata
Mark-Jan Nederhof
Faculty of Arts
University of Groningen
P.O. Box 716
NL-9700 AS Groningen, The Netherlands
markjan@let.rug.nl
Giorgio Satta
Department of Information Engineering
University of Padua
via Gradenigo, 6/A
I-35131 Padova, Italy
satta@dei.unipd.it
Abstract
We consider the problem of computing the
Kullback-Leibler distance, also called the
relative entropy, between a probabilistic
context-free grammar and a probabilistic  -
nite automaton. We show that there is
a closed-form (analytical) solution for one
part of the Kullback-Leibler distance, viz.
the cross-entropy. We discuss several ap-
plications of the result to the problem of
distributional approximation of probabilis-
tic context-free grammars by means of prob-
abilistic  nite automata.
1 Introduction
Among the many formalisms used for descrip-
tion and analysis of syntactic structure of natu-
ral language, the class of context-free grammars
(CFGs) is by far the best understood and most
widely used. Many formalisms with greater gen-
erative power, in particular the di erent types
of uni cation grammars, are ultimately based
on CFGs.
Regular expressions, with their procedural
counter-part of  nite automata (FAs), are not
able to describe hierarchical, tree-shaped struc-
ture, and thereby seem less suitable than CFGs
for full analysis of syntactic structure. How-
ever, there are many applications where only
partial or approximated analysis of structure is
needed, and where full context-free processing
could be prohibitively expensive. Such appli-
cations can for example be found in real-time
speech recognition systems: of the many hy-
potheses returned by a speech recognizer, shal-
low syntactic analysis may be used to select a
small subset of those that seem most promis-
ing for full syntactic processing in a next phase,
thereby avoiding further computational costs
for the less promising hypotheses.
As FAs cannot describe structure as such, it
is impractical to write the automata required
for such applications by hand, and even di -
cult to derive them automatically by training.
For this reason, the used FAs are often derived
from CFGs, by means of some form of approx-
imation. An overview of di erent methods of
approximating CFGs by FAs, along with an ex-
perimental comparison, was given by (Nederhof,
2000).
The next step is to assign probabilities to the
transitions of the approximating FA, as the ap-
plication outlined above requires a qualitative
distinction between hypotheses rather than the
purely boolean distinction of language member-
ship. Under certain circumstances, this may be
done by carrying over the probabilities from an
input probabilistic CFG (PCFG), as shown for
the special case of n-grams by (Rimon and Herz,
1991; Stolcke and Segal, 1994), or by training
of the FA on a corpus generated by the PCFG
(Jurafsky et al., 1994). See also (Mohri and
Nederhof, 2001) for discussion of related ideas.
An obvious question to ask is then how
well the resulting PFA approximates the input
PCFG, possibly for di erent methods of deter-
mining an FA and di erent ways of attaching
probabilities to the transitions. Until now, any
direct way of measuring the distance between
a PCFG and a PFA has been lacking. As we
will argue in this paper, the natural distance
measure between probability distributions, the
Kullback-Leibler (KL) distance, is di cult to
compute. (The KL distance is also called rela-
tive entropy.) We can however derive a closed-
form (analytical) solution for the cross entropy
of a PCFG and a PFA, provided the FA under-
lying the PFA is deterministic. The di erence
between the cross-entropy and the KL distance
is the entropy of the PCFG, which does not rely
on the PFA. This means that if we are interested
in the relative quality of di erent approximat-
ing PFAs with respect to a single input PCFG,
the cross-entropy may be used instead of the
KL distance. The constraint of determinism is
not a problem in practice, as any FA can be
determinized, and FAs derived by approxima-
tion algorithms are normally determinized (and
minimized).
As a second possible application, we now look
more closely into the matter of determinization
of  nite-state models. Not all PFAs can be de-
terminized, as discussed by (Mohri, 1997). This
is unfortunate, as deterministic (P)FAs process
input with time and space costs independent
of the size of the automaton, whereas these
costs are linear in the size of the automaton
in the nondeterministic case, which may be too
high for some real-time applications. Instead
of distribution-preserving determinization, we
may therefore approximate a nondeterministic
PFA by a deterministic PFA whose probability
distribution is close to, but not necessarily iden-
tical to, that of the  rst PFA. Again, an impor-
tant question is how close the two models are to
each other. It was argued before by (Juang and
Rabiner, 1985; Falkhausen et al., 1995; Vihola
et al., 2002) that the KL distance between  nite-
state models is di cult to compute in general.
The theory developed in this paper shows how-
ever that the cross-entropy between the input
PFA and the approximating deterministic PFA
can be expressed in closed form, relying on the
fact that a PFA can be seen as a special case of
a PCFG. Thereby, di erent approximating de-
terministic PFAs can be compared for closeness
to the input PFA. We can even compute the
KL distance between two unambiguous PFAs,
in closed form. (It is not di cult to see that
ambiguity is a decidable property for FAs.)
The structure of this paper is as follows.
We provide some preliminary de nitions in Sec-
tion 2. Section 3 discusses the expected fre-
quency of a rule in derivations allowed by a
PCFG, and explains how such values can be ef-
fectively computed. The KL distance between
a PCFG and a PFA is closely related to the
entropy of the PCFG, which we discuss in Sec-
tion 4. Essential to our approach is the inter-
section of PCFGs and PFAs, to be discussed in
Section 5. As we show in Section 6, the part
of the KL distance expressing the cross-entropy
can be computed in closed form, based on this
intersection. Section 7 concludes this paper.
2 Preliminaries
Throughout the paper we use mostly stan-
dard formal language notation, as for instance
in (Hopcroft and Ullman, 1979; Booth and
Thompson, 1973), which we summarize below.
A context-free grammar (CFG) is a 4-tuple
G = ( ;N;S;R) where  and N are  nite dis-
joint sets of terminals and nonterminals, respec-
tively, S2N is the start symbol and R is a  -
nite set of rules. Each rule has the form A! ,
where A2N and  2( [N) .
The ‘derives’ relation ) associated with G
is de ned on triples consisting of two strings
 ; 2 ( [N) and a rule  2R. We write
  )  if and only if  is of the form uA 
and  is of the form u  , for some u 2   ,
 2 ( [N) , and  = (A! ). A left-most
derivation (for G) is a string d =  1    m,
m  0, such that  0  1)  1  2)    m)  m, for
some  0;:::; m 2 ( [N) ; d =  (where  
denotes the empty string) is also a left-most
derivation. In the remainder of this paper,
we will let the term ‘derivation’ refer to ‘left-
most derivation’, unless speci ed otherwise. If
 0  1)    m) m for some  0;:::; m2( [N) ,
then we say that d =  1    m derives  m from
 0 and we write  0 d) m; d =  derives any
 0 2( [N) from itself.
A (left-most) derivation d such that S d)w,
w2  , is called a complete derivation. If d is
a complete derivation, we write y(d) to denote
the (unique) string w2  such that S d)w.
The language generated by G is the set of all
strings y(d) derived by complete derivations,
i.e., L(G) = fwjS d) w; d 2 R ; w 2   g.
It is well-known that there is a one-to-one cor-
respondence between complete derivations and
parse trees for strings in L(G).
A probabilistic CFG (PCFG) is a pair Gp =
(G;pG), where G is a CFG and pG is a function
from R to real numbers in the interval [0;1].
A PCFG is proper if P =(A! )pG( ) = 1 for
all A 2 N. Function pG can be used to as-
sociate probabilities to derivations of the un-
derlying CFG G, in the following way. For
d =  1    m 2R , m 0, we de ne pG(d) =Q
m
i=1 pG( i) if S
d)w for some w 2  , and
pG(d) = 0 otherwise. The probability of a string
w2  is de ned as pG(w) = Pd:y(d)=wpG(d).
A PCFG is consistent if PwpG(w) = 1. Con-
sistency implies that the PCFG de nes a proba-
bility distribution on the set of terminal strings
as well as on the set of grammar derivations. If
a PCFG is proper, then consistency means that
no probability mass is lost in ‘in nite’ deriva-
tions.
A  nite automaton (FA) is a 5-tuple M = ( ;
Q;q0;Qf;T), where  and Q are two  nite sets
of terminals and states, respectively, q0 is the
initial state, Qf  Q is the set of  nal states,
and T is a  nite set of transitions, each of the
form s a7! t, where s;t 2 Q and a 2  . A
probabilistic  nite automaton (PFA) is a pair
Mp = (M;pM), where M is an FA and pM is a
function from T to real numbers in the interval
[0;1].1
For a  xed (P)FA M, we de ne a con gu-
ration to be an element of Q   , and we
de ne the relation ‘ on triples consisting of
two con gurations and a transition  2 T by:
(s;w)  ‘(t;w0) if and only if w is of the form aw0,
for some a2 , and  = (s a7!t). A complete
computation is a string c =  1    m, m  0,
such that (s0;w0)  1‘ (s1;w1)  2‘    m‘ (sm;wm),
for some (s0;w0);:::; (sm;wm)2Q   , with
s0 = q0, sm 2 Qf and wm =  , and we write
(s0;w0) c‘(sm;wm). The language accepted by
M is L(M) = fw 2   j(q0;w) c‘ (s; );c 2
T ;s2Qfg.
For a PFA Mp = (M;pM), and c =  1    m2
T , m  0, we de ne pM(c) = Qmi=1pM( i) if
c is a complete computation, and pM(c) = 0
otherwise. A PFA is consistent ifPcpM(c) = 1.
We say M is unambiguous if for each w2  ,
9s2Qf[(q0;w) c‘ (s; )] for at most one c 2 T .
We say M is deterministic if for each s and a,
there is at most one transition s a7! t. Deter-
minism implies unambiguity. It can be more
readily checked whether an FA is determinis-
tic than whether it is unambiguous. Further-
more, any FA can be e ectively turned into a
deterministic FA accepting the same language.
Therefore, this paper will assume that FAs are
deterministic, although technically, unambigu-
ity is su cient for our constructions to apply.
3 Expectation of rule frequency
Here we discuss how we can compute the ex-
pectation of the frequency of a rule or a non-
terminal over all derivations of a probabilistic
context-free grammar. These quantities will be
used later by our algorithms.
1Our de nition of PFAs amounts to a slight loss of
generality with respect to standard de nitions, in that
there are no epsilon transitions and no probability func-
tion on states being  nal. We want to avoid these con-
cepts as they would cause some technical complications
later in this article. There is no loss of generality how-
ever if we may assume an end-of-sentence marker, which
is often the case in practice.
Let (A !  ) 2 R be a rule of PCFG Gp,
and let d2R be a complete derivation in Gp.
We de ne f(A! ;d) as the number of occur-
rences, or frequency, of A! in d. Similarly,
the frequency of nonterminal A in d is de ned
as f(A;d) = P f(A! ;d). We consider the
following related quantities
EpG f(A! ;d) = X
d
pG(d) f(A! ;d);
EpG f(A;d) = X
d
pG(d) f(A;d)
= X
 
EpG f(A! ;d):
A method for the computation of these quan-
tities is reported in (Hutchins, 1972), based on
the so-called momentum matrix. We propose
an alternative method here, based on an idea
related to the inside-outside algorithm (Baker,
1979; Lari and Young, 1990; Lari and Young,
1991). We observe that we can factorize a
derivation d at each occurrence of rule A! 
into an ‘innermost’ part d2 and two ‘outermost’
parts d1 and d3. We can then write
EpG f(A! ;d) =
X
d= 1    m;m1;m2;w; ;v;x:
Sd1)wA ; with d1= 1    m1 1;
(A! )= m1;
 d2)v; with d2= m1+1    m2;
 d3)x; with d3= m2+1    m
mY
i=1
pG( i):
Next we group together all of the innermost and
all of the outermost derivations and write
EpG f(A! ;d) =
outGp(A) pG(A! ) inGp( )
where
outGp(A) =
X
d= 1    m;d0= 01    0m0;w; ;x:
Sd)wA ;  d
0
)x
mY
i=1
pG( i) 
m0Y
i=1
pG( 0i)
and
inGp( ) = X
d= 1    m;v:
 d)v
mY
i=1
pG( i).
Both outGp(A) and inGp( ) can be described in
terms of recursive equations, of which the least
 xed-points are the required values. If Gp is
proper and consistent, then inGp( ) = 1 for
each  2 ( [N) . Quantities outGp(A) for
every A can all be (exactly) calculated by solv-
ing a linear system, requiring an amount of time
proportional to the cube of the size of Gp; see
for instance (Corazza et al., 1991).
On the basis of all the above quantities, a
number of useful statistical properties of Gp can
be easily computed, such as the expected length
of derivations, denoted EDL(Gp) and the ex-
pected length of sentences, denoted EWL(Gp),
discussed before by (Wetherell, 1980). These
quantities satisfy the relations
EDL(Gp) = EpG jdj=X
A! 
outGp(A) pG(A! ) inGp( );
EWL(Gp) = EpG jy(d)j=X
A! 
outGp(A) pG(A! ) inGp( ) j j ;
where for a string  2 (N[ ) we write j j 
to denote the number of occurrences of terminal
symbols in  .
4 Entropy of PCFGs
In this section we introduce the notion of deriva-
tional entropy of a PCFG, and discuss an algo-
rithm for its computation.
Let Gp = (G;pG) be a PCFG. For a nonter-
minal A of G, let us de ne the entropy of A as
the entropy of the distribution pG on all rules
of the form A! , i.e.,
H(A) = EpG log 1p
G(A! )
= X
 
pG(A! ) log 1p
G(A! )
:
The derivational entropy of Gp is de ned as
the expectation of the information of the com-
plete derivations generated by Gp, i.e.,
Hd(Gp) = EpG log 1p
G(d)
= X
d
pG(d) log 1p
G(d)
: (1)
We now characterize derivational entropy using
expected rule frequencies as
Hd(Gp) =
X
d
pG(d) log 1p
G(d)
=
X
d
pG(d) log Y
A! 
 1
pG(A! )
 f(A! ;d)
=
X
d
pG(d) X
A! 
f(A! ;d) log 1p
G(A! )
=
X
A! 
log 1p
G(A! )
 X
d
pG(d) f(A! ;d) =
X
A! 
log 1p
G(A! )
 EpG f(A! ;d) =
X
A
X
 
log 1p
G(A! )
 outGp(A) pG(A! ) 
inGp( ) =
X
A
outGp(A) X
 
pG(A! ) log 1p
G(A! )
 
inGp( ):
As already discussed, under the assumption
that Gp is proper and consistent we have
inGp( ) = 1 for every  . Thus we can write
Hd(Gp) = X
A
outGp(A) H(A): (2)
The computation of outGp(A) was discussed
in Section 3, and also H(A) can easily be calcu-
lated.
Under the restrictive assumption that a
PCFG is proper and consistent, the characteri-
zation in (2) was already known from (Grenan-
der, 1976, Theorem 10.7, pp. 90{92). The proof
reported in that work is di erent from ours and
uses a momentum matrix (Section 3). Our char-
acterization above is more general and uses sim-
pler notation than the one in (Grenander, 1976).
The sentential entropy, or entropy for short,
of Gp is de ned as the expectation of the infor-
mation of the strings generated by Gp, i.e.,
H(Gp) = EpG log 1p
G(w)
= X
w
pG(w) log 1p
G(w)
; (3)
assuming 0 log 10 = 0, for strings w not gen-
erated by Gp. It is not di cult to see that
H(Gp) Hd(Gp) and equality holds if and only
if G is unambiguous (Soule, 1974, Theorem 2.2).
As ambiguity of CFGs is undecidable, it follows
that we cannot hope to obtain a closed-form
solution for H(Gp) for which equality to (2) is
decidable. We will return to this issue in Sec-
tion 6.
5 Weighted intersection
In order to compute the cross-entropy de ned
in the next section, we need to derive a sin-
gle probabilistic model that simultaneously ac-
counts for both the computations of an under-
lying FA and the derivations of an underlying
PCFG. We start from a construction originally
presented in (Bar-Hillel et al., 1964), that com-
putes the intersection of a context-free language
and a regular language. The input consists of a
CFG G = ( ;N;S;R) and an FA M = ( ;Q;
q0; Qf; T); note that we assume, without loss
of generality, that G and M share the same set
of terminals  .
The output of the construction is CFG G\ =
( ; N\; S\; R\), where N\ = Q ( [N) 
Q[fS\g, and R\ consists of the set of rules
that is obtained as follows.
 For each s2Qf, let S\ ! (q0;S;s) be a
rule of G\.
 For each rule A ! X1   Xm of G
and each sequence of states s0;:::;sm
of M, with m  0, let (s0;A;sm) !
(s0;X1;s1)   (sm 1;Xm;sm) be a rule of
G\; form = 0, G\ has a rule (s0;A;s0)! 
for each state s0.
 For each transition s a7! t of M, let
(s;a;t)!a be a rule of G\.
Note that for each rule (s0;A;sm) !
(s0;X1;s1)   (sm 1;Xm;sm) there is a unique
rule A ! X1   Xm from which it has been
constructed by the above. Similarly, each rule
(s;a;t) ! a uniquely identi es a transition
s a7!t. This means that if we take a complete
derivation d\ in G\, we can extract a sequence
h1(d\) of rules from G and a sequence h2(d\) of
transitions from M, where h1 and h2 are string
homomorphisms that we de ne point-wise as
 h1( \) =  , if  \ is S\!(q0;S;s);
h1( \) =  , if  \ is (s0;A;sm) !
(s0;X1;s1)   (sm 1;Xm;sm) and  is
(A!X1   Xm);
h1( \) =  , if  \ is (s;a;t)!a;
 h2( \) =  , if  \ is S\!(q0;S;s);
h2( \) =  , if  \ is (s;a;t) !a and  is
s a7!t;
h2( \) =  , if  \ is (s0;A;sm) !
(s0;X1;s1)   (sm 1;Xm;sm).
We de ne h(d\) = (h1(d\);h2(d\)). It can be
easily shown that if S\d\)w and h(d\) = (d;c),
then for the same w we have S d) w and
(q0;w) c‘ (s; ), some s 2 Qf. Conversely,
if for some w, d and c we have S d) w and
(q0;w) c‘(s; ), some s2Qf, then there is pre-
cisely one derivation d\ such that h(d\) = (d;c)
and S\d\)w.
As noted before by (Nederhof and Satta,
2003), this construction can be extended to ap-
ply to a PCFG Gp = (G;pG) and an FA M. The
output is a PCFG G\;p = (G\;pG\), where G\
is de ned as above and pG\ is de ned by:
 pG\(S\!(q0;S;s)) = 1;
 pG\((s0;A;sm) ! (s0;X1;s1)    
(sm 1;Xm;sm)) = pG(A!X1   Xm);
 pG\((s;a;t)!a) = 1.
Note that G\;p is non-proper. More speci cally,
probabilities of rules with left-hand side S\ or
(s0;A;sm) might not sum to one. This is not
a problem for the algorithms presented in this
paper, as we have never assumed properness for
our PCFGs. What is most important here is the
following property of G\;p. If d\, d and c are
such that h(d\) = (d;c), then pG\(d\) = pG(d).
Let us now assume that M is deterministic.
(In fact, the weaker condition of M being unam-
biguous is su cient for our purposes, but unam-
biguity is not a very practical condition.) Given
a string w and a transition s a7!t of M we de ne
f(s a7!t;w) as the frequency (number of occur-
rences) of s a7!t in the unique computation of
M, if it exists, that accepts w; this frequency is
0 if w is not accepted by M. On the basis of the
above construction of G\;p and of Section 3, we
 nd
EpG f(s a7!t;y(d)) =X
d
pG(d) f(s a7!t; y(d)) =
outG\;p((s;a;t)) pG\((s;a;t)!a) inG\;p(a) =
outG\;p((s;a;t)) (4)
6 Kullback-Leibler distance
In this section we consider the Kullback-Leibler
distance between a PCFGs and a PFA, and
present a method for its optimization under cer-
tain assumptions. Let Gp = (G;pG) be a consis-
tent PCFG and let Mp = (M;pM) be a consis-
tent PFA. We demand that M be deterministic
(or more generally, unambiguous). Let us  rst
assume that L(G)  L(M); we will later drop
this constraint.
The cross-entropy of Gp and Mp is de ned as
usual for probabilistic models, viz. as the expec-
tation under distribution pG of the information
of the strings generated by M, i.e.,
H(GpjjMp) = EpG log 1p
M(w)
= X
w
pG(w) log 1p
M(w)
:
The Kullback-Leibler distance of Gp and Mp is
de ned as
D(GpjjMp) = EpG log pG(w)p
M(w)
= X
w
pG(w) log pG(w)p
M(w)
:
Quantity D(GpjjMp) can also be expressed as
the di erence between the cross-entropy of Gp
and Mp and the entropy of Gp, i.e.,
D(GpjjMp) = H(GpjjMp) H(Gp): (5)
Let G\;p be the PCFG obtained by intersecting
Gp with the non-probabilistic FA M underlying
Mp, as in Section 5. Using (4) the cross-entropy
of Gp and Mp can be expressed as
H(GpjjMp) =
X
w
pG(w) log 1p
M(w)
=
X
d
pG(d) log 1p
M(y(d))
=
X
d
pG(d) log Y
sa7!t
 
1
pM(s a7!t)
!f(sa7!t;y(d))
=
X
d
pG(d) X
sa7!t
f(s a7!t; y(d)) log 1p
M(s
a7!t) =
X
sa7!t
log 1p
M(s
a7!t) 
X
d
pG(d) f(s a7!t; y(d)) =
X
sa7!t
log 1p
M(s
a7!t)  EpG f(s
a7!t; y(d)) =
X
sa7!t
log 1p
M(s
a7!t)  outG\;p((s;a;t)):
We can combine the above with (5) to obtain
D(GpjjMp) =
X
sa7!t
outG\;p((s;a;t)) log 1p
M(s
a7!t)  H(Gp).
The values of outG\;p can be calculated eas-
ily, as discussed in Section 3. Computation of
H(Gp) in closed-form is problematic, as already
pointed out in Section 4. However, for many
purposes computation of H(Gp) is not needed.
For example, assume that the non-
probabilistic FA M underlying Mp is given, and
our goal is to measure the distance between Gp
and Mp, for di erent choices of pM. Then the
choice that minimizes H(GpjjMp) determines
the choice that minimizes D(GpjjMp), irre-
spective of H(Gp). Formally, we can use the
above characterization to compute
p M = argmaxp
M
D(GpjjMp)
= argmaxp
M
H(GpjjMp):
When L(G)  L(M) is non-empty, both
D(GpjjMp) and H(GpjjMp) are unde ned, as
their de nitions imply a division by pM(w) = 0
for w2L(G) L(M). In cases where the non-
probabilistic FA M is given, and our goal is to
compare the relative distances between Gp and
Mp for di erent choices of pM, it makes sense
to ignore strings in L(G) L(M), and de ne
D(GpjjMp), H(GpjjMp) and H(Gp) on the do-
main L(G)\L(M). Our equations above then
still hold. Note that strings in L(M) L(G) can
be ignored since they do not contribute non-zero
values to D(GpjjMp) and H(GpjjMp).
7 Conclusions
We have discussed the computation of the
KL distance between PCFGs and deterministic
PFAs. We have argued that exact computation
is di cult in general, but for determining the
relative qualities of di erent PFAs, with respect
to their closeness to an input PCFG, it su ces
to compute the cross-entropy. We have shown
that the cross-entropy between a PCFG and a
deterministic PFA can be computed exactly.
These results can also be used for comparing
a pair of PFAs, one of which is deterministic.
Generalization of PCFGs to probabilistic tree-
adjoining grammars (PTAGs) is also possible,
by means of the intersection of a PTAG and a
PFA, along the lines of (Lang, 1994).
Acknowledgements
Helpful comments from Zhiyi Chi are gratefully
acknowledged. The  rst author is supported by
the PIONIER Project Algorithms for Linguis-
tic Processing, funded by NWO (Dutch Orga-
nization for Scienti c Research). The second
author is partially supported by MIUR (Italian
Ministry of Education) under project PRIN No.
2003091149 005.

References

J.K. Baker. 1979. Trainable grammars for
speech recognition. In J.J. Wolf and D.H.
Klatt, editors, Speech Communication Papers
Presented at the 97th Meeting of the Acousti-
cal Society of America, pages 547{550.

Y. Bar-Hillel, M. Perles, and E. Shamir. 1964.
On formal properties of simple phrase struc-
ture grammars. In Y. Bar-Hillel, editor,
Language and Information: Selected Essays
on their Theory and Application, chapter 9,
pages 116{150. Addison-Wesley.

T.L. Booth and R.A. Thompson. 1973. Ap-
plying probabilistic measures to abstract lan-
guages. IEEE Transactions on Computers,
C-22(5):442{450, May.

A. Corazza, R. De Mori, R. Gretter, and
G. Satta. 1991. Computation of probabilities
for an island-driven parser. IEEE Transac-
tions on Pattern Analysis and Machine In-
telligence, 13(9):936{950.

M. Falkhausen, H. Reininger, and D. Wolf.
1995. Calculation of distance measures be-
tween Hidden Markov Models. In Proceedings
of Eurospeech ’95, pages 1487{1490, Madrid.

U. Grenander. 1976. Lectures in Pattern The-
ory, Vol. I: Pattern Synthesis. Springer-
Verlag.

J.E. Hopcroft and J.D. Ullman. 1979. Intro-
duction to Automata Theory, Languages, and
Computation. Addison-Wesley.

S.E. Hutchins. 1972. Moments of strings and
derivation lengths of stochastic context-free
ggrammars. Information Sciences, 4:179{
191.

B.-H. Juang and L.R. Rabiner. 1985. A prob-
abilistic distance measure for hidden Markov
models. AT&T Technical Journal, 64(2):391{
408.

D. Jurafsky, C. Wooters, G. Tajchman, J. Se-
gal, A. Stolcke, E. Fosler, and N. Morgan.
1994. The Berkeley Restaurant Project. In
Proceedings of the International Conference
on Spoken Language Processing (ICSLP-94),
pages 2139{2142, Yokohama, Japan.

B. Lang. 1994. Recognition can be harder
than parsing. Computational Intelligence,
10(4):486{494.

K. Lari and S.J. Young. 1990. The estimation
of stochastic context-free grammars using the
Inside-Outside algorithm. Computer Speech
and Language, 4:35{56.

K. Lari and S.J. Young. 1991. Applications of
stochastic context-free grammars using the
Inside-Outside algorithm. Computer Speech
and Language, 5:237{257.

M. Mohri and M.-J. Nederhof. 2001. Regu-
lar approximation of context-free grammars
through transformation. In J.-C. Junqua and
G. van Noord, editors, Robustness in Lan-
guage and Speech Technology, pages 153{163.
Kluwer Academic Publishers.

M. Mohri. 1997. Finite-state transducers in
language and speech processing. Computa-
tional Linguistics, 23(2):269{311.

M.-J. Nederhof and G. Satta. 2003. Proba-
bilistic parsing as intersection. In 8th Inter-
national Workshop on Parsing Technologies,
pages 137{148, LORIA, Nancy, France, April.

M.-J. Nederhof. 2000. Practical experi-
ments with regular approximation of context-
free languages. Computational Linguistics,
26(1):17{44.

M. Rimon and J. Herz. 1991. The recogni-
tion capacity of local syntactic constraints.
In Fifth Conference of the European Chap-
ter of the Association for Computational Lin-
guistics, Proceedings of the Conference, pages
155{160, Berlin, Germany, April.

S. Soule. 1974. Entropies of probabilistic gram-
mars. Information and Control, 25:57{74.

A. Stolcke and J. Segal. 1994. Precise N-
gram probabilities from stochastic context-
free grammars. In 32nd Annual Meeting of
the Association for Computational Linguis-
tics, Proceedings of the Conference, pages 74{
79, Las Cruces, New Mexico, USA, June.

M. Vihola, M. Harju, P. Salmela, J. Suon-
tausta, and J. Savela. 2002. Two dissimilar-
ity measures for HMMs and their application
in phoneme model clustering. In ICASSP
2002, volume I, pages 933{936.

C.S. Wetherell. 1980. Probabilistic languages:
A review and some open questions. Comput-
ing Surveys, 12(4):361{379, December.
