A General Technique to Train Language
Models on Language Models
Mark-Jan Nederhof
∗
University of Groningen
We show that under certain conditions, a language model can be trained on the basis of a
second language model. The main instance of the technique trains a finite automaton on the
basis of a probabilistic context-free grammar, such that the Kullback-Leibler distance between
grammar and trained automaton is provably minimal. This is a substantial generalization of
an existing algorithm to train an n-gram model on the basis of a probabilistic context-free
grammar.
1. Introduction
In this article, the term language model is used to refer to any description that assigns
probabilities to strings over a certain alphabet. Language models have important
applications in natural language processing, and in particular, in speech recognition
systems (Manning and Sch ¨utze 1999).
Language models often consist of a symbolic description of a language, such as
a finite automaton (FA) or a context-free grammar (CFG), extended by a probability
assignment to, for example, the transitions of the FA or the rules of the CFG, by which
we obtain a probabilistic finite automaton (PFA) or probabilistic context-free grammar
(PCFG), respectively. For certain applications, one may first determine the symbolic part
of the automaton or grammar and in a second phase try to find reliable probability
estimates for the transitions or rules. The current article is involved with the second
problem, that of extending FAs or CFGs to become PFAs or PCFGs. We refer to this
process as training.
Training is often done on the basis of a corpus of actual language use in a certain
domain. If each sentence in this corpus is annotated by a list of transitions of an
FA recognizing the sentence or a parse tree for a CFG generating the sentence, then
training may consist simply in relative frequency estimation. This means that we estimate
probabilities of transitions or rules by counting their frequencies in the corpus, relative
to the frequencies of the start states of transitions or to the frequencies of the left-hand
side nonterminals of rules, respectively. By this estimation, the likelihood of the corpus
is maximized.
The technique we introduce in this article is different in that training is done on
the basis not of a finite corpus, but of an input language model. Our goal is to find
estimations for the probabilities of transitions or rules of the input FA or CFG such that
∗ Faculty of Arts, Humanities Computing, P.O. Box 716, NL-9700 AS Groningen, The Netherlands.
E-mail: markjan@let.rug.nl.
Submission received: 20th January 2004; Revised submission received: 5th August 2004; Accepted for
publication: 19th September 2004
© 2005 Association for Computational Linguistics
Computational Linguistics Volume 31, Number 2
the resulting PFA or PCFG approximates the input language model as well as possible,
or more specifically, such that the Kullback-Leibler (KL) distance (or relative entropy)
between the input model and the trained model is minimized. The input FA or CFG to
be trained may be structurally unrelated to the input language model.
This technique has several applications. One is an extension with probabilities
of existing work on approximation of CFGs by means of FAs (Nederhof 2000). The
motivation for this work was that application of FAs is generally less costly than
application of CFGs, which is an important benefit when the input is very large, as
is often the case in, for example, speech recognition systems. The practical relevance of
this work was limited, however, by the fact that in practice one is more interested in
the probabilities of sentences than in a purely Boolean distinction between grammatical
and ungrammatical sentences.
Several approaches were discussed by Mohri and Nederhof (2001) to extend this
work to approximation of PCFGs by means of PFAs. A first approach is to directly map
rules with attached probabilities to transitions with attached probabilities. Although
this is computationally the easiest approach, the resulting PFA may be a very inaccurate
approximation of the probability distribution described by the input PCFG. In particu-
lar, there may be assignments of probabilities to the transitions of the same FA that lead
to more accurate approximating language models.
A second approach is to train the approximating FA by means of a corpus. If
the input PCFG was itself obtained by training on a corpus, then we already possess
training material. However, this may not always be the case, and no training material
may be available. Furthermore, as a determinized approximating FA may be much
larger than the input PCFG, the sparse-data problem may be more severe for the
automaton than it was for the grammar.
1
Hence, even if sufficient material was available
to train the CFG, it may not be sufficient to accurately train the FA.
A third approach is to construct a training corpus from the PCFG by means of
a (pseudo)random generator of sentences, such that sentences that are more likely
according to the PCFG are generated with greater likelihood. This has been proposed
by Jurafsky et al. (1994), for the special case of bigrams, extending a nonprobabilistic
technique by Zue et al. (1991). It is not clear, however, whether this idea is feasible
for training of finite-state models that are larger than bigrams. The reason is that
very large corpora would have to be generated in order to obtain accurate probability
estimates for the PFA. Note that the number of parameters of a bigram model is
bounded by the square of the size of the lexicon; such a bound does not exist for
general PFAs.
The current article discusses a fourth approach. In the limit, it is equivalent to the
third approach above, as if an infinite corpus were constructed on which the PFA is
trained, but we have found a way to avoid considering sentences individually. The key
idea that allows us to handle an infinite set of strings generated by the PCFG is that we
construct a new grammar that represents the intersection of the languages described by
the input PCFG and the FA. Within this new grammar, we can compute the expected
frequencies of transitions of the FA, using a fairly standard analysis of PCFGs. These
expected frequencies then allow us to determine the assignment of probabilities to
transitions of the FA that minimizes the KL distance between the PCFG and the resulting
PFA.
1 In Nederhof (2000), several methods of approximation were discussed that lead to determinized
approximating FAs that can be much larger than the input CFGs.
174
Nederhof Training Models on Models
The only requirement is that the FA to be trained be unambiguous, by which we
mean that each input string can be recognized by at most one computation of the FA.
The special case of n-grams has already been formulated by Stolcke and Segal (1994),
realizing an idea previously envisioned by Rimon and Herz (1991). An n-gram model is
here seen as a (P)FA that contains exactly one state for each possible history of the n − 1
previously read symbols. It is clear that such an FA is unambiguous (even deterministic)
and that our technique therefore properly subsumes the technique by Stolcke and Segal
(1994), although the way that the two techniques are formulated is rather different. Also
note that the FA underlying an n-gram model accepts any input string over the alphabet,
which does not hold for general (unambiguous) FAs.
Another application of our work involves determinization and minimization of
PFAs. As shown by Mohri (1997), PFAs cannot always be determinized, and no practical
algorithms are known to minimize arbitrary nondeterministic (P)FAs. This can be a
problem when deterministic or small PFAs are required. We can, however, always
compute a minimal deterministic FA equivalent to an input FA. The new results in this
article offer a way to extend this determinized FA to a PFA such that it approximates
the probability distribution described by the input PFA as well as possible, in terms of
the KL distance.
Although the proposed technique has some limitations, in particular, that the model
to be trained is unambiguous, it is by no means restricted to language models based on
finite automata or context-free grammars, as several other probabilistic grammatical
formalisms can be treated in a similar manner.
The structure of this article is as follows. We provide some preliminary definitions
in Section 2. Section 3 discusses how the expected frequency of a rule in a PCFG can be
computed. This is an auxiliary step in the algorithms to be discussed below. Section 4
defines a way to combine a PFA and a PCFG into a new PCFG that extends a well-known
representation of the intersection of a regular and a context-free language. Thereby
we merge the input model and the model to be trained into a single structure. This
structure is the foundation for a number of algorithms, presented in section 5, which
allow, respectively, training of an unambiguous FA on the basis of a PCFG (section 5.1),
training of an unambiguous CFG on the basis of a PFA (section 5.2), and training of an
unambiguous FA on the basis of a PFA (section 5.3).
2. Preliminaries
Many of the definitions on probabilistic context-free grammars are based on Santos
(1972) and Booth and Thompson (1973), and the definitions on probabilistic finite
automata are based on Paz (1971) and Starke (1972).
A context-free grammar G is a 4-tuple (Σ, N, S, R), where Σ and N are two finite
disjoint sets of terminals and nonterminals, respectively, S ∈ N is the start symbol,and
R is a finite set of rules, each of the form A → α, where A ∈ N and α ∈ (Σ∪ N)
∗
.A
probabilistic context-free grammar G is a 5-tuple (Σ, N, S, R, p
G
), where Σ, N, S and R
are as above, and p
G
is a function from rules in R to probabilities.
In what follows, symbol a ranges over the set Σ, symbols w, v range over the
set Σ
∗
, symbols A, B range over the set N, symbol X ranges over the set Σ∪ N,
symbols α,β,γ range over the set (Σ∪ N)
∗
, symbol ρ ranges over the set R,and
symbols d, e range over the set R
∗
. With slight abuse of notation, we treat a rule
ρ = (A → α) ∈ R as an atomic symbol when it occurs within a string dρe ∈ R
∗
.
The symbol epsilon1 denotes the empty string. String concatenation is represented by
operator · or by empty space.
175
Computational Linguistics Volume 31, Number 2
For a fixed (P)CFG G, we define the relation ⇒ on triples consisting of two strings
α,β ∈ (Σ∪ N)
∗
and a rule ρ ∈ R by α
ρ
⇒ β, if and only if α is of the form wAδ and β
is of the form wγδ, for some w ∈Σ
∗
and δ ∈ (Σ∪ N)
∗
,andρ = (A → γ). A leftmost
derivation (in G) is a string d = ρ
1
···ρ
m
, m ≥ 0, such that α
0
ρ
1
⇒ α
1
ρ
2
⇒ ···
ρ
m
⇒ α
m
,for
some α
0
,...,α
m
∈ (Σ∪ N)
∗
; d = epsilon1 is always a leftmost derivation. In the remainder
of this article, we let the term derivation refer to leftmost derivation, unless spec-
ified otherwise. If α
0
ρ
1
⇒ ···
ρ
m
⇒ α
m
for some α
0
,...,α
m
∈ (Σ∪ N)
∗
, then we say that
d = ρ
1
···ρ
m
derives α
m
from α
0
, and we write α
0
d
⇒ α
m
; epsilon1 derives any α
0
∈ (Σ∪ N)
∗
from itself. A derivation d such that S
d
⇒ w, for some w ∈Σ
∗
, is called a complete
derivation. We say that G is unambiguous if for each w ∈Σ
∗
, S
d
⇒ w for at most
one d ∈ R
∗
.
Let G be a fixed PCFG (Σ, N, S, R, p
G
). For α,β ∈ (Σ∪ N)
∗
and d = ρ
1
···ρ
m
∈ R
∗
,
m ≥ 0, we define p
G
(α
d
⇒ β) =
producttext
m
i=1
p
G
(ρ
i
)ifα
d
⇒ β,andp
G
(α
d
⇒ β) = 0 otherwise. The
probability p
G
(w) of a string w ∈Σ
∗
is defined to be
summationtext
d
p
G
(S
d
⇒ w).
PCFG G is said to be proper if
summationtext
ρ,α
p
G
(A
ρ
⇒ α) = 1 for all A ∈ N,thatis,ifthe
probabilities of all rules ρ = (A → α) with left-hand side A sumtoone.PCFGG is said to
be consistent if
summationtext
w
p
G
(w) = 1. Consistency implies that the PCFG defines a probability
distribution on the set of terminal strings. There is a practical sufficient condition for
consistency that is decidable (Booth and Thompson 1973).
APCFGissaidtobereduced if for each nonterminal A, there are d
1
, d
2
∈ R
∗
,
w
1
, w
2
∈Σ
∗
,andβ ∈ (Σ∪ N)
∗
such that p
G
(S
d
1
⇒ w
1
Aβ) · p
G
(w
1
Aβ
d
2
⇒ w
1
w
2
) > 0. In
words, if a PCFG is reduced, then for each nonterminal A, there is at least one derivation
d
1
d
2
with nonzero probability that derives a string w
1
w
2
from S and that includes
some rule with left-hand side A.APCFGG that is not reduced can be turned into
one that is reduced and that describes the same probability distribution, provided that
summationtext
w
p
G
(w) > 0. This reduction consists in removing from the grammar any nonterminal
A for which the above conditions do not hold, together with any rule that contains
such a nonterminal; see Aho and Ullman (1972) for reduction of CFGs, which is very
similar.
A finite automaton M is a 5-tuple (Σ, Q, q
0
, q
f
, T), where Σ and Q are two
finite sets of terminals and states, respectively, q
0
, q
f
∈ Q are the initial and final
states, respectively, and T is a finite set of transitions, each of the form r
a
mapsto→ s, where
r ∈ Q −{q
f
}, s ∈ Q,anda ∈Σ.
2
A probabilistic finite automaton M is a 6-tuple (Σ, Q,
q
0
, q
f
, T, p
M
), whereΣ, Q, q
0
, q
f
,andT are as above, and p
M
is a function from transitions
in T to probabilities.
In what follows, symbols q, r, s range over the set Q, symbol τ ranges over the set T,
and symbol c ranges over the set T
∗
.
For a fixed (P)FA M, we define a configuration to be an element of Q ×Σ
∗
, and we
define the relation turnstileleft on triples consisting of two configurations and a transition τ ∈ T
by (r, w)
τ
turnstileleft (s, w
prime
)ifandonlyifw is of the form aw
prime
, for some a ∈Σ,andτ = (r
a
mapsto→ s).
A computation (in M) is a string c = τ
1
···τ
m
, m ≥ 0, such that (r
0
, w
0
)
τ
1
turnstileleft (r
1
, w
1
)
τ
2
turnstileleft ···
τ
m
turnstileleft (r
m
, w
m
), for some (r
0
, w
0
), ...,(r
m
, w
m
) ∈ Q ×Σ
∗
; c = epsilon1 is always a compu-
tation. If (r
0
, w
0
)
τ
1
turnstileleft ···
τ
m
turnstileleft (r
m
, w
m
) for some (r
0
, w
0
), ...,(r
m
, w
m
) ∈ Q ×Σ
∗
and c = τ
1
···
τ
m
∈ T
∗
, then we write (r
0
, w
0
)
c
turnstileleft (r
m
, w
m
). We say that c recognizes w if (q
0
, w)
c
turnstileleft (q
f
,epsilon1).
2 That we only allow one final state is not a serious restriction with regard to the set of strings we can
process; only when the empty string is to be recognized could this lead to difficulties. Lifting the
restriction would encumber the presentation with treatment of additional cases without affecting,
however, the validity of the main results.
176
Nederhof Training Models on Models
Let M be a fixed FA (Σ, Q, q
0
, q
f
, T). The language L(M) accepted by M is
defined to be {w ∈Σ
∗
|∃
c
[(q

, w)
c
turnstileleft (q
f
,epsilon1)]}.WesayM is unambiguous if for each
w ∈Σ
∗
,(q
0
, w)
c
turnstileleft (q
f
,epsilon1) for at most one c ∈ T
∗
.WesayM is deterministic if for each
(r, w) ∈ Q ×Σ
∗
, there is at most one combination of τ ∈ T and (s, w
prime
) ∈ Q ×Σ
∗
such
that (r, w)
τ
turnstileleft (s, w
prime
). Turning a given FA into one that is deterministic and accepts the
same language is called determinization. All FAs can be determinized. Turning a given
(deterministic) FA into the smallest (deterministic) FA that accepts the same language
is called minimization. There are effective algorithms for minimization of deterministic
FAs.
Let M be a fixed PFA (Σ, Q, q
0
, q
f
, T, p
M
). For (r, w), (s, v) ∈ Q ×Σ
∗
and
c = τ
1
···τ
m
∈ T
∗
, we define p
M
((r, w)
c
turnstileleft (s, v)) =
producttext
m
i=1
p
M
(τ
i
)if(r, w)
c
turnstileleft (s, v), and
p
M
((r, w)
c
turnstileleft (s, v)) = 0 otherwise. The probability p
M
(w)ofastringw ∈Σ
∗
is defined
to be
summationtext
c
p
M
((q
0
, w)
c
turnstileleft (q
f
,epsilon1)).
PFA M is said to be proper if
summationtext
τ,a,s: τ=(r
a
mapsto→s)∈T
p
M
(τ) = 1 for all r ∈ Q −{q
f
}.
3. Expected Frequencies of Rules
Let G be a PCFG (Σ, N, S, R, p
G
). We assume without loss of generality that S does not
occur in the right-hand side of any rule from R. For each rule ρ, we define
E(ρ) =
summationdisplay
d,d
prime
,w
p
G
(S
dρd
prime
⇒ w)(1)
If G is proper and consistent, (1) is the expected frequency of ρ in a complete derivation.
Each complete derivation dρd
prime
can be written as dρd
primeprime
d
primeprimeprime
,withd
prime
= d
primeprime
d
primeprimeprime
, where
S
d
⇒ w
prime
Aβ, A
ρ
⇒ α,α
d
primeprime
⇒ w
primeprime
,β
d
primeprimeprime
⇒ w
primeprimeprime
(2)
for some A, α, β, w
prime
, w
primeprime
,andw
primeprimeprime
. Therefore
E(ρ) = outer(A) · p
G
(ρ) · inner(α)(3)
where we define
outer(A) =
summationdisplay
d,w
prime
,β,d
primeprimeprime
,w
primeprimeprime
p
G
(S
d
⇒ w
prime
Aβ) · p
G
(β
d
primeprimeprime
⇒ w
primeprimeprime
)(4)
inner(α) =
summationdisplay
d
primeprime
,w
primeprime
p
G
(α
d
primeprime
⇒ w
primeprime
)(5
for each A ∈ N and α ∈ (Σ∪ N)
∗
. From the definition of inner, we can easily derive the
following equations:
inner(a) = 1(6)
inner(A) =
summationdisplay
ρ,α:
ρ=(A→α)
p
G
(ρ) · inner(α)(7
inner(Xβ) = inner(X) · inner(β 8)
177
Computational Linguistics Volume 31, Number 2
This can be taken as a recursive definition of inner, assuming β negationslash= epsilon1 in (8). Similarly, we
can derive a recursive definition of outer:
outer(S) = 1(9)
outer(A) =
summationdisplay
ρ,B,α,β:
ρ=(B→αAβ)
outer(B) · p
G
(ρ) · inner(α) · inner(β) (10)
for A negationslash= S.
In general, there may be cyclic dependencies in the equations for inner and outer;
that is, for certain nonterminals A, inner(A)andouter(A) may be defined in terms
of themselves. There may even be no closed-form expression for inner(A). However,
one may approximate the solutions to arbitrary precision by means of fixed-point
iteration.
4. Intersection of Context-Free and Regular Languages
We recall a construction from Bar-Hillel, Perles, and Shamir (1964) that computes the
intersection of a context-free language and a regular language. The input consists of a
CFG G = (Σ, N, S, R)andanFAM= (Σ, Q, q
0
, q
f
, T); note that we assume, without loss
of generality, that G and M share the same set of terminals Σ.
The output of the construction is CFG G
∩
= (Σ, N
∩
, S
∩
, R
∩
), where N
∩
= Q ×
(Σ∪ N) × Q, S
∩
= (q
0
, S, q
f
), and R
∩
consists of the set of rules that is obtained as
follows:
a114
For each rule ρ = (A → X
1
···X
m
) ∈ R, m ≥ 0, and each sequence of states
r
0
,..., r
m
∈ Q,lettheruleρ
∩
= ((r
0
, A, r
m
) → (r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
))
be in R
∩
;form = 0, R
∩
contains a rule ρ
∩
= ((r
0
, A, r
0
) → epsilon1) for each
state r
0
.
a114
For each transition τ = (r
a
mapsto→ s) ∈ T,lettheruleρ
∩
= ((r, a, s) → a)be
in R
∩
.
Note that for each rule (r
0
, A, r
m
) → (r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
)fromR
∩
, there is a
unique rule A → X
1
···X
m
from R from which it has been constructed by the above.
Similarly, each rule (r, a, s) → a uniquely identifies a transition r
a
mapsto→ s. This means that if
we take a derivation d
∩
in G
∩
, we can extract a sequence h
1
(d
∩
) of rules from G and a
sequence h
2
(d
∩
) of transitions from M, where h
1
and h
2
are string homomorphisms that
we define pointwise as
h
1
(ρ
∩
) = ρ if ρ
∩
= ((r
0
, A, r
m
) → (r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
))
and ρ = (A → X
1
···X
m
)
(11)
epsilon1 if ρ
∩
= ((r, a, s) → a) (12)
h
2
(ρ
∩
) = τ if ρ
∩
= ((r, a, s) → a)andτ = (r
a
mapsto→ s) (13)
epsilon1 if ρ
∩
= ((r
0
, A, r
m
) → (r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
)) (14)
178
Nederhof Training Models on Models
We define h(d
∩
) = (h
1
(d
∩
), h
2
(d
∩
)). It can be easily shown that if h(d
∩
) = (d, c)and
S
∩
d
∩
⇒ w, then for the same w, we have S
d
⇒ w and (q
0
, w)
c
turnstileleft (q
f
,epsilon1). Conversely, if for some
w, d,andc we have S
d
⇒ w and (q
0
, w)
c
turnstileleft (q
f
,epsilon1), then there is precisely one derivation d
∩
such that h(d
∩
) = (d, c)andS
∩
d
∩
⇒ w.
It was observed by Lang (1994) that G
∩
can be seen as a parse forest,thatis,a
compact representation of all parse trees according to G that derive strings recognized
by M. The construction can be generalized to, for example, tree-adjoining grammars
(Vijay-Shanker and Weir 1993) and range concatenation grammars (Boullier 2000;
Bertsch and Nederhof 2001). The construction for the latter also has implications for
linear context-free rewriting systems (Seki et al. 1991).
The construction has been extended by Nederhof and Satta (2003) to apply to a
PCFG G = (Σ, N, S, R, p
G
)andaPFAM = (Σ, Q, q
0
, q
f
, T, p
M
). The output is a
PCFG G
∩
= (Σ, N
∩
, S
∩
, R
∩
, p
∩
), where N
∩
, S
∩
,andR
∩
are as before, and p
∩
is
defined by
p
∩
((r
0
, A, r
m
) → (r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
)) = p
G
(A → X
1
···X
m
) (15)
p
∩
((r, a, s) → a) = p
M
(r
a
mapsto→ s) (16)
If d
∩
, d,andc are such that h(d
∩
) = (d, c), then clearly p
∩
(d
∩
) = p
G
(d) · p
M
(c).
5. Training Models on Models
We restrict ourselves to a few cases of the general technique of training a model on the
basis of another model.
5.1 Training a PFA on a PCFG
Let us assume we have a proper and consistent PCFG G = (Σ, N, S, R, p
G
)andanFA
M= (Σ, Q, q
0
, q
f
, T) that is unambiguous. This FA may have resulted from (nonprob-
abilistic) approximation of CFG (Σ, N, S, R), but it may also be totally unrelated to G.
Note that an FA is guaranteed to be unambiguous if it is deterministic; any FA can be
determinized. Our goal is now to assign probabilities to the transitions from FA M to
obtain a proper PFA that approximates the probability distribution described by G as
well as possible.
Let us define 1 as the function that maps each transition from T to one. This means
that for each r, w, c and s, 1((r, w)
c
turnstileleft (s,epsilon1)) = 1if(r, w)
c
turnstileleft (s,epsilon1), and 1((r, w)
c
turnstileleft (s,epsilon1)) = 0
otherwise.
Of the set of strings generated by G, a subset is recognized by computations of M;
note again that there can be at most one such computation for each string. The expected
frequency of a transition τ in such computations is given by
E(τ) =
summationdisplay
w,c,c
prime
p
G
(w) ·1((q
0
, w)
cτc
prime
turnstileleft (q
f
,epsilon1)) (17)
Now we construct the PCFG G
∩
as explained in section 4 from the PCFG G and the
PFA (Σ, Q, q
0
, q
f
, T, 1). Let τ = (r
a
mapsto→ s) ∈ T and ρ = ((r, a, s) → a). On the basis of the
179
Computational Linguistics Volume 31, Number 2
properties of function h, we can now rewrite E(τ)as
E(τ) =
summationdisplay
d,w,c,c
prime
p
G
(S
d
⇒ w) ·1((q
0
, w)
cτc
prime
turnstileleft (q
f
,epsilon1))
=
summationdisplay
e,d,w,c,c
prime
:
h(e)=(d,cτc
prime
)
p
G
(S
d
⇒ w) ·1((q
0
, w)
cτc
prime
turnstileleft (q
f
,epsilon1))
=
summationdisplay
e,e
prime
,w
p
∩
(S
∩
eρe
prime
⇒ w)
= E(ρ) (18)
Hereby we have expressed the expected frequency of a transition τ = (r
a
mapsto→ s)in
terms of the expected frequency of rule ρ = ((r, a, s) → a) in derivations in PCFG G
∩
.
It was explained in section 3 how such a value can be computed. Note that since
by definition 1(τ) = 1, also p
∩
(ρ) = 1. Furthermore, for the right-hand side a of ρ,
inner(a) = 1. Therefore,
E(τ) = outer((r, a, s))· p
∩
(ρ) · inner(a)
= outer((r, a, s)) (19)
To obtain the required PFA (Σ, Q, q
0
, q
f
, T, p
M
), we now define the probability
function p
M
for each τ = (r
a
mapsto→ s) ∈ T as
p
M
(τ) =
outer((r, a, s))
summationtext
a
prime
,s
prime
:(r
a
prime
mapsto→s
prime
)∈T
outer((r, a
prime
, s
prime
))
(20)
That such a relative frequency estimator p
M
minimizes the KL distance between p
G
and
p
M
on the domain L(M) is proven in the appendix.
An example with finite languages is given in Figure 1. We have, for example,
p
M
(q
0
a
mapsto→ q
1
) =
outer((q
0
, a, q
1
))
outer((q
0
, a, q
1
))+ outer((q
0
, c, q
1
))
=
1
3
1
3
+
2
3
=
1
3
(21)
5.2 Training a PCFG on a PFA
Similarly to section 5.1, we now assume we have a proper PFA M = (Σ, Q, q
0
,
q
f
, T, p
M
)andaCFGG = (Σ, N, S, R) that is unambiguous. Our goal is to find a
function p
G
that lets proper and consistent PCFG (Σ, N, S, R, p
G
) approximate M as
well as possible. Although CFGs used for natural language processing are usually
ambiguous, there may be cases in other fields in which we may assume grammars are
unambiguous.
180
Nederhof Training Models on Models
Figure 1
Example of input PCFG G, with rule probabilities between square brackets, input FA M, the
reduced PCFG G
∩
, and the resulting trained PFA.
Let us define 1 as the function that maps each rule from R to one. Of the set of
strings recognized by M, a subset can be derived in G. The expected frequency of a rule
ρ in those derivations is given by
E(ρ) =
summationdisplay
d,d
prime
,w
p
M
(w) ·1(S
dρd
prime
⇒ w) (22)
Now we construct the PCFG G
∩
from the PCFG G = (Σ, N, S, R, 1)andthe
PFA M as explained in section 4. Analogously to section 5.1, we obtain for each
ρ = (A → X
1
···X
m
)
E(ρ) =
summationdisplay
r
0
,r
1
,...,r
m
E((r
0
, A, r
m
) → (r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
))
=
summationdisplay
r
0
,r
1
,...,r
m
outer((r
0
, A, r
m
))· inner((r
0
, X
1
, r
1
)···(r
m−1
, X
m
, r
m
)) (23)
To obtain the required PCFG (Σ, N, S, R, p
G
), we now define the probability function
p
G
for each ρ = (A → α)as
p
G
(ρ) =
E(ρ)
summationtext
ρ
prime
=(A→α
prime
)∈R
E(ρ
prime
)
(24)
The proof that this relative frequency estimator p
G
minimizes the KL distance between
p
M
and p
G
on the domain L(G) is almost identical to the proof in the appendix for a
similar claim from section 5.1.
5.3 Training a PFA on a PFA
We now assume we have a proper PFA M
1
= (Σ, Q
1
, q
0,1
, q
f,1
, T
1
, p
1
)andanFA
M
2
= (Σ, Q
2
, q
0,2
, q
f,2
, T
2
) that is unambiguous. Our goal is to find a function p
2
so that
181
Computational Linguistics Volume 31, Number 2
proper PFA (Σ, Q
2
, q
0,2
, q
f,2
, T
2
, p
2
) approximates M
1
as well as possible, minimizing
the KL distance between p
1
and p
2
on the domain L(M
2
).
One way to solve this problem is to map M
2
to an equivalent right-linear CFGG and
then to apply the algorithm from section 5.2. The obtained probability function p
G
can
be translated back to an appropriate function p
2
. For this special case, the construction
from section 4 can be simplified to the “cross-product” construction of finite automata
(see, e.g., Aho and Ullman 1972). The simplified forms of the functions inner and outer
from section 3 are commonly called forward and backward, respectively, and they are
defined by systems of linear equations. As a result, we can compute exact solutions, as
opposed to approximate solutions by iteration.
Appendix
We now prove that the choice of p
M
in section 5.1 is such that it minimizes the Kullback-
Leibler distance between p
G
and p
M
, restricted to the domain L(M). Without this
restriction, the KL distance is given by
D(p
G
bardblp
M
) =
summationdisplay
w
p
G
(w) · log
p
G
(w)
p
M
(w)
(25)
This can be used for many applications mentioned in section 1. For example, an FA M
approximating a CFG G is guaranteed to be such that L(M) ⊇ L(G) in the case of most
practical approximation algorithms. However, if there are strings w such that w /∈ L(M)
and p
G
(w) > 0, then (25) is infinite, regardless of the choice of p
M
. We therefore restrict
p
G
to the domain L(M) and normalize it to obtain
p
G|M
(w) =
p
G
(w)
Z
,ifw ∈ L(M) (26)
0, otherwise (27)
where Z =
summationtext
w:w∈L(M)
p
G
(w). Note that p
G|M
= p
G
if L(M) ⊇ L(G). Our goal is now to
show that our choice of p
M
minimizes
D(p
G|M
bardblp
M
) =
summationdisplay
w:w∈L(M)
p
G|M
(w) · log
p
G|M
(w)
p
M
(w)
= log
1
Z
+
1
Z
summationdisplay
w:w∈L(M)
p
G
(w) · log
p
G
(w)
p
M
(w)
(28)
As Z is independent of p
M
, it is sufficient to show that our choice of p
M
minimizes
summationdisplay
w:w∈L(M)
p
G
(w) · log
p
G
(w)
p
M
(w)
(29)
Now consider the expression
productdisplay
τ
p
M
(τ)
E(τ)
(30)
182
Nederhof Training Models on Models
By the usual proof technique with Lagrange multipliers, it is easy to show that our
choice of p
M
in section 5.1, given by
p
M
(τ) =
E(τ)
summationtext
τ
prime
,a
prime
,s
prime
:τ
prime
=(r
a
prime
mapsto→s
prime
)∈T
E(τ
prime
)
(31)
for each τ = (r
a
mapsto→ s) ∈ T, is such that it maximizes (30), under the constraint of
properness.
For τ ∈ T and w ∈Σ
∗
, we define #
τ
(w)tobezero,ifw /∈ L(M), and otherwise to be
the number of occurrences of τ in the (unique) computation that recognizes w. Formally,
#
τ
(w) =
summationtext
c,c
prime 1((q
0
, w)
cτc
prime
turnstileleft (q
f
,epsilon1)). We rewrite (30) as
productdisplay
τ
p
M
(τ)
E(τ)
=
productdisplay
τ
p
M
(τ)
summationtext
w
p
G
(w)·#
τ
(w)
=
productdisplay
w
productdisplay
τ
p
M
(τ)
p
G
(w)·#
τ
(w)
=
productdisplay
w
parenleftBigg
productdisplay
τ
p
M
(τ)
#
τ
(w)
parenrightBigg
p
G
(w)
=
productdisplay
w:p
M
(w)>0
p
M
(w)
p
G
(w)
=
productdisplay
w:p
M
(w)>0
2
p
G
(w)·log p
M
(w)
=
productdisplay
w:p
M
(w)>0
2
p
G
(w)·log p
M
(w)−p
G
(w)·log p
G
(w)+p
G
(w)·log p
G
(w)
=
productdisplay
w:p
M
(w)>0
2
−p
G
(w)·log
p
G
(w)
p
M
(w)
+p
G
(w)·log p
G
(w)
= 2
−
summationtext
w:p
M
(w)>0
p
G
(w)·log
p
G
(w)
p
M
(w)
· 2
summationtext
w:p
M
(w)>0
p
G
(w)·log p
G
(w)
(32)
We have already seen that the choice of p
M
that maximizes (30) is given by (31), and
(31) implies p
M
(w) > 0 for all w such that w ∈ L(M)andp
G
(w) > 0. Since p
M
(w) > 0is
impossible for w /∈ L(M), the value of
2
summationtext
w:p
M
(w)>0
p
G
(w)·log p
G
(w)
(33)
is determined solely by p
G
and by the condition that p
M
(w) > 0 for all w such that
w ∈ L(M)andp
G
(w) > 0. This implies that (30) is maximized by choosing p
M
such
that
2
−
summationtext
w:p
M
(w)>0
p
G
(w)·log
p
G
(w)
p
M
(w)
(34)
183
Computational Linguistics Volume 31, Number 2
is maximized, or alternatively that
summationdisplay
w:p
M
(w)>0
p
G
(w) · log
p
G
(w)
p
M
(w)
(35)
is minimized, under the constraint that p
M
(w) > 0 for all w such that w ∈ L(M)and
p
G
(w) > 0. For this choice of p
M
, (29) equals (35).
Conversely, if a choice of p
M
minimizes (29), we may assume that p
M
(w) > 0for
all w such that w ∈ L(M)andp
G
(w) > 0, since otherwise (29) is infinite. Again, for this
choice of p
M
, (29) equals (35). It follows that the choice of p
M
that minimizes (29) concurs
with the choice of p
M
that maximizes (30), which concludes our proof.
Acknowledgments
Comments by Khalil Sima’an, Giorgio Satta,
Yuval Krymolowski, and anonymous
reviewers are gratefully acknowledged. The
author is supported by the PIONIER Project
Algorithms for Linguistic Processing, funded
by NWO (Dutch Organization for Scientific
Research).

References
Aho, Alfred V. and Jeffrey D. Ullman. 1972. Parsing, volume 1 of The Theory of Parsing, Translation and Compiling. Prentice Hall, Englewood Cliffs, NJ. 
Bar-Hillel, Yehoshua, M. Perles, and E. Shamir. 1964. On formal properties of simple phrase structure grammars. In Yehoshua Bar-Hillel, editor, Language and Information: Selected Essays on Their Theory and Application. Addison-Wesley, Reading, MA, pages 116–150.
Bertsch, Eberhard and Mark-Jan Nederhof. 2001. On the complexity of some extensions of RCG parsing. In Proceedings of the Seventh International Workshop on Parsing Technologies, pages 66–77, Beijing, October.
Booth, Taylor L. and Richard A. Thompson. 1973. Applying probabilistic measures to abstract languages. IEEE Transactions on Computers, C-22(5):442–450.
Boullier, Pierre. 2000. Range concatenation grammars. In Proceedings of the Sixth International Workshop on Parsing Technologies, pages 53–64, Trento, Italy, February.
Jurafsky, Daniel, Chuck Wooters, Gary Tajchman, Jonathan Segal, Andreas Stolcke, Eric Fosler, and Nelson Morgan. 1994. The Berkeley Restaurant Project. In Proceedings of the International Conference on Spoken Language Processing (ICSLP-94), pages 2139–2142, Yokohama, Japan.
Lang, Bernard. 1994. Recognition can be harder than parsing. Computational Intelligence, 10(4):486–494.
Manning, Christopher D. and Hinrich Sch ¨utze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
Mohri, Mehryar. 1997. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311.
Mohri, Mehryar and Mark-Jan Nederhof. 2001. Regular approximation of context-free grammars through transformation. In J.-C. Junqua and G. van Noord, editors, Robustness in Language and Speech Technology. Kluwer Academic, pages 153–163.
Nederhof, Mark-Jan. 2000. Practical experiments with regular approximation of context-free languages. Computational Linguistics, 26(1):17–44.
Nederhof, Mark-Jan and Giorgio Satta. 2003. Probabilistic parsing as intersection. In Proceedings of the Eighth International Workshop on Parsing Technologies,pages 137–148, Laboratoire Lorrain de recherche  en informatique et ses applications (LORIA), Nancy, France, April.
Paz, Azaria. 1971. Introduction to Probabilistic Automata. Academic Press, New York. 
Rimon, Mori and J. Herz. 1991. The recognition capacity of local syntactic constraints. In Proceedings of the Fifth Conference of the European Chapter of the ACL, pages 155–160, Berlin, April.
Santos, Eugene S. 1972. Probabilistic grammars and automata. Information and Control, 21:27–47.
Seki, Hiroyuki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On multiple context-free grammars. Theoretical Computer Science, 88:191–229.
Starke, Peter H. 1972. Abstract Automata. North-Holland, Amsterdam.
Stolcke, Andreas and Jonathan Segal. 1994. Precise N-gram probabilities from stochastic context-free grammars. In Proceedings of the 32nd Annual Meeting of the ACL, pages 74–79, Las Cruces, NM, June.
Vijay-Shanker, K. and David J. Weir. 1993. The use of shared forests in tree adjoining grammar parsing. In Proceedings of the Sixth Conference of the European Chapter of the ACL, pages 384–393, Utrecht, The Netherlands, April.
Zue, Victor, James Glass, David Goodine, Hong Leung, Michael Phillips, Joseph Polifroni, and Stephanie Seneff. 1991. Integration of speech recognition and natural language processing in the MIT Voyager system. In Proceedings of the ICASSP-91, Toronto, volume 1, pages 713–716.
