Generalized Algorithms for Constructing Statistical Language Models
Cyril Allauzen, Mehryar Mohri, Brian Roark
AT&T Labs – Research
180 Park Avenue
Florham Park, NJ 07932, USA
a0 allauzen,mohri,roark
a1 @research.att.com
Abstract
Recent text and speech processing applications such as
speech mining raise new and more general problems re-
lated to the construction of language models. We present
and describe in detail several new and efficient algorithms
to address these more general problems and report ex-
perimental results demonstrating their usefulness. We
give an algorithm for computing efficiently the expected
counts of any sequence in a word lattice output by a
speech recognizer or any arbitrary weighted automaton;
describe a new technique for creating exact representa-
tions of a2 -gram language models by weighted automata
whose size is practical for offline use even for a vocab-
ulary size of about 500,000 words and an a2 -gram order
a2a4a3a6a5 ; and present a simple and more general technique
for constructing class-based language models that allows
each class to represent an arbitrary weighted automaton.
An efficient implementation of our algorithms and tech-
niques has been incorporated in a general software library
for language modeling, the GRM Library, that includes
many other text and grammar processing functionalities.
1 Motivation
Statistical language models are crucial components of
many modern natural language processing systems such
as speech recognition, information extraction, machine
translation, or document classification. In all cases, a
language model is used in combination with other in-
formation sources to rank alternative hypotheses by as-
signing them some probabilities. There are classical
techniques for constructing language models such as a2 -
gram models with various smoothing techniques (see
Chen and Goodman (1998) and the references therein for
a survey and comparison of these techniques).
In some recent text and speech processing applications,
several new and more general problems arise that are re-
lated to the construction of language models. We present
new and efficient algorithms to address these more gen-
eral problems.
Counting. Classical language models are constructed
by deriving statistics from large input texts. In speech
mining applications or for adaptation purposes, one often
needs to construct a language model based on the out-
put of a speech recognition system. But, the output of a
recognition system is not just text. Indeed, the word er-
ror rate of conversational speech recognition systems is
still too high in many tasks to rely only on the one-best
output of the recognizer. Thus, the word lattice output
by speech recognition systems is used instead because it
contains the correct transcription in most cases.
A word lattice is a weighted finite automaton (WFA)
output by the recognizer for a particular utterance. It
contains typically a very large set of alternative transcrip-
tion sentences for that utterance with the corresponding
weights or probabilities. A necessary step for construct-
ing a language model based on a word lattice is to derive
the statistics for any given sequence from the lattices or
WFAs output by the recognizer. This cannot be done by
simply enumerating each path of the lattice and counting
the number of occurrences of the sequence considered in
each path since the number of paths of even a small au-
tomaton may be more than four billion. We present a
simple and efficient algorithm for computing the expected
count of any given sequence in a WFA and report experi-
mental results demonstrating its efficiency.
Representation of language models by WFAs. Clas-
sical a2 -gram language models admit a natural representa-
tion by WFAs in which each state encodes a left context
of width less than a2 . However, the size of that represen-
tation makes it impractical for offline optimizations such
as those used in large-vocabulary speech recognition or
general information extraction systems. Most offline rep-
resentations of these models are based instead on an ap-
proximation to limit their size. We describe a new tech-
nique for creating an exact representation of a2 -gram lan-
guage models by WFAs whose size is practical for offline
use even in tasks with a vocabulary size of about 500,000
words and for a2a7a3a8a5 .
Class-based models. In many applications, it is nat-
ural and convenient to construct class-based language
models, that is models based on classes of words (Brown
et al., 1992). Such models are also often more robust
since they may include words that belong to a class but
that were not found in the corpus. Classical class-based
models are based on simple classes such as a list of
words. But new clustering algorithms allow one to create
more general and more complex classes that may be reg-
ular languages. Very large and complex classes can also
be defined using regular expressions. We present a simple
and more general approach to class-based language mod-
els based on general weighted context-dependent rules
(Kaplan and Kay, 1994; Mohri and Sproat, 1996). Our
approach allows us to deal efficiently with more complex
classes such as weighted regular languages.
We have fully implemented the algorithms just men-
tioned and incorporated them in a general software li-
brary for language modeling, the GRM Library, that in-
cludes many other text and grammar processing function-
alities (Allauzen et al., 2003). In the following, we will
present in detail these algorithms and briefly describe the
corresponding GRM utilities.
2 Preliminaries
Definition 1 A system a9a11a10a13a12a15a14a16a12a18a17a16a12 a19a20a12 a21a23a22 is a semiring
(Kuich and Salomaa, 1986) if: a9a11a10a24a12a15a14a16a12 a19a25a22 is a commuta-
tive monoid with identity element a19 ; a9a11a10a26a12a18a17a16a12 a21a27a22 is a monoid
with identity element a21 ; a17 distributes over a14 ; and a19 is an
annihilator for a17 : for all a28a30a29a31a10a26a12a32a28a33a17 a19a34a3 a19a24a17a35a28a36a3 a19 .
Thus, a semiring is a ring that may lack negation. Two
semirings often used in speech processing are: the log
semiring a37a38a3a39a9a11a40a42a41a44a43a46a45a48a47a49a12a15a14a13a50 a51a32a52a53a12a18a54a16a12a15a45a48a12a32a19a49a22 (Mohri, 2002)
which is isomorphic to the familiar real or probability
semiring a9a11a40a56a55a57a12a18a54a16a12a53a58a59a12a32a19a60a12a61a21a23a22 via a a62a64a63a49a65 morphism with, for
all a28a66a12a32a67a24a29a68a40a69a41a70a43a46a45a48a47 :
a28a59a14 a50 a51a71a52 a67a72a3a6a73a68a62a64a63a49a65a74a9a76a75a78a77a80a79a81a9a82a73a83a28a80a22a81a54a44a75a84a77a80a79a81a9a85a73a24a67a84a22a85a22
and the convention that: a75a84a77a80a79a81a9a85a73a13a45a35a22 a3 a19 and
a73a68a62a64a63a49a65a25a9a11a19a49a22a86a3a87a45 , and the tropical semiring a88a89a3a89a9a90a40a91a55a92a41
a43a46a45a48a47a49a12a71a93a33a94a64a95a96a12a15a54a16a12a18a45a8a12a71a19a49a22 which can be derived from the log
semiring using the Viterbi approximation.
Definition 2 A weighted finite-state transducer a97 over a
semiring a10 is an 8-tuple a97a98a3a99a9a101a100a13a12a18a102a30a12a18a103a33a12a71a104a25a12a71a105a72a12a32a106a107a12a32a108a81a12a85a109a80a22
where: a100 is the finite input alphabet of the transducer;
a102 is the finite output alphabet; a103 is a finite set of states;
a104a48a110a111a103 the set of initial states; a105a112a110a113a103 the set of final
states; a106a114a110a48a103a115a58a68a9a116a100a31a41a117a43a23a118a61a47a119a22a120a58a121a9a11a102a122a41a117a43a119a118a78a47a46a22a123a58a107a10a92a58a86a103 a finite
set of transitions; a108a4a124a60a104a30a125a126a10 the initial weight function;
and a109a4a124a20a105a127a125a128a10 the final weight function mapping a105 to
a10 .
A Weighted automaton a129a39a3a130a9a101a100a13a12a18a103a33a12a71a104a25a12a71a105a72a12a32a106a107a12a32a108a81a12a85a109a80a22 is de-
fined in a similar way by simply omitting the output la-
bels. We denote by a131a24a9a11a129a13a22a132a110a8a100a13a133 the set of strings accepted
by an automaton a129 and similarly by a131a24a9a90a134a135a22 the strings de-
scribed by a regular expression a134 .
Given a transition a136a70a29a8a106 , we denote by a137a15a138 a136a61a139 its input
label, a140a120a138 a136a61a139 its origin or previous state and a2a141a138 a136a61a139 its desti-
nation state or next state, a142a34a138 a136a78a139 its weight, a143a80a138 a136a61a139 its output
label (transducer case). Given a state a144a68a29a145a103 , we denote
by a106a86a138 a144a23a139 the set of transitions leaving a144 .
A path a146a147a3a39a136a49a148a120a149a61a149a78a149a85a136a46a150 is an element of a106a107a133 with con-
secutive transitions: a2a141a138 a136a152a151a76a153a20a148a32a139a83a3a154a140a120a138 a136a119a151a90a139 , a137a155a3a157a156a80a12a61a158a78a158a78a158a61a12a32a159 . We
extend a2 and a140 to paths by setting: a2a141a138a146a160a139a86a3a130a2a141a138 a136 a150 a139 and
a140a161a138 a146a160a139a70a3a162a140a161a138 a136a46a148a18a139 . A cycle a146 is a path whose origin and
destination states coincide: a2a141a138 a146a160a139a59a3a6a140a161a138 a146a160a139 . We denote by
a163
a9a90a144a164a12a32a144a46a165a166a22 the set of paths from a144 to a144a164a165 and by
a163
a9a90a144a164a12a85a167a161a12a71a144a49a165a168a22
and a163 a9a90a144a164a12a85a167a161a12a85a169a160a12a71a144a53a165a166a22 the set of paths from a144 to a144a27a165 with in-
put label a167a38a29a38a100 a133 and output label a169 (transducer case).
These definitions can be extended to subsets a170a36a12a32a170a107a165a161a110a171a103 ,
by: a163 a9a11a170a16a12a71a167a81a12a32a170a172a165a173a22a135a3a174a41a56a175a15a176a53a177a120a178a116a175a85a179a76a176a53a177a96a179 a163 a9a90a144a164a12a85a167a161a12a71a144a46a165a173a22 . The label-
ing functions a137 (and similarly a143 ) and the weight func-
tion a142 can also be extended to paths by defining the la-
bel of a path as the concatenation of the labels of its
constituent transitions, and the weight of a path as the
a17 -product of the weights of its constituent transitions:
a137a15a138a146a160a139a83a3a127a137a18a138 a136 a148 a139a27a149a78a149a78a149a85a137a18a138 a136 a150 a139 , a142a34a138 a146a160a139a24a3a180a142a16a138 a136 a148 a139a160a17a181a149a61a149a78a149a27a17a48a142a34a138 a136 a150 a139 . We
also extend a142 to any finite set of paths a182 by setting:
a142a34a138 a182a59a139a57a3a184a183a154a185
a176a152a186
a142a34a138 a146a160a139 . The output weight associated by
a129 to each input string a167a121a29a7a100a26a133 is:
a138a138 a129a83a139a139a116a9a90a167a66a22a91a3 a187
a185
a176a53a188a123a189a64a190a61a178a191a53a178a192a20a193
a108a81a9a194a140a161a138 a146a160a139a168a22a120a17a195a142a16a138a146a160a139a60a17a195a109a160a9a76a2a141a138 a146a160a139a168a22
a138a138 a129a83a139a139a116a9a90a167a66a22 is defined to be a19 when
a163
a9a90a104a25a12a85a167a81a12a32a105a57a22a107a3a39a196 . Simi-
larly, the output weight associated by a transducer a97 to a
pair of input-output string a9a76a167a161a12a85a169a60a22 is:
a138a138 a97a59a139a139a197a9a76a167a81a12a71a169a60a22a141a3 a187
a185
a176a53a188a56a189a64a190a61a178a191a152a178a198a119a178a192a81a193
a108a161a9a199a140a120a138a146a160a139a76a22a161a17a44a142a34a138 a146a160a139a80a17a44a109a66a9a90a2a141a138a146a160a139a76a22
a138a138a97a83a139a139a197a9a76a167a161a12a85a169a60a22a7a3 a19 when
a163
a9a90a104a25a12a85a167a161a12a85a169a160a12a71a105a57a22a7a3a162a196 . A successful
path in a weighted automaton or transducer a200 is a path
from an initial state to a final state. a200 is unambiguous if
for any string a167a7a29a121a100a26a133 there is at most one successful path
labeled with a167 . Thus, an unambiguous transducer defines
a function.
For any transducer a97 , denote by a182a33a201a49a9a90a97a13a22 the automaton
obtained by projecting a97 on its output, that is by omitting
its input labels.
Note that the second operation of the tropical semiring
and the log semiring as well as their identity elements are
identical. Thus the weight of a path in an automaton a129
over the tropical semiring does not change if a129 is viewed
as a weighted automaton over the log semiring or vice-
versa.
3 Counting
This section describes a counting algorithm based on
general weighted automata algorithms. Let a129 a3
a9a11a103a107a12a71a104a25a12a71a105a72a12a18a100a13a12a32a202a46a12a71a203a160a12a18a108a20a12a71a109a164a22 be an arbitrary weighted automa-
ton over the probability semiring and let a134 be a regular
expression defined over the alphabet a100 . We are interested
in counting the occurrences of the sequences a167a35a29a35a131a13a9a76a134a135a22
in a129 while taking into account the weight of the paths
where they appear.
3.1 Definition
When a129 is deterministic and pushed, or stochastic, it can
be viewed as a probability distribution a163 over all strings
0
a:ε/1
b:ε/1
1/1X:X/1
a:ε/1
b:ε/1
Figure 1: Counting weighted transducer a97 with a100a204a3
a43a23a28a66a12a18a67a23a47 . The transition weights and the final weight at state
a21 are all equal to a21 .
a100a59a133 .
1 The weight
a138a138 a129a83a139a139a116a9a90a167a66a22 associated by a129 to each string a167
is then a163 a9a90a167a66a22 . Thus, we define the count of the sequence
a167 in a129 , a205 a9a76a167a160a22 , as:
a205 a9a76a167a66a22a91a3a204a206
a207
a176a53a208a66a209
a210a211a123a210
a191 a138a138 a129a83a139a139a197a9a76a167a160a22
where a210a211a123a210 a191 denotes the number of occurrences of a167 in the
string a211 , i.e., the expected number of occurrences of a167
given a129 . More generally, we will define the count of a167 as
above regardless of whether a129 is stochastic or not.
In most speech processing applications, a129 may be an
acyclic automaton called a phone or a word lattice out-
put by a speech recognition system. But our algorithm is
general and does not assume a129 to be acyclic.
3.2 Algorithm
We describe our algorithm for computing the expected
counts of the sequences a167a69a29a70a131a13a9a76a134a135a22 and give the proof of
its correctness.
Let a212 be the formal power series (Kuich and Salomaa,
1986) a212 over the probability semiring defined by a212 a3
a213
a133a172a58a31a167a121a58
a213
a133 , where a167a121a29a68a131a13a9a76a134a135a22 .
Lemma 1 For all a214 a29a7a100a83a133 , a9 a212 a12 a214 a22a141a3 a210a214 a210 a191 .
Proof. By definition of the multiplication of power se-
ries in the probability semiring:
a9 a212 a12 a214 a22a215a3 a206
a207
a191a160a216a15a217a96a218
a9
a213
a133 a12
a211
a22a132a58a69a9a76a167a81a12a71a167a66a22a83a58a4a9
a213
a133 a12a71a219a164a22
a3 a21
a207
a191a160a216a15a217a96a218a68a3
a210
a214
a210
a191
This proves the lemma.
a212 is a rational power series as a product and closure of
the polynomial power series a213 and a167 (Salomaa and Soit-
tola, 1978; Berstel and Reutenauer, 1988). Similarly,
since a134 is regular, the weighted transduction defined by
a9a101a100a92a58a135a43a61a118a61a47a119a22a32a133a152a9a90a134a87a58a31a134a135a22a78a9a101a100a92a58a70a43a23a118a78a47a46a22a71a133 is rational. Thus, by the
theorem of Sch¨utzenberger (Sch¨utzenberger, 1961), there
exists a weighted transducer a97 defined over the alphabet
a100 and the probability semiring realizing that transduc-
tion. Figure 1 shows the transducer a97 in the particular
case of a100a42a3a181a43a23a28a66a12a32a67a23a47 .
1There exist a general weighted determinization and weight
pushing algorithms that can be used to create a deterministic and
pushed automaton equivalent to an input word or phone lattice
(Mohri, 1997).
Proposition 1 Let a129 be a weighted automaton over the
probability semiring, then:
a138a138 a182 a201 a9a11a129a42a220a91a97a13a22a197a139a139a197a9a76a167a160a22a141a3 a205 a9a76a167a66a22
Proof. By definition of a97 , for any a214 a29a7a100a57a133 , a138a138 a97a59a139a139a197a9 a214 a12a85a167a160a22a141a3
a9 a212 a12a85a167a66a22 , and by lemma 1, a138a138a97a59a139a139a197a9 a214 a12a71a167a66a22a135a3
a210
a214
a210
a191 . Thus, by
definition of composition:
a138a138 a182 a201 a9a90a129a145a220a91a97a13a22a116a139a139a197a9a76a167a66a22a221a3 a206
a185
a176a53a188a123a189a166a190a78a178a192a81a193a116a178a90a218a80a217a20a151a90a222
a185a46a223
a138a138 a129a59a139a139a197a9 a214 a22a72a58
a210
a214
a210
a191
a3 a206
a218a66a176a53a208a66a209
a210
a214
a210
a191 a138a138 a129a59a139a139a116a9 a214 a22a224a3 a205 a9a76a167a160a22
This ends the proof of the proposition.
The proposition gives a simple algorithm for computing
the expected counts of a134 in a weighted automaton a129
based on two general algorithms: composition (Mohri et
al., 1996) and projection of weighted transducers. It is
also based on the transducer a97 which is easy to construct.
The size of a97 is in a225 a9 a210 a100 a210 a54 a210 a129a26a226 a210 a22 , where a129a26a226 is a finite
automaton accepting a134 . With a lazy implementation of
a97 , only one transition can be used instead of
a210
a100
a210 , thereby
reducing the size of the representation of a97 to a225 a9 a210 a129 a226 a210 a22 .
The weighted automaton a227 a3a147a182 a201 a9a11a129a8a220a59a97a13a22 contains a118 -
transitions. A general a118 -removal algorithm can be used
to compute an equivalent weighted automaton with no a118 -
transition. The computation of a138a138 a227 a139a139a197a9a76a167a66a22 for a given a167 is
done by composing a227 with an automaton representing a167
and by using a simple shortest-distance algorithm (Mohri,
2002) to compute the sum of the weights of all the paths
of the result.
For numerical stability, implementations often replace
probabilities with a73a68a62a166a63a53a65 probabilities. The algorithm just
described applies in a similar way by taking a73a68a62a64a63a49a65 of the
weights of a97 (thus all the weights of a97 will be zero in
that case) and by using the log semiring version of com-
position and a118 -removal.
3.3 GRM Utility and Experimental Results
An efficient implementation of the counting algorithm
was incorporated in the GRM library (Allauzen et al.,
2003). The GRM utility grmcount can be used in par-
ticular to generate a compact representation of the ex-
pected counts of the a2 -gram sequences appearing in a
word lattice (of which a string encoded as an automaton
is a special case), whose order is less or equal to a given
integer. As an example, the following command line:
grmcount -n3 foo.fsm > count.fsm
creates an encoded representation count.fsm of the a2 -
gram sequences, a2a4a228a145a229 , which can be used to construct a
trigram model. The encoded representation itself is also
given as an automaton that we do not describe here.
The counting utility of the GRM library is used in a va-
riety of language modeling and training adaptation tasks.
Our experiments show that grmcount is quite efficient.
We tested this utility with 41,000 weighted automata out-
puts of our speech recognition system for the same num-
ber of speech utterances. The total number of transitions
of these automata was a21a61a230a74a158 a230 M. It took about 1h52m, in-
cluding I/O, to compute the accumulated expected counts
of all a2 -gram, a2a231a228a232a229 , appearing in all these automata
on a single processor of a 1GHz Intel Pentium processor
Linux cluster with 2GB of memory and 256 KB cache.
The time to compute these counts represents just a148a233a85a234 th of
the total duration of the 41,000 speech utterances used in
our experiment.
4 Representation of a235 -gram Language
Models with WFAs
Standard smoothed a2 -gram models, including backoff
(Katz, 1987) and interpolated (Jelinek and Mercer, 1980)
models, admit a natural representation by WFAs in which
each state encodes a conditioning history of length less
than a2 . The size of that representation is often pro-
hibitive. Indeed, the corresponding automaton may have
a210
a100
a210a236
a153a20a148 states and
a210
a100
a210 a236 transitions. Thus, even if the vo-
cabulary size is just 1,000, the representation of a classi-
cal trigram model may require in the worst case up to one
billion transitions. Clearly, this representation is even less
adequate for realistic natural language processing appli-
cations where the vocabulary size is in the order of several
hundred thousand words.
In the past, two methods have been used to deal with
this problem. One consists of expanding that WFA on-
demand. Thus, in some speech recognition systems, the
states and transitions of the language model automaton
are constructed as needed based on the particular input
speech utterances. The disadvantage of that method is
that it cannot benefit from offline optimization techniques
that can substantially improve the efficiency of a rec-
ognizer (Mohri et al., 1998). A similar drawback af-
fects other systems where several information sources are
combined such as a complex information extraction sys-
tem. An alternative method commonly used in many ap-
plications consists of constructing instead an approxima-
tion of that weighted automaton whose size is practical
for offline optimizations. This method is used in many
large-vocabulary speech recognition systems.
In this section, we present a new method for creat-
ing an exact representation of a2 -gram language models
with WFAs whose size is practical even for very large-
vocabulary tasks and for relatively high a2 -gram orders.
Thus, our representation does not suffer from the disad-
vantages just pointed out for the two classical methods.
We first briefly present the classical definitions of a2 -
gram language models and several smoothing techniques
commonly used. We then describe a natural representa-
tion of a2 -gram language models using failure transitions.
This is equivalent to the on-demand construction referred
to above but it helps us introduce both the approximate
solution commonly used and our solution for an exact of-
fline representation.
4.1 Classical Definitions
In an a2 -gram model, the joint probability of a string
a142
a234
a158a61a158a78a158a82a142a83a150 is given as the product of conditional proba-
bilities:
a237a224a238
a9a90a142
a234
a158a78a158a61a158a82a142 a150 a22a215a3
a150
a239
a151a64a217
a234
a237a141a238
a9a76a142 a151
a210 a240
a151 a22 (1)
where the conditioning history a240 a151 consists of zero or more
words immediately preceding a142 a151 and is dictated by the
order of the a2 -gram model.
Let a205 a9 a240 a142a26a22 denote the count of a2 -gram a240 a142 and let
a241
a237a141a238
a9a76a142
a210 a240
a22 be the maximum likelihood probability of a142
given a240 , estimated from counts.
a241
a237a141a238 is often adjusted
to reserve some probability mass for unseen a2 -gram se-
quences. Denote by a242a237a141a238 a9a90a142 a210 a240 a22 the adjusted conditional
probability. Katz or absolute discounting both lead to an
adjusted probability a242a237a141a238 .
For all a2 -grams a240 a3a171a142 a240 a165 where a240 a29a135a100 a150 for some a159a31a243
a21 , we refer to a240 a165 as the backoff a2 -gram of a240 . Conditional
probabilities in a backoff model are of the form:
a244a20a245a84a246a168a247a13a248a249a80a250a99a251 a252a114a253
a244a20a245a84a246a168a247a13a248a249a80a250 a254 a255
a1
a246a76a249a27a247a132a250a3a2a5a4
a6a8a7
a244a96a245a84a246a168a247a13a248a249a10a9a168a250a12a11a14a13a16a15a18a17a32a245a20a19a141a254a22a21a20a17 (2)
where a23a3a24 is a factor that ensures a normalized model.
Conditional probabilities in a deleted interpolation model
are of the form:
a244a96a245a84a246a168a247a26a248a249a80a250a81a251a8a252
a246a26a25a28a27
a6a29a7
a250a26a30a244a96a245a84a246a168a247a13a248a249a164a250a32a31
a6a8a7
a244a96a245a78a246a168a247a26a248a249a18a9a173a250a35a254 a255
a1
a246a76a249a27a247a132a250a33a2a34a4
a6a8a7
a244a96a245a78a246a168a247a26a248a249a18a9a173a250 a11a14a13a16a15a18a17a32a245a20a19a141a254a22a21a20a17
(3)
where a23a3a24 is the mixing parameter between zero and one.
In practice, as mentioned before, for numerical sta-
bility, a73a68a62a166a63a53a65 probabilities are used. Furthermore, due
the Viterbi approximation used in most speech process-
ing applications, the weight associated to a string a167 by a
weighted automaton representing the model is the mini-
mum weight of a path labeled with a167 . Thus, an a2 -gram
language model is represented by a WFA over the tropical
semiring.
4.2 Representation with Failure Transitions
Both backoff and interpolated models can be naturally
represented using default or failure transitions. A fail-
ure transition is labeled with a distinct symbol a35 . It is the
default transition taken at state a144 when a144 does not admit
an outgoing transition labeled with the word considered.
Thus, failure transitions have the semantics of otherwise.
w  wi-2     i-1 w   wi-1     iwi
wi-1
φ
wi
φwi
ε
φ wi
Figure 2: Representation of a trigram model with failure
transitions.
The set of states of the WFA representing a backoff or
interpolated model is defined by associating a state a144a36a24 to
each sequence of length less than a2 found in the corpus:
a103a181a3a181a43a23a144 a24 a124
a210 a240a120a210a38a37
a2a40a39a152a95a42a41 a205 a9
a240
a22a44a43a42a19a60a47
Its transition set a106 is defined as the union of the following
set of failure transitions:
a43a164a9a90a144a14a45a8a24a23a179a197a12a46a35a161a12a78a73a68a62a64a63a49a65a25a9a47a23a48a24a164a22a15a12a32a144a49a24a23a179a90a22a72a124a49a144a14a45a8a24a23a179a120a29a68a103a36a47
and the following set of regular transitions:
a43a49a9a90a144a50a24a74a12a71a142a57a12a78a73a68a62a166a63a53a65a66a9
a237a141a238
a9a76a142
a210 a240
a22a85a22a15a12a71a2a51a24a50a45a141a22a132a124a53a144a49a24a86a29a68a103a33a12 a205 a9
a240
a142a13a22a44a43a145a19a80a47
where a2 a24a50a45 is defined by:
a52a32a7a54a53
a251 a252a56a55
a7a54a53
a254a199a255a57a4a59a58a42a248a249a49a247a26a248a60a58
a52
a55
a7
a179
a53
a254a199a255a123a248a249a27a247a26a248a23a251
a52
a19a61a15a18a17a32a245a16a17a120a249a16a251a69a247a62a9a173a249a18a9 (4)
Figure 2 illustrates this construction for a trigram model.
Treating a118 -transitions as regular symbols, this is a
deterministic automaton. Figure 3 shows a complete
Katz backoff bigram model built from counts taken from
the following toy corpus and using failure transitions:
a63 s
a64 b a a a a a63 /sa64
a63 s
a64 b a a a a
a63 /s
a64
a63 s
a64 a
a63 /s
a64
where a63 sa64 denotes the start symbol and a63 /sa64 the end sym-
bol for each sentence. Note that the start symbol a63 sa64 does
not label any transition, it encodes the history a63 sa64 . All
transitions labeled with the end symbol a63 /sa64 lead to the
single final state of the automaton.
4.3 Approximate Offline Representation
The common method used for an offline representation of
an a2 -gram language model can be easily derived from the
representation using failure transitions by simply replac-
ing each a35 -transition by an a118 -transition. Thus, a transition
that could only be taken in the absence of any other alter-
native in the exact representation can now be taken re-
gardless of whether there exists an alternative transition.
Thus the approximate representation may contain paths
whose weight does not correspond to the exact probabil-
ity of the string labeling that path according to the model.
</s>
a
</s>/1.101
a/0.405
φ/4.856 </s>/1.540
a/0.441
bb/1.945
a/0.287
φ/0.356
<s>
a/1.108
φ/0.231
b/0.693
Figure 3: Example of representation of a bigram model
with failure transitions.
Consider for example the start state in figure 3, labeled
with a63 sa64 . In a failure transition model, there exists only
one path from the start state to the state labeled a28 , with a
cost of 1.108, since the a35 transition cannot be traversed
with an input of a28 . If the a35 transition is replaced by an
a118 -transition, there is a second path to the state labeled a28
– taking the a118 -transition to the history-less state, then the
a28 transition out of the history-less state. This path is not
part of the probabilistic model – we shall refer to it as an
invalid path. In this case, there is a problem, because the
cost of the invalid path to the state – the sum of the two
transition costs (0.672) – is lower than the cost of the true
path. Hence the WFA with a118 -transitions gives a lower
cost (higher probability) to all strings beginning with the
symbol a28 . Note that the invalid path from the state labeled
a63 s
a64 to the state labeled a67 has a higher cost than the correct
path, which is not a problem in the tropical semiring.
4.4 Exact Offline Representation
This section presents a method for constructing an ex-
act offline representation of an a2 -gram language model
whose size remains practical for large-vocabulary tasks.
The main idea behind our new construction is to mod-
ify the topology of the WFA to remove any path contain-
ing a118 -transitions whose cost is lower than the correct cost
associated by the model to the string labeling that path.
Since, as a result, the low cost path for each string will
have the correct cost, this will guarantee the correctness
of the representation in the tropical semiring.
Our construction admits two parts: the detection of the
invalid paths of the WFA, and the modification of the
topology by splitting states to remove the invalid paths.
To detect invalid paths, we determine first their initial
non-a118 transitions. Let a106a66a65 denote the set of a118 -transitions
of the original automaton. Let a163 a175 be the set of all paths
a146a7a3a8a136a152a148a120a158a61a158a78a158a85a136a46a150a16a29a4a9a11a106a4a73a36a106 a65 a22
a150 ,
a159a40a43a122a19 , leading to state a144 such
that for all a137 , a137a24a3a231a21a141a158a78a158a61a158a71a159 , a140a161a138 a136 a151 a139 is the destination state of
some a118 -transition.
Lemma 2 For an a2 -gram language model, the number
of paths in a163 a175 is less than the a2 -gram order: a210a163 a175 a210a38a37 a2 .
Proof. For all a146a96a151a33a29 a163 a175 , let a146a25a151a34a3a113a146a20a165
a151
a136a119a151 . By definition,
there is some a136a152a165
a151
a29a135a106 a65 such that a2a141a138 a136a46a165
a151
a139a120a3a122a140a120a138 a136a119a151a168a139a120a3a6a144 a24a50a67 . By
definition of a118 -transitions in the model, a210 a240 a151 a210a57a37 a2a121a73a48a21 for
all a137 . It follows from the definition of regular transitions
that a2a141a138 a136a119a151a76a139a91a3 a144 a24a50a67a68a45 a3 a144 . Hence, a240 a151a72a3 a240a38a69 a3 a240 , i.e. a136a119a151a59a3
q’
r’pi’
q
e r
e’
pi
Figure 4: The path a136a23a146 is invalid if a137a15a138 a136a61a139a81a3a6a118 , a137a15a138 a146a160a139a120a3a181a137a15a138a146 a165 a139 ,
a146a8a29
a163a48a70 , and either (i)
a71a49a165a224a3a72a71 and a142a34a138 a136a23a146a160a139
a37
a142a34a138 a146a81a165a64a139 or (ii)
a137a15a138 a136a119a165a194a139a160a3a48a118 and a142a34a138 a136a23a146a160a139
a37
a142a34a138a146a161a165a173a136a23a165a194a139 .
a136
a69
a3a8a136 , for all a146a160a151a71a12a85a146
a69
a29
a163
a175 . Then, a163 a175a83a3a181a43a61a146a20a136a57a124a53a146a135a29
a163
a175a20a73a80a47a53a41
a43a23a136a49a47 . The history-less state has no incoming non-a118 paths,
therefore, by recursion, a210a163 a175 a210 a3 a210 a163 a175a20a73 a210 a54a48a21a13a3 a210 a240 a142 a210a74a37 a2 .
We now define transition sets a75 a175a85a175 a179 (originally empty)
following this procedure: for all states a71a181a29a127a103 and all
a146a115a3a111a136a152a148a161a158a61a158a78a158a71a136a46a150a69a29
a163a76a70 , if there exists another path
a146a141a165 and
transition a136a44a29a154a106a59a65 such that a2a141a138 a136a61a139a172a3a127a140a120a138a146a160a139 , a140a161a138 a146a161a165a64a139a172a3a127a140a161a138 a136a78a139 ,
and a137a15a138a146a161a165a199a139a160a3a8a137a15a138a146a160a139 , and either (i) a2a141a138a146a123a165a194a139a96a3a8a2a141a138a146a160a139 and a142a34a138 a136a23a146a160a139 a37
a142a34a138 a146a20a165a64a139 or (ii) there exists a136a152a165a91a29a44a106 a65 such that a140a161a138 a136a46a165a194a139a56a3a147a2a141a138 a146a20a165a194a139
and a2a141a138 a136a46a165a194a139a160a3a8a2a141a138a146a160a139 and a142a34a138 a136a61a146a160a139 a37 a142a34a138 a146a81a165a168a136a119a165a194a139 , then we add a136 a148 to
the set: a75a78a77 a222a185a119a223a77 a222a185 a179 a223a3a79 a75a78a77 a222a185a119a223a77 a222a185 a179 a223 a41a4a43a23a136a46a148a119a47 . See figure 4 for
an illustration of this condition. Using this procedure, we
can determine the set:
a80
a106a117a138 a144a23a139a96a3a115a43a23a136a34a29a31a106a30a138 a144a23a139a81a124a18a81a80a144a46a165a101a12a71a136a57a29a82a75a107a175a85a175a85a179a197a47 .
This set provides the first non-a118 transition of each invalid
path. Thus, we can use these transitions to eliminate in-
valid paths.
Proposition 2 The cost of the construction of
a80
a106a86a138 a144a23a139 for all
a144a16a29a121a103 is a2
a201
a210
a100
a210a64a210
a103
a210 , where
a2 is the n-gram order.
Proof. For each a144a44a29a92a103 and each a146a154a29 a163 a175 , there are at
most a210 a100
a210 possible states
a144a152a165 such that for some a136a35a29a6a106 a65 ,
a140a161a138 a136a61a139a161a3a181a144a46a165 and a2a141a138 a136a78a139a161a3a181a144 . It is trivial to see from the proof
of lemma 2 that the maximum length of a146 is a2 . Hence,
the cost of finding all a146a123a165 for a given a146 is a2 a210 a100 a210 . Therefore,
the total cost is a2 a201 a210 a100 a210a166a210 a103 a210 .
For all non-empty
a80
a106a86a138 a144a23a139 , we create a new state
a80
a144 and
for all a136a68a29
a80
a106a30a138 a144a23a139 we set a140a161a138 a136a61a139a141a3
a80
a144 . We create a transition
a9
a80
a144a164a12a32a118a23a12a71a19a60a12a32a144a152a22 , and for all a136a135a29a171a106 a73a145a106a83a65 such that a2a141a138 a136a61a139a83a3a231a144 ,
we set a2a141a138 a136a61a139a224a3 a80a144 . For all a136a31a29a35a106 a65 such that a2a141a138 a136a61a139a141a3a114a144 and
a210
a75 a175 a77 a222a84
a223
a210
a3a114a19 , we set a2a141a138 a136a61a139a141a3
a80
a144 . For all a136a117a29a42a106 a65 such that
a2a141a138 a136a78a139a161a3a6a144 and
a210
a75 a175 a77 a222a84
a223
a210
a43a171a19 , we create a new intermediate
backoff state a85a144 and set a2a141a138 a136a61a139a96a3a86a85a144 ; then for all a136a164a165a20a29a121a106a86a138 a80a144a23a139 , if
a136a119a165a33a87a29a82a75 a175 a77 a222a84
a223 , we add a transition
a88a136a26a3a154a9a14a85a144a80a12a85a137a15a138 a136a53a165a199a139a101a12a71a142a34a138 a136a119a165a64a139a116a12a85a2a141a138 a136a23a165a194a139a76a22
to a106 .
Proposition 3 The WFA over the tropical semiring mod-
ified following the procedure just outlined is equivalent to
the exact online representation with failure transitions.
Proof. Assume that there exists a string a89 for which the
WFA returns a weight a80a142a16a9a90a89a119a22 less than the correct weight
a142a34a9a90a89a23a22 that would have been assigned to a89 by the exact
online representation with failure transitions. We will
call an a118 -transition a136 a151 within a path a146a184a3a204a136 a148 a158a78a158a78a158a32a136 a150 in-
valid if the next non-a118 transition a136 a69 , a91a92a43a111a137 , has the la-
bel a142 , and there is a transition a136 with a140a161a138 a136a61a139a26a3 a140a161a138 a136a27a151a76a139 and
b ε/0.356
aa/0.287a/0.441
ε/0
ε/4.856
a/0.405
</s>
</s>/1.101
<s> b/0.693
a/1.108
ε/0.231b/1.945
</s>/1.540
Figure 5: Bigram model encoded exactly with a118 -
transitions.
a137a15a138 a136a61a139a57a3a113a142 . Let a146 be a path through the WFA such that
a137a15a138a146a160a139a59a3a86a89 and a142a34a138 a146a160a139a59a3
a80
a142a16a9a90a89a119a22 , and a146 has the least number
of invalid a118 -transitions of all paths labeled with a89 with
weight a80a142a16a9a90a89a23a22 . Let a136a152a151 be the last invalid a118 -transition taken
in path a146 . Let a146a120a165 be the valid path leaving a140a120a138 a136a164a151a168a139 such that
a137a15a138a146a20a165a194a139a33a3a87a137a15a138 a136 a151a64a55a161a148 a158a78a158a78a158a32a136 a150 a139 . a142a34a138a146a81a165a194a139a40a43a184a142a34a138 a136 a151 a158a78a158a61a158a71a136 a150 a139 , otherwise
there would be a path with fewer invalid a118 -transitions with
weight a80a142a16a9a90a89a119a22 . Let a71 be the first state where paths a146 a165 and
a136 a151a64a55a120a148 a158a78a158a61a158a71a136 a150 intersect. Then a71a34a3a171a2a141a138 a136
a69
a139 for some a91a40a43a48a137 . By
definition, a136a152a151a166a55a161a148a120a158a61a158a78a158a71a136 a69 a29 a163a76a70 , since intersection will occur
before any a118 -transitions are traversed in a146 . Then it must
be the case that a136a152a151a166a55a161a148a86a29a93a75 a236 a222a84 a67 a223a77 a222a84 a67 a223 , requiring the path to
be removed from the WFA. This is a contradiction.
4.5 GRM Utility and Experimental Results
Note that some of the new intermediate backoff states (a85a144 )
can be fully or partially merged, to reduce the space re-
quirements of the model. Finding the optimal configu-
ration of these states, however, is an NP-hard problem.
For our experiments, we used a simple greedy approach
to sharing structure, which helped reduce space dramati-
cally.
Figure 5 shows our example bigram model, after ap-
plication of the algorithm. Notice that there are now two
history-less states, which correspond to a144 and a80a144 in the al-
gorithm (no a85a144 was required). The start state backs off to
a144 , which does not include a transition to the state labeled
a28 , thus eliminating the invalid path.
Table 1 gives the sizes of three models in terms of
transitions and states, for both the failure transition and
a118 -transition encoding of the model. The DARPA North
American Business News (NAB) corpus contains 250
million words, with a vocabulary of 463,331 words. The
Switchboard training corpus has 3.1 million words, and a
vocabulary of 45,643. The number of transitions needed
for the exact offline representation in each case was be-
tween 2 and 3 times the number of transitions used in the
representation with failure transitions, and the number of
states was less than twice the original number of states.
This shows that our technique is practical even for very
large tasks.
Efficient implementations of model building algo-
rithms have been incorporated into the GRM library.
The GRM utility grmmake produces basic backoff
models, using Katz or Absolute discounting (Ney et
al., 1994) methods, in the topology shown in fig-
Model a94 -representation exact offline
Corpus order arcs states arcs states
NAB 3-gram 102752 16838 303686 19033
SWBD 3-gram 2416 475 5499 573
SWBD 6-gram 15430 6295 54002 12374
Table 1: Size of models (in thousands) built from the
NAB and Switchboard corpora, with failure transitions
a35 versus the exact offline representation.
ure 3, with a118 -transitions in the place of failure tran-
sitions. The utility grmshrink removes transitions
from the model according to the shrinking methods of
Seymore and Rosenfeld (1996) or Stolcke (1998). The
utility grmconvert takes a backoff model produced by
grmmake or grmshrink and converts it into an exact
model using either failure transitions or the algorithm just
described. It also converts the model to an interpolated
model for use in the tropical semiring. As an example,
the following command line:
grmmake -n3 counts.fsm > model.fsm
creates a basic Katz backoff trigram model from the
counts produced by the command line example in the ear-
lier section. The command:
grmshrink -c1 model.fsm > m.s1.fsm
shrinks the trigram model using the weighted difference
method (Seymore and Rosenfeld, 1996) with a threshold
of 1. Finally, the command:
grmconvert -tfail m.s1.fsm > f.s1.fsm
outputs the model represented with failure transitions.
5 General class-based language modeling
Standard class-based or phrase-based language models
are based on simple classes often reduced to a short list
of words or expressions. New spoken-dialog applications
require the use of more sophisticated classes either de-
rived from a series of regular expressions or using general
clustering algorithms. Regular expressions can be used to
define classes with an infinite number of elements. Such
classes can naturally arise, e.g., dates form an infinite set
since the year field is unbounded, but they can be eas-
ily represented or approximated by a regular expression.
Also, representing a class by an automaton can be much
more compact than specifying them as a list, especially
when dealing with classes representing phone numbers
or a list of names or addresses.
This section describes a simple and efficient method
for constructing class-based language models where each
class may represent an arbitrary (weighted) regular lan-
guage.
Let a205 a148a46a12 a205 a201a53a12a78a158a78a158a61a158a78a12 a205 a236 be a set of a2 classes and assume
that each class a205 a151 corresponds to a stochastic weighted
automaton a129 a151 defined over the log semiring. Thus, the
weight a138a138 a129a13a151a90a139a139a116a9a90a142a13a22 associated by a129a26a151 to a string a142 can be in-
terpreted as a73a68a62a166a63a53a65 of the conditional probability a163 a9a76a142
a210
a205 a151a101a22 .
Each class a205 a151 defines a weighted transduction:
a129 a151 a73a66a125 a205 a151
This can be viewed as a specific obligatory weighted
context-dependent rewrite rule where the left and right
contexts are not restricted (Kaplan and Kay, 1994; Mohri
and Sproat, 1996). Thus, the transduction corresponding
to the class a205 a151 can be viewed as the application of the fol-
lowing obligatory weighted rewrite rule:
a129 a151 a125 a205 a151a20a95 a118 a118
The direction of application of the rule, left-to-right or
right-to-left, can be chosen depending on the task 2. Thus,
these a2 classes can be viewed as a set of batch rewrite
rules (Kaplan and Kay, 1994) which can be compiled into
weighted transducers. The utilities of the GRM Library
can be used to compile such a batch set of rewrite rules
efficiently (Mohri and Sproat, 1996).
Let a97 be the weighted transducer obtained by compil-
ing the rules corresponding to the classes. The corpus can
be represented as a finite automaton a134 . To apply the rules
defining the classes to the input corpus, we just need to
compose the automaton a134 with a97 and project the result
on the output:
a88
a134a113a3a92a182a24a201a27a9a76a134a127a220a91a97a13a22
a88
a134 can be made stochastic using a pushing algorithm
(Mohri, 1997). In general, the transducer a97 may not
be unambiguous. Thus, the result of the application of
the class rules to the corpus may not be a single text but
an automaton representing a set of alternative sequences.
However, this is not an issue since we can use the gen-
eral counting algorithm previously described to construct
a language model based on a weighted automaton. When
a131a115a3a114a41
a236
a151a166a217a161a148
a131a24a9a90a129a13a151a101a22 , the language defined by the classes, is
a code, the transducer a97 is unambiguous.
Denote now by a88a96 the language model constructed
from the new corpus a88a134 . To construct our final class-
based language model a96 , we simply have to compose a88a96
with a97 a153a96a148 and project the result on the output side:
a96
a3a92a182a24a201a49a9
a88
a96
a220a91a97
a153a20a148
a22
A more general approach would be to have two trans-
ducers a97a161a148 and a97a20a201 , the first one to be applied to the corpus
and the second one to the language model. In a proba-
bilistic interpretation, a97a56a148 should represent the probability
distribution a163 a9 a205 a151 a210a142a13a22 and a97 a201 the probability distribution
a163
a9a76a142
a210
a205 a151 a22 . By using a97 a148 a3a122a97 and a97 a201 a3a8a97
a153a96a148 , we are in fact
making the assumptions that the classes are equally prob-
able and thus that a163 a9 a205 a151 a210a142a13a22a195a3 a163 a9a90a142 a210a205 a151 a22 a95 a100 a236a69
a217a161a148
a163
a9a76a142
a210
a205
a69
a22 .
More generally, the weights of a97a224a148 and a97a20a201 could be the re-
sults of an iterative learning process. Note however that
2The simultaneous case is equivalent to the left-to-right one
here.
0/0
returns:returns/0
batman:<movie>/0.510
1         batman:<movie>/0.916
returns:ε/0
Figure 6: Weighted transducer a97 obtained from the com-
pilation of context-dependent rewrite rules.
0 1batman 2returns
0
1<movie>/0.510
3
<movie>/0.916 2/0
returns/0
ε/0
Figure 7: Corpora a134 and a88a134 .
we are not limited to this probabilistic interpretation and
that our approach can still be used if a97a91a148 and a97a20a201 do not
represent probability distributions, since we can always
push a88a134 and normalize a96 .
Example. We illustrate this construction in the simple
case of the following class containing movie titles:
a37 movie
a43a13a3a115a43a27a9 batmana12a71a19a74a158 a5a27a22a15a12a23a9 batman returnsa12a32a19a60a158a97a27a22a15a47
The compilation of the rewrite rule defined by this class
and applied left to right leads to the weighted transducer
a97 given by figure 6. Our corpus simply consists of the
sentence “batman returns” and is represented by the au-
tomaton a134 given by figure 7. The corpus a88a134 obtained by
composing a134 with a97 is given by figure 7.
6 Conclusion
We presented several new and efficient algorithms to
deal with more general problems related to the construc-
tion of language models found in new language process-
ing applications and reported experimental results show-
ing their practicality for constructing very large models.
These algorithms and many others related to the construc-
tion of weighted grammars have been fully implemented
and incorporated in a general grammar software library,
the GRM Library (Allauzen et al., 2003).
Acknowledgments
We thank Michael Riley for discussions and for having
implemented an earlier version of the counting utility.
References
Cyril Allauzen, Mehryar Mohri, and Brian
Roark. 2003. GRM Library-Grammar Library.
http://www.research.att.com/sw/tools/grm, AT&T Labs
- Research.
Jean Berstel and Christophe Reutenauer. 1988. Rational Series
and Their Languages. Springer-Verlag: Berlin-New York.
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jen-
nifer C. Lai, and Robert L. Mercer. 1992. Class-based n-
gram models of natural language. Computational Linguis-
tics, 18(4):467–479.
Stanley Chen and Joshua Goodman. 1998. An empirical study
of smoothing techniques for language modeling. Technical
Report, TR-10-98, Harvard University.
Frederick Jelinek and Robert L. Mercer. 1980. Interpolated
estimation of markov source parameters from sparse data.
In Proceedings of the Workshop on Pattern Recognition in
Practice, pages 381–397.
Ronald M. Kaplan and Martin Kay. 1994. Regular models
of phonological rule systems. Computational Linguistics,
20(3).
Slava M. Katz. 1987. Estimation of probabilities from sparse
data for the language model component of a speech recog-
niser. IEEE Transactions on Acoustic, Speech, and Signal
Processing, 35(3):400–401.
Werner Kuich and Arto Salomaa. 1986. Semirings, Automata,
Languages. Number 5 in EATCS Monographs on Theoreti-
cal Computer Science. Springer-Verlag, Berlin, Germany.
Mehryar Mohri and Richard Sproat. 1996. An Efficient Com-
piler for Weighted Rewrite Rules. In a98a14a99 th Meeting of the
Association for Computational Linguistics (ACL ’96), Pro-
ceedings of the Conference, Santa Cruz, California. ACL.
Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley.
1996. Weighted Automata in Text and Speech Processing.
In Proceedings of the 12th biennial European Conference on
Artificial Intelligence (ECAI-96), Workshop on Extended fi-
nite state models of language, Budapest, Hungary. ECAI.
Mehryar Mohri, Michael Riley, Don Hindle, Andrej Ljolje, and
Fernando C. N. Pereira. 1998. Full expansion of context-
dependent networks in large vocabulary speech recognition.
In Proceedings of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP).
Mehryar Mohri. 1997. Finite-State Transducers in Language
and Speech Processing. Computational Linguistics, 23:2.
Mehryar Mohri. 2002. Semiring Frameworks and Algorithms
for Shortest-Distance Problems. Journal of Automata, Lan-
guages and Combinatorics, 7(3):321–350.
Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On
structuring probabilistic dependences in stochastic language
modeling. Computer Speech and Language, 8:1–38.
Arto Salomaa and Matti Soittola. 1978. Automata-Theoretic
Aspects of Formal Power Series. Springer-Verlag: New
York.
Marcel Paul Sch¨utzenberger. 1961. On the definition of a fam-
ily of automata. Information and Control, 4.
Kristie Seymore and Ronald Rosenfeld. 1996. Scalable backoff
language models. In Proceedings of the International Con-
ference on Spoken Language Processing (ICSLP).
Andreas Stolcke. 1998. Entropy-based pruning of backoff lan-
guage models. In Proc. DARPA Broadcast News Transcrip-
tion and Understanding Workshop, pages 270–274.
