Compiling Boostexter Rules into a Finite-state Transducer
Srinivas Bangalore
AT&T Labs Research
180 Park Avenue
Florham Park, NJ 07932
Abstract
A number of NLP tasks have been effectively mod-
eled as classi cation tasks using a variety of classi-
 cation techniques. Most of these tasks have been
pursued in isolation with the classi er assuming un-
ambiguous input. In order for these techniques to be
more broadly applicable, they need to be extended
to apply on weighted packed representations of am-
biguous input. One approach for achieving this is
to represent the classi cation model as a weighted
 nite-state transducer (WFST). In this paper, we
present a compilation procedure to convert the rules
resulting from an AdaBoost classi er into an WFST.
We validate the compilation technique by applying
the resulting WFST on a call-routing application.
1 Introduction
Many problems in Natural Language Processing
(NLP) can be modeled as classi cation tasks either
at the word or at the sentence level. For example,
part-of-speech tagging, named-entity identi cation
supertagging1, word sense disambiguation are tasks
that have been modeled as classi cation problems at
the word level. In addition, there are problems that
classify the entire sentence or document into one of
a set of categories. These problems are loosely char-
acterized as semantic classi cation and have been
used in many practical applications including call
routing and text classi cation.
Most of these problems have been addressed in
isolation assuming unambiguous (one-best) input.
Typically, however, in NLP applications these mod-
ules are chained together with each module intro-
ducing some amount of error. In order to alleviate
the errors introduced by a module, it is typical for a
module to provide multiple weighted solutions (ide-
ally as a packed representation) that serve as input
to the next module. For example, a speech recog-
nizer provides a lattice of possible recognition out-
puts that is to be annotated with part-of-speech and
1associating each word with a label that represents the syn-
tactic information of the word given the context of the sentence.
named-entities. Thus classi cation approaches need
to be extended to be applicable on weighted packed
representations of ambiguous input represented as a
weighted lattice. The research direction we adopt
here is to compile the model of a classi er into a
weighted  nite-state transducer (WFST) so that it
can compose with the input lattice.
Finite state models have been extensively ap-
plied to many aspects of language processing in-
cluding, speech recognition (Pereira and Riley,
1997), phonology (Kaplan and Kay, 1994), mor-
phology (Koskenniemi, 1984), chunking (Abney,
1991; Bangalore and Joshi, 1999), parsing (Roche,
1999; O azer, 1999) and machine translation (Vilar
et al., 1999; Bangalore and Riccardi, 2000). Finite-
state models are attractive mechanisms for language
processing since they (a) provide an ef cient data
structure for representing weighted ambiguous hy-
potheses (b) generally effective for decoding (c)
associated with a calculus for composing models
which allows for straightforward integration of con-
straints from various levels of speech and language
processing.2
In this paper, we describe the compilation pro-
cess for a particular classi er model into an WFST
and validate the accuracy of the compilation pro-
cess on a one-best input in a call-routing task. We
view this as a  rst step toward using a classi cation
model on a lattice input. The outline of the paper is
as follows. In Section 2, we review the classi ca-
tion approach to resolving ambiguity in NLP tasks
and in Section 3 we discuss the boosting approach
to classi cation. In Section 4 we describe the com-
pilation of the boosting model into an WFST and
validate the result of this compilation using a call-
routing task.
2 Resolving Ambiguity by Classi cation
In general, we can characterize all these tagging
problems as search problems formulated as shown
2Furthermore, software implementing the  nite-state calcu-
lus is available for research purposes.
in Equation (1). We notatea0 to be the input vocab-
ulary,a1 to be the vocabulary ofa2 tags, ana3 word
input sequence asa4 (a5 a0a7a6 ) and tag sequence asa8
(a5a9a1
a6 ). We are interested in
a8a11a10 , the most likely tag
sequence out of the possible tag sequences (a8 ) that
can be associated toa4 .
a8 a10a13a12a15a14a17a16a19a18a21a20a22a14a17a23
a24
a25a27a26
a8a29a28a30a4a32a31 (1)
Following the techniques of Hidden Markov
Models (HMM) applied to speech recognition, these
tagging problems have been previously modeled in-
directly through the transformation of the Bayes
rule as in Equation 2. The problem is then approx-
imated for sequence classi cation by a ka33a35a34 -order
Markov model as shown in Equation (3).
a8 a10 a12a15a14a17a16a19a18a21a20a22a14a21a23
a24
a25a27a26
a4a36a28a8a11a31
a25a37a26
a8a38a31 (2)
a39
a8 a12a15a14a17a16a19a18a21a20a22a14a17a23
a24
a40
a41
a42a35a43a45a44
a25a37a26a47a46
a42
a28a49a48
a42
a31
a25a37a26
a48
a42
a28a50a48
a42a47a51a52a44a52a53a54a53a54a53
a48
a42a55a51a57a56a49a51a52a44
a31
(3)Although the HMM approach to tagging can eas-
ily be represented as a WFST, it has a drawback in
that the use of large contexts and richer features re-
sults in sparseness leading to unreliable estimation
of the parameters of the model.
An alternate approach to arriving at a8a58a10 is to
model Equation 1 directly. There are many exam-
ples in recent literature (Breiman et al., 1984; Fre-
und and Schapire, 1996; Roth, 1998; Lafferty et al.,
2001; McCallum et al., 2000) which take this ap-
proach and are well equipped to handle large num-
ber of features. The general framework for these
approaches is to learn a model from pairs of asso-
ciations of the form (a23 a42a60a59a62a61a63a42) where a23 a42 is a feature
representation of a4 and a61a21a42 (a5a64a1 ) is one of the
members of the tag set. Although these approaches
have been more effective than HMMs, there have
not been many attempts to represent these models
as a WFST, with the exception of the work on com-
piling decision trees (Sproat and Riley, 1996). In
this paper, we consider the boosting (Freund and
Schapire, 1996) approach (which outperforms de-
cision trees) to Equation 1 and present a technique
for compiling the classi er model into a WFST.
3 Boostexter
Boostexter is a machine learning tool which is based
on the boosting family of algorithms  rst proposed
in (Freund and Schapire, 1996). The basic idea of
boosting is to build a highly accurate classi er by
combining many  weak or  simple base learner,
each one of which may only be moderately accurate.
A weak learner or a rulea65 is a triplea26a67a66 a59a69a68a70 a59 a68a71 a31 , which
tests a predicate (a66 ) of the input (a23 ) and assigns a
weighta70 a42 (a72 a12a32a73a59a54a53a74a53a74a53a74a59a2 ) for each member (a61 ) ofa1 if
a66 is true in
a23 and assigns a weight (
a71
a42) otherwise. It
is assumed that a pool of such weak learners a75 a12
a76
a65a78a77 can be constructed easily.
From the pool of weak learners, the selection
the weak learner to be combined is performed it-
eratively. At each iteration a48 , a weak learner a65
a33is selected that minimizes a prediction error loss
function on the training corpus which takes into ac-
count the weighta46
a33
assigned to each training exam-
ple. Intuitively, the weights encode how important
it is that a65
a33
correctly classi es each training exam-
ple. Generally, the examples that were most often
misclassi ed by the preceding base classi ers will
be given the most weight so as to force the base
learner to focus on the  hardest examples. As de-
scribed in (Schapire and Singer, 1999), Boostexter
uses con dence rated classi ers a65
a33
that output a
real numbera65
a33
a26
a23
a59a62a61
a31 whose sign (-1 or +1) is inter-
preted as a prediction, and whose magnitude a28a65
a33
a26
a23 a31a54a28
is a measure of  con dence . The iterative algo-
rithm for combining weak learners stops after a pre-
speci ed number of iterations or when the training
set accuracy saturates.
3.1 Weak Learners
In the case of text classi cation applications, the set
of possible weak learners is instantiated from sim-
plea2 -grams of the input text (a4 ). Thus, ifa79a81a80 is a
function to produce alla2 -grams up toa2 of its argu-
ment, then the set of predicates for the weak learn-
ers is a25 a12 a79a45a80
a26
a4a82a31 . For word-level classi cation
problems, which take into account the left and right
context, we extend the set of weak learners created
from the word features with those created from the
left and right context features. Thus features of the
left context (a83 a42a84 ), features of the right context (a83 a42a85 )
and the features of the word itself (a83 a42a86a57a87) constitute
the features at positiona72. The predicates for the pool
of weak learners are created from these set of fea-
tures and are typicallya2 -grams on the feature repre-
sentations. Thus the set of predicates resulting from
the word level features is a75a89a88 a12a91a90 a42a79a45a80 a26a83 a42a86a57a87a31 , from
left context features is a75
a84
a12a92a90
a42
a79a93a80
a26
a83
a42a84
a31 and from
right context features is a75
a85
a12a94a90
a42
a79 a80
a26
a83
a42a85
a31 . The set
of predicates for the weak learners for word level
classi cation problems is: a75 a12 a75 a88 a90 a75
a84
a90 a75
a85
.
3.2 Decoding
The result of training is a set of selected rules
a76
a65
a44a95a59
a65a57a96
a59a54a53a54a53a54a53a97a59
a65
a40
a77 (a98 a75 ). The output of the  nal
classi er isa99 a26a23 a59a62a61 a31 a12a91a100
a40
a33
a43a45a44
a65
a33
a26
a23
a59a62a61
a31 , i.e. the sum
of con dence of all classi ers a65
a33
. The real-valued
predictions of the  nal classi era99 can be converted
into probabilities by a logistic function transform;
that is
a25a37a26
a61
a28a23 a31 a12 a101a95a102a104a103a106a105a50a107a108a110a109
a100
a108a112a111a55a113a49a114
a101a102a104a103a67a105a50a107a108a111a109
(4)
Thus the most likely tag sequence a8a115a10 is deter-
mined as in Equation 5, wherea25a37a26a48
a42
a28a83
a42a84
a59
a83
a42a85
a59
a83
a42a86a57a87
a31 is
computed using Equation 4.
a8 a10a13a12a15a14a17a16a19a18a21a20a22a14a17a23
a24
a40
a41
a42a74a43a45a44
a25a27a26
a48
a42
a28a83
a42a84
a59
a83
a42a85
a59
a83
a42a86a116a87
a31 (5)
To date, decoding using the boosted rule sets is
restricted to cases where the test input is unambigu-
ous such as strings or words (not word graphs). By
compiling these rule sets into WFSTs, we intend to
extend their applicability to packed representations
of ambiguous input such as word graphs.
4 Compilation
We note that the weak learners selected at the end
of the training process can be partitioned into one
of three types based on the features that the learners
test.
a117
a65a116a88 : test features of the word
a117
a65
a84
: test features of the left context
a117
a65
a85
: test features of the right context
We use the representation of context-dependent
rewrite rules (Johnson, 1972; Kaplan and Kay,
1994) and their weighted version (Mohri and
Sproat, 1996) to represent these weak learners. The
(weighted) context-dependent rewrite rules have the
general form
a83a119a118a64a120a121a28a97a122 a123 (6)
where a83 ,a120 ,a122 and a123 are regular expressions on the
alphabet of the rules. The interpretation of these
rules are as follows: Rewrite a83 by a120 when it is
preceded by a122 and followed by a123 . Furthermore, a120
can be extended to a rational power series which
are weighted regular expressions where the weights
encode preferences over the paths ina120 (Mohri and
Sproat, 1996).
Each weak learner can then be viewed as a set
of weighted rewrite rules mapping the input word
into each membera61a124a42 (a5a125a1 ) with a weight a70 a42 when
the predicate of the weak learner is true and with
weight a71 a42 when the predicate of the weak learner
is false. The translation between the three types of
weak learners and the weighted context-dependency
rules is shown in Table 13.
We note that these rules apply left to right on an
input and do not repeatedly apply at the same point
in an input since the output vocabularya1 would typ-
ically be disjoint from the input vocabulary a0 .
We use the technique described in (Mohri and
Sproat, 1996) to compile each weighted context-
dependency rules into an WFST. The compilation
is accomplished by the introduction of context sym-
bols which are used as markers to identify locations
for rewrites of a83 with a120 . After the rewrites, the
markers are deleted. The compilation process is rep-
resented as a composition of  ve transducers.
The WFSTs resulting from the compilation of
each selected weak learner (a126 a42) are unioned to cre-
ate the WFST to be used for decoding. The weights
of paths with the same input and output labels are
added during the union operation.
a127
a12a128a90
a42
a126
a42 (7)
We note that the due to the difference in the nature
of the learning algorithm, compiling decision trees
results in a composition of WFSTs representing the
rules on the path from the root to a leaf node (Sproat
and Riley, 1996), while compiling boosted rules re-
sults in a union of WFSTs, which is expected to re-
sult in smaller transducers.
In order to apply the WFST for decoding, we sim-
ply compose the model with the input represented as
an WFST (a126
a105
) and search for the best path (if we are
interested in the single best classi cation result).
a61
a10 a12a15a129
a101a97a130
a48
a25
a14a48a62a65
a26
a126
a105a7a131
a127
a31 (8)
We have compiled the rules resulting from boos-
texter trained on transcriptions of speech utterances
from a call routing task with a vocabulary (a28a0 a28) of
2912 and 40 classes (a2 a12a64a132a124a133 ). There were a to-
tal of 1800 rules comprising of 900 positive rules
and their negative counterparts. The WFST result-
ing from compiling these rules has a 14372 states
and 5.7 million arcs. The accuracy of the WFST on
a random set of 7013 sentences was the same (85%
accuracy) as the accuracy with the decoder that ac-
companies the boostexter program. This validates
the compilation procedure.
5 Conclusions
Classi cation techniques have been used to effec-
tively resolve ambiguity in many natural language
3For ease of exposition, we show the positive and negative
sides of a rule each resulting in a context dependency rule.
However, we can represent them in the form of a single con-
text dependency rule which is ommitted here due to space con-
straints.
Type of Weak Learner Weak Learner Weighted Context Dependency Rule
a65a57a88 : if WORD==
a46 then a46
a118
a70
a44a134a61a135a44a93a136
a70
a96
a61
a96
a53a54a53a54a53a19a136
a70
a80
a61
a80a119a28
a61a63a42a104a137
a70
a42 elsea61a63a42a104a137
a71
a42
a26a138a0a140a139a141a46
a31a142a118
a71
a44a62a61a135a44a104a136
a71
a96
a61
a96
a53a54a53a54a53a19a136
a71
a80
a61
a80a143a28
a65
a84
: if LeftContext==a46 then a0 a118 a70 a44a61a44 a136 a70 a96a61a96 a53a54a53a54a53a19a136 a70 a80a61a80 a28a46
a61a42 a137
a70
a42 elsea61a42 a137
a71
a42
a0
a118
a71
a44a61a44 a136
a71
a96
a61
a96
a53a54a53a54a53a19a136
a71
a80
a61
a80 a28
a26a138a0a144a139a145a46
a31
a65
a85
: if RightContext==a46 then a0 a118 a70 a44a134a61a135a44a45a136 a70 a96a61a96 a53a54a53a54a53a19a136 a70 a80a61a80a143a28 a46
a61a63a42a104a137
a70
a42 elsea61a63a42a104a137
a71
a42
a0
a118
a71
a44a134a61a135a44a45a136
a71
a96
a61
a96
a53a54a53a54a53a19a136
a71
a80
a61
a80a143a28
a26a138a0a140a139a145a46
a31
Table 1: Translation of the three types of weak learners into weighted context-dependency rules.
processing tasks. However, most of these tasks have
been solved in isolation and hence assume an un-
ambiguous input. In this paper, we extend the util-
ity of the classi cation based techniques so as to be
applicable on packed representations such as word
graphs. We do this by compiling the rules resulting
from an AdaBoost classi er into a  nite-state trans-
ducer. The resulting  nite-state transducer can then
be used as one part of a  nite-state decoding chain.
References
S. Abney. 1991. Parsing by chunks. In Robert
Berwick, Steven Abney, and Carol Tenny, editors,
Principle-based parsing. Kluwer Academic Pub-
lishers.
S. Bangalore and A. K. Joshi. 1999. Supertagging:
An approach to almost parsing. Computational
Linguistics, 25(2).
S. Bangalore and G. Riccardi. 2000. Stochastic
 nite-state models for spoken language machine
translation. In Proceedings of the Workshop on
Embedded Machine Translation Systems.
L. Breiman, J.H. Friedman, R.A. Olshen, and
C.J. Stone. 1984. Classi cation and Regression
Trees. Wadsworth & Brooks, Paci c Grove, CA.
Y. Freund and R. E. Schapire. 1996. Experi-
ments with a new boosting alogrithm. In Ma-
chine Learning: Proceedings of the Thirteenth
International Conference, pages 148 156.
C.D. Johnson. 1972. Formal Aspects of Phonologi-
cal Description. Mouton, The Hague.
R. M. Kaplan and M. Kay. 1994. Regular models of
phonological rule systems. Computational Lin-
guistics, 20(3):331 378.
K. K. Koskenniemi. 1984. Two-level morphol-
ogy: a general computation model for word-form
recognition and production. Ph.D. thesis, Uni-
versity of Helsinki.
J. Lafferty, A. McCallum, and F. Pereira. 2001.
Conditional random  elds: Probabilistic models
for segmenting and labeling sequence data. In In
Proceedings of ICML, San Francisco, CA.
A. McCallum, D. Freitag, and F. Pereira. 2000.
Maximum entropy markov models for informa-
tion extraction and segmentation. In In Proceed-
ings of ICML, Stanford, CA.
M. Mohri and R. Sproat. 1996. An ef cient com-
piler for weighted rewrite rules. In Proceedings
of ACL, pages 231 238.
K. O azer. 1999. Dependency parsing with an
extended  nite state approach. In Proceedings
of the 37th Annual Meeting of the Association
for Computational Linguistics, Maryland, USA,
June.
F.C.N. Pereira and M.D. Riley. 1997. Speech
recognition by composition of weighted  nite au-
tomata. In E. Roche and Schabes Y., editors,
Finite State Devices for Natural Language Pro-
cessing, pages 431 456. MIT Press, Cambridge,
Massachusetts.
E. Roche. 1999. Finite state transducers: parsing
free and frozen sentences. In Andr·as Kornai, ed-
itor, Extended Finite State Models of Language.
Cambridge University Press.
D. Roth. 1998. Learning to resolve natural lan-
guage ambiguities: A uni ed approach. In Pro-
ceedings of AAAI.
R.E. Schapire and Y. Singer. 1999. Improved
boosting algorithms using con dence-rated pre-
dictions. Machine Learning, 37(3):297 336, De-
cember.
R. Sproat and M. Riley. 1996. Compilation of
weighted  nite-state transducers from decision
trees. In Proceedings of ACL, pages 215 222.
J. Vilar, V.M. Jim·enez, J. Amengual, A. Castellanos,
D. Llorens, and E. Vidal. 1999. Text and speech
translation by means of subsequential transduc-
ers. In Andr·as Kornai, editor, Extened Finite
State Models of Language. Cambridge University
Press.
