A Pylonic Decision-Tree Language Model with Optimal Question 
Selection 
Adrian Corduneanu 
University of Toronto 
73 Saint George St #299 
Toronto, Ontario, M5S 2E5, Canada 
g7adrian@cdf.toronto.edu 
Abstract 
This paper discusses a decision-tree approach to 
the problem of assigning probabilities to words 
following a given text. In contrast with previ- 
ous decision-tree language model attempts, an 
algorithm for selecting nearly optimal questions 
is considered. The model is to be tested on a 
standard task, The Wall Street Journal, allow- 
ing a fair comparison with the well-known tri- 
gram model. 
1 Introduction 
In many applications such as automatic speech 
recognition, machine translation, spelling cor- 
rection, etc., a statistical language model (LM) 
is needed to assign ~probabilities to sentences. 
This probability assignment may be used, e.g., 
to choose one of many transcriptions hypoth- 
esized by the recognizer or to make deci- 
sions about capitalization. Without any loss 
of generality, we consider models that oper- 
ate left-to-right on the sentences, assigning a 
probability to the next word given its word 
history. Specifically, we consider statistical 
LM's which compute probabilities of the type 
P{wn \]Wl, W2,..-, Wn--1}, where wi denotes the 
i-th word in the text. 
Even for a small vocabulary, the space of 
word histories is so large that any attempt to 
estimate the conditional probabilities for each 
distinct history from raw frequencies is infea- 
sible. To make the problem manageable, one 
partitions the word histories into some classes 
C(wl,w2,...,Wn-1), and identifies the word 
probabilities with P{wn \[ C(wl, w2,. . . , Wn-1)}. 
Such probabilities are easier to estimate as each 
class gets significantly more counts from a train- 
ing corpus. With this setup, building a language 
model becomes a classification problem: group 
the word histories into a small number of classes 
606 
while preserving their predictive power. 
Currently, popular N-gram models classify 
the word histories by their last N - 1 words. 
N varies from 2 to 4 and the trigram model 
P{wn \[Wn-2, wn-1} is commonly used. Al- 
though these simple models perform surpris- 
ingly well, there is much room for improvement. 
The approach used in this paper is to classify 
the histories by means of a decision tree: to clus- 
ter word histories Wl,W2,... ,wn-1 for which 
the distributions of the following word Wn in 
a training corpus are similar. The decision tree 
is pylonic in the sense that histories at different 
nodes in the tree may be recombined in a new 
node to increase the complexity of questions and 
avoid data fragmentation. 
The method has been tried before (Bahl et al., 
1989) and had promising results. In the work 
presented here we made two major changes to 
the previous attempts: we have used an opti- 
mal tree growing algorithm (Chou, 1991) not 
known at the time of publication of (Bahl et 
al., 1989), and we have replaced the ad-hoc clus- 
tering of vocabulary items used by Bahl with a 
data-driven clustering scheme proposed in (Lu- 
cassen and Mercer, 1984). 
2 Description of the Model 
2.1 The Decision-Tree Classifier 
The purpose of the decision-tree classifier is to 
cluster the word history wl, w2,..., Wn-1 into a 
manageable number of classes Ci, and to esti- 
mate for each class the next word conditional 
distribution P{wn \[C i}. The classifier, together 
with the collection of conditional probabilities, 
is the resultant LM. 
The general methodology of decision tree 
construction is well known (e.g., see (Jelinek, 
1998)). The following issues need to be ad- 
dressed for our specific application. 
• A tree growing criterion, often called the 
measure of purity; 
• A set of permitted questions (partitions) to 
be considered at each node; 
• A stopping rule, which decides the number 
of distinct classes. 
These are discussed below. Once the tree has 
been grown, we address one other issue: the 
estimation of the language model at each leaf of 
the resulting tree classifier. 
2.1.1 The Tree Growing Criterion 
We view the training corpus as a set of ordered 
pairs of the following word wn and its word his- 
tory (wi,w2,... ,wn-i). We seek a classifica- 
tion of the space of all histories (not just those 
seen in the corpus) such that a good conditional 
probability P{wn I C(wi, w2,.. . , Wn-i)} can be 
estimated for each class of histories. Since sev- 
eral vocabulary items may potentially follow 
any history, perfect "classification" or predic- 
tion of the word that follows a history is out 
of the question, and the classifier must parti- 
tion the space of all word histories maximizing 
the probability P{wn I C(wi, w2, . . . , Wn-i)} as" 
signed to the pairs in the corpus. 
We seek a history classification such that 
C(wi,w2,... ,Wn-i) is as informative as pos- 
sible about the distribution of the next word. 
Thus, from an information theoretical point of 
view, a natural cost function for choosing ques- 
tions is the empirical conditional entropy of the 
training data with respect to the tree: 
H = - Z I c,)log f(w I C,). 
w i 
Each question in the tree is chosen so as to 
minimize the conditional entropy, or, equiva- 
lently, to maximize the mutual information be- 
tween the class of a history and the predicted 
word. 
2.1.2 The Set of Questions and 
Decision Pylons 
Although a tree with general questions can rep- 
resent any classification of the histories, some 
restrictions must be made in order to make the 
selection of an optimal question computation- 
ally feasible. We consider elementary questions 
of the type w-k E S, where W-k refers to the 
k-th position before the word to be predicted, 
607 
y/ n 
( D 
n 
yes no 
Figure 1: The structure of a pylon 
and S is a subset of the vocabulary. However, 
this kind of elementary question is rather sim- 
plistic, as one node in the tree cannot refer to 
two different history positions. A conjunction of 
elementary questions can still be implemented 
over a few nodes, but similar histories become 
unnecessarily fragmented. Therefore a node in 
the tree is not implemented as a single elemen- 
tary question, but as a modified decision tree in 
itself, called a pylon (Bahl et al., 1989). The 
topology of the pylon as in Figure 1 allows us 
to combine answers from elementary questions 
without increasing the number of classes. A py- 
lon may be of any size, and it is grown as a 
standard decision tree. 
2.1.3 Question Selection Within the 
Pylon 
For each leaf node and position k the problem 
is to find the subset S of the vocabulary that 
minimizes the entropy of the split W-k E S. 
The best question over all k's will eventually 
be selected. We will use a greedy optimization 
algorithm developed by Chou (1991). Given a 
partition P = {81,/32,...,/3k} of the vocabu- 
lary, the method finds a subset S of P for which 
the reduction of entropy after the split is nearly 
optimal. 
The algorithm is initialized with a random 
partition S t2 S of P. At each iteration every 
atom 3 is examined and redistributed into a new 
partition S'U S', according to the following rule: 
place j3 into S' when 
l(wlw-kcf~) < Ew f(wlw-k e 3) log I(w w_heS) -- 
E,o f (wlw_  3) log f(wlW-kEC3) 
where the f's are word frequencies computed 
relative to the given leaf. This selection crite- 
rion ensures a decreasing empirical entropy of 
the tree. The iteration stops when S = S' and 
If questions on the same level in the pylon are 
constructed independently with the Chou algo- 
ritm, the overall entropy may increase. That is 
why nodes whose children are merged must be 
jointly optimized. In order to reduce complex- 
ity, questions on the same level in the pylon are 
asked with respect to the same position in the 
history. 
The Chou algorithm is not accurate when the 
training data is sparse. For instance, when no 
history at the leaf has w-k E /3, the atom is 
invariantly placed in S'. Because such a choice 
of a question is not based on evidence, it is not 
expected to generalize to unseen data. As the 
tree is growing, data is fragmented among the 
leaves, and this issue becomes unavoidable. To 
deal with this problem, we choose the atomic 
partition P so that each atom gets a history 
count above a threshold. 
The choice of such an atomic partition is a 
complex problem, as words composing an atom 
must have similar predictive power. Our ap- 
proach is to consider a hierarchical classification 
of the words, and prune it to a level at which 
each atom gets sufficient history counts. The 
word hierarchy is generated from training data 
with an information theoretical algorithm (Lu- 
cassen and Mercer, 1984) detailed in section 2.2. 
2.1.4 The Stopping Rule 
A common problem of all decision trees is the 
lack of a clear rule for when to stop growing 
new nodes. The split of a node always brings 
a reduction in the estimated entropy, but that 
might not hold for the true entropy. We use a 
simplified version of cross-validation (Breiman 
et al., 1984), to test for the significance of the 
reduction in entropy. If the entropy on a held 
out data set is not reduced, or the reduction 
on the held out text is less than 10% of the 
entropy reduction on the training text, the leaf 
is not split, because the reduction in entropy 
has failed to generalize to the unseen data. 
2.1.5 Estimating the Language Model 
at Each Leaf 
Once an equivalence classification of all histo- 
ries is constructed, additional training data is 
used to estimate the conditional probabilities 
required for each node, as described in (Bahl et 
al., 1989). Smoothing as well as interpolation 
with a standard trigram model eliminates the 
zero probabilities. 
2.2 The Hierarchical Classification of 
Words 
The goal is to build a binary tree with the words 
of the vocabulary as leaves, such that similar 
words correspond to closely related leaves. A 
partition of the vocabulary can be derived from 
such a hierarchy by taking a cut through the 
tree to obtain a set of subtrees. The reason for 
keeping a hierarchy instead of a fixed partition 
of the vocabulary is to be able to dynamically 
adjust the partition to accommodate for train- 
ing data fragmentation. 
The hierarchical classification of words was 
built with an entirely data-driven method. The 
motivation is that even though an expert could 
exhibit some strong classes by looking at parts 
of speech and synonyms, it is hard to produce a 
full hierarchy of a large vocabulary. Perhaps a 
combination of the expert and data-driven ap- 
proaches would give the best result. Neverthe- 
less, the algorithm that has been used in deriv- 
ing the hierarchy can be initialized with classes 
based on parts of speech or meaning, thus tak- 
ing account of prior expert information. 
The approach is to construct the tree back- 
wards. Starting with single-word classes, each 
iteration consists of merging the two classes 
most similar in predicting the word that follows 
them. The process continues until the entire vo- 
cabulary is in one class. The binary tree is then 
obtained from the sequence of merge operations. 
To quantify the predictive power of a parti- 
tion P = {j3z,/32,...,/3k} of the vocabulary we 
look at the conditional entropy of the vocabu- 
lary with respect to class of the previous word: 
H(w I P) = EZeP p(/3)H(w \[ w-1 •/3) = 
- E epp(/3) E evp(wl )logp(w I/3) 
At each iteration we merge the two classes 
that minimize H(w I P') - H(w I P), where P' is 
the partition after the merge. In information- 
theoretical terms we seek the merge that brings 
the least reduction in the information provided 
by P about the distribution of the current word. 
608 
IRAN'S 
UNION'S 
IRAQ'S 
INVESTORS' 
BANKS' 
PEOPLE'S 
FARMER 
TEACHER 
WORKER 
DRIVER 
WRITER 
SPECIALIST 
EXPERT 
TRADER 
PLUMMETED 
PLUNGED 
SOARED 
TUMBLED 
SURGED 
RALLIED 
FALLING 
FALLS 
RISEN 
FALLEN 
MYSELF 
HIMSELF 
OURSELVES 
THEMSELVES 
CONSIDERABLY 
SIGNIFICANTLY 
SUBSTANTIALLY 
SOMEWHAT 
SLIGHTLY 
Figure 2: Sample classes from a 1000-element 
partition of a 5000-word vocabulary (each col- 
umn is a different class) 
The algorithm produced satisfactory results 
on a 5000-word vocabulary. One can see from 
the sample classes that the automatic building 
of the hierarchy accounts both for similarity in 
meaning and of parts of speech. 
the vocabulary is significantly larger, making 
impossible the estimation of N-gram models for 
N > 3. However, we expect that due to the 
good smoothing of the trigram probabilities a 
combination of the decision-tree and N-gram 
models will give the best results. 
4 Summary 
In this paper we have developed a decision-tree 
method for building a language model that pre- 
dicts words given their previous history. We 
have described a powerful question search algo- 
rithm, that guarantees the local optimality of 
the selection, and which has not been applied 
before to word language models. We expect 
that the model will perform significantly better 
than the standard N-gram approach. 
5 Acknowledgments 
I would like to thank Prof.Frederick Jelinek and Sanjeev Khu- 
dampur from Center for Language and Speech Processing, 
Johns Hopkins University, for their help related to this work 
and for providing the computer resources. I also wish to thank 
Prof.Graeme Hirst from University of Toronto for his useful 
advice in all the stages of this project. 
3 Evaluation of the Model 
The decision tree is being trained and tested 
on the Wall Street Journal corpus from 1987 to 
1989 containing 45 million words. The data is 
divided into 15 million words for growing the 
nodes, 15 million for cross-validation, 10 mil- 
lion for estimating probabilities, and 5 million 
for testing. To compare the results with other 
similar attempts (Bahl et al., 1989), the vocab- 
ulary consists of only the 5000 most frequent 
words and a special "unknown" word that re- 
places all the others. The model tries to predict 
the word following a 20-word history. 
At the time this paper was written, the im- 
plementation of the presented algorithms was 
nearly complete and preliminary results on the 
performance of the decision tree were expected 
soon. The evaluation criterion to be used is 
the perplexity of the test data with respect to 
the tree. A comparison with the perplexity 
of a standard back-off trigram model will in- 
dicate which model performs better. Although 
decision-tree letter language models are inferior 
to their N-gram counterparts (Potamianos and 
Jelinek, 1998), the situation should be reversed 
for word language models. In the case of words 

References 
L. R. Bahl, P. F. Brown, P. V. de Souza, and 
R. L. Mercer. 1989. A tree-based statistical 
language model for natural language speech 
recognition. IEEE Transactions on Acous- 
tics, Speech, and Signal Processing, 37:1001- 
1008. 
L. Breiman, J. Friedman, R. Olshen, and 
C. Stone. 1984. Classification and regression 
trees. Wadsworth and Brooks, Pacific Grove. 
P. A. Chou. 1991. Optimal partitioning for 
classification and regression trees. IEEE 
Transactions on Pattern Analysis and Ma- 
chine Intelligence, 13:340-354. 
F. Jelinek. 1998. Statistical methods \]or speech 
recognition. The MIT Press, Cambridge. 
J. M. Lucassen and R. L. Mercer. 1984. An 
information theoretic approach to the auto- 
matic determination of phonemic baseforms. 
In Proceedings of the 1984 International Con- 
-ference on Acoustics, Speech, and Signal Pro- 
cessing, volume III, pages 42.5.1-42.5.4. 
G. Potamianos and F. Jelinek. 1998. A study 
of n-gram and decision tree letter language 
modeling methods. Speech Communication, 
24:171-192. 
