Decision Tree Models Applied to 
the Labeling of Text with Parts-of-Speech 
Ezra Black Fred Jelinek John Lafferty 
Robert Mercer Salim Roukos 
IBM Thomas J. Watson Research Center 
ABSTRACT 
We describe work which uses decision trees to estimate 
marginal probabilities in a maximum entropy model for pre- 
dicting the part-of-speech of a word given the context in 
which it appears. Two experiments are presented which ex- 
hibit improvements over the usual hidden Markov model ap- 
proach. 
1. Introduction 
In this paper we describe work which uses decision trees 
to estimate probabilities of words appearing with various 
parts-of-speech, given the context in which the words ap- 
pear. In principle, this approach affords the optimal so- 
lution to the problem of predicting the correct sequence 
of parts-of-speech. In practice, the method is limited by 
the lack of large, hand-labeled training corpora, as well 
as by the difficulty inherent in constructing a set of ques- 
tions to be used in the decision procedure. Nevertheless, 
decision trees provide a powerful mechanism for tackling 
the problem of modeling long-distance dependencies. 
The following sentence is typical of the difficulties facing 
a tagging program: 
The new energy policy announced in December by 
the Prime Minister will guarantee sufficien~ oil sup- 
plies at one price only. 
structed a complete set of binary questions to be asked of 
words, using a mutual information clustering procedure 
\[2\]. We then extracted a set of events from a 2-million 
word corpus of hand-labeled text. Using an algorithm 
similar to that described in \[1\], the set of contexts was 
divided into equivalence classes using a decision proce- 
dure which queried the binary questions, splitting the 
data based upon the principle of maximum mutual in- 
formation between tags and questions. The resulting 
tree was then smoothed using the forward-backward al- 
gorithm \[6\] on a set of held-out events, and tested on a 
set of previously unseen sentences from the hand-labeled 
corpus. 
The results showed a modest improvement over the usual 
hidden Markov model approach. We present explana- 
tions and examples of the results obtained and suggest 
ideas for obtaining further improvements. 
2. Decision Trees 
The problem at hand is to predict a tag for a given word 
in a sentence, taking into consideration the tags assigned 
to previous words, as well as the remaining words in the 
sentence. Thus, if we wish to predict tag S,~ for word w~ 
in a sentence S -- wl, w2, • • •, wN, then we must form an 
estimate of the probability 
The usual hidden Markov model, trained as described 
the last section of this paper, incorrectly labeled the verb 
announced as having the active rather than the passive 
aspect. If, however, a decision procedure is used to re- 
solve the ambiguity, the context may be queried to de- 
termine the nature of the verb as well its agent. We can 
easily imagine, for example, that if the battery of avail- 
able questions is rich enough to include such queries as 
"Is the previous noun inanimate?" and "Does the prepo- 
sition by appear within three words of the word being 
tagged?" then such ambiguities may be probabilistically 
resolved. Thus it is evident that the success of the de- 
cision approach will rely in the questions as well as the 
manner in which they are asked. 
In the experiments described in this paper, we con- 
P(S,~ \[ $1, $2,...S,~-1 and wl, w2 . .., war). 
We will refer to a sequence ($1,..., t,~-l; wl,..., wN) as 
a history. A generic history is denoted as H, or as H = 
(HT, Hw), when we wish to separate it into its tag and 
word components. The set of histories is denoted by 7-/, 
and a pair (t, H) is called an event. 
A tag is chosen from a fixed tag vocabulary VT, and 
words are chosen from a word vocabulary Vw. Given 
a training corpus E of events, the decision tree method 
proceeds by placing the observed histories into equiv- 
alence classes by asking binary questions about them. 
Thus, a tree is grown with each node labeled by a ques- 
tion q : 7-/ --~ {True, False}. The entropy of tags at a 
117 
leaf L of the tree q- is given by 
H(T I L) = - ~ P(t I L) log P(41 L) 
t£T 
and the average entropy of tags in the tree is given by 
H~(T) = ~ P(L) H(T \] L). 
L6T 
The method of growing trees that we have employed 
adopts a greedy algorithm, described in \[1\], to minimize 
the average entropy of tags. 
Specifically, the tree is grown in the following manner. 
Each node n is associated with a subset E, C E of train- 
ing events. For a given node n, we compute for each 
question q, the conditional entropy of tags at n, given 
by 
~r(Tln, q) = 
P(q(H) = True In) H(TIn, q(H) = True) + 
P(q(H) = False In) H(TIn, q(H) = False). 
The node n is then assigned the question q with the 
lowest conditional entropy. The reduction in entropy at 
node n resulting in asking question q is 
H'(T In) - B(TIn, q). 
If this reduction is significant, as determined by evaluat- 
ing the question on held-out data, then two descendent 
nodes of n are created, corresponding to the equivalence 
classes of events 
{E = (t, H) I E 6 E., q(H) = True} 
and 
{E = (t, H) I E 6 £., q(H) = False}. 
The algorithm continues to split nodes by choosing the 
questions which maximize the reduction in entropy, until 
either no further splits are possible, or until a maximum 
number of leaves is obtained. 
3. Maximum Entropy Models 
The above algorithm for growing trees has as its ob- 
jective function the entropy of the joint distribution of 
tags and histories. More generally, if we suppose that 
tags and histories arise according to some distribution 
~(4, HT, Hw) in textual data, the coding theory point- 
of-view encourages us to try to construct a model for 
generating tags and histories according to a distribution 
p(4, HT, Hw) which minimizes the Kullback information 
D(P ll )= P(t, HT, Hw)Iog p(4'HT'HW) 
t,H~,Hw ~(4, HT, Hw) " 
Typically, one may be able to obtain estimates for certain 
marginals ofp. In the case of tagging, we have estimates 
of the marginals q(t, HT) = ~H- p(4, HT, Hw) from the 
EM algorithm applied to label~'ed or partially labeled 
text. The marginals r(4, Hw) = ~ HTp(4, HT, Hw) 
might be estimated using decision trees applied to la- 
belled text. To minimize D(p II q) subject to knowing 
these marginals, introducing Lagrange multipliers a and 
fl leads us to minimize the function 
~ p(4, HT, Hw)log HT, HW)+ 
t,HT,Hw 
t,H= 
t,Hw H~ 
Differentiating with respect to p and solving this equa- 
tion, we find that the maximum en4ropy solution p takes 
the form 
p(4, HT, Hw) = 7f(4, HT)g(4, Hw)p(4, HT, Hw) 
for some normalizing constant 7. In particular, in the 
case where we know no better than to take ~ equal to 
the uniform distribution, we obtain the solution 
p(4, HT, HW) = q(t, HT) r(t, Hw) q(4) 
where the marginal q(4) is assumed to satisfy 
q(t)= ~q(t, HT)= ~r(4, Hw). 
Hz Hw 
Note that the usual HMM tagging model is given by 
P(4.,t.-2, t._l) P(w.,4 ) 
p(t., HT, Hw) = P(t.) 
which has the form of a maximum entropy model, even 
though the marginals P(wn, in) and P(4., 4.-2, 4,~-1) are 
modelled as bigram and trigram statistics, estimated ac- 
cording to the maximum likelihood criterion using the 
EM algorithm. 
In principle, growing a decision tree to estimate the full 
density p(4n, HT, Hw) will provide a model with smaller 
Kullback information. In practice, however, the quantity 
of training data is severely limited, and the statistics at 
the leaves will be unreliable. In the model described 
above we assume that we are able to construct more 
reliable estimates of the marginals separating the word 
118 
and tag components of the history, and we then com- 
bine these marginals according to the maximum entropy 
criterion. In the experiments that we performed, such 
models performed slightly better than those for which 
the full distribution p(tn, HT, Hw) was modeled with a 
tree. 
4. Constructing Questions 
The method of mutual information clustering, described 
in \[2\], can be used to obtain a set of binary features to as- 
sign to words, which may in turn be employed as binary 
questions in growing decision trees. Mutual information 
clustering proceeds by beginning with a vocabulary V, 
and initially assigning each word to a distinct class. At 
each step, the average mutual information between adja- 
cent classes in training text is computed using a bigram 
model, and two classes are chosen to be merged based 
upon the criterion of minimizing the loss in average mu- 
tual information that the merge affects. If this process 
is continued until only one class remains, the result is a 
binary tree, the leaves of which are labeled by the words 
in the original vocabulary. By labeling each branch by 0 
or 1, we obtain a bit string assigned to each word. 
Like all methods in statistical language modeling, this 
approach is limited by the problems of statistical signif- 
icance imposed by the lack of sufficient training data. 
However, the method provides a powerful way of au- 
tomatically extracting both semantic and syntactic fea- 
tures of large vocabularies. We refer to \[2\] for examples 
of the features which this procedure yields. 
5. Smoothing the Leaf Distributions 
After growing a decision tree according to the proce- 
dures outlined above, we obtain an equivalence class of 
histories together with an empirical distribution of tags 
at each leaf. Because the training data, which is in any 
case limited, is split exponentially in the process of grow- 
ing the tree, many of the leaves are invariably associated 
with a small number of events. Consequently, the em- 
pirical distributions at such leaves may not be reliable, 
and it is desirable to smooth them against more reliable 
statistics. 
One approach is to form the smoothed distributions 
P(. \[ n) from the empirical distributions P(. \[ n) for 
a node n by setting 
P(t I n) = An P(t I n) + (1 - An) P(t I parent(n)) 
where parent(n) is the parent node of n (with the con- 
vention that parent(root) -- root), and 0 _< An _< 1 can 
be thought of as the confidence placed in the empirical 
distribution at the node. 
In order to optimize the coefficients An, we seek to max- 
imize the probability that the correct prediction is made 
for every event in a corpus £g held-out from the train- 
ing corpus used to grow the tree. That is, we attempt 
to maximize the objective function 
O = 11 P(t l L(H)) 
(t,H) E£H 
as a function of the coefficients A = (A1, A2,...) where 
L(H) is the leaf of the history H. While finding the max- 
imizing A is generally an intractable problem, the EM 
algorithm can be adopted to estimate coefficients which 
locally maximize the above objective function. Since this 
is a straightforward application of the EM algorithm we 
will not present the details of the calculation here. 
6. Experimental Results 
In this section we report on two experiments in part- 
of-speech labeling using decision trees. In the first ex- 
periment, we created a model for tagging text using a 
portion of the Lancaster treebank. In the second exper- 
iment, we tagged a portion of the Brown corpus using a 
model derived from the University of Pennsylvania cor- 
pus of hand-corrected labeled text. In each case we com- 
pared the standard HMM model to a maximum entropy 
model of the form 
P(tn \] ~1, ~2,''' ~'n,--1 and 1,01, w2... , wN) = 
: P(t,~ \[ t,~-2,t.~_1 ;Wn-2, W,~-i, Wn, Wn+1, Wn+2) 
= P(tn \[ w.-2, w.-1, w,~, w,~+1, Wn+2) x 
X P(~n i tn-~,~n-~) P(~.)-~ 
where the parameters P(tn \[ tn-l,tn-1) were ob- 
tained Using the usual HMM method, and the parame- 
ters P(tn \[ wn-2, w,~-l, wn, wn+l, wn+2) were obtained 
from a smoothed decision tree as described above. The 
trees were grown to have from 30,000 to 40,000 leaves. 
The relevant data of the experiments is tabulated in Ta- 
bles 2 and 3. The word and tag vocabularies were derived 
from the data, as opposed to being obtained from on-line 
dictionaries or other sources. In the case of the Lancaster 
treebank, however, the original set of approximately 350 
tags, many of which were special tags for idioms, was 
compressed to a set of 163 tags. A rough categorization 
of these parts-of-speech appears in Table 1. 
For training the model we had at our disposal approxi- 
mately 1.9 million words of hand-labeled text. This cor- 
pus is approximately half AP newswire text and half En- 
glish Hansard text, and was labeled by the team of Lan- 
caster linguists. To construct our model, we divided the 
data into three sections, to be used for training, smooth- 
119 
29 Nouns 
27 Verbs 
20 Pronouns 
17 Determiners 
16 Adverbs 
12 Punctuation 
10 Conjunctions 
8 Adjectives 
4 Prepositions 
20 Other 
Table 1: Lancaster parts-of-speech 
ing, and testing, consisting of 1,488,271 words, 392,732 
words, and 51,384 words respectively. 
We created an initial lexicon with the word-tag pairs that 
appear in the training, smoothing, and test portions of 
this data. We then filled out this lexicon using a statis- 
tical procedure which combines information from word 
spellings together with information derived from word 
bigram statistics in English text. This technique can be 
used both to discover parts-of-speech for words which do 
not occur in the hand-labeled text, as well as to discover 
additional parts-of-speech for those that do. In both ex- 
periments multiword expressions, such as "nineteenth- 
century" and "stream-of-consciousness," which were as- 
signed a single tag in the hand-labelled text, were broken 
up into single words in the training text, with each word 
receiving no tag. 
The parameters of the HMM model were estimated from 
the training section of the hand-labeled text, without 
any use of the forward-backward algorithm. Subse- 
quently, we used the smoothing section of the data to 
construct an interpolated model as described by Meri- 
aldo \[4, 6\]. 
We evaluated the performance of the interpolated hidden 
Markov model by tagging the 2000 sentences which make 
up the testing portion of the data. We then compared 
the resultant tags with those produced by the Lancaster 
team, and found the error rate to be 3.03%. 
We then grew and smoothed a decision tree using the 
same division of training and smoothing data, and com- 
bined the resulting marginals for predicting tags from 
the word context with the marginals for predicting tags 
from the tag context derived from the HMM model. The 
resulting error rate was 2.61%, a 14% reduction from the 
HMM model figure. 
Tag vocabulary size: 163 tags 
Word vocabulary size: 41471 words 
Training data: 1,488,271 words 
Held-out data: 392,732 words 
Test data: 51,384 words 
(2000 sentences) 
Source of data: Hansards 
AP newswire 
Dictionary: no unknown words 
Multiword expressions: broken up 
HMM errors: 1558 (3.03%) 
Decision tree errors: 1341 (2.61%) 
Error reduction: 13.9% 
Table 2: Lancaster Treebank Experiment 
In the case of the experiment with the UPenn corpus, 
the word vocabulary and dictionary were derived from 
the training and smoothing data only, and the dictio- 
nary was not statistically filled out. Thus, there were 
unknown words in the test data. The tag set used in the 
second experiment was comprised of the 48 tags chosen 
by the UPenn project. For training the model we had 
at our disposal approximately 4.4 million words of hand- 
labeled text, using approximately half the Brown corpus, 
with the remainder coming from the Wall Street Jour- 
nal texts labelled by the UPenn team. For testing the 
model we used the remaining half of the Brown corpus, 
which was not used for any other purpose. To construct 
our model, we divided the data into a training section 
of 4,113,858 words, and a smoothing section of 292,731 
words. The error rate on 8,000 sentences from the Brown 
corpus test set was found to be 4.57%. The correspond- 
ing error rate for the model using a decision tree grown 
only on the Brown corpus portion of the training data 
was 4.37%, representing only a 4.31% reduction in the 
error rate. 
7. Conclusions 
In two experiments we have seen how decision trees pro- 
vide modest improvements over HMM's for the problem 
of labeling unrestricted text with parts-of-speech. In ex- 
amining the errors made by the models which incorpo- 
rate the decision tree marginals, we find that the errors 
may be attributed to two primary problems: bad ques- 
120 
Tag vocabulary size: 48 tags 
Word vocabulary size: 86456 words 
Training data: 4,113,858 words 
Held-out data: 292,731 words 
Test data: 212,064 words 
(8000 sentences) 
Source of data: Brown corpus 
Wall Street Journal 
Dictionary: unknown test words 
Multiword expressions: broken up 
HMM errors: 9683 (4.57%) 
Decision tree errors: 9265 (4.37%) 
Error reduction: 4.31% 
References 
1. L. Bahl, P. Brown, P. deSouza, and R. Mercer. A tree- 
based statistical language model for natural language 
speech recognition. IEEE Transactions on Acoustics, 
Speech, and Signal Processing, 37, pp. 1001-1008, 1989. 
2. P. Brown, V. Della Pietra, P. deSouza, and R. Mercer. 
Class-based n-gram models of natural language. Pro- 
ceedings o\] the IBM Natural Language ITL, pp. 283-298, 
Paris, France, 1990. 
3. K. Church. A stochastic parts program and noun phrase 
parser for unrestricted text. Proceedings of the Second 
Conference on Applied Natural Language Processing, 
Austin, Texas, 1988. 
4. B. Merialdo. Tagging text with a probabilistic model. 
IBM Research Report, RC 1597~, 1990. 
5. M. Meteer, R. Schwartz, and It. Weischedel. Studies in 
part of speech labelling. In Proceedings of the February 
1991 DAItPA Speech and Natural Language Workshop. 
Asflomar, California. 
6. S. Katz. Estimation of probabilities from sparse data 
for the language model component of a speech recog- 
nizer. IEEE Transactions on Acoustics, Speech, and Sig- 
nal Processing, ASSP-35, Number 3, pp. 400-401, 1987. 
Table 3: UPenn Brown Corpus Experiment 
tions and insufficient training data. Consider the word 
lack, for example, which may be either a noun or a verb. 
The mutual information clustering procedure tends to 
classify such words as either nouns or verbs, rather than 
as words which may be both. In the case of lack as 
it appeared in the Lancaster data, the binary features 
emphasized the nominal apects of the word, relating it 
to such words as scarcity, number, amount and portion. 
This resulted in errors when it occurred as a verb in the 
test data. 
Clearly an improvement in the binary questions asked of 
the histories is called for. In a preliminary set of exper- 
iments we augmented the automatically-derived ques- 
tions with a small set of hand-constructed questions 
which were intended to resolve the ambiguity of the la- 
bel for verbs which may have either the active or pas- 
sive aspect. The resulting decision trees, however, did 
not significantly improve the error rate on this partic- 
ular problem, which represents inherently long-distance 
linguistic phenomena. Nevertheless, it appears that the 
basic approach can be made to prosper through a com- 
bination of automatic and linguistic efforts. 
121 
