i 
A Maximum Entropy Model for Part-Of-Speech Tagging 
Adwait Ratnaparkhi 
University of Pennsylvania 
Dept. of Computer and Information Science 
adwait~gradient, cis. upenn, edu 
Abstract 
This paper presents a statistical model which 
trains from a corpus annotated with Part-Of- 
Speech tags and assigns them to previously unseen 
text with state-of-the-art accuracy(96.6%). The 
model can be classified as a Maximum Entropy 
model and simultaneously uses many contextual 
"features" to predict the POS tag. Furthermore, 
this paper demonstrates the use of specialized fea- 
tures to model difficult tagging decisions, discusses 
the corpus consistency problems discovered during 
the implementation of these features, and proposes 
a training strategy that mitigates these problems. 
Introduction 
Many natural language tasks require the accurate 
assignment of Part-Of-Speech (POS) tags to pre- 
viously unseen text. Due to the availability of 
large corpora which have been manually annotated 
with POS information, many taggers use anno- 
tated text to "learn" either probability distribu- 
tions or rules and use them to automatically assign 
POS tags to unseen text. 
The experiments in this paper were conducted 
on the Wall Street Journal corpus from the Penn 
Treebank project(Marcus et al., 1994), although 
the model can trai~n from any large corpus anno- 
tated with POS tags. Since most realistic natu- 
ral language applications must process words that 
were never seen before in training data, all exper- 
iments in this paper are conducted on test data 
that include unknown words. 
Several recent papers(Brill, 1994, 
Magerman, 1995) have reported 96.5% tag- 
ging accuracy on the Wall St. Journal corpus. 
The experiments in this paper test the hy- 
pothesis that better use of context will improve 
the accuracy. A Maximum Entropy model is 
well-suited for such experiments since it corn- 
bines diverse forms of contextual information 
in a principled manner, and does not impose 
any distributional assumptions on the train- 
ing data. Previous uses of this model include 
language modeling(Lau et al., 1993), machine 
translation(Berger et al., 1996), prepositional 
phrase attachment(Ratnaparkhi et al., 1994), and 
word morphology(Della Pietra et al., 1995). This 
paper briefly describes the maximum entropy 
and maximum likelihood properties of the model, 
features used for POS tagging, and the experi- 
ments on the Penn Treebank Wall St. Journal 
corpus. It then discusses the consistency problems 
discovered during an attempt to use specialized 
features on the word context. Lastly, the results 
in this paper are compared to those from previous 
work on POS tagging. 
The Probability Model 
The probability model is defined over 7-/x 7-, where 
7t is the set of possible word and tag contexts, or 
"histories", and T is the set of allowable tags. The 
model's probability of a history h together with a 
tag t is defined as: 
k 
p(h, t) = Ill -J- jj(h,t) 
j=l 
(1) 
where ~" is a normalization constant, 
{tt, ~1,..., ak} are the positive model parameters 
and {fl,..-,fk} are known as "features", where 
fj(h,t) E {O, 1}. Note that each parameter aj 
corresponds to a feature fj. 
Given a sequence of words {wl,..., Wn} and 
tags {tl,...t,~} as training data, define hi as the 
history available when predicting ti. The param- 
eters {p, al ..... ak} are then chosen to maximize 
133 
the likelihood of the training data using p: 
r(p) = 1-\[ p(h.t,) = 1-\[ 
i=1 i=1 j=l 
This model also can be interpreted under the 
Maximum Entropy formalism, in which the goal is 
to maximize the entropy of a distribution subject 
to certain constraints. Here, the entropy of the 
distribution p is defined as: 
H(p) = - E p(h,t) logp(h,t) 
hE74,tET 
and the constraints are given by: 
Efj = Efj, l < j _< k (2) 
where the model's feature expectation is 
Efj= E p(h,t)fj(h,t) 
hET.l,tET 
and the observed feature expectation is 
i=1 
and where iS(hi, ti) denotes the observed probabil- 
ity of (hi,ti) in the training data. Thus the con- 
straints force the model to match its feature expec- 
tations with those observed in the training data. 
In practice, 7-/ is very large and the model's ex- 
pectation Efj cannot be computed directly, so the 
following approximation(Lau et al., 1993) is used: 
n 
E fj ,~ E15(hi)p(tilhi)fj(hi,ti) 
i=1 
where fi(hi) is the observed probability of the his- 
tory hi in the training set. 
It can be shown (Darroch and Ratcliff, 1972) 
that if p has the form (1) and satisfies the k 
constraints (2), it uniquely maximizes the en- 
tropy H(p) over distributions that satisfy (2), and 
uniquely maximizes the likelihood L(p) over dis- 
tributions of the form (1). The model parameters 
for the distribution p are obtained via Generalized 
Iterative Scaling(Darroch and Ratcliff, 1972). 
Features for POS Tagging 
The joint probability of a history h and tag t 
is determined by those parameters whose corre- 
sponding features are active, i.e., those o~j such 
134 
that fj(h,t) = 1. A feature, given (h,t), may ac- 
tivate on any word or tag in the history h, and 
must encode any information that might help pre- 
dict t, such as the spelling of the current word, or 
the identity of the previous two tags. The specific 
word and tag context available to a feature is given 
in the following definition of a history hi: 
hi   {wi, wi+I, wi-}-2, wi-1, wi-2, ti-1, ti-2} 
For example, 
1 if suffix(w/) = "ing" & ti = VBG 
fj(hl,ti)= 0 otherwise 
If the above feature exists in the feature set of 
the model, its corresponding model parameter will 
contribute towards the joint probability p(hi,ti) 
when wi ends with "±ng" and when ti =VBG 1. 
Thus a model parameter aj effectively serves as 
a "weight" for a certain contextual predictor, in 
this case the suffix "ing", towards the probability 
of observing a certain tag, in this case a VBG. 
The model generates the space of features by 
scanning each pair (hi, ti) in the training data with 
the feature "templates" given in Table 1. Given 
hi as the current history, a feature always asks 
some yes/no question about hi, and furthermore 
constrains ti to be a certain tag. The instantia- 
tions for the variables X, Y, and T in Table 1 are 
obtained automatically from the training data. 
The generation of features for tagging un- 
known words relies on the hypothesized distinction 
that "rare" words 2 in the training set are similar 
to unknown words in test data, with respect to 
how their spellings help predict their tags. The 
rare word features in Table 1, which look at the 
word spellings, will apply to both rare words and 
unknown words in test data. 
For example, Table 2 contains an excerpt from 
training data while Table 3 contains the features 
generated while scanning (ha, t3), in which the cur- 
rent word is about, and Table 4 contains features 
generated while scanning (h4, t4), in which the cur- 
rent word, well-heeled, occurs 3 times in train- 
ing data and is therefore classified as "rare". 
The behavior of a feature that occurs very 
sparsely in the training set is often difficult to pre- 
dict, since its statistics may not be reliable. There- 
fore, the model uses the heuristic that any feature 
1VBG is the Treebank POS tag for Verb Gerund. 
2A "rare" word here denotes a word which occurs 
less than 5 times in the training set. The count of 5 
was chosen by subjective inspection of words in the 
training data. 
Condition Features 
w i is not rare wi = X & ti = T 
wi is rare 
¥ wi 
X is prefix of wi, IXI ~ 4 &ti=T 
X is suffix of wi, IXI < 4 & li   T 
wi contains number ~z ti = T 
wi contains uppercase character ~ ti = T 
wi contains hyphen ~ti=T 
ti-1 = X & ti = T 
ti_2ti_t = XY ~ ti = T 
wi-1 = X & li = T 
wi-2 = X ~ ti = T 
wi+ 1 = X ~s t i = T 
wi+2 = X & ti = T 
Table h Features on the current history hi 
Word: 
Tag: 
Position: 
the stories about well-heeled communities and developers 
DT NNS IN JJ NNS CC NNS 
1 2 3 4 5 6 7 
Table 2: Sample Data 
wi ---- about 
wi-i -~ stories 
wi-2 ---- the 
Wi+l = well-heeled 
wi+2 ---- communities 
ti- I = NNS 
ti-2ti-1 = DT NNS 
Table 3: Features Generated From h3 (for 
& ti = IN 
g5 ti = IN 
& ti = IN 
ti = IN 
~2 ti = IN 
~: ti = IN 
ti = IN 
tagging about) from Table 2 
Wi-1 -~ about   ti = JJ 
wi-2 = stories ~z ti = JJ 
Wi+l = communities & ti = JJ 
wi+2 = and ~ ti = JJ 
ti-1 = IN ~ ti = JJ 
ti-2ti-I ---- I~IS IN ~z ti = JJ 
prefix(wi)=w ~ ii = JJ 
prefix(wi)----we & ti = JJ 
prefix(wi)=wel ~z li = JJ 
prefix(wi)=well & ti = JJ 
sulffix(wi)=d ~ ti = JJ 
suffix(wi)=ed & ti = JJ 
sufx(wi)=led ~ ti = JJ 
suffix(wi)=eled & ti = JJ 
wi contains hyphen & ti = JJ 
Table 4: Features Generated From h4 (for tagging well-heeled) from Table 2 
135 
which occurs less than 10 times in the data is un- 
reliable, and ignores features whose counts are less 
than 10. 3 While there are many smoothing algo- 
rithms which use techniques more rigorous than a 
simple count cutoff, they have not yet been inves- 
tigated in conjunction with this tagger. 
Testing the Model 
The test corpus is tagged one sentence at a time. 
The testing procedure requires a search to enumer- 
ate the candidate tag sequences for the sentence, 
and the tag sequence with the highest probability 
is chosen as the answer. 
Search Algorithm 
The search algorithm, essentially a "beam search", 
uses the conditional tag probability 
p(h,t) 
p(tlh) - p(h,t') 
and maintains, as it sees a new word, the N 
highest probability tag sequence candidates up 
to that point in the sentence. Given a sentence 
a tag sequence candidate 
has conditional probability: 
P(tl . .tnlwl ..wn) = II p(tilh,) 
i=l 
In addition the search procedure optionally 
consults a Tag Dictionary, which, for each known 
word, lists the tags that it has appeared with in 
the training set. If the Tag Dictionary is in effect, 
the search procedure, for known words, generates 
only tags given by the dictionary entry, while for 
unknown words, generates all tags in the tag set. 
Without the Tag Dictionary, the search procedure 
generates all tags in the tag set for every word. 
Let W = {wl...w,~} be a test sentence, and 
let sij be the jth highest probability tag sequence 
up to and including word wi. The search is de- 
scribed below: 
1. Generate tags for wl, find top N, set 81j , 1 _< 
j < N, accordingly. 
2. Initialize i = 2 
(a) Initialize j = 1 
3Except for features that look only at the cur- 
rent word, i.e., features of the form wl ----<word> and 
tl :<TAG>. The count of 10 was chosen by inspection 
of Training and Development data. 
136 
(b) Generate tags for wi, given s(i-1)j as previous 
tag context, and append each tag to s(i-1)j to 
make a new sequence 
(c) j = j + 1, Repeat from (b) ifj _< g 
3. Find N highest probability sequences generated 
by above loop, and set sij, 1 < j _< N, accord- 
ingly. 
4. i = i + 1, Repeat from (a) if i _< n 
5. Return highest probability sequence, s~l 
Experiments 
In order to conduct tagging experiments, the 
Wall St. Journal data has been split into three 
contiguous sections, as shown in Table 5. The 
feature set and search algorithm were tested and 
debugged only on the Training and Development 
sets, and the official test result on the unseen Test 
set is presented in the conclusion of the paper. 
The performances of the "baseline" model on the 
Development Set, both with and without the Tag 
Dictionary, are shown in Table 6. 
All experiments use a beam size of N = 5; 
further increasing the beam size does not signifi- 
cantly increase performance on the Development 
Set but adversely affects the speed of the tagger. 
Even though use of the Tag Dictionary gave an ap- 
parently insignificant (. 12%) improvement in accu- 
racy, it is used in further experiments since it sig- 
nificantly reduces the number of hypotheses and 
thus speeds up the tagger. 
The running time of the parameter estimation 
algorithm is O(NTA), where N is the training set 
size, T is the number of allowable tags, and A is 
the average number of features that are active for a 
given event (h, t). The running time of the search 
procedure on a sentence of length N is O(NTAB), 
where T, A are defined above, and B is the beam 
size. In practice, the model for the experiment 
shown in Table 6 requires approximately 24 hours 
to train, and 1 hour to test 4 on an IBM RS/6000 
Model 380 with 256MB of RAM. 
Specialized Features and 
Consistency 
The Maximum Entropy model allows arbitrary 
binary-valued features on the context, so it can use 
additional specialized, i.e., word-specific, features 
4The search procedure has not been optimized and 
the author expects it to run 3 to 5 times faster after 
optimizations. 
DataSet Sentences Words Unknown Words 
Training 40000 962687 
Development 8000 192826 6107 
Test 5485 133805 3546 
Table 5: WSJ Data Sizes 
Tag DictiOnary 
No Tag Dictionary 
Total Word Accuracy Unknown Word Accuracy 
96.43% 86.23% 
96.31% 86.28% 
Table 6: Baseline Performance on Development Set 
Sentence Accuracy 
47.55% 
47.38% 
Table 7: 
Word 
about 
that 
more 
up 
that 
as 
up 
more 
that 
about 
that 
out 
that 
much 
yen 
chief 
up 
ago 
much 
out 
Correct Tag Model's Tag Frequency 
RB IN 393 
DT IN 389 
RBR J JR 221 
IN RB 187 
WDT IN 184 
RB IN 176 
IN RP 176 
J JR RBR 175 
IN WDT 159 
IN i RB 144 
IN DT 127 
RP 
DT 
IN 
WDT 
126 
123 
JJ RB 118 
NN NNS 117 
NN I JJ 116 
RP IN 114 
IN RB 112 
RB !JJ 111 
IN RP 109 
Top Tagging Mistakes on Training Set for Baseline Model 
137 
to correctly tag the "residue" that the baseline 
features cannot model. Since such features typ- 
ically occur infrequently, the training set consis- 
tency must be good enough to yield reliable statis- 
tics. Otherwise the specialized features will model 
noise and perform poorly on test data. 
Such features can be designed for those words 
which are especially problematic for the model. 
The top errors of the model (over the training set) 
are shown in Table 7; clearly, the model has trou- 
ble with the words that and about, among others. 
As hypothesized in the introduction, better fea- 
tures on the context surrounding that and about 
should correct the tagging mistakes for these two 
words, assuming that the tagging errors are due to 
an impoverished feature set, and not inconsistent 
data. 
Specialized features for a given word are con- 
structed by conjoining certain features in the base- 
line model with a question about the word itself. 
The features which ask about previous tags and 
surrounding words now additionally ask about the 
identity of the current word, e.g., a specialized fea- 
ture for the word about in Table 3 could be: 
1 if wi : about ~ ti-2ti-1 = DT NNS 
fj (hi, ti) = & ti = IN 
0 otherwise 
Table 8 shows the results of an experiment 
in which specialized features are constructed for 
"difficult" words, and are added to the baseline 
feature set. Here, "difficult" words are those that 
are mistagged a certain way at least 50 times when 
the training set is tagged with the baseline model. 
Using the set of 29 difficult words, the model per- 
forms at 96.49% accuracy on the Development Set, 
an insignificant improvement from the baseline ac- 
curacy of 96.43%. Table 9 shows the change in er- 
ror rates on the Development Set for the frequently 
occurring "difficult" words. For most words, the 
specialized model yields little or no improvement, 
and for some, i.e., more and about, the specialized 
model performs worse. 
The lack of improvement implies that either 
the feature set is still impoverished, or that the 
training data is inconsistent. A simple consistency 
test is to graph the POS tag assignments for a 
given word as a function of the article in which 
it occurs. Consistently tagged words should have 
roughly the same tag distribution as the article 
numbers vary. Figure 1 represents each POS tag 
with a unique integer and graphs the POS annota- 
tion of about in the training set as a function of the 
138 
articles (the points are "scattered" to show den- 
sity). As seen in figure 1, about is usually anno- 
tated with tag#l, which denotes IN (preposition), 
or tag#9, which denotes RB (adverb), and the ob- 
served probability of either choice depends heavily 
on the current article-~. Upon further examina- 
tion 5, the tagging distribution for about changes 
precisely when the annotator changes. Figure 2, 
which again uses integers to denote POS tags, 
shows the tag distribution of about as a function of 
annotator, and implies that the tagging errors for 
this word are due mostly to inconsistent data. The 
words ago, chief, down, executive, off, out, up 
and yen also exhibit similar bias. 
Thus specialized features may be less effective 
for those words affected by inter-annotator bias. 
A simple solution to eliminate inter-annotator in- 
consistency is to train and test the model on data 
that has been created by the same annotator. The 
results of such an experiment 6 are shown in Ta- 
ble 10. The total accuracy is higher, implying 
that the singly-annotated training and test sets 
are more consistent, and the improvement due to 
the specialized features is higher than before (.1%) 
but still modest, implying that either the features 
need further improvement or that intra-annotator 
inconsistencies exist in the corpus. 
Comparison With Previous Work 
Most of the recent corpus-based POS taggers in 
the literature are either statistically based, and 
use Markov Model(Weischedel et al., 1993, 
Merialdo, 1994) or Statistical Decision 
Tree(Jelinek et al., 1994, Magerman, 1995)(SDT) 
techniques, or are primarily rule based, 
such as Drill's Transformation Based 
Learner(Drill, 1994)(TBL). The Maximum 
Entropy (MaxEnt) tagger presented in this paper 
combines the advantages of all these methods. It 
uses a rich feature representation, like TBL and 
SDT, and generates a tag probability distribution 
for each word, like Decision Tree and Markov 
Model techniques. 
5The mapping from article to annotator is in the 
file doc/wsj .wht on the Treebank CDROM. 
6The single-annotator training data was obtained 
by extracting those articles tagged by "maryann" in 
the Treebank v.5 CDROM. This training data does 
not overlap with the Development and Test set used 
in the paper. The single-annotator Development Set 
is the portion of the Development Set which has also 
been annotated by "maryann". The word vocabulary 
and tag dictionary are the same as in the baseline 
experiment. 
Number of "Difficult" Words I Development Set Performance 
29 \] 96.49% 
Table 8: Performance of Baseline Model with Specialized Features 
Word ~ Baseline Model Errors # Specialized Model Errors 
that 246 207 
up 186 169 
about 110 120 
out 104 97 
more 88 89 
down 81 84 
off 73 78 
as 50 38 
much 47 40 
chief 46 47 
in 39 39 
executive 37 33 
most 23 34 
ago 22 18 
yen 18 17 
Table 9: Errors on Development Set with Baseline and Specialized Models 
35 
30 
25 
20 
POS Tag 
15 
10 
5 
• I 
I I I I I I I I I ., , .. o , 
'~$1r~. . • mmL.up~ ~ .'mNNmn~. ~ gtPdl= |&.allm~WI.Lqlf 
IDW,t~lIO, r I~"1~~ ~ II, Mlmulm, • IP, il~ ,,lllb, l~ ~ 
I I I I I I I I I 
0 200 400 600 800 1000 1200 1400 1600 1800 
Article# 
2000 
Figure 1: Distribution of Tags for the word "about" vs. Article# 
Training Size(w°rds)I Test571190 Size(w°rds) I Baseline44478 97.04% Specialized 197.13% 
Table 10: Performance of Baseline ~ Specialized Model When Tested on Consistent Subset of Development 
Set 
139 
POS Tag 
35 
30 
25 
2O 
15 
10 
5 
0 
1 
I o. Oho° 
m 
I I 
I 
B ~ m M 
I I I 
2 3 4 
Annotator 
Figure 2: Distribution of Tags for the word "about" vs. Annotator 
(Weischedel et al., 1993) provide the results 
from a battery of "tri-tag" Markov Model exper- 
iments, in which the probability P(W,T) of ob- 
serving a word sequence W = {wl,w2,...,wn} 
together with a tag sequence T = {tl,t2,...,tn} 
is given by: 
P(TIW)P(W) p(tl)p(t21tl) × 
H P(tilti-lti-2) p(wilti 
i=3 
Furthermore, p(wilti) for unknown words is com- 
puted by the following heuristic, which uses a set 
of 35 pre-determined endings: 
p(wilti) p(unknownwordlti ) x 
p(capitalfeature\[ti) x 
p(endings, hypenationlti ) 
This approximation works as well as the 
MaxEnt model, giving 85% unknown word 
accuracy(Weischedel et al., 1993) on the Wall St. 
Journal, but cannot be generalized to handle more 
diverse information sources. Multiplying together 
all the probabilities becomes less convincing of 
an approximation as the information sources be- 
come less independent. In contrast, the Max- 
Ent model combines diverse and non-local infor- 
mation sources without making any independence 
assumptions. 
140 
A POS tagger is one component in the 
SDT based statisticM parsing system described 
in (Jelinek et al., 1994, Magerman, 1995). The 
total word accuracy on Wall St. Jour- 
nal data, 96.5%(Magerman, 1995), is similar to 
that presented in this paper. However, the 
aforementioned SDT techniques require word 
classes(Brown et al., 1992) to help prevent data 
fragmentation, and a sophisticated smoothing al- 
gorithm to mitigate the effects of any fragmenta- 
tion that occurs. Unlike SDT, the MaxEnt train- 
ing procedure does not recursively split the data, 
and hence does not suffer from unreliable counts 
due to data fragmentation. As a result, no word 
classes are required and a trivial count cutoff suf- 
rices as a smoothing procedure in order to achieve 
roughly the same level of accuracy. 
TBL is a non-statistical approach to POS 
tagging which also uses a rich feature repre- 
sentation, and performs at a total word accu- 
racy of 96.5% and an unknown word accuracy of 
85%.(Bri11, 1994). The TBL representation of the 
surrounding word context is almost the same 7 and 
the TBL representation of unknown words is a 
superset s of the unknown word representation in 
this paper. However, since TBL is non-statistical, 
it does not provide probability distributions and 
7(Brill, 1994) looks at words ±3 away from the cur- 
rent, whereas the feature set in this paper uses a win- 
dow of ±2. 
8(Brill, 1994) uses prefix/suffix additions and dele- 
tions, which are not used in this paper. 
! 
unlike MaxEnt, cannot be used as a probabilis- 
tic component in a larger model. MaxEnt can 
provide a probability for each tagging decision, 
which can be used in the probability calculation 
of any structure that is predicted over the POS 
tags, such as noun phrases, or entire parse trees, 
as in (Jelinek et al., 1994, Magerman, 1995). 
Thus MaxEnt has at least one advantage over 
each of the reviewed POS tagging techniques. It is 
better able to use diverse information than Markov 
Models, requires less supporting techniques than 
SDT, and unlike TBL, can be used in a prob- 
abilistic framework. However, the POS tagging 
accuracy on the Penn Wall St. Journal corpus 
is roughly the same for all these modelling tech- 
niques. The convergence of the accuracy rate 
implies that either all these techniques are miss- 
ing the right predictors in their representation 
to get the "residue", or more likely, that any 
corpus based algorithm on the Penn Treebank 
Wall St. Journal corpus will not perform much 
higher than 96.5% due to consistency problems. 
Conclusion 
The Maximum Entropy model is an extremely 
flexible technique for linguistic modelling, since it 
can use a virtually unrestricted and rich feature 
set in the framework of a probability model. The 
implementation in this paper is a state-of-the-art 
POS tagger, as evidenced by the 96.6% accuracy 
on the unseen Test set, shown in Table 11. 
The model with specialized features does not 
perform much better than the baseline model, and 
further discovery or refinement of word-based fea- 
tures is difficult given the inconsistencies in the 
training data. A model trained and tested on data 
from a single annotator performs at .5% higher 
accuracy than the baseline model and should pro- 
duce more consistent input for applications that 
require tagged text. 

References 
\[ARP, 1994\] ARPA. 1994. Proceedings of the Hu- 
man Language Technology Workshop. 
\[Berger et al., 1996\] Adam Berger, Stephen 
A. Della Pietra, and Vincent J. Della Pietra. 
1996. A Maximum Entropy Approach to 
Natural Language Processing. Computational 
Linguistics, 22(1):39-71. 
\[Brill, 1994\] Eric Brill. 1994. Some Advances in 
Transformation-Based Part of Speech Tagging. 
In Proceedings off the Twelfth National Confer- 
ence on Artificial Intelligence, volume 1, pages 
722-727. 
\[Brown et al., 1992\] Peter F Brown, Vincent Del- 
laPietra, Peter V deSouza, Jennifer C Lai, and 
Robert L Mercer. 1992. Class-Based n-gram 
Models of Natural Language. Computational 
Linguistics, 18(4). 
\[Darroch and Ratcliff, 1972\] J. N. Darroch and 
D. Ratcliff. 1972. Generalized Iterative Scaling 
for Log-Linear Models. The Annals of Mathe- 
matical Statistics, 43(5) :1470-1480. 
\[Della Pietra et al., 1995\] Steven Della Pietra, 
Vincent Della Pietra, and John Lafferty. 1995. 
Inducing Features of Random Fields. Techni- 
cal Report CMU-CS95-144, School of Computer 
Science, Carnegie-Mellon University. 
\[Jelinek et al., 1994\] F Jelinek, J Lafferty, 
D Magerman, R Mercer, A Ratnaparkhi, and 
S Roukos. 1994. Decision Tree Parsing using 
a Hidden Derivational Model. In Proceedings 
of the Human Language Technology Workshop 
(ARP, 1994), pages 272-277. 
\[Lau et al., 1993\] Ray Lau, Ronald Rosenfeld, 
and Salim Roukos. 1993. Adaptive Language 
Modeling Using The Maximum Entropy Prin- 
ciple. In Proceedings of the Human Language 
Technology Workshop, pages 108-113. ARPA. 
\[Magerman, 1995\] David M. Magerman. 1995. 
Statistical Decision-Tree Models for Parsing. In 
Proceedings of the 33rd Annual Meeting of the 
ACL. 
\[Marcus et al., 1994\] Mitchell P. Marcus, Beatrice 
Santorini, and Mary Ann Mareinkiewicz. 1994. 
Building a large annotated corpus of English: 
the Penn Treebank. Computational Linguistics, 
19(2):313-330. 
\[Merialdo, 1994\] Bernard Merialdo. 1994. Tag- 
ging English Text with a Probabilistic Model. 
Computational Linguistics, 20(2):155-172. 
\[Ratnaparkhi et al., 1994\] Adwait Ratnaparkhi, 
Jeff Reynar, and Salim Roukos. 1994. A Maxi- 
mum Entropy Model for Prepositional Phrase 
Attachment. In Proceedings of the Human 
Language Technology Workshop (ARP, 1994), 
pages 250-255. 
\[Weischedel et al., 1993\] Ralph Weischedel, Marie 
Meteer, Richard Schwartz, Lance Ramshaw, 
and Jeff Palmucci. 1993. Coping With Ambigu- 
ity and Unknown Words through Probabilistic 
Models. Computational Linguistics, 19(2):359- 
382. 
