A Maximum-Entropy-Inspired Parser * 
Eugene Charniak 
Brown Laboratory for Linguistic Information Processing 
Department of Computer Science 
Brown University, Box 1910, Providence, RI 02912 
ec@cs.brown.edu 
Abstract 
We present a new parser for parsing down to 
Penn tree-bank style parse trees that achieves 
90.1% average precision/recall for sentences of 
length 40 and less, and 89.5% for sentences of 
length 100 and less when trMned and tested on 
the previously established \[5,9,10,15,17\] "stan- 
dard" sections of the Wall Street Journal tree- 
bank. This represents a 13% decrease in er- 
ror rate over the best single-parser results on 
this corpus \[9\]. The major technical innova- 
tion is tire use of a "ma~ximum-entropy-inspired" 
model for conditioning and smoothing that let 
us successfully to test and combine many differ- 
ent conditioning events. We also present some 
partial results showing the effects of different 
conditioning information, including a surpris- 
ing 2% improvement due to guessing the lexical 
head's pre-terminal before guessing the lexical 
head. 
1 Introduction 
We present a new parser for parsing down 
to Penn tree-bank style parse trees \[16\] that 
achieves 90.1~ average precision/recall for sen- 
tences of length < 40, and 89.5% for sentences 
of length < 100, when trained and tested on the 
previously established \[5,9,10,15,17\] "standard" 
sections of the Wall Street Journal tree-bank. 
This represents a 13% decrease in error rate over 
the best single-parser results on this corpus \[9\]. 
Following \[5,10\], our parser is based upon a 
probabilistic generative model. That is, for all 
sentences s and MI parses 7r, the parser assigns a 
probability p(s, ~) = p(Tr), the equality holding 
when we restrict consideration to ~r whose yield 
* This research was supported in part by NSF grant 
LIS SBR 9720368. The author would like to thank Mark 
Johnson and all the rest of the Brown Laboratory for 
Linguistic Information Processing. 
is s. Then for any s the parser returns the parse 
7r that maximizes this probability. That is, the 
parser implements the function 
arg. a= p(  I = 
= arg maxTrp(lr). 
What fundamentally distinguishes probabilis- 
tic generative parsers is how they compute p(~r), 
and it is to that topic we turn next. 
2 The Generative Model 
The model assigns a probability to a parse by 
a top-down process of considering each con- 
stituent c in ~r and for each c first guessing the 
pre-terminal of c, t(c) (t for "tag"), then the 
lexical head of c, h(c), and then the expansion 
of c into further constituents c(c). Thus the 
probability of a parse is given by the equation 
1"I P(t(c) l l(c),H(c)) 
cE;¢ 
.v(h(c) l t(c),l(c),H(c)) 
.p(e(c) i l(c),t(c),h(c),H(c)) 
where l(c) is the label of c (e.g., whether it is a 
noun phrase (np), verb-phrase, etc.) and H(c)is 
the relevant history of c -- information outside c 
that our probability model deems important in 
determining the probability in question. Much 
of the interesting work is determining what goes 
into H(c). Whenever it is clear to which con- 
stituent we are referring we omit the (c) in, e.g., h(c). 
In this notation the above equation takes 
the following form: 
= 1-\[ v(t I l,z).v(h I t,l,H).v(¢ I 
cE;¢ 
132 
Next we describe how we assign a probability 
to the expansion e of a constituent. In Sec- 
tion 5 we present some results in which the 
possible expansions of a constituent are fixed 
in advanced by extracting a tree-bank grammar 
\[3\] from the training corpus. The method that 
gives the best results, however, uses a Markov 
grammar -- a method for assigning probabil- 
ities to any possible expansion using statistics 
gathered from the training corpus \[6,10,15\]. The 
method we use follows that of \[10\]. In this 
scheme a traditional probabilistic context-free 
grammar (PCFG) rule can be thought of as con- 
sisting of a left-hand side with a label l(e) drawn 
from the non-terminal symbols of our grammar, 
and a right-hand side that is a sequence of one or 
more such symbols. (We assume that all termi- 
nal symbols are generated by rules of the form 
"preterm -+ word' and we treat these as a spe- 
cial case.) For us the non-terminal symbols are 
those of the tree-bank, augmented by the sym- 
bols aux and auxg, which have been assigned de- 
terministically to certain auxiliary verbs such as 
"have" or "having". For each expansion we dis- 
tinguish one of the right-hand side labels as the 
"middle" or "head" symbol M(c). M(c) is the 
constituent from which the head lexical item h 
is obtained according to deterministic rules that 
pick the head of a constituent from among the 
heads of its children. To the left of M is a se- 
quence of one or more left labels Li (c) including 
the special termination symbol A, which indi- 
cates that there are no more symbols to the left, 
and similarly for the labels to the right, Ri(c). 
Thus an expansion e(c) looks like: 
1 --~ ALm...L1MRI...RnA. (2) 
The expansion is generated by guessing first M, 
then in order L1 through Lm+t (= A), and sim- 
ilarly for R1 through R,+~. 
In a pure Markov PCFG we are given the 
left-hand side label l and then probabilisticaily 
generate the right-hand side conditioning on no 
information other than I and (possibly) previ- 
ously generated pieces of the right-hand side 
itself. In the simplest of such models, a zero- 
order Markov grammar, each label on the right- 
hand side is generated conditioned only on l -- 
that is, according to the distributions p(Li I l), 
p(M l I), and p(Ri l l). 
More generally, one can condition on the m 
previously generated labels, thereby obtaining 
an mth-order Markov grammar. So, for ex- 
ample, in a second-order Markov PCFG, L2 
would be conditioned on L1 and M. In our 
complete model, of course, the probability of 
each label in the expansions is also conditioned 
on other material as specified in Equation 1, 
e.g., p(e I l, t, h, H). Thus we would use p(L2 I 
L1, M, l, t, h, H). Note that the As on both ends 
of the expansion in Expression 2 are conditioned 
just like any other label in the expansion. 
3 Maximum-Entropy-Inspired Parsing 
The major problem confronting the author of 
a generative parser is what information to use 
to condition the probabilities required in the 
model, and how to smooth the empirically ob- 
tained probabilities to take the sting out of the 
sparse data problems that are inevitable with 
even the most modest conditioning. For exam- 
ple, in a second-order Markov grammar we con- 
ditioned the L2 label according to the distribu- 
tion p(L2 I Lt,M,I,t,h,H). Also, remember 
that H is a placeholder for any other informa- 
tion beyond the constituent e that may be useful 
in assigning c a probability. 
In the past few years the maximum entropy, 
or log-linear, approach has recommended itself 
to probabilistic model builders for its flexibility 
and its novel approach to smoothing \[1,17\]. A 
complete review of log-linear models is beyond 
the scope of this paper. Rather, we concentrate 
on the aspects of these models that most di- 
rectly influenced the model presented here. 
To compute a probability in a log-linear 
model one first defines a set of "features", 
functions from the space of configurations over 
which one is trying to compute probabilities to 
integers that denote the number of times some 
pattern occurs in the input. In our work we as- 
sume that any feature can occur at most once, 
so features are boolean-valued: 0 if the pattern 
does not occur, 1 if it does. 
In the parser we further assume that fea- 
tures are chosen from certain feature schemata 
and that every feature is a boolean conjunc- 
tion of sub-features. For example, in computing 
the probability of the head's pre-terminal t we 
might want a feature schema f(t, l) that returns 
1 if the observed pre-terminal of c = t and the 
133 
label of c = l, and zero otherwise. This feature 
is obviously composed of two sub-features, one 
recognizing t, the other 1. If both return 1, then 
the feature returns 1. 
Now consider computing a conditional prob- 
ability p(a I H) with a set of features fl... fj 
that connect a to the history H. In a log-linear 
model the probability function takes the follow- 
ing form: 
1 eXl(a,H)fl(a,H)+...+,km(a,H).fm(a,H) p(a I H) - Z(H) 
(3) 
Here the Ai are weights between negative and 
positive infinity that indicate the relative impor- 
tance of a feature: the more relevant the feature 
to the value of the probability, the higher the ab- 
solute value of the associated X. The function 
Z(H), called the partition function, is a normal- 
izing constant (for fixed H), so the probabilities 
over all a sum to one. 
Now for our purposes it is useful to rewrite 
this as a sequence of multiplicative functions 
gi(a,H) for 0 < i < j: 
p(a I H)= go(a,H)gl(a,H) ...gj(a,H). (4) 
Here go(a,H) = 1/Z(H) and gi(a,H) = 
e'~(a'n)f~(a'H). The intuitive idea is that each 
factor gi is larger than one if the feature in ques- 
tion makes the probability more likely, one if the 
feature has no effect, and smaller than one if it 
makes the probability less likely. 
Maximum-entropy models have two benefits 
for a parser builder. First, as already implicit in 
our discussion, factoring the probability compu- 
tation into a sequence of values corresponding 
to various 'tfeatures" suggests that the proba- 
bility model should be easily changeable -- just 
change the set of features used. This point is 
emphasized by Ratnaparkhi in discussing his 
parser \[17\]. Second, and this is a point we have 
not yet mentioned, the features used in these 
models need have no particular independence of 
one another. This is useful if one is using a log- 
linear model for smoothing. That is, suppose 
we want to compute a conditional probability 
p(a \] b,c), but we are not sure that we have 
enough examples of the conditioning event b, c 
in the training corpus to ensure that the empiri- 
cally obtained probability/~(a \[ b, c) is accurate. 
The traditional way to handle this is also to 
compute/~(a I b), and perhaps iS(a I c) as well, 
and take some combination of these values as 
one's best estimate for p(a I b, c). This method 
is known as "deleted interpolation" smoothing. 
In max-entropy models one can simply include 
features for all three events fl(a, b, c), f2(a, b), 
and f3(a, c) and combine them in the model ac- 
cording to Equation 3, or equivalently, Equation 
4. The fact that the features are very far from 
independent is not a concern. 
Now let us note that we can get an equation 
of exactly the same form as Equation 4 in the 
following fashion: 
p(alb, c)p(alb, c,d) p(alb, c,d)=p(alb)-~alb) p(alb, c) 
(5) 
Note that the first term of the equation gives a 
probability based upon little conditioning infor- 
mation and that each subsequent term is a num- 
ber from zero to positive infinity that is greater 
or smaller than one if the new information be- 
ing considered makes the probability greater or 
smaller than the previous estimate. 
As it stands, this last equation is pretty much 
content-free. But let us look at how it works for 
a particular case in our parsing scheme. Con- 
sider the probability distribution for choosing 
the pre-terminal for the head of a constituent. 
In Equation I we wrote this as p(t I l, H). As 
we discuss in more detail in Section 5, several 
different features in the context surrounding c 
are useful to include in H: the label, head 
pre-terminal and head of the parent of c (de- 
noted as lv, tv, hp), the label of c's left sibling 
(lb for "before"), and the label of the grand- 
parent of c (la). That is, we wish to compute 
p(t I l, lv, tv, lb, lg, by). We can now rewrite this 
in the form of Equation 5 as follows: 
p(t I 1, Iv, tv, lb, IQ, hv) = 
p(t l t)P(t l t, tv) P(t l t, tv, tv) p(t l t, tp, tv, tb) 
p(t l l) p(t l l, lp) p(t l t, tp, tp) 
P(t l t'Iv'tv'Ib'Ig)p(t l t'Ip'tv'Ib'Ig'hP). (6) 
p(t I z, t,, t,, lb) p(t I t, l,, t,, lb, t,) 
Here we have sequentially conditioned on 
steadily increasing portions of c's history. In 
many cases this is clearly warranted. For ex- 
ample, it does not seem to make much sense 
to condition on, say, h v without first condition- 
ing on tp. In other cases, however, we seem 
134 
to be conditioning on apples and oranges, so 
to speak. For example, one can well imagine 
that one might want to condition on the par- 
ent's lexical head without conditioning on the 
left sibling, or the grandparent label. One way 
to do this is to modify the simple version shown 
in Equation 6 to allow this: 
p(t I l, l., b, h,) = 
p(t t l)P(t l l, lv) P(t l l, lp, tv) P(t l l, lv, tp, lb) 
p(t i l ) p(t l l ,lp) p(t l l ,lv,tv) 
p(t I l, lp, tp, p(t I l, t,,, 
p(t I l, lp, tp) p(t I l, tp, (7) 
Note the changes to the last three terms in 
Equation 7. Rather than conditioning each 
term on the previous ones, they are now condi- 
tioned only on those aspects of the history that 
seem most relevant. The hope is that by doing 
this we will have less difficulty with the splitting 
of conditioning events, and thus somewhat less 
difficulty with sparse data. 
We make one more point on the connec- 
tion of Equation 7 to a maximum entropy for- 
mulation. Suppose we were, in fact, going 
to compute a true maximum entropy model 
based upon the features used in Equation 7, 
fl(t,l),f2(t,l, lp),f3(t,l, lv) .... This requires 
finding the appropriate his for Equation 3, 
which is accomplished using an algorithm such 
as iterative scaling \[11\] in which values for the Ai 
are initially "guessed" and then modified until 
they converge on stable values. With no prior 
knowledge of values for the )q one traditionally 
starts with )~i = 0, this being a neutral assump- 
tion that the feature has neither a positive nor 
negative impact on the probability in question. 
With some prior knowledge, non-zero values can 
greatly speed up this process because fewer it- 
erations are required for convergence. We com- 
ment on this because in our example we can sub- 
stantially speed up the process by choosing val- 
ues picked so that, when the maximum-entropy 
equation is expressed in the form of Equation 
4, the gi have as their initial values the values 
of the corresponding terms in Equation 7. (Our 
experience is that rather than requiring 50 or so 
iterations, three suffice.) Now we observe that 
if we were to use a maximum-entropy approach 
but run iterative scaling zero times, we would, 
in fact, just have Equation 7. 
The major advantage of using Equation 7 is 
that one can generally get away without com- 
puting the partition function Z(H). In the sim- 
ple (content-free) form (Equation 6), it is clear 
that Z(H) = 1. In the more interesting version, 
Equation 7, this is not true in general, but one 
would not expect it to differ much from one, 
and we assume that as long as we are not pub- 
lishing the raw probabilities (as we would be 
doing, for example, in publishing perplexity re- 
sults) the difference from one should be unim- 
portant. As partition-function calculation is 
typically the major on-line computational prob- 
lem for maximum-entropy models, this simpli- 
fies the model significantly. 
Naturally, the distributions required by 
Equation 7 cannot be used without smooth- 
ing. In a pure maximum-entropy model this is 
done by feature selection, as in Ratnaparkhi's 
maximum-entropy parser \[17\]. While we could 
have smoothed in the same fashion, we choose 
instead to use standard deleted interpolation. 
(Actually, we use a minor variant described in \[4\].) 
4 The Experiment 
We created a parser based upon the maximum- 
entropy-inspired model of the last section, 
smoothed using standard deleted interpolation. 
As the generative model is top-down and we 
use a standard bottom-up best-first probabilis- 
tic chart parser \[2,7\], we use the chart parser as 
a first pass to generate candidate possible parses 
to be evaluated in the second pass by our prob- 
abilistic model. For runs with the generative 
model based upon Markov grammar statistics, 
the first pass uses the same statistics, but con- 
ditioned only on standard PCFG information. 
This allows the second pass to see expansions 
not present in the training corpus. 
We use the gathered statistics for all observed 
words, even those with very low counts, though 
obviously our deleted interpolation smoothing 
gives less emphasis to observed probabilities for 
rare words. We guess the preterminals of words 
that are not observed in the training data using 
statistics on capitalization, hyphenation, word 
endings (the last two letters), and the probabil- 
ity that a given pre-terminal is realized using a 
previously unobserved word. 
As noted above, the probability model uses 
135 
Parser LR LP CB 0CB 2CB 
< 40 words (2245 sentences) 
Char97 87.5 87.4 1.00 62.1 86.1 
Co1199 88.5 88.7 0.92 66.7 87.1 
Char00 90.1 90.1 0.74 70.1 89.6 
< 100 words (2416 sentences) 
Char97 86.7 86.6 1.20 59.9 83.2 
Coll99 88.1 88.3 1.06 64.0 85.1 
Ratna99 86.3 87.5 
Char00 89.6 89.5 0.88 67.6 87.7 
Figure 1: Parsing results compared with previ- 
ous work 
five smoothed probability distributions, one 
each for L~, M, Ri, t, and h. The equation for 
the (unsmoothed) conditional probability distri- 
bution for t is given in Equation 7. The other 
four equations can be found in a longer version 
of this paper available on the author's website 
(www.cs.brown.edu/~.,ec). L and R are condi- 
tioned on three previous labels so we are using 
a third-order Markov grammar. Also, the label 
of the parent constituent Ip is conditioned upon 
even when it is not obviously related to the fur- 
ther conditioning events. This is due to the im- 
portance of this factor in parsing, as noted in, 
e.g., \[14\]. 
In keeping with the standard methodology \[5, 
9,10,15,17\], we used the Penn Wall Street Jour- 
nal tree-bank \[16\] with sections 2-21 for train- 
ing, section 23 for testing, and section 24 for 
development (debugging and tuning). 
Performance on the test corpus is measured 
using the standard measures from \[5,9,10,17\]. 
In particular, we measure labeled precision 
(LP) and recall (LR), average number of cross- 
brackets per sentence (CB), percentage of sen- 
tences with zero cross brackets (0CB), and per- 
centage of sentences with < 2 cross brackets 
(2CB). Again as standard, we take separate 
measurements for all sentences of length <_ 40 
and all sentences of length < 100. Note that 
the definitions of labeled precision and recall are 
those given in \[9\] and used in all of the previous 
work. As noted in \[5\], these definitions typically 
give results about 0.4% higher than the more 
obvious ones. The results for the new parser 
as well as for the previous top-three individual 
parsers on this corpus are given in Figure 1. 
As is typical, all of the standard measures tell 
pretty much the same story, with the new parser 
outperforming the other three parsers. Looking 
in particular at the precision and recall figures, 
the new parser's give us a 13% error reduction 
over the best of the previous work, Co1199 \[9\]. 
5 Discussion 
In the previous sections we have concentrated 
on the relation of the parser to a maximum- 
entropy approach, the aspect of the parser that 
is most novel. However, we do not think this 
aspect is the sole or even the most important 
reason for its comparative success. Here we list 
what we believe to be the most significant con- 
tributions and give some experimental results 
on how well the program behaves without them. 
We take as our starting point the parser 
labled Char97 in Figure 1 \[5\], as that is the 
program from which our current parser derives. 
That parser, as stated in Figure 1, achieves an 
average precision/recall of 87.5. As noted in \[5\], 
that system is based upon a "tree-bank gram- 
mar" -- a grammar read directly off the train- 
ing corpus. This is as opposed to the "Markov- 
grammar" approach used in the current parser. 
Also, the earlier parser uses two techniques not 
employed in the current parser. First, it uses 
a clustering scheme on words to give the sys- 
tem a "soft" clustering of heads and sub-heads. 
(It is "soft" clustering in that a word can be- 
long to more than one cluster with different 
weights -- the weights express the probability 
of producing the word given that one is going 
to produce a word from that cluster.) Second, 
Char97 uses unsupervised learning in that the 
original system was run on about thirty million 
words of unparsed text, the output was taken 
as "correct", and statistics were collected on 
the resulting parses. Without these enhance- 
ments Char97 performs at the 86.6% level for 
sentences of length < 40. 
In this section we evaluate the effects of the 
various changes we have made by running var- 
ious versions of our current program. To avoid 
repeated evaluations based upon the testing cor- 
pus, here our evaluation is based upon sen- 
tences of length < 40 from the development cor- 
pus. We note here that this corpus is somewhat 
more difficult than the "official" test corpus. 
For example, the final version of our system 
136 
System Precision Recall 
Old 86.3 86.1 
Explicit Pre-Term 88.0 88.1 
Marked Coordination 88.6 88.7 
Standard Interpolation 88.2 88.3 
MaxEnt-Inspired 89.0 89.2 
First-order Markov 88.6 87.4 
Second-order Markov 89.5 89.3 
Best 89.8 89.6 
Figure 2: Labeled precision/recall for length < 
40, development corpus 
achieves an average precision/recall of 90.1% on 
the test corpus but an average precision/recall 
of only 89.7% on the development corpus. This 
is indicated in Figure 2, where the model la- 
beled "Best" has precision of 89.8% and recall of 
89.6% for an average of 89.7%, 0.4% lower than 
the results on the official test corpus. This is in 
accord with our experience that development- 
corpus results are from 0.3% to 0.5% lower than 
those obtained on the test corpus. 
The model labeled "Old" attempts to recreate 
the Char97 system using the current program. 
It makes no use of special maximum-entropy- 
inspired features (though their presence made 
it much easier to perform these experiments), it 
does not guess the pre-terminal before guess- 
ing the lexical head, and it uses a tree-bank 
grammar rather than a Markov grammar. This 
parser achieves an average precision/recall of 
86.2%. This is consistent with the average pre- 
cision/recall of 86.6% for \[5\] mentioned above, 
as the latter was on the test corpus and the for- 
mer on the development corpus. 
Between the Old model and the Best model, 
Figure 2 gives precision/recall measurements for 
several different versions of our parser. One of 
the first and without doubt the most signifi- 
cant change we made in the current parser is to 
move from two stages of probabilistic decisions 
at each node to three. As already noted, Char97 
first guesses the lexical head of a constituent 
and then, given the head, guesses the PCFG 
rule used to expand the constituent in question. 
In contrast, the current parser first guesses the 
head's pre~terminal, then the head, and then the 
expansion. It turns out that usefulness of this 
process had a/ready been discovered by Collins 
\[10\], who in turn notes (personal communica- 
tion) that it was previously used by Eisner \[12\]. 
However, Collins in \[10\] does not stress the de- 
cision to guess the head's pre-terminal first, and 
it might be lost on the casual reader. Indeed, 
it was lost on the present author until he went 
back after the fact and found it there. In Figure 
2 we show that this one factor improves perfor- 
mance by nearly 2%. 
It may not be obvious why this should make 
so great a difference, since most words are ef- 
fectively unambiguous. (For example, part-of- 
speech tagging using the most probable pre- 
terminal for each word is 90% accurate \[8\].) We 
believe that two factors contribute to this per- 
formance gain. The first is simply that if we first 
guess the pre~terminal, when we go to guess the 
head the first thing we can condition upon is 
the pre-terminal, i.e., we compute p(h I t). This 
quantity is a relatively intuitive one (as, for ex- 
ample, it is the quantity used in a PCFG to re- 
late words to their pre-terminals) and it seems 
particularly good to condition upon here since 
we use it, in effect, as the unsmoothed probabil- 
ity upon which all smoothing of p(h) is based. 
This one '~fix" makes slightly over a percent dif- 
ference in the results. 
The second major reason why first guessing 
the pre-terminal makes so much difference is 
that it can be used when backing off the lexical 
head in computing the probability of the rule 
expansion. For example, when we first guess 
the lexical head we can move from computing 
p(r I 1, lp, h) to p(r I l,t, lp, h). So, e.g., even 
if the word "conflating" does not appear in the 
training corpus (and it does not)~ the "ng" end- 
ing allows our program to guess with relative 
security that the word has the vbg pre-terminal, 
and thus the probability of various rule expan- 
sions can be considerable sharpened. For exam- 
ple, the tree-bank PCFG probability of the rule 
"vp --+ vbg np" is 0.0145, whereas once we con- 
dition on the fact that the lexical head is a vbg 
we get a probability of 0.214. 
The second modification is the explicit mark- 
ing of noun and verb-phrase coordination. We 
have already noted the importance of condition- 
ing on the parent label l v. So, for example, 
information about an np is conditioned on the 
parent -- e.g., an s, vp, pp, etc. Note that when 
an np is part of an np coordinate structure the 
137 
vp 
aux vp 
vbd np 
Figure 3: Verb phrase with both main and aux- 
iliary verbs 
parent will itself be an np, and similarly for a 
vp. But nps and vps can occur with np and 
vp parents in non-coordinate structures as well. 
For example, in the Penn Treebank a vp with 
both main and auxiliary verbs has the structure 
shown in Figure 3. Note that the subordinate 
vp has a vp parent. 
Thus np and vp parents of constituents are 
marked to indicate if the parents are a coor- 
dinate structure. A vp coordinate structure 
is defined here as a constituent with two or 
more vp children, one or more of the con- 
stituents comma, cc, conjp (conjunctive phrase), 
and nothing else; coordinate np phrases are de- 
fined similarly. Something very much like this is 
done in \[15\]. As shown in Figure 2, condition- 
ing on this information gives a 0.6% improve- 
ment. We believe that this is mostly due to 
improvements in guessing the sub-constituent's 
pre-terminai and head. Given we are already 
at the 88% level of accuracy, we judge a 0.6% 
improvement to be very much worth while. 
Next we add the less obvious conditioning 
events noted in our previous discussion of the 
final model -- grandparent label I a and left 
sibling label lb. When we do so using our 
maximum-entropy-inspired conditioning, we get 
another 0.45% improvement in average preci- 
sion/recall, as indicated in Figure 2 on the line 
labeled "MaocEnt-Inspired'. Note that we also 
tried including this information using a stan- 
dard deleted-interpolation model. The results 
here are shown in the line "Standard Interpola- 
tion". Including this information within a stan- 
dard deleted-interpolation model causes a 0.6% 
decrease from the results using the less conven- 
tional model. Indeed, the resulting performance 
is worse than not using this information at all. 
Up to this point all the models considered 
in this section are tree-bank grammar models. 
That is, the PCFG grammar rules are read di- 
rectly off the training corpus. As already noted, 
our best model uses a Markov-grammar ap- 
proach. As one can see in Figure 2, a first- 
order Markov grammar (with all the aforemen- 
tioned improvements) performs slightly worse 
than the equivalent tree-bank-grammar parser. 
However, a second-order grammar does slightly 
better and a third-order grammar does signifi- 
cantly better than the tree-bank parser. 
6 Conclusion 
We have presented a lexicalized Markov gram- 
mar parsing model that achieves (using the now 
standard training/testing/development sections 
of the Penn treebank) an average preci- 
sion/recall of 91.1% on sentences of length < 
40 and 89.5% on sentences of length < 100. 
This corresponds to an error reduction of 13% 
over the best previously published single parser 
results on this test set, those of Collins \[9\]. 
That the previous three best parsers on this 
test \[5,9,17\] all perform within a percentage 
point of each other, despite quite different ba- 
sic mechanisms, led some researchers to won- 
der if there might be some maximum level of 
parsing performance that could be obtained us- 
ing the treebank for training, and to conjec- 
ture that perhaps we were at it. The results 
reported here disprove this conjecture. The re- 
sults of \[13\] achieved by combining the afore- 
mentioned three-best parsers also suggest that 
the limit on tree-bank trained parsers is much 
higher than previously thought. Indeed, it may 
be that adding this new parser to the mix may 
yield still higher results. 
From our perspective, perhaps the two most 
important numbers to come out of this re- 
search are the overall error reduction of 13% 
over the results in \[9\] and the intermediate- 
result improvement of nearly 2% on labeled pre- 
cision/recall due to the simple idea of guess- 
ing the bead's pre-terminal before guessing the 
head. Neither of these results were anticipated 
at the start of this research. 
As noted above, the main methodological 
innovation presented here is our "maximum- 
entropy-inspired" model for conditioning and 
smoothing. Two aspects of this model deserve 
some comment. The first is the slight, but im- 
portant, improvement achieved by using this 
model over conventional deleted interpolation, 
as indicated in Figure 2. We expect that as 
138 
we experiment with other, more semantic con- 
ditioning information, the importance of this as- 
pect of the model will increase. 
More important in our eyes, though, is 
the flexibility of the maximum-entropy-inspired 
model. Though in some respects not quite as 
flexible as true maximum entropy, it is much 
simpler and, in our estimation, has benefits 
when it comes to smoothing. Ultimately it is 
this flexibility that let us try the various condi- 
tioning events, to move on to a Markov gram- 
mar approach, and to try several Markov gram- 
mars of different orders, without significant pro- 
gramming. Indeed, we initiated this line of work 
in an attempt to create a parser that would be 
flexible enough to allow modifications for pars- 
ing down to more semantic levels of detail. It is 
to this project that our future parsing work will 
be devoted. 

References 

1. BERGER, A. L., PIETRA, S. A. D. AND 
PIETRA, V. J. D. A maximum entropy approach to natural  processing. Computational Linguistics 22 1 (1996), 39-71. 

2. CARABALLO, S. AND CHARNIAK, E. New 
figures of merit for best-first probabilistic 
chart parsing. Computational Linguistics 24 
(1998), 275-298. 

3. CHARNIAK, E. Tree-bank grammars. In 
Proceedings of the Thirteenth National 
Conference on Artificial Intelligence. AAAI 
Press/MIT Press, Menlo Park, 1996, 1031- 
1036. 

4. CHARNIAK, E. Expected-frequency interpolation. Department of Computer Science, 
Brown University, Technical Report CS96-37, 
1996. 

5. CHARNIAK, E. Statistical parsing with a 
context-free grammar and word statistics. 
In Proceedings of the Fourteenth National 
Conference on Artificial Intelligence. AAAI 
Press/MIT Press, Menlo Park, CA, 1997, 
598-603. 

6. CHARNIAK, E. Statistical techniques for 
natural  parsing. AI Magazine 18 4 
(1997), 33-43. 

7. CHARNIAK, E., GOLDWATER, S. AND JOHNSON, M. Edge-based best-first chart parsing. In Proceedings of the Sixth Workshop 
on Very Large Corpora. 1998, 127-133. 

8. CHARNIAK, E., HENDRICKSON, C., JACOBSON, N. AND PERKOWITZ, M. Equations 
for part-of-speech tagging. In Proceedings of 
the Eleventh National Conference on Artificial Intelligence. AAAI Press/MIT Press, 
Menlo Park, 1993, 784-789. 

9. COLLINS, M. Head-Driven Statistical Models for Natural Language Parsing. University 
of Pennsylvania, Ph.D. Disseration, 1999. 

10. COLLINS, M. J. Three generative lexicalised 
models for statistical parsing. In Proceedings 
of the 35th Annual Meeting of the ACL. 1997, 
16-23. 

11. DARROCH, J. N. AND RATCLIFF, D. Generalized iterative scaling for log-linear models. 
Annals of Mathematical Statistics 33 (1972), 
1470-1480. 

12. EISNER J. M. An empirical comparison of 
probability models for dependency grammar. 
Institute for Research in Cognitive Science, 
University of Pennsylvania, Technical Report 
IRCS-96-11, 1996. 

13. HENDERSON, J. C. AND BRILL, E. Exploiting diversity in natural  processing: combining parsers. In 1999 Joint Sigdat 
Conference on Empirical Methods in Natured Language Processing and Very Large Corpora. ACL, New Brunswick N J, 1999, 187-194. 

14. JOHNSON, M. PCFG models of linguistic 
tree representations. Computational Linguistics 24 4 (1998), 613-632. 

15. MAGERMAN, D.M. Statistical decision-tree 
models for parsing. In Proceedings of the 33rd 
Annual Meeting of the Association for Computational Linguistics. 1995, 276-283. 

16. MARCUS, M. P., SANTORINI, B. AND 
MARCINKIEWICZ, M. A. Building a large 
annotated corpus of English: the Penn treebank. Computational Linguistics 19 (1993), 
313-330. 

17. RATNAPARKHI, A. Learning to parse natural  with maximum entropy models. 
Machine Learning 341/2/3 (1999), 151-176. 
