A. Linear Observed Time Statistical Parser Based on Maximum 
Entropy Models 
Adwait Ratnaparkhi* 
Dept. of Computer and Information Science 
University of Pennsylvania 
200 South 33rd Street 
Philadelphia, PA 19104-6389 
adwait~unagi, cis. upenn, edu 
Abstract 
This paper presents a statistical parser for 
natural language that obtains a parsing 
accuracy--roughly 87% precision and 86% 
recall--which surpasses the best previously 
published results on the Wall St. Journal 
domain. The parser itself requires very lit- 
tle human intervention, since the informa- 
tion it uses to make parsing decisions is 
specified in a concise and simple manner, 
and is combined in a fully automatic way 
under the maximum entropy framework. 
The observed running time of the parser on 
a test sentence is linear with respect to the 
sentence length. Furthermore, the parser 
returns several scored parses for a sentence, 
and this paper shows that a scheme to pick 
the best parse from the 20 highest scoring 
parses could yield a dramatically higher ac- 
curacy of 93% precision and recall. 
1 Introduction 
This paper presents a statistical parser for natural 
language that finds one or more scored syntactic 
parse trees for a given input sentence. The parsing 
accuracy--roughly 87% precision and 86% recall-- 
surpasses the best previously published results on 
the Wall St. Journal domain. The parser consists of 
the following three conceptually distinct parts: 
1. h set of procedures that use certain actions to 
incrementally construct parse trees. 
2. A set of maximum entropy models that com- 
pute probabilities of the above actions, and ef- 
fectively "score" parse trees. 
* The author acknowledges the support of AI:tPA 
grant N66001-94C-6043. 
3. A search heuristic which attempts to find the 
highest scoring parse tree for a given input sen- 
tence. 
The maximum entropy models used here are simi- 
lar in form to those in (Ratnaparkhi, 1996; Berger, 
Della Pietra, and Della Pietra, 1996; Lau, Rosen- 
feld, and Roukos, 1993). The models compute the 
probabilities of actions based on certain syntactic 
characteristics, or features, of the current context. 
The features used here are defined in a concise and 
simple manner, and their relative importance is de- 
termined automatically by applying a training pro- 
cedure on a corpus of syntactically annotated sen- 
tences, such as the Penn Treebank (Marcus, San- 
torini, and Marcinkiewicz, 1994). Although creat- 
ing the annotated corpus requires much linguistic 
expertise, creating the feature set for the parser it- 
self requires very little linguistic effort. 
Also, the search heuristic is very simple, and its 
observed running time on a test sentence is linear 
with respect to the sentence length. Furthermore, 
the search heuristic returns several scored parses for 
-a sentence, and this paper shows that a scheme to 
pick the best parse from the 20 highest scoring parses 
could yield a dramatically higher accuracy of 93% 
precision and recall. 
Sections 2, 3, and 4 describe the tree-building 
procedures, the maximum entropy models, and the 
search heuristic, respectively. Section 5 describes 
experiments with the Penn Treebank and section 
6 compares this paper with previously published 
works. 
2 Procedures for Building Trees 
The parser uses four procedures, TAG, CHUNK, 
BUILD, and CHECK, that incrementally build parse 
trees with their actions. The procedures are ap- 
plied in three left-to-right passes over the input sen- 
tence; the first pass applies TAG, the second pass ap- 
plies CHUNK, and the third pass applies BUILD and 
Pass Procedure Actions Description 
First Pass TAG A POS tag in tag set Assign POS Tag to word 
Second Pass CHUNK Start X, Join X, Other Assign Chunk tag to POS tag and 
word 
Third Pass BUILD Start X, Join X, where X is a Assign current tree to start a new 
constituent label in label set constituent, or to join the previ- 
ous one 
CHECK Yes, No Decide if current constituent is 
complete 
Table 1: Tree-Building Procedures of Parser 
CHECK. The passes, the procedures they apply, and 
the actions of the procedures are summarized in ta- 
ble 1 and described below. 
The actions of the procedures are designed so that 
any possible complete parse tree T for the input sen- 
tence corresponds to exactly one sequence of actions; 
call this sequence the derivation of T. Each proce- 
dure, when given a derivation d = {al... an}, pre- 
dicts some action a,+l to create a new derivation 
d' = {al ... a,+l }. Typically, the procedures postu- 
late many different values for a,+l, which cause the 
parser to explore many different derivations when 
parsing an input sentence. But for demonstration 
purposes, figures 1-7 trace one possible derivation 
for the sentence "I saw the man with the telescope", 
using the part-of-speech (POS) tag set and con- 
stituent label set of the Penn treebank. 
2.1 First Pass 
The first pass takes an input sentence, shown in fig- 
ure 1, and uses TAG to assign each word a POS tag. 
The result of applying TAG to each word is shown in 
figure 2. 
2.2 Second Pass 
The second pass takes the output of the first pass 
and uses CHUNK to determine the "flat" phrase 
chunks of the sentence, where a phrase is "flat" if 
and only if it is a constituent whose children consist 
solely of POS tags. Starting from the left, CHUNK 
assigns each (word,POS tag) pair a "chunk" tag, ei- 
ther Start X, Join X, or Other. Figure 3 shows the 
result after the second pass. The chunk tags are then 
used for chunk detection, in which any consecutive 
sequence of words win.., w, (m _< n) are grouped 
into a "flat" chunk X if wm has been assigned Start 
X and Wm+l...w, have all been assigned Join X. 
The result of chunk detection, shown in figure 4, is 
a forest of trees and serves as the input to the third 
pass. 
Procedure Actions Similar 
Shift-Reduce 
Parser Action 
CHECK No shift 
CHECK Yes reduce c~, where 
a is CFG 
rule of proposed 
constituent 
BUILD Start X, Join X Determinesa for 
subsequent re- 
duce operations 
Table 2: Comparison of BUILD and CHECK to oper- 
ations of a shift-reduce parser 
2.3 Third Pass 
The third pass always alternates between the use 
of BUILD and CHECK, and completes any remain- 
ing constituent structure. BUILD decides whether 
a tree will start a new constituent or join the in- 
complete constituent immediately to its left. Ac- 
cordingly, it annotates the tree with either Start X, 
where X is any constituent label, or with Join X, 
where X matches the label of the incomplete con- 
stituent to the left. BUILD always processes the 
leftmost tree without any Start X or Join X an- 
notation. Figure 5 shows an application of BUILD 
in which the action is Join VP. After BUILD, con- 
trol passes to CHECK, which finds the most recently 
proposed constituent, and decides if it is complete. 
The most recently proposed constituent, shown in 
figure 6, is the rightmost sequence of trees tin.., tn 
(m < n) such that tm is annotated with Start X 
and tm+l .. • tn are annotated with Join X. If CHECK 
decides yes, then the proposed constituent takes its 
place in the forest as an actual constituent, on which 
BUILD does its work. Otherwise, the constituent is 
not finished and BUILD processes the next tree in 
the forest, tn+ 1. CHECK always answers no if the 
2 
I saw the man with the telescope 
Figure 1: Initial Sentence 
PRP VBD DT NN IN DT NN 
I I i i I i I I saw the man with the telescope 
Figure 2: The result after First Pass 
Start NP Other Start NP Join NP Other Start NP Join NP 
i i I I i i I PRP VBD DT NN IN DT NN 
i I I i i i i I saw the man with the telescope 
Figure 3: The result after Second Pass 
NP VBD NP IN 
PRP saw DT NN with 
I I I I the man 
NP 
DT NN 
i i the telescope 
Figure 4: The result of chunk detection 
Start S Start VP Join VP IN NP 
I I I I 
NP VBD NP with DT NN 
I f ~ i i PRP saw DT NN the telescope 
I I i I the man 
Figure 5: An application of BUILD in which Join VP is the action 
Start S ? IN NP 
NP Start VP Join VP with DT NN 
I I I I I PRP VBD NP the telescope 
I saw DT NN 
I I the man 
Figure 6: The most recently proposed constituent (shown under ?) 
3 
Start S Start VP Join VP ? 
I I i I NP VBD NP IN 
PRP saw DT NN with 
I I I I the man 
NP 
DT NN 
I I the telescope 
Figure 7: An application of CHECK in which No is the action, indicating that the proposed constituent in 
figure 6 is not complete. BUILD will now process the tree marked with ? 
proposed constituent is a "fiat" chunk, since such 
constituents must be formed in the second pass. Fig- 
ure 7 shows the result when CHECK looks at the pro- 
posed constituent in figure 6 and decides No. The 
third pass terminates when CHECK is presented a 
constituent that spans the entire sentence. 
Table 2 compares the actions of BUILD and CHECK 
to the operations of a standard shift-reduce parser. 
The No and Yea actions of CHECK correspond to the 
shift and reduce actions, respectively. The impor- 
tant difference is that while a shift-reduce parser 
creates a constituent in one step (reduce a), the pro- 
cedures BUILD and CHECK create it over several steps 
in smaller increments. 
3 Probability Model 
This paper takes a "history-based" approach (Black 
et al., 1993) where each tree-building procedure uses 
a probability model p(alb), derived from p(a, b), to 
weight any action a based on the available context, 
or history, b. First, we present a few simple cate- 
gories of contextual predicates that capture any in- 
formation in b that is useful for predicting a. Next, 
the predicates are used to extract a set of features 
from a corpus of manually parsed sentences. Finally, 
those features are combined under the maximum en- 
tropy framework, yielding p(a, b). 
3.1 Contextual Predicates 
Contextual predicates are functions that check for 
the presence or absence of useful information in a 
context b and return true or false accordingly. The 
comprehensive guidelines, or templates, for the con- 
textual predicates of each tree building procedure 
are given in table 3. The templates use indices 
relative to the tree that is currently being modi- 
fied. For example, if the current tree is the 5th 
tree, cons(-2) looks at the constituent label, head 
word, and start/join annotation of the 3rd tree in 
the forest. The actual contextual predicates are gen- 
erated automatically by scanning the derivations of 
the trees in the manually parsed corpus with the 
templates. For example, an actual contextual pred- 
icate based on the template cons(0) might be "Does 
cons(0) = { NP, he } ?" Constituent head words 
are found, when necessary, with the algorithm in 
(Magerman, 1995). 
Contextual predicates which look at head words, 
or especially pairs of head words, may not be re- 
liable predictors for the procedure actions due to 
their sparseness in the training sample. Therefore, 
for each lexically based contextual predicate, there 
also exist one or more corresponding less specific, 
or "backed-off", contextual predicates which look 
at the same context, but omit one or more words. 
For example, the contexts cons(0, 1"), cons(0*, 1), 
cons(0*, 1") are the same as cons(0,1) but omit ref- 
erences to the head word of the 1st tree, the 0th 
tree, and both the 0th and 1st tree, respectively. 
The backed-off contextual predicates should allow 
the model to provide reliable probability estimates 
when the words in the history are rare. Backed-off 
predicates are not enumerated in table 3, but their 
existence is indicated with a * and t. 
3.2 Maximum Entropy Framework 
The contextual predicates derived from the tem- 
plates of table 3 are used to create the features nec- 
essary for the maximum entropy models. The pred- 
icates for TAG, CHUNK, BUILD, and CHECK are used 
to scan the derivations of the trees in the corpus to 
form the training samples "~^c, TcMusK, '~U~LD, and 
TCMECK, respectively. Each training sample has the 
form T = ((al, 51), (a2, b2),..., CaN, bN)}, where ai 
is an action of the corresponding procedure and bi 
is the list of contextual predicates that were true in 
the context in which al was decided. 
The training samples are respectively used to cre- 
ate the models PT^G, PCHUNK, PBUILD, and PCMECK, all of 
which have the form: 
k 
p(a, b) = II _ij(o,b  ~j (1) 
j----1 
where a is some action, b is some context, ~" is a nor- 
4 
Model Categories Description Templates Used 
TAG See (Ratnaparkhi, 1996) 
CHUNK chunkandpostag(n)* 
BUILD 
CHECK 
chunkandpostag(m, n)* 
cons(n) 
cons(re, n)* 
cons(m, n,p) T 
punctuation 
checkcons(n)* 
checkcons(m,n)* 
production 
surround(n)* 
The word, POS tag, and chunk tag of nth 
leaf. Chunk tag omitted if n > 0. 
chunkandpostag(m) & chunkandpostag(n) 
The head word, constituent (or POS) la- 
bel, and start/join annotation of the nth 
tree. Start/join annotation omitted if 
n>0. 
cons(m) & cons(n) 
cons(m), cons(n), & cons(p). 
The constituent we could join (1) contains 
a "\[" and the current tree is a "\]"; (2) 
contains a "," and the current tree is a ","; 
(3) spans the entire sentence and current 
tree is "." 
The head word, constituent (or POS) la- 
bel of the nth tree, and the label of pro- 
posed constituent, begin and last are 
first and last child (resp.) of proposed 
constituent. 
checkcons(m) & checkcons(n) 
Constituent label of parent (X), and 
constituent or P0S labels of children 
(Xz ... Xn) of proposed constituent 
POS tag and word of the nth leaf to the 
left of the constituent, if n < 0, or to the 
right of the constituent, if n > 0 
chunkandpostag(O), 
chunkandpostag(-1), 
chunkandpostag(-2) 
chunkandpostag(1), 
chunkandpostag(2) 
chunkandpostag(-1, 0), 
chunkandpostag(O, 1) 
cons(O), cons(-1), cons(-2), 
cons(l), cons(2) 
cons(-1, 0), cons(O, 1) 
cons(O, -1, -2), cons(O, 1, 2), 
cons(-1, O, 1) 
bracketsmatch, iscomma, 
endofsentence 
checkcons( last) , 
checkcons(begin) 
checkcons(i, last), begin < 
i < last 
production=X --} X1 ... Xn 
surround(I), surround(2), 
surround(-1), surround(-2) 
Table 3: Contextual Information Used by Probability Models (* = all backed-off contexts are used, t = only 
backed-off contexts that include head word of current tree, i.e., 0th tree, are used) 
5 
malization constant, aj are the model parameters, 
0 < aj < oo, and fj(a, b) E {0, 1} are called features, 
j = {1... k}. Features encode an action a' as well 
as some contextual predicate cp that a tree-building 
procedure would find useful for predicting the action 
a'. Any contextual predicate cp derived from table 3 
which occurs 5 or more times in a training sample 
with a particular action a' is used to construct a 
feature fj: 
1 if cp(b) = true && a = a' 
fj(a,b)= 0 otherwise 
for use in the corresponding model. Each feature fj 
corresponds to a parameter aj, which can be viewed 
as a "weight" that reflects the importance of the 
feature. 
The parameters {al...an} are found automat- 
ically with Generalized Iterative Scaling (Darroch 
and Ratcliff, 1972), or GIS. The GIS procedure, as 
well as the maximum entropy and maximum likeli- 
hood properties of the distribution of form (1), are 
described in detail in (Ratnaparkhi, 1997). In gen- 
eral, the maximum entropy framework puts no lim- 
itations on the kinds of features in the model; no 
special estimation technique is required to combine 
features that encode different kinds of contextual 
predicates, like punctuation and cons(0, 1, 2). As a 
result, experimenters need only worry about what 
features to use, and not how to use them. 
We then use the models Pr^a, PeHusK, PBUILD, and 
Pongee to define a function score, which the search 
procedure uses to rank derivations of incomplete and 
complete parse trees. For each model, the corre- 
sponding conditional probability is defined as usual: 
p(a, b) 
P(alb) = Ea'eA p(a', b) 
For notational convenience, define q as follows 
\[ PrAa(a\]b) if a is an action from TAG 
pCStmK(a\[b) if a is an action from CHUNK 
q(alb) = PBUILD(al b) if a is an action from BUILD 
PcnEcK(alb ) if a is an action from CHECK 
Let deriv(T) = {al,... ,an} be the derivation of a 
parse T, where T is not necessarily complete, and 
where each al is an action of some tree-building 
procedure. By design, the tree-building procedures 
guarantee that {al,..., an} is the only derivation for 
the parse T. Then the score of T is merely the prod- 
uct of the conditional probabilities of the individual 
actions in its derivation: 
score(T) = H q(adbi) 
a~ Ederiv(T) 
where bi is the context in which ai was decided. 
4 Search 
The search heuristic attempts to find the best parse 
T*, defined as: 
T* =arg max score(T) 
TGtrees(S) 
where trees(S) are all the complete parses for an 
input sentence S. 
The heuristic employs a breadth-first search 
(BFS) which does not explore the entire frontier, 
but rather, explores only at most the top g scor- 
ing incomplete parses in the frontier, and terminates 
when it has found M complete parses, or when all 
the hypotheses have been exhausted. Furthermore, 
if {al...an} are the possible actions for a given 
procedure on a derivation with context b, and they 
are sorted in decreasing order according to q(ailb), 
we only consider exploring those actions {al ... am} 
that hold most of the probability mass, where m is 
defined as follows: 
m 
m = m axE q(a, lb) < Q 
i=1 
and where Q is a threshold less than 1. The search 
also uses a Tag Dictionary constructed from train- 
ing data, described in (Ratnaparkhi, 1996), that re- 
duces the number of actions explored by the tag- 
ging model. Thus there are three parameters for the 
search heuristic, namely K,M, and Q and all exper- 
iments reported in this paper use K = 20, M = 20, 
and Q = .951 Table 4 describes the top K BFS and 
the semantics of the supporting functions. 
It should be emphasized that if K > 1, the parser 
does not commit to a single POS or chunk assign- 
ment for the input sentence before building con- 
stituent structure. All three of the passes described 
in section 2 are integrated in the search, i.e., when 
parsing a test sentence, the input to the second pass 
consists of K of the best distinct POS tag assign- 
ments for the input sentence. Likewise, the input 
to the third pass consists of K of the best distinct 
chunk and POS tag assignments for the input sen- 
tence. 
The top K BFS described above exploits the ob- 
served property that the individual steps of correct 
derivations tend to have high probabilities, and thus 
avoids searching a large fraction of the search space. 
Since, in practice, it only does a constant amount of 
work to advance each step in a derivation, and since 
derivation lengths are roughly proportional to the 
1The parameters K,M, and Q were optimized on 
"held out" data separate from the training and test sets. 
6 
advance: d x V 
insert: d x h 
extract: h ) d 
completed: d 
> dl...dm 
> void 
) {true,false} 
M=20 
K=20 
Q = .95 
C = <empty heap> 
h0 =<input sentence> 
while (ICl <M ) 
if ( Vi, hi is empty ) 
then break 
/* Applies relevant tree building procedure to d 
and returns list of new derivations whose action 
probabilities pass the threshold Q */ 
/* inserts d in heap h */ 
/* removes and returns derivation in h 
with highest score */ 
/* returns true if and only if 
d is a complete derivation */ 
/* Heap of completed parses */ 
/* hi contains derivations of length i */ 
i = max{i I hi is non-empty} 
sz = min(g, ihii) 
for j=l to sz 
dl...dp = advance(extract(hl), V ) 
for q=l to p 
if (completed (dq)) 
then insert (dq, C) 
else insert(dq, hi+l) 
Table 4: Top K BFS Search Heuristic 
Seconds 
I I I I I 
, li! ,ll!ill, 
0 
 ::it 
:ii i ,i!.,: 
...:i ,!lili !)))I, 
)!i' 
• ° 
• o• 
:'i:!"." :.:',I- "," 
• i.,iI :o: !!II, 
| li: '. " I|I: i 
i~': 
I 
• o• 
i'l " I I f I I 
10 20 30 40 50 60 70 
Sentence Length 
Figure 8: Observed running time of top K BFS on Section 23 of Penn Treebank WSJ, using one 167Mhz 
UltraSPARC processor and 256MB RAM of a Sun Ultra Enterprise 4000. 
sentence length, we would expect it to run in lin- 
ear observed time with respect to sentence length. 
Figure 8 confirms our assumptions about the linear 
observed running time. 
5 Experiments 
The maximum entropy parser was trained on sec- 
tions 2 through 21 (roughly 40000 sentences) of 
the Penn Treebank Wall St. Journal corpus, release 
2 (Marcus, Santorini, and Marcinkiewicz, 1994), and 
tested on section 23 (2416 sentences) for compar- 
ison with other work. All trees were stripped of 
their semantic tags (e.g., -LOC, -BNF, etc.), coref- 
erence information(e.g., *-1), and quotation marks 
(" and ' ' ) for both training and testing. The PAR- 
SEVAL (Black and others, 1991) measures compare 
a proposed parse P with the corresponding correct 
treebank parse T as follows: 
# correct constituents in P Recall = 
# constituents in T 
# correct constituents in P Precision = 
# constituents in P 
A constituent in P is "correct" if there exists a con- 
stituent in T of the same label that spans the same 
words. Table 5 shows results using the PARSEVAL 
measures, as well as results using the slightly more 
forgiving measures of (Collins, 1996) and (Mager- 
man, 1995). Table 5 shows that the maximum en- 
tropy parser performs better than the parsers pre- 
sented in (Collins, 1996) and (Magerman, 1995) ~, 
which have the best previously published parsing ac- 
curacies on the Wall St. Journal domain. 
It is often advantageous to produce the top N 
parses instead of just the top 1, since additional in- 
formation can be used in a secondary model that re- 
orders the top N and hopefully improves the quality 
of the top ranked parse. Suppose there exists a "per- 
fect" reranking scheme that, for each sentence, magi- 
cally picks the best parse from the top N parses pro- 
duced by the maximum entropy parser, where the 
best parse has the highest average precision and re- 
call when compared to the treebank parse. The per- 
formance of this "perfect" scheme is then an upper 
bound on the performance of any reranking scheme 
that might be used to reorder the top N parses. Fig- 
ure 9 shows that the "perfect" scheme would achieve 
roughly 93% precision and recall, which is a dra- 
matic increase over the top 1 accuracy of 87% preci- 
sion and 86% recall. Figure 10 shows that the "Ex- 
act Match", which counts the percentage of times 
2Results for SPATTER on section 23 are reported in 
(Collins, 1996) 
Parser Precision 
Maximum Entropy ° 86.8% 
Maximum Entropy* 87.5% 
(Collins, 1996)* 85.7% 
(Magerman, 1995)* 84.3% 
Recall 
85.6% 
86.3% 
85.3% 
84.0% 
Table 5: Results on 2416 sentences of section 23 
(0 to 100 words in length) of the WSJ Treebank. 
Evaluations marked with ~ ignore quotation marks. 
Evaluations marked with * collapse the distinction 
between ADVP and PRT, and ignore all punctuation. 
the proposed parse P is identical (excluding POS 
tags) to the treebank parse T, rises substantially 
to about 53% from 30% when the "perfect" scheme 
is applied. For this reason, research into reranking 
schemes appears to be a promising step towards the 
goal of improving parsing accuracy. 
6 Comparison With Previous Work 
The two parsers which have previously reported the 
best accuracies on the Penn Treebank Wall St. Jour- 
nal are the bigram parser described in (Collins, 1996) 
and the SPATTER parser described in (Jelinek et 
al., 1994; Magerman, 1995). The parser presented 
here outperforms both the bigram parser and the 
SPATTER parser, and uses different modelling tech- 
nology and different information to drive its deci- 
sions. 
The bigram parser is a statistical CKY-style chart 
parser, which uses cooccurrence statistics of head- 
modifier pairs to find the best parse. The max- 
imum entropy parser is a statistical shift-reduce 
style parser that cannot always access head-modifier 
pairs. For example, the checkcons(m,n) predicate 
of the maximum entropy parser may use two words 
such that neither is the intended head of the pro- 
posed consituent that the CHECK procedure must 
judge. And unlike the bigram parser, the maximum 
entropy parser cannot use head word information 
besides "flat" chunks in the right context. 
The bigram parser uses a backed-off estimation 
scheme that is customized for a particular task, 
whereas the maximum entropy parser uses a gen- 
eral purpose modelling technique. This allows the 
maximum entropy parser to easily integrate vary- 
ing kinds of features, such as those for punctua- 
tion, whereas the bigram parser uses hand-crafted 
punctuation rules. Furthermore, the customized es- 
timation framework of the bigram parser must use 
information that has been carefully selected for its 
value, whereas the maximum entropy framework ro- 
8 
95 
94 
93 
92 
91 
%Accuracy90 
89 
88 
87 
0 
0 + 
+ 
0 + 
0 + + 
+ 
I I I 
Precision O 
Recall + 
oO oO O O OOOOO O OOO + + + + + + + + + + + 
+ + 
85 L------------------~------------------------..------------...-- 
0 5 10 15 20 
N 
Figure 9: Precision & recall of a "perfect" reranking scheme for the top N parses of section 23 of the WSJ 
Treebank, as a function of N. Evaluation ignores quotation marks. 
55 
53 
51 
49 
47 
45 
43 
% Accuracy~ 
37 
35 
33 
31< 
29 
27 
25 
0 
O 
O 
O 
O O 
O 
O O 
I 
O O O O O O O OOO O 
I I I 
5 10 15 20 
N 
Figure 10: Exact match of a "perfect" reranking scheme for the top N parses of section 23 of the WSJ 
Treebank, as a function of N. Evaluation ignores quotation marks. 
bustly integrates any kind of information, obviating 
the need to screen it first. 
The SPATTER parser is a history-based parser 
that uses decision tree models to guide the opera- 
tions of a few tree building procedures. It differs 
from the maximum entropy parser in how it builds 
trees and more critically, in how its decision trees 
use information. The SPATTER decision trees use 
predicates on word classes created with a statistical 
clustering technique, whereas the maximum entropy 
parser uses predicates that contain merely the words 
themselves, and thus lacks the need for a (typically 
expensive) word clustering procedure. Furthermore, 
the top K BFS search heuristic appears to be much 
simpler than the stack decoder algorithm outlined 
in (Magerman, 1995). 
77 Conclusion 
The maximum entropy parser presented here 
achieves a parsing accuracy which exceeds the best 
previously published results, and parses a test sen- 
tence in linear observed time, with respect to the 
sentence length. It uses simple and concisely speci- 
• fled predicates which can added or modified quickly 
with little human effort under the maximum entropy 
framework. Lastly, this paper clearly demonstrates 
that schemes for reranking the top 20 parses deserve 
research effort since they could yield vastly better 
accuracy results. 
8 Acknowledgements 
Many thanks to Mike Collins and Professor Mitch 
Marcus from the University of Pennsylvania for their 
helpful comments on this work. 

References 
Berger, Adam, Stephen A. Della Pietra, and Vin- 
cent J. Della Pietra. 1996. A Maximum Entropy 
Approach to Natural Language Processing. Com- 
putational Linguistics, 22(1):39-71. 
Black, Ezra et al. 1991. A Procedure for Quan- 
titatively Comparing the Syntactic Coverage of 
English Grammars. In Proceedings of the Febru- 
ary 1991 DARPA Speech and Natural Language 
Workshop, pages 306-311. 
Black, Ezra, Fred Jelinek, John Lafferty, David M. 
Magerman, Robert Mercer, and Salim Roukos. 
1993. Towards History-based Grammars: Using 
Richer Models for Probabilistic Parsing. In Pro- 
ceedings of the 31st Annual Meeting of the ACL, 
Columbus, Ohio. 
Collins, Michael John. 1996. A New Statistical 
Parser Based on Bigram Lexical Dependencies. 
In Proceedings of the 34th Annual Meeting of the 
ACL. 
Darroch, J. N. and D. Ratcliff. 1972. Generalized 
Iterative Scaling for Log-Linear Models. The An- 
nals of Mathematical Statistics, 43(5):1470-1480. 
Jelinek, Fred, John Lafferty, David M. Magerman, 
Robert Mercer, Adwait Ratnaparkhi, and Salim 
Roukos. 1994. Decision Tree Parsing using a Hid- 
den Derivational Model. In Proceedings of the Hu- 
man Language Technology Workshop, pages 272- 
277. ARPA. 
Lau, Ray, Ronald Rosenfeld, and Salim Roukos. 
1993. Adaptive Language Modeling Using The 
Maximum Entropy Principle. In Proceedings of 
the Human Language Technology Workshop, pages 
108-113. ARPA. 
Magerman, David M. 1995. Statistical Decision- 
Tree Models for Parsing. In Proceedings of the 
33rd Annual Meeting of the ACL. 
Marcus, Mitchell P., Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1994. Building a large 
annotated corpus of English: the Penn Treebank. 
Computational Linguistics, 19(2):313-330. 
Ratnaparkhi, Adwait. 1996. A Maximum Entropy 
Part of Speech Tagger. In Conference on Em- 
pirical Methods in Natural Language Processing, 
University of Pennsylvania, May 17-18. 
Ratnaparkhi, Adwait. 1997. A Simple Introduction 
to Maximum Entropy Models for Natural Lan- 
guage Processing. Technical Report 97-08, Insti- 
tute for Research in Cognitive Science, University 
of Pennsylvania. 
