Figures of Merit for 
Best-First Probabilistic Chart Parsing 
Sharon A. Caraballo and Eugene Charniak 
Brown University 
{ SO, ec}Ocs, brown, edu 
Abstract 
Best-first parsing methods for natural language 
try to parse efficiently by considering the most 
likely constituents first. Some figure of merit is 
needed by which to compare the likelihood of con- 
stituents, and the choice of this figure has a sub- 
stantial impact on the efficiency of the parser. 
While several parsers described in the literature 
have used such techniques, there is no published 
data on their efficacy, much less attempts to judge 
their relative merits. We propose and evaluate 
several figures of merit for best-first parsing. 
Introduction 
Chart parsing is a commonly-used algorithm for 
parsing natural language texts. The chart is a data 
structure which contains all of the constituents 
which may occur in the sentence being parsed. 
At any point in the algorithm, there exist con- 
stituents which have been proposed but not ac- 
tually included in a parse. These proposed con- 
stituents are stored in a data structure called the 
keylist. When a constituent is removed from the 
keylist, the system considers how this constituent 
can be used to extend its current structural hy- 
pothesis. In general this can lead to the creation of 
new, more encompassing constituents which them- 
selves are then added to the keylist. When we are 
finished processing one constituent, a new one is 
chosen to be removed from the keylist, and so on. 
Traditionally, the keylist is represented as a stack, 
so that the last item added to the keylist is the 
next one removed. 
Best-first chart parsing is a variation of chart 
parsing which attempts to find the most likely 
parses first, by adding constituents to the chart 
in order of the likelihood that they will appear in 
a correct parse, rather than simply popping con- 
stituents off of a stack. Some figure of merit is 
assigned to potential constituents, and the con- 
stituent maximizing this value is the next to be 
added to the chart. 
127 
In best-first probabilistic chart parsing a prob- 
abilistic measure is used. In this paper we con- 
sider probabilities primarily based on probabilistic 
context-free grammars, though in principle other, 
more complicated schemes could be used. 
Ideally, we would like to use as our figure 
of merit the conditional probability of that con- 
stituent, given the entire sentence, in order to 
choose a constituent that not only appears likely in 
isolation, but maximizes the likelihood of the sen- 
tence as a whole; that is, we would like to pick the 
constituent that maximizes the following quantity: 
i P(N~,klto,~) 
where to,n is the sequence of the n tags, or parts 
of speech, in the sentence (numbered to,..., tn- 1), 
and Nj, k is a nonterminal of type i covering terms 
tj...tk_l. However, we cannot calculate this 
quantity, since in order to do so, we would need 
to completely parse the sentence. In this paper, 
we examine the performance of several proposed 
figures of merit that approximate it in one way or 
another. 
In our experiments, we use only tag sequences 
for parsing. More accurate probability estimates 
should be attainable using lexical information. 
Figures of Merit 
Straight 
It seems reasonable to base a figure of merit on 
the inside probability fl of the constituent. In- 
side probability is defined as the probability of the 
words or tags in the constituent given that the con- 
stituent is dominated by a particular nonterminal 
symbol. This seems to be a reasonable basis for 
comparing constituent probabilities, and has the 
additional advantage that it is easy to compute 
during chart parsing. 
The inside probability of the constituent N~, k 
is defined as 
/3(Nj, k) ~ p(tj,klN i) 
where N i represents the ith nonterminal sym- 
bol. 
in terms of our earlier discussion, our "ideal" 
figure of merit can be rewritten as: 
i p( Nj,k lto,,d 
p(Nj, , to,n) 
p(to, ) 
p(Nij, k, to,j, t j, k, tk, n) 
p(to,**) 
p(to,j,Nj, k,tk,~)p(tj,klto,j, ' Nj,a, ta,n) 
p(to, ) 
We apply the usual independence assumption 
that given a nonterminal, the tag sequence it gen- 
erates depends only on that nonterminal, giving 
p(to,j, i i N;, k, tk,n)p(tj,k INj,k) P( N;,k lto,,d 
p(to,n) 
p(to,j, i i Nj,k,tk,~)~(N;,k) 
p(to,.) 
The first term in the numerator is just the 
definition of the outside probability a of the con- 
stituent. Outside probability a of a constituent Nj, k 
is defined as the probability of that con- 
stituent and the rest of the words in the sentence 
(or rest of the tags in the tag sequence, in our 
case). 
-(Nj,k) =- p(t0,j, Nj, , 
We can therefore rewrite our ideal figure of merit 
as 
i i • 
p(to, ) 
In this equation, we can see that a(Nj,k) and 
p(to,~) represent the influence of the surrounding 
words. Thus using j3 alone assumes that a and 
P(tom) can be ignored. 
We will refer to this figure of merit as 
straight ft. 
Normalized /~ 
One side effect from omitting the a and p(to,,~) 
terms in the m-only figure above is that inside 
probability alone tends to prefer shorter con- 
stituents to longer ones, as the inside probabil- 
ity of a longer constituent involves the product of 
128 
more probabilities. This can result in a "thrash- 
ing" effect, where the system parses short con- 
stituents, even very low probability ones, while 
avoiding combining them into longer constituents. 
To avoid thrashing, typically some technique is 
used to normalize the inside probability for use as 
a figure of merit. One approach is to take the ge- 
ometric mean of the inside probability, to obtain 
a "per-word" inside probability. (In the "ideal" 
model, the p(to,~) term acts as a normalizing fac- 
tor.) 
The per-word inside probability of the con- 
stituent Nj, k is calculated as 
We will refer to this figure as normalized/3. 
Normalized aLf~ 
In the previous section, we showed that our ideal 
figure of merit can be written as 
i i .( N3,k ) fl( Nj,k ) 
p(N3, lt0, ) p(t0,.) 
However, the a term, representing outside 
probability, cannot be calculated directly during a 
parse, since we need the full parse of the sentence 
to compute it. In some of our figures of merit, we 
use the quantity p(Nj,k, t0,j), which is closely re- 
lated to outside probability. We call this quantity 
the left outside probability, and denote it ai. 
The following recursive formula can be used to 
compute aL. Let g~,k be the set of all completed 
edges, or rule expansions, in which the nontermi- 
nal Nj, k appears. For each edge e in gj,k, we com- 
pute the the product of aL of the nonterminal ap- 
pearing on the left-hand side (lhs) of the rule, the 
probability of the rule itself, and /33 of each non- 
terminal N~s appearing to the left of Nj, a in the 
rule. Then aL(N),k) is the sum of these products: 
i  L(Nj,k) 
E lhs(e) ---- ~L(N~tart(e),end(e))p(rule(e)) H f~(Nvq, s )" 
eE$~, k N:., 
This formula can be infinitely recursive, 
depending on the properties of the grammar. 
A method for calculating aL more efficiently 
can be derived from the calculations given in 
(3elinek and Lafferty, 1991). 
A simple extension to the normalized fl model 
allows us to estimate the per-word probability of 
all tags in the sentence through the end of the 
constituent under consideration. This allows us to 
take advantage of information already obtained in 
a left-right parse. We calculate this quantity as 
follows: 
k O~ i i L ( N;,k ) J3( N;,k )" 
We are again~ taking the geometric mean to 
avoid thrashing by compensating for the aj3 quan- 
tity's preference for shorter constituents, as ex- 
plained in the previous section. 
We refer to this figure of merit as normal- 
ized O~Lfl. 
Trigram estimate 
An alternative way to rewrite the "ideal" figure of 
merit is as followS: 
P(Nj,ktto,n) 
__ P(Nj, k't°,~) 
p(to,,d 
p(to,j, tk,n)p(N~, klto,j t i __ , , k,n)p(tj,klN~,k,to,j, tk,n) 
p(to,j, tk,~)p(tj,k Ito,i, tk,~) 
Once again applying the usual independence 
assumption that given a nonterminal, the tag se- 
quence it generates depends only on that nonter- 
minal, we can rewrite the figure of merit as follows: 
p(tj,k Ito,j, tk,.) 
To derive an estimate of this quantity for prac- 
tical use as a figure of merit, we make some addi- 
tional independence assumptions. We assume that 
p(N),klto,j, tk,~) ~ p(N~,k), that is, that the prob- 
ability of a nonterminal is independent of the tags 
before and after it in the sentence. We also use 
a trigram model for the tags themselves, giving 
p(tj,klto,j, tk,n) ,~ p(tj,kltj_2,j). Then we have: 
i i p(N )fl(N\],k) 
p(Nj, ktto,,~) .~. p(tj,kltj_2,j)" 
We can calculate ~(Nj, k) as usual. The p(N ~) 
term is estimated from our PCFG as the sum of 
the counts for all rules having N i as their left- 
hand side, divided by the sum of the counts for 
all rules. The p(tj,kltj_2,j) term is just the proba- 
bility of the tag sequence tj... tk- 1 according to a 
trigram model. 1 (Technically, this is not a trigram 
model but a tritag model, since we are consider- 
ing sequences of tags, not words.) We refer to this 
model as the trigram estimate. 
1Our results show that the p(N i) term can be omit- 
ted without much effect. 
129 
Prefix estimate 
We also derived an estimate of the ideal figure of 
merit which takes advantage of statistics on the 
first j - 1 tags of the sentence as well as tj,k. 
This estimate represents the probability of the 
constituent in the context of the preceding tags. 
p(Nj, klto,n) 
P(Nj,k,to,~) 
p(to, ) 
p(tk,~)p(N), k, to,j Itk,~)p(tj,k \]Nj, k, to,j, ta,n) 
p(tk,,,)p(to,k\]tk,=) 
p( Nj, k, to,j \] t k,~ ) p( t j,k I Nj, k , to,a, t k,~ ) 
p(to,kltk,,~) 
We again make the independence assumption 
that p(tj,kINj, k,to,j, tk,~) ~ fl(Nj, k). Addition- 
ally, we assume that i P(N~,k,to,i) and p(to,k) are 
independent of p(tk,n), giving 
i i 
p(N),klto,.) p(to,k) 
The denominator, p(t0,k), is once again calcu- 
lated from a tritag model. The p(N),k, t0,j) term 
is just O~L, defined above in the discussion of the 
normalized O~Lfl model. Thus this figure of merit 
can be written as 
i i L ( N3,k ) Z( N;,k ) 
p(to,k) 
We will refer to this as the prefix estimate. 
The Experiment 
We used as our grammar a probabilistic 
context-free grammar learned from the Brown 
corpus (see (Francis and K@era, 1982), Car- 
roll and Charniak (1992a) and (1992b), and 
(Charniak and Carroll, 1994)). We parsed 500 
sentences of length 3 to 30 (including punctua- 
tion) from the Penn Treebank Wall Street Journal 
corpus using a best-first parsing method and each 
of the following estimates for p(Nj, klto,~) as the 
figure of merit: 
1. straight 
2. normalized \[3 
3. normalized O~Lfl 
4. trigram estimate 
5. prefix estimate 
The probability p(N i) in the trigram estimate 
was determined from the same training data from 
which our grammar was learned initially. Our 
tritag probabilities for the trigram and prefix es- 
timates were learned from this data as well, using 
the deleted interpolation method for smoothing. 
For each figure of merit, we compared the per- 
formance of best-first parsing using that figure of 
merit to exhaustive parsing. By exhaustive pars- 
ing, we mean continuing to parse until there are 
no more constituents available to be added to the 
chart. We parse exhaustively to determine the to- 
tal probability of a sentence, that is, the sum of the 
probabilities of all parses found for that sentence. 
We then computed several quantities for best- 
first parsing with each figure of merit at the point 
where the best-first parsing method has found 
parses contributing at least 95% of the probability 
mass of the sentence. 
Results 
The chart below presents the following measures 
for each figure of merit: 
1. %E: The percentage of edges, or rule expan- 
sions, in the exhaustive parse that have been 
used by the best-first parse to get 95% of the 
probability mass. Edge creation is generally 
considered the best measure of CFG parser ef- 
fort. 
2. %non-0 E: The percentage of nonzero-length 
edges used by the best-first parse to get 95%. 
Zero-length edges are required by our parser as 
a book-keeping measure, and as such are virtu- 
ally un-elimitable. We anticipated that remov- 
ing them from consideration would highlight the 
"true" differences in the figures of merit. 
3. %popped: The percentage of constituents in the 
exhaustive parse that were used by the best-first 
parse to get 95% of the probability mass. 
Figure of Merit %E 
straight/3 97.6 
normalized/3 34.7 
normahzed crL/3 39.7 
trigram estimate 25.2 
prefix estimate 21.8 
%non-0 E %popped 
97.5 93.8 
31.6 61.5 
36.4 57.3 
21.7 44.3 
17.4 38.3 
The statistics converged to their final values 
quickly. The edge-count percentages were gener- 
ally within .01 of their final values after processing 
only 200 sentences, so the results were quite stable 
by the end of our 500-sentence test corpus. 
We gathered statistics for each sentence length 
from 3 to 30. Sentence length was limited to a 
maximum of 30 because of the huge number of 
edges that are generated in doing a full parse of 
130 
long sentences; using this grammar, sentences in 
this length range have produced up to 130,000 
edges. Figure 1 shows a graph of %non-0 E, that 
is, the percent of nonzero-length edges needed to 
get 95% of the probability mass, for each sentence 
length. 
We also measured the total CPU time (in sec- 
onds) needed to get 95% of the probability mass 
for each of the 500 sentences. The results are pre- 
sented in the following chart: 
Figure of Merit CPU time 
straight fl 3966 
normahzed/3 1631 
normahzed aL/3 68660 
trigram estimate 1547 
prefix estimate 26520 
Figure 2 shows the average CPU time to get 
95% of the probability mass for each estimate and 
each sentence length. Each estimate averaged be- 
low 1 second on sentences of fewer than 7 words. 
(The y-axis has been restricted so that the normal- 
ized /3 and trigram estimates can be better com- 
pared.) 
Previous work 
The literature shows many implementations of 
best-first parsing, but none of the previous work 
shares our goal of explicitly comparing figures of 
merit. 
Bobrow (1990) and Chitrao and Grishman 
(1990) introduced statistical agenda-based parsing 
techniques. Chitrao and Grishman implemented 
a best-first probabilistic parser and noted the 
parser's tendency to prefer shorter constituents. 
They proposed a heuristic solution of penalizing 
shorter constituents by a fixed amount per word. 
Miller and Fox (1994) compare the perfor- 
mance of parsers using three different types of 
grammars, and show that a probabilistic context- 
free grammar using inside probability (unnormal- 
ized) as a figure of merit outperforms both a 
context-free grammar and a context-dependent 
grammar. 
Kochman and Kupin (1991) propose a fig- 
ure of merit closely related to our prefix estimate. 
They do not actually incorporate this figure into 
a best-first parser. 
Magerman and Marcus (1991) use the geomet- 
ric mean to compute a figure of merit that is in- 
dependent of constituent length. Magerman and 
Weir (1992) use a similar model with a different 
parsing algorithm. 
tm 
"o 
lO0 - 
80 
60 
40 
20 
I,. ~, ~,, ."- .... 
iV?.: ",% ",,'" ,'~ "" .... :'. ." 
:,:' "~. -"L """ "'. ":,// ">:~. ".../ '?'-": "~ 
"-"~'/ \'~"x~ s t /C" \LJ \ 
• ' ' ' ' ' ' 1 ' " ' " ' ' ' ' " I " ' " ' ' w , 
I0 20 
Sentence Length 
Figure 1: Nonzero-length edges for 95% of Probability Mass 
I 
3O 
-- stlraigl}t beta 
..... normalized beta 
...... normalized alphaL beta 
.... trigram estimate 
prefix estimate 
= 
/ / 
/ 
f I 
I I 
I I 
I I 
I I 
I / 
/ ' 
B~ 
# I 
i ll'l , 
11 I / /" I I// 
: / I 
il ,';"-; 
: I /;' 
5 ;I : I fl 
: I : 
I ,,/</ :/ f-- 
f J p°--V 
10 15 20 25 30 
Sentence Length 
Figure 2: Average CPU Time for 95% of Probability Mass 
131 
-- straight beta 
.... normalized beta 
...... normalized alphaL beta 
trigram estimate 
------ prefix estimate 
Conclusions 
From the edge count statistics, it is clear that 
straight ,3 is a poor figure of merit. Figure 1 also 
demonstrates that its performance generally wors- 
ens as sentence length increases. 
The best performance in terms of edge counts 
of the figures we tested was the model which used 
the most information available from the sentence, 
the prefix model. However, so far, the additional 
running time needed for the computation of O' L 
terms has exceeded the time saved by processing 
fewer edges, as is made clear in the CPU time 
statistics, where these two models perform sub- 
stantially worse than even the straight j3 figure. 
While chart parsing and calculations of j3 can 
be done in O(n 3) time, we have been unable to 
find an algorithm to compute the o~ L terms faster 
than O(nS). When a constituent is removed from 
the keylist, it only affects the j3 values of its an- 
cestors in the parse trees; however, ~L values are 
propagated to all of the constituent's siblings to 
the right and all of its descendants. Recomput- 
ing the o~ L terms when a constituent is removed 
from the keylist can be done in O(n 3) time, and 
since there are O(n 2) possible constituents, the to- 
tal time needed to compute the ol L terms in this 
manner is O(n5). 
The best performer in running time was the 
parser using the trigram estimate as a figure of 
merit. This figure has the additional advantage 
that it can be easily incorporated into existing 
best-first parsers using a figure of merit based on 
inside probability. From the CPU time statistics, 
it can be seen that the running time begins to show 
a real improvement over the normalized j3 model 
on sentences of length 25 or greater, and the trend 
suggests that the improvement would be greater 
for longer sentences. 
It is also interesting to note that while the 
models using figures of merit normalized by the 
geometric mean performed similarly to the other 
models on shorter sentences, the superior perfor- 
mance of the other models becomes more pro- 
nounced as sentence length increases. From Figure 
1, we can see that the models using the geometric 
mean appear to level off with respect to an exhaus- 
tive parse when used to parse sentences of length 
greater than about 15. The other two estimates 
seem to continue improving with greater sentence 
length. In fact, the measurements presented here 
almost certainly underestimate the true benefits of 
the better models. We restricted sentence length 
to a maximum of 30 words, in order to keep the 
number of edges in the exhaustive parse to a prac- 
tical size; however, since the percentage of edges 
needed by the best-first parse decreases with in- 
creasing sentence length, we assume that the ira- 
132 
provement would be even more dramatic for sen- 
tences longer than 30 words. 

References 
\[1990\] Robert J. Bobrow. 1990. Statistical agenda 
parsing. In DARPA Speech and Language 
Workshop, pages 222-224. 
\[1992a\] Glenn Carroll and Eugene Charniak. 
1992a. Learning probabilistic dependency gram- 
mars from labeled text. In Working Notes, Fall 
Symposium Series, pages 25-32. AAAI. 
\[1992b\] Glenn Carroll and Eugene Charniak. 
1992b. Two experiments on learning proba- 
bilistic dependency grammars from corpora. In 
Workshop Notes, Statistically-Based NLP Tech- 
niques, pages 1-13. AAAI. 
\[1994\] Eugene Charniak and Glenn Carroll. 1994. 
Context-sensitive statistics for improved gram- 
matical language models. In Proceedings of the 
Twelfth National Conference on Artificial Intel- 
ligence, pages 728-733. 
\[1990\] Mahesh V. Chitrao and Ralph Grishrnan. 
1990. Statistical parsing of messages. In 
DARPA Speech and Language Workshop, pages 
263-266. 
\[1982\] W. Nelson Francis and Henry Ku~era. 
1982. Frequency Analysis of English Usage: 
Lexicon and Grammar. Houghton Mifflin. 
\[1991\] Frederick Jelinek and John D. Lafferty. 
1991. Computation of the probability of initial 
substring generation by stochastic context-free 
grammars. Computational Linguistics, 17:315- 
323. 
\[1991\] Fred Kochman and Joseph Kupin. 1991. 
Calculating the probability of a partial parse of 
a sentence. In DARPA Speech and Language 
Workshop, pages 237-240. 
\[1991\] David M. Magerman and Mitchell P. Mar- 
cus. 1991. Parsing the voyager domain using 
pearl. In DARPA Speech and Language Work- 
shop, pages 231-236. 
\[1992\] David M. Magerman and Carl Weir. 1992. 
Efficiency, robustness and accuracy in picky 
chart parsing. In Proceedings of the 30th ACL 
Conference, pages 40-47. 
\[1994\] Scott Miller and Heidi Fox. 1994. Au- 
tomatic grammar acquisition. In Proceedings 
of the Human Language Technology Workshop, 
pages 268-271. 
