New Figures of Merit for 
Best-First Probabilistic Chart Parsing 
Sharon A. Caraballo* 
Brown University 
Eugene Chamiak* 
Brown University 
Best-first parsing methods for natural language try to parse efficiently by considering the most 
likely constituents first. Some figure of merit is needed by which to compare the likelihood of 
constituents, and the choice of this figure has a substantial impact on the efficiency of the parser. 
While several parsers described in the literature have used such techniques, there is little published 
data on their efficacy, much less attempts to judge their relative merits. We propose and evaluate 
several figures of merit for best-first parsing, and we identify an easily computable figure of merit 
that provides excellent performance on various measures and two different grammars. 
1. Introduction 
Chart parsing is a commonly used algorithm for parsing natural language texts. The 
chart is a data structure that contains all of the constituents for which subtrees have 
been found, that is, constituents for which a derivation has been found and which may 
therefore appear in some complete parse of the sentence. The agenda is a structure that 
stores a list of constituents for which a derivation has been found but which have not 
yet been combined with other constituents. Initially, the agenda contains the terminal 
symbols from the sentence to be parsed. A constituent is removed from the agenda 
and added to the chart, and the system considers how this constituent can be used 
to extend its current structural hypothesis by combining with other constituents in 
the chart according to the grammar rules. (We will often refer to these expansions of 
rules as "edges".) In general this can lead to the creation of new, more encompassing 
constituents, which themselves are then added to the agenda. When one constituent 
has been processed, a new one is chosen to be removed from the agenda, and so 
on. Traditionally, the agenda is represented as a stack, so that the last item added 
to the agenda is the next one removed. Chart parsing is described extensively in the 
literature; for one such discussion see 'Section 1.4 of Charniak (1993). 
Best-first probabilistic chart parsing is a variation of chart parsing that attempts 
to find the most likely parses first, by adding constituents to the chart in order of 
the likelihood that they will appear in a correct parse, rather than simply popping 
constituents off of a stack. Some probabilistic figure of merit is assigned to the con- 
stituents on the agenda, and the constituent maximizing this value is the next to be 
added to the chart. 
In this paper we consider probabilities primarily based on probabilistic context- 
free grammars, though in principle, other, more complicated schemes could be used. 
The purpose of this work is to compare how well several figures of merit select 
* Computer Science Department, Box 1910, Brown University, Providence, RI 02912. E-mail: {sc, 
ec}@cs.brown.edu 
(~) 1998 Association for Computational Linguistics 
Computational Linguistics Volume 24, Number 2 
i Nj, k 
t o ... tj_19 ... tk_lt k 
Figure 1 
Constituent N;, k in a sentence to, n. 
t n-I 
constituents to be moved from the agenda to the chart. Ideally, we would like to use 
as our figure of merit the conditional probability of that constituent, given the entire 
sentence, in order to choose a constituent that not only appears likely in isolation, 
but is most likely given the sentence as a whole; that is, we would like to pick the 
constituent that maximizes the following quantity: 
P(N~,k I to,n) 
where t0,n is the sequence of the n tags, or parts of speech, in the sentence (numbered 
to ..... tn-1), and N~, k is a nonterminal of type i covering terms tj... tk-1. (See Figure 1.) 
In our experiments, we use only tag sequences (as given in the test data) for pars- 
ing. More accurate probability estimates should be attainable using lexical information 
in future experiments, as more detail usually leads to better statistics, but lexicalized 
figures of merit are beyond the scope of the research described here. 
Note that our "ideal" figure is simply a heuristic, since there is no guarantee 
that a constituent that scores well on this measure will appear in the correct parse 
of a sentence. For example, there may be a very large number of low-probability 
derivations of N~, k, which are combined here to give a high value, but a parse of the 
sentence can only include one of these derivations, making it unlikely that N~, k appears 
in the most probable parse of the sentence. On the other hand, there is no reason to 
believe that such cases are common in practice. 
We cannot calculate p(N~, k \[ t0,n), since in order to do so, we would need to com- 
pletely parse the sentence. In this paper, we examine the performance of several pro- 
posed figures of merit that approximate it in one way or another, using two different 
grammars. We identify a figure of merit that gives superior results on all of our per- 
formance measures and on both grammars. 
Section 2 of this paper describes the method we used to determine the effectiveness 
of figures of merit, that is, to compare how well they choose constituents to be moved 
from the agenda to the chart. Section 2.1 explains the experiment, Section 2.2 describes 
the measures we used to compare the performance of the figures of merit, and Section 
2.3 describes a model we used to represent the performance of a traditional parser 
using a simple stack as an agenda. 
In Section 3, we describe and compare three simple and easily computable figures 
of merit based on inside probability. Sections 3.1 through 3.3 describe each figure 
in detail, and Section 3.4 presents the results of an experiment comparing these three 
figures. Sections 4 and 5 have a similar structure to Section 3, with Section 4 evaluating 
two figures of merit using statistics on the left-side context of the constituent, and 
276 
Caraballo and Charniak Figures of Merit 
Section 5 evaluating three additional figures of merit using statistics on the context on 
both sides of the constituent. Section 6 contains a table summarizing the results from 
Sections 3, 4, and 5. 
In Section 7, we use another grammar in the experiment, to verify that our results 
are not an artifact of the grammar used for parsing. Section 8 describes previous work 
in this area, and Section 9 presents our conclusions and recommendations. 
There are also three appendices to this paper. Appendix A gives our method for 
computing inside probability estimates while maintaining parser speed. Appendix B 
explains how we obtained our boundary statistics used in Section 5. Appendix C 
presents data comparing the parsing accuracy obtained by each of our parsers as the 
number of edges they create increases. 
2. Comparing Figures of Merit 
2.1 The Experiment 
We used as our first grammar a probabilistic context-free grammar learned from the 
Brown corpus (see Francis and Ku~era \[1982\] for a description of the Brown Cor- 
pus, and Carroll and Charniak \[1992a, 1992b\], and Charniak and Carroll \[1994\] for 
grammar and training details). This grammar contains about 5,000 rules using 32 ter- 
minal and nonterminal symbols. We parsed 500 sentences of length 3 to 30 (including 
punctuation) from the Penn Treebank Wall Street Journal corpus (Marcus, Santorini, 
and Marcinkiewicz 1993) using a best-first parsing method and various estimates for 
P(N~i,k I t0,,~) as the figure of merit. 
For each figure of merit, we compared the performance of best-first parsing using 
that figure of merit to exhaustive parsing. By exhaustive parsing, we mean continuing 
to parse until there are no more constituents available to be added to the chart. We 
parse exhaustively to determine the total probability of a sentence, that is, the sum of 
the probabilities of all parses found for that sentence. 
We then computed several quantities for best-first parsing with each figure of merit 
at the point where the best-first parsing method has found parses contributing at least 
95% of the probability mass of the sentence. The 95% figure is simply a convenience; 
see Appendix C for a discussion of speed versus accuracy. 
2.2 Measures Used 
We compared the figures of merit using the following measures: 
. 
. 
. 
%E: The percentage of edges, or rule expansions, in the exhaustive parse 
that have been used by the best-first parse to get 95% of the probability 
mass. Edge creation is a good measure of CFG parser effort, since it is 
independent of platform and implementation. 
%non-0 E: The percentage of nonzero-length edges used by the best-first 
parse to get 95%. Zero-length edges are required by our parser as a 
bookkeeping measure, and, as such, virtually cannot be eliminated. We 
anticipated that removing them from consideration would highlight the 
"true" differences in the figures of merit. 
%popped: The percentage of constituents in the exhaustive parse that 
were used by the best-first parse to get 95% of the probability mass. This 
measure was included to confirm that a figure of merit that is efficient in 
terms of edge creation is also efficient in terms of constituent creation. 
277 
Computational Linguistics Volume 24, Number 2 
t o ... t n-I ... t j_ 
Figure 2 
fl includes only words within the constituent. 
. CPU time: The total CPU time (in seconds) needed to get 95% of the 
probability mass for all of the 500 sentences. 
The statistics converged to their final values quickly. The edge-count percentages 
were generally within .01 of their final values after processing only 200 sentences, so 
the results were quite stable by the end of our 500-sentence test corpus. 
We gathered statistics for each sentence length from 3 to 30. Sentence length was 
limited to a maximum of 30 because of the huge number of edges that are generated 
in doing a full parse of long sentences; using this grammar, sentences in this length 
range have produced up to 130,000 edges. 
2.3 The "Stack" Model 
As a basis for comparison, we measured the CPU time for a non-best-first version of 
the parser to completely parse all 500 sentences. The CPU time needed by this version 
of the parser was 4,882 seconds. For a best-first version of the parser to be useful, it 
must be able to find the most probable parse (or a reasonably good parse, depending 
on the application) in less than this amount of time. Here, for the best-first parsers, we 
will use for convenience the time needed to get 95% of the sentence's total probability 
mass. 
3. Simple Figures of Merit 
3.1 Straight fl 
It seems reasonable to base a figure of merit on the inside probability fl of the con- 
stituent. Inside probability is defined as the probability of the words or tags in the con- 
stituent given that the constituent is dominated by a particular nonterminal symbol; 
see Figure 2. This seems to be a reasonable basis for comparing constituent probabili- 
ties, and has the additional advantage that it is easy to compute during chart parsing. 
Appendix A gives details of our on-line computation of ft. 
The inside probability of the constituent N~, k is defined as: 
--= p(tj,k IN') 
where N i represents the ith nonterminal symbol. 
278 
Caraballo and Charniak Figures of Merit 
i Nj, k 
~0~i" ' 9tJ "" tk_ tk~~n_ 1 
Figure 3 
c~ includes the entire context of the constituent. 
In terms of our earlier discussion, our "ideal" figure of merit can be rewritten as: 
p( N;i,k, to,n) P(N~,k I to,,) -- 
p(to, n) 
p(N~, k, to,j, tj,k, tk,n) 
p(to,,) 
p(to,j,N~,k, tk, n)p(tj,k l i to,j, Nj, k, tk,n) 
p(to, n) 
We apply the usual independence assumption that given a nonterminal, the tag 
sequence it generates depends only on that nonterminal, giving: 
• i p(to,j,N;, k, tk,n)p(tj,k N~,k) 
P(Nji, k I tO,n) p(to,,) 
i t p( to,j, N;, k, tk,n ) fl ( Nj,k ) 
p(t0,,) 
The first term in the numerator is just the definition of the outside probability c~ of 
the constituent. Outside probability o~ of a constituent N~, k is defined as the probability 
of that constituent and the rest of the words in the sentence (or rest of the tags in the 
tag sequence, in our case); see Figure 3. 
= p(t0,j, N), k, tk,n). 
We can therefore rewrite our ideal figure of merit as: 
i i 
P(Nji'k l t°'n) ~ p(to.,) 
In this equation, we can see that oL(Xj,k) and p(to,n) represent the influence of the 
surrounding words. Thus using fl alone assumes that o~ and p(to,n) can be ignored. 
We will refer to this figure of merit as straight ft. 
3.2 Normalized fl 
One side effect of omitting the c~ and p(to,,) terms in the straight fl figure above is 
that inside probability alone tends to prefer shorter constituents to longer ones, as the 
279 
Computational Linguistics Volume 24, Number 2 
inside probability of a longer constituent involves the product of more probabilities. 
This can result in a "thrashing" effect as noted in Chitrao and Grishman (1990), where 
the system parses short constituents, even very low-probability ones, while avoiding 
combining them into longer constituents. To avoid thrashing, some technique is used 
to normalize the inside probability for use as a figure of merit. One approach is to take 
the geometric mean of the inside probability, to obtain a per-word inside probability. 
(In the "ideal" model, the p(to,n) term acts as a normalizing factor.) 
The per-word inside probability of the constituent N~, k is calculated as: 
We will refer to this figure as normalized ft. 
3.3 Trigram Estimate 
An alternative way to rewrite the "ideal" figure of merit is as follows: 
p(N~ik \]t0,n) 
p( N;i,k, tO,n ) 
p(to,,) 
p(to,j, tk,n)p(N~, k I to,j, tk,,)p(tj,k I N~,k, to,j, tk,n) 
p(to,j, tk, n)p(tj,k I to,j, tk,,) 
Once again applying the usual independence assumption that given a nonterminal, 
the tag sequence it generates depends only on that nonterminal, we can rewrite the 
figure of merit as follows: 
P(Nj,k I t0,,,) ~ P(NJ, k I to,j, tk,,)fl(Nj,k) 
p(tj,k I to,j, tk,n) 
To derive an estimate of this quantity for practical use as a figure of merit, we make 
some additional independence assumptions. We assume that p(N~, k I to,j, tk,,) ~, p(Nj,k), 
that is, that the probability of a nonterminal is independent of the tags before and 
after it in the sentence. We also use a trigram model for the tags themselves, giving 
p(tj,k I to,j, tk,,) ,~ p(tj,k I tj-2, tj-1). Then we have: 
P(Ni)fl(N~,k) 
P(N~'k I t0'") ~ p(tj,k I tj-2, tj-1)" 
We can calculate fl(N~,k) as usual. 
The p(N i) term is estimated from our PCFG and the training data from which the 
grammar was learned. We estimate p(N i) as the sum of the counts for all rules having 
N i as their left-hand side, divided by the sum of the counts for all rules. ~ 
The p(tj,k I tj-a, tj-1) term is just the probability of the tag sequence tj... tk-1 ac- 
cording to a trigram model. (Technically, this is not a trigram model but a tritag 
model, since we are considering sequences of tags, not words.) Our tritag probabili- 
ties p(ta I ta--2, ta-1) were learned from the training data used for the grammar, using 
1 Our results show that the p(N i) term can be omitted from this figure of merit without much effect. 
280 
Caraballo and Charniak Figures of Merit 
Table 1 
Results for the fl estimates. 
Figure of Merit %E %non-0 E %popped CPU Time 
straight fl 97.6 97.5 93.8 3,966 
normalized fl 34.7 31.6 61.5 1,631 
trigram estimate 25.2 21.7 44.3 1,547 
"stack" -- -- -- 4,882 
100- 
80- 
60- 
40 ¸ 
20- 
~* ..... .. ° ° - - * - +... o °. °. ° ° .... ° ........ o °. ....... 
~t°~ ,°.° °°°'- °.°o°.°°   
", • °°°°°".%°°'° 
! 
%x~ I~ 
I¢! \.2-. /\ 
I l i I 
I ......... ! ......... ! 10 20 30 
Sentence Lenglh 
Figure 4 
Nonzero-length edges for 95% of the probability mass for the fl estimates. 
...... straight beta 
..... normalized beta 
.... ~igram estimate 
the deleted interpolation method for smoothing. Our figure of merit uses: 
k-1 
p(tj,k l tj-2, tj--1) *~' H p(ta I ta-2, ta-1) a=j 
We refer to this figure of merit as the trigram estimate. 
3.4 Results 
The results for the three figures of merit introduced in the last section according to 
the measurements given in Section 2.2 are shown in Table 1 (the time to fully parse 
using the "stack" model is included for easy reference). 
Figure 4 expands the %non-0 E data to show the percent of nonzero-length edges 
needed to get 95% of the probability mass for each sentence length. 
Straight fl performs quite poorly on this measure. In order to find 95% of the 
probability mass for a sentence, a parser using this figure of merit typically needs to 
do over 90% of the work. On the other hand, normalized fl and the trigram estimate 
both result in substantial savings of work. However, while these two models produce 
281 
Computational Linguistics Volume 24, Number 2 
I0- 
j : 8 ; /'1 
i ii iy'l , 
S.; i:,t ," /..../ I, '01 
Ii .,'i ',,',, 
:.7." .: i 
o/ II 
,0"" i I 
°'° 14 
...: f_~Y," 
.." ~, d~-- 
." • _ ~"j" 
' • "~"l ;" " " " I .... I .... I .... I 
10 15 20 25 30 
Sentence Lenglh 
Figure 5 
Average CPU time for 95% of the probability mass for the fl estimates. 
"stack" 
...... straight beta 
..... normalized beta 
.... uigramestimate 
near-equivalent performance for short sentences, for longer sentences, with length 
greater than about 15 words, the trigram estimate gains a clear advantage. In fact, the 
performance of normalized fl appears to level off in this range, while the amount of 
work done using the trigram estimate shows a continuing downward trend. 
Figure 5 shows the average CPU time to get 95% of the probability mass for 
each estimate and each sentence length. Each estimate averaged below 1 second on 
sentences of fewer than 7 words. (The y-axis has been restricted so that the normalized 
fl and trigram estimates can be better compared). 
Note that while straight fl does perform better than the "stack" model in CPU time, 
the two models approach equivalent performance as sentence length increases, which 
is what would be expected from the edge count measures. The other two models 
provide a real time savings over the "stack" model, as can be seen from Figure 5 
and from the total CPU times given earlier. Through most of the length range, the 
CPU time needed by the normalized fl and the trigram estimate is quite close, but at 
the upper end of the range we can see better performance by the trigram estimate. 
(This improvement comes later than in the edge count statistics because of the small 
additional amount of overhead work needed to use the trigram estimate.) 
4. Figures Involving Left Outside Probability 
4.1 Normalized O~Lfl 
Earlier, we showed that our ideal figure of merit can be written as: 
i i o~(N~,k)fl(N;,k) 
p(N~,k \[to,,) ,~ p(to,,) 
However, the a term, representing outside probability, cannot be calculated di- 
282 
Caraballo and Charniak Figures of Merit 
Figure 6 
Left outside context. 
N~,k 
!i'o --. ... ',-, 
rectly during a parse, since we need the full parse of the sentence to compute it. In 
some of our figures of merit, we use the quantity p(N~, k, t0,j), which is closely related 
to outside probability. We call this quantity the left outside probability, and denote it 
O~L (see Figure 6). 
The following recursive formula can be used to compute aL. Let Cj/k be the set of 
all edges, or rule expansions, in which the nonterminal N~, k appears. For each edge e 
in gjik, we compute the product of aL of the nonterminal appearing on the left-hand 
side (lhs) of the rule, the probability of the rule itself, and fl of each nonterminal N~,s 
appearing to the left of Nj, k in the rule. Then aL(N~,k) is the sum of these products: 
-~ ~ aL(N~tart(e),end(e))p(rule(e) ) Hfl(N~,s ). O~L ( N~,k ) lhs(e) 
eEq,, N~q,~ 
Given a complete parse of the sentence, the formula above gives an exact value 
for aL. During parsing, the set gjik is not complete, and so the formula gives an ap- 
proximation of aL. 
This formula can be infinitely recursive, depending on the properties of the gram- 
mar. A method for calculating O~L more efficiently can be derived from the calculations 
given in Jelinek and Lafferty (1991). 
A simple extension to the normalized fl model allows us to estimate the per- 
word probability of all tags in the sentence through the end of the constituent under 
consideration. This allows us to take advantage of information already obtained in a 
left-right parse. We calculate this quantity as follows: 
k N i i 
We are again taking the geometric mean to avoid thrashing by compensating for 
the aLfl quantity's preference for shorter constituents, as explained in the previous 
section. 
We refer to this figure of merit as normalized OlLfl. 
4.2 Prefix Estimate 
We also derived an estimate of the ideal figure of merit that takes advantage of statistics 
on the first j - 1 tags of the sentence as well as tj,k. This estimate represents the 
283 
Computational Linguistics Volume 24, Number 2 
Table 2 
Results for the OLLfl estimates. 
Figure of Merit %E %non-0 E %popped CPU Time 
normalized C~Lfl 39.7 36.4 57.3 68,660 
prefix estimate 21.8 17.4 38.3 26,520 
probability of the constituent in the context of the preceding tags. 
i p(Nj. k, to.n) 
P(N~.k I to.,) -- p(to.,) 
p(tk.,)p(N~, k, to.j I tk.,)p(tj.k I Nji.k, to.j, tk.,) 
p(tk.,)p(to.k t tk.n) 
i i p(Nj.k, to.j l tk.n)p(tj.k l Nj. k, to.j, tk.n) 
p(to,k l tk.n) 
We again make the independence assumption that p(tj,k I N~,k, tO,j, tk,,) fl(N~,k). 
Additionally, we assume that p(N~, k, to,j) and p(to,k) are independent of p(tk,n), giving: 
P(N~.k, tO.j)fl(N;.k) P(N~,k 
I t0,n) p ( to.k ) 
The denominator, p(to.k), is once again calculated from a tritag model. The p(Nji.k , t0.j) 
term is just OiL, defined above in the discussion of the normalized OLLfl model. Thus 
this figure of merit can be written as: 
i i C~L (N;.k)fi(N;.k) 
p(to.k) 
We will refer to this as the prefix estimate. 
4.3 Results 
The results for the figures of merit introduced in the previous section according to the 
measurements given in Section 2.2 are shown in Table 2. 
Figure 7 shows a graph of %non-0 E for each sentence length for the two OZL models 
and the related fl models. 
Figure 7 illustrates two main points. First, the deterioration of the performance of 
the geometric-mean-based models with sentence length can be seen clearly. Second, 
when we consider only the two conditional-probability models, we can see that the 
additional information obtained from context in the prefix estimate gives a substantial 
improvement in this measure as compared to the trigram estimate. 
However, the CPU time needed to compute the O~L term exceeds the time saved 
by processing fewer edges. Note that using this estimate, the parser took over 26,000 
seconds to get 95% of the probability mass, while the "stack" model can exhaustively 
parse the test data in less than 5,000 seconds. Figure 8 shows the average CPU time 
for each sentence length. 
While chart parsing and calculations of fl can be done in O(n 3) time (see Ap- 
pendix A), we have been unable to find an algorithm to compute the OIL terms faster 
284 
Caraballo and Charniak Figures of Merit 
"0 
t~ 
# 
100 - 
80. 
60, 
40- 
20- 
\ 
\ 
"~i' "( : .~ .::....._./,;. / ',. t %. ,::~ 
~/ \~ % // 1X %, 
X" I X 
0 ......... I .................. I 
0 10 ~0 30 
Sentence Length 
.... normalized beta 
...... normalized alphaL beta 
- - - trigram estimate 
------ prefix estimate 
Figure 7 
Nonzero-length edges for 95% of the probability mass for the O~Lfl estimates. 
50- 
40, 
30 ¸ 
20 ¸ 
10 ¸ 
• " I ." / 
.° / 
: I I 
I : I 
: I 
: /t-~.J / 
/ 
. I/// A • " // / V/ 
;"f ~ 
• ,°" f / 
: f ~ /'\ 
;" .~.j 
• / / SS-" // js 
3'0 
Sentence Length 
Figure 8 
Average CPU time for 95% of the probability mass for the O~Lfl estimates. 
"stack" 
.... normalized beta 
...... normalized a\]phaL beta 
- - - trigram estimate 
------ prefix estimate 
285 
Computational Linguistics Volume 24, Number 2 
Figure 9 
Left boundary context. 
i Nj, k 
t o ... ~._')tj ... tk_lt k ... t n-I 
than O(n5). When a constituent is removed from the agenda, it only affects the fl val- 
ues of its ancestors in the parse trees; however, C~L values are propagated to all of the 
constituent's siblings to the right and all of its descendants. Recomputing the aL terms 
when a constituent is removed from the agenda can be done in O(n 3) time, and since 
there are O(n 2) possible constituents, the total time needed to compute the aL terms 
in this manner is O(n5). 
5. Figures Using Boundary Statistics 
5.1 Left Boundary Trigram Estimate 
Although the OLL-based models seem impractical, the edge-count and constituent-count 
statistics show that contextual information is useful. We can derive an estimate similar 
to the prefix estimate but containing a much simpler model of the context as follows: 
P(N~,k, tO,n) p(X)'k It0,.) - 
p(to,n) 
p( to,j, tk,,, )p( N~, k I to,j, tk,, )p( tj,k \[ Nji, k, to,j, tk,,, ) 
p(to,j, tk,n)p(tj,k I to,j, tk,,,) 
Once again applying the usual independence assumption that given a nonterminal, 
the tag sequence it generates depends only on that nonterminal, we can rewrite the 
figure of merit as follows: 
p(Nii, k I to,n) ,~ P(N~, k I to,j, tk,n)fl(N~,k) 
p(tj,k l to,j, tk,,) 
As usual, we use a trigram model for the tags, giving p(tj,k I to,j, tk,,) ~ p(tj,k \[ 
tj-2, tj-1). 
Now, we assume that p(N~, k I to,j, tk,n) ,~, p(N~, k I tj-1), that is, that the probability 
of a nonterminal is dependent on the tag immediately before it in the sentence (see 
Figure 9). Then we have: 
p(N;i,k I to,,) ~ P(N~, k I tj-1)fl(N~,k) 
p(tj,k I tj-2, tj-1) 
We can calculate fl(N~,k) and the tritag probabilities as usual. The p(N~, k I tj-1) 
probabilities are estimated from our training data by parsing the training data and 
286 
Caraballo and Charniak Figures of Merit 
Figure 10 
Boundary context. 
counting the occurrences of the nonterminal and the tag weighted by their probability 
in the parse. (Further details are provided in Appendix B.) 
We will refer to this figure as the left boundary trigram estimate. 
5.2 Boundary Trigram Estimate 
We can derive a similar estimate using context on both sides of the constituent as 
follows: 
p(N~, k \[to,,,) 
p ( Nji, k , t0,, ) 
p(t0,n) 
p(to,j)p(N~, k I to,j)p(tj,k INS, k, to,j)p(tk \[ i i to,j, N;, k, tj,k )p( tk + l,n I to,j, N;, k, tj,k, tk ) 
p(to,j)p(tj,k l to,j)p(tk \[ tO,k)p(tk+l,n I tO,k+l) 
p(Njik \[ to,j)p(tj,k INS,k, to,j)p(tk \[ i to,j, N;, k, tj,k)p(tk+l,n \[ tO,k+1, N~, k) 
p(tj,kItO,/)p(tk\]tO,k)p(tk+l,, I tO,k+1) 
Once again applying the usual independence assumption that given a nonterminal, 
the tag sequence it generates depends only on that nonterminal and also assuming 
that the probability of tk+l,n depends only on the previous tags, we can rewrite the 
figure of merit as follows: 
p(Nj, k I to,j)fl(N~,k)p(tk \[t0,k, N)i,k) p(N;i,k 
\[to,,,) ~ p(tj,k+l \[to,j) 
Now we add some new independence assumptions. We assume that the proba- 
bility of the nonterminal depends only on the immediately preceding tag, and that 
the probability of the tag immediately following the nonterminal depends only on the 
nonterminal (see Figure 10), giving: 
P(Njqk \[ tj-1)fl(N~,k)p(tk \[ Njkk) P(N)i'k I tO'n) ~" p(tj,k+l 
\[to,j) 
As usual, we use a trigram model for the tags, giving p(tj,k \] to,j, tk, n) ~ p(tj,k I 
tj-2, tj-1). Then we have: 
p(N)ik \[to,,) ~ p(N~'k \[tJ-')fl(N;'k)p(tk \[ N~'k) 
p(tj,k+l \[ tj-2, tj-1) 
287 
Computational Linguistics Volume 24, Number 2 
We can calculate fl(N~,k) and the tritag probabilities as usual. The p(Njik I tj-1) and 
p(tk I Nji, k) probabilities are estimated from our training data by parsing the training 
data and counting the occurrences of the nonterminal and the tag weighted by their 
probability in the parse. 2 Again, see Appendix B for details of how these estimates 
were obtained. 
We will refer to this figure as the boundary trigram estimate. 
5.3 Boundary Statistics Only 
We also wished to examine whether contextual information by itself is sufficient as a 
figure of merit. We can derive an estimate based only on easily computable contextual 
information as follows: 
p(N~.k \[to.,) 
p(N;' k, t0,. ) 
p(to,,) 
p(to,j)p(N~,k I tod)p(tj,kl N~,k, tOd)p(tk I to,," iNj,k ' tj,k)p(tk+t,, i t0,j, N~,k ,i t/,k, tk) 
p(to.j)p(tj.kltO.j)p(tkltO.k)p(tk+l., I to.k+1) 
p(N~.k l tO.j)p(tj.k \[ i i Nj. k, to.j)p(tk I I Nji.k) to.j. Nj. k, tj.k)p(tk+l., tO.k+1. 
p(tj.kltO.j)p(tkltO.k)p(tk+l.. I tO.k+1) 
Most of the independence assumptions we make are the same as in the boundary 
trigram estimate. We assume that the probability of the nonterminal depends only on 
the previous tag, that the probability of the immediately following tag depends only 
on the nonterminal, and that the probability of the tags following that depend only 
on the previous tags. However, we make one independence assumption that differs 
from all of our previous estimates. Rather than assuming that the probability of the 
tags within the constituent depends on the nonterminal, giving an inside probability 
term, we assume that the probability of these tags depends only on the previous tags. 
Then we have 
p(N~,k \[to,,) P(N}ik I to.j)p(tj.k I to.j)p(tkl Nj.k)p(tk+t., \] tO.k+1) p(tj.klto.j)p(tkltO.k)p(tk+l., I to.k+1) 
p(N~.k I toj)p(tk t N~.k) 
p(tk l tO.k) 
In the denominator, we take p(tk \[ to,k) ~ p(tk), giving: 
p(N}.k l to..) ~. P(N~. k I tod)p(tk I Nj.k) p(tk) 
which is simply the product of the two boundary statistics described in the previous 
section. 
We refer to this estimate as boundary statistics only. 
2 Actually, in our implementation, the p(tk) in the denominator is included in the following-tag statistic, 
P(tklN~) for which we use -~;G)--" Then at run time we only use the trigram probabilities for t0. k. 
288 
Caraballo and Charniak Figures of Merit 
Table 3 
Results for the boundary estimates. 
Figure of Merit %E % non-0 E %popped CPU Time 
boundary statistics only 53.2 
left boundary trigram estimate 22.1 
boundary trigram estimate 18.2 
100 - 
50.8 59.6 2,759 
18.4 39.6 1,700 
13.9 31.2 1,111 
80- 
60- 
40" 
20- 
\ 
% 
'X4'~.\ ''~', , !".,"~ I "- I: " : I "-.d %" • • I/~, -/", ' \ • "- tl, 
"-'ltV" "~?, ". ! "d 
0 ......... I ......... I ......... / 
0 I0 20 30 
Sentence Length 
- - - trigram estimate 
------ prefix estimate 
...... left boundary trigram 
.... boundary trigram 
..... boundary stats only 
Figure 11 
Nonzero-length edges for 95% of the probability mass for the boundary estimates. 
5.4 Results 
The results for the figures of merit introduced in the previous section according to the 
measurements given in Section 2.2 are shown in Table 3. 
Figure 11 shows a graph of %non-0 E for each sentence length for the boundary 
models and the trigram and prefix estimates. This graph shows that the contextual 
information gained from using OL L in the prefix estimate is almost completely included 
in just the previous tag, as illustrated by the left boundary trigram estimate. Adding 
right contextual information in the boundary trigram estimate gives us the best per- 
formance on this measure of any of our figures of merit. 
We can consider the left boundary trigram estimate to be an approximation of the 
prefix estimate, where the effect of the left context is approximated by the effect of the 
single tag to the left. Similarly, the boundary trigram estimate is an approximation to 
an estimate involving the full context, i.e., an estimate involving the outside probability 
c~. However, the parser cannot compute the outside probability of a constituent during 
a parse, and so in order to use context on both sides of the constituent, we need to use 
something like our boundary statistics. Our results suggest that a single tag before or 
after the constituent can be used as a reasonable approximation to the full context on 
289 
Computational Linguistics Volume 24, Number 2 
I0- 
5- 
I / /° ,, 
: ) / :i \[ ../ !t, :. 
I F I :ft; 
I / : jJ t: .. / / / \/ 
/ I : ~; \] : ..f 
l / i :".-~" ' 1 / / ' '" "t 'x 
/ I ; i I,. I / /\./:,, :',.- 
l I / :, t I ~ i ..', / 
i /.:--".)'i.... ' I / .,-C,"::._.,: 
/--" / .-..;/ '(7,.; ..... 
, ~_.d_.~dd .... • 
/ . ....uf" 
.~._~..~,1,..~ . . . I .... I .... , .... I I0 15 20 25 30 
Sentence Length 
"stack" 
- - trigram estimate 
~---- prefix estimate 
...... left boundary trigmm 
.... boundary trlgram 
..... boundary stats only 
Figure 12 
Average CPU time for 95% of the probability mass for the boundary estimates. 
that side of the constituent. Figure 12 shows the average CPU time for each sentence 
length. 
Since the boundary trigram estimate has none of the overhead associated with the 
prefix estimate, it is the best performer in terms of CPU time as well. We can also 
see that using just the boundary statistics, which can be precomputed and require no 
extra processing during parsing, still results in a substantial improvement over the 
non-best-first "stack" model. 
As another method of comparison between the two best-performing estimates, 
the context-dependent boundary trigram model and the context-independent trigram 
model, we compared the number of edges needed to find the first parse for average- 
length sentences. The average length of a sentence in our test data is about 22 words. 
Figure 13 shows the percentage of sentences of length 18 through 26 for which a 
parse could be found within 2,500 edges. For this experiment, we used a separate 
test set from the Wall Street Journal corpus, containing approximately 570 sentences in 
the desired length range. This measure also shows a real advantage of the boundary 
trigram estimate over the trigram estimate. 
6. Results Summary 
Table 4 summarizes the results obtained for each figure of merit. 
7. Comparing Figures of Merit Using a Treebank Grammar 
7.1 Background 
To verify that our results are not an artifact of the particular grammar we chose for 
testing, we also tested using a treebank grammar introduced in Charniak (1996). This 
290 
Caraballo and Charniak Figures of Merit 
1.0-- 
0.$- 
0.6" 
0.4- 
0.2" 
! 
i ° 
I 
! 
i 
I 
I 
I 
I 
I / • Z 
1" I • o ~ 
,~.~,~. .............................. 
,r ° 
°J 
~ jfs 
/ S S 
\] /" 
\] / 
I 
S P 
S S 
I I 
$ 
S S 
S I 
/ 
I S 
0.0 I .... / .... ! .... ! 
500 1000 1500 2000 2500 
# Edges 
Figure 13 
% of the 18- to 26-word sentences finding a parse in a fixed number of edges. 
..... boundary t rigrara 
.... trigram 
Table 4 
Results for all figures of merit. 
Figure of Merit %E %non-0 E %popped CPU Time 
"stack" model 4,882 
straight fl 97.6 97.5 93.8 3,966 
normalized fl 34.7 31.6 61.5 1,631 
trigram estimate 25.2 21.7 44.3 1,547 
normalized C~Lfl 39.7 36.4 57.3 68,660 
prefix estimate 21.8 17.4 38.3 26,520 
boundary statistics only 53.2 
left boundary trigram estimate 22.1 
boundary trigram estimate 18.2 
50.8 59.6 2,759 
18.4 39.6 1,700 
13.9 31.2 1,111 
grammar was trained in a straightforward way by reading the grammar directly (with 
minor modifications) from a portion of the Penn Treebank Wall Street Journal data com- 
prised of about 300,000 words. The boundary statistics were counted directly from the 
training data as well. The treebank grammar is much larger and more ambiguous than 
our original grammar, containing about 16,000 rules and 78 terminal and nonterminal 
symbols, and it was impractical to parse sentences to exhaustion using our existing 
hardware, so the figures based on 95% of the probability mass could not be computed. 
We were able to use this grammar to compare the number of edges needed to find 
the first parse using the trigram and boundary trigram estimates. 
291 
Computational Linguistics Volume 24, Number 2 
1.0- 
0.8. 
0.6, 
0.4- 
0.2- 
oS 
oJ ,J 
I / 
/ 
i 
I i 
i i 
I 
I 
I 
i 
I 
i 
I i 
I 
i 
i II S 
i .t 
s 
s s 
s~ 
s s 
s s 
s 
J 
i 
s 
s 
s S 
s 
I i 
0.0 I .... I .... I .... I 
0 5000 10000 15000 20000 
# Edges 
..... boundary ~igram 
.... trigram 
Figure 14 
% of the 18- to 26-word sentences finding a parse in a fixed number of edges for a treebank 
grammar. 
7.2 Results 
Figure 14 shows the percentage of sentences of length 18 through 26 for which a parse 
could be found within 20,000 edges. Again, we used a test set of approximately 570 
sentences of the appropriate length from the Wall Street Journal corpus. Although the 
x-axis covers a much wider range than in Figure 13, the relationship between the two 
estimates is quite similar. 
8. Previous Work 
In an earlier version of this paper (Caraballo and Charniak 1996), we presented the 
results for several of these models using our original grammar. The treebank grammar 
was introduced in Charniak (1996), and the parser in. that paper is a best-first parser 
using the boundary trigram figure of merit. 
The literature shows many implementations of best-first parsing, but none of the 
previous work shares our goal of explicitly comparing figures of merit. 
Bobrow (1990) and Chitrao and Grishman (1990) introduced statistical agenda- 
based parsing techniques. Chitrao and Grishman implemented a best-first probabilistic 
parser and noted the parser's tendency to prefer shorter constituents. They proposed 
a heuristic solution of penalizing shorter constituents by a fixed amount per word. 
Miller and Fox (1994) compare the performance of parsers using three different 
types of grammars, and show that a probabilistic context-free grammar using inside 
probability (unnormalized) as a figure of merit outperforms both a context-free gram- 
mar and a context-dependent grammar. 
Kochman and Kupin (1991) propose a figure of merit closely related to our prefix 
estimate. They do not actually incorporate this figure into a best-first parser. 
292 
Caraballo and Charniak Figures of Merit 
Magerman and Marcus (1991) use the geometric mean to compute a figure of 
merit that is independent of constituent length. Magerman and Weir (1992) use a 
similar model with a different parsing algorithm. 
9. Conclusions 
We have presented and evaluated several figures of merit for best-first parsing. The 
best performer according to all of our measures was the parser using the boundary 
trigram estimate as a figure of merit, and this result holds for two different grammars. 
This figure has the additional advantage that it can be easily incorporated into existing 
best-first parsers using a figure of merit based on inside probability. (As mentioned 
earlier, the efficient online computation of fl is described in Appendix A.) We strongly 
recommend this figure of merit as the basis for best-first statistical parsers. 
The measurements presented here almost certainly underestimate the true benefits 
of this model. We restricted sentence length to a maximum of 30 words, in order to 
keep the number of edges in the exhaustive parse to a practical size; however, since the 
percentage of edges needed by the best-first parse decreases with increasing sentence 
length, we assume that the improvement would be even more dramatic for sentences 
longer than 30 words. 
Appendix A: Efficient On-Line Computation of fl 
We compute estimates of the inside probability fl for each proposed constituent in- 
crementally as new constituents are added to the chart. Initially, fl is set to 1 for each 
terminal symbol, since our input is given as a stream of tags, which are our terminals. 
When a new proposed constituent is added to the agenda, its fl estimate is set to its 
current inside probability according to the constituents already in the chart. However, 
as more constituents are added to the chart, we may find a new way to build up a 
proposed constituent, i.e., additional evidence for that proposed constituent, so we 
need to update the fl for the proposed constituent (and also for affected constituents 
already in the chart, since these may in turn affect other proposed constituents). 
These updates can be quite expensive in terms of CPU time. However, many of 
the updates are quite small, and do not affect the relative ordering of the proposed 
constituents on the agenda. Instead of propagating every change to fl, then, we only 
want to propagate those changes that we expect to have an effect on this ordering. 
What we have done is to have each constituent store not only its fl value, but also 
an increment. Increases to the inside probability are added not to fl itself, but to this 
increment, until the increment exceeds some threshold. Experimentally we have found 
that we can avoid propagating increments until they exceed 1% of the current value 
of fl with very little effect on the parser's selection of constituents from the agenda. 
This thresholding on the propagation of fl allows us to update the fl values on 
line while still keeping the performance of the parser as O(n 3) empirically. 
Appendix B: Estimation of Boundary Statistics 
Our figures of merit incorporating boundary statistics use the figures p(N;, k I t\]-l) to 
p/tkf~,~) represent the effect of the left context and --pGT- to represent the effect of the right 
context. For our experiments with the first grammar, which was learned from training 
data taken from the Brown corpus, we estimated these statistics from the same training 
data. 
293 
Computational Linguistics Volume 24, Number 2 
First, we parsed the training data according to our grammar. (It was necessary 
to do this, rather than using the hand-annotated parses of the training data, because 
our grammar does not use the same set of nonterminals as the corpus; see Carroll 
and Charniak \[1992a, 1992b\] and Charniak and Carroll \[1994\] for details.) Since we 
use the tags as our input, the probability of a nonterminal appearing with a particular 
previous tag is the same as the probability of that nonterminal appearing in any 
sentence containing that tag. 
We can then count the probability-weighted occurrences of a nonterminal given 
the previous tag as follows: 
i C(Nj,k, tj-,) = ~ p(N~, k \[w0,n) 
w0,, containing tj_~ 
i i 
_ (Nj,k)fl(Nj,k) 
p(wo,.) 
That is, for each sentence that contains the previous tag tj_l, we increment our count 
by the probability of the nonterminal N~, k occurring immediately following tj-1 in that 
sentence. 
Since we have a complete parse, the inside and outside probabilities and the 
sentence probability can be easily computed. We can also obtain the count C(tj_l) 
simply by counting the number of sentences in which that tag appears in position 
j- 1. We then obtain the conditional probability for the left boundary statistic as 
follows: 
1' C(Nj, k, tj-1 ) 
p(N ,k I tj-,I - CItj_,) 
The right boundary statistic is computed in the corresponding way. 
For the experiment using the treebank grammar, these statistics were obtained by 
counting directly from the Wall Street \]ournal treebank corpus, just as the grammar 
rules and trigram statistics were. 
Appendix C: Speed vs. Accuracy 
As an additional verification of our results, we gathered data on speed versus accuracy. 
For this experiment, we used the probabilistic context-free grammar learned from the 
Brown corpus and the average-length test sentences described in Section 5.4. For each 
figure of merit, we computed the average precision and recall of the best parse found 
as compared to the number of edges created. We computed unlabeled precision and 
recall only, since our grammar uses a different set of nonterminals from those used in 
the test data. 
Precision is defined as the percentage of the constituents proposed by our parser 
that are actually correct according to the treebank. For each edge count, we measured 
the precision of the best parse of each sentence found within that number of edges. 
Figure 15 is a graph of the average precision for the fl figures of merit from Section 3, 
plotted against edge counts. 
The fluctuations at the low edge counts are due to the small amount of data at this 
level. At a low edge count, very few sentences have actually been parsed, and since 
these sentences tend to be short and simple, the parses are likely to be correct. The 
sentences that could not be parsed do not contribute to the measurement of precision. 
As more sentences are parsed, precision settles at about 47%, the highest precision 
attainable by our particular test grammar, and remains there as edge counts increase. 
294 
Caraballo and Charniak Figures of Merit 
:\[ ¢ 
d: 
0.6- 
0.5- 
0.4 I ......... I ......... I0000 
20000 30~)00 
# Edges 
"stack" 
...... straight beta 
..... normalized beta 
.... trigrzxn estimate 
Figure 15 
Precision of the best parse found in a fixed number of edges for the fl estimates. 
0.6 
-- 0.4 \] 
0.2 
0.0 
ii 
/I 
/ 
/ ,, 
/ 
/ 
i 
i 
t i 
• ..-° 
i, 
I /, 
t I i, 
I 
!, !' 
t i I i l 
2000 4000 6000 8000 
# Edges 
"stuck" 
...... straight beta 
..... normalized beta 
.... tngram estimate 
Figure 16 
Recall of the best parse found in a fixed number of edges for the fl estimates. 
295 
Computational Linguistics Volume 24, Number 2 
0.6. 
_-- 0.4. 
0.2- 
0.0 
/ 
I / 
i , ; t !, 
, ! 
/ /, 
.f,,f--: ................. 
..... normalized beta 
...... normalizedalphaL be~ 
.... ~igramestlmate 
--q--prefix estimate 
Figure 17 
Recall of the best parse found in a fixed number of edges for the C~Lfl estimates. 
0.0- 
-- 0.4- 
0.2- 
0.0 
II 
i 
i !/ 
t 
!i , 
'i/: !l , !! / !ii, 
Y 
S!!'; "-'" 
# Edges 
.... tri~ara ~tlmate 
------ prefix estimate 
...... left boundary trigram 
..... botmdary ~igram 
..... boundary stats only 
Figure 18 
Recall of the best parse found in a fixed number of edges for the boundary estimates. 
296 
CarabaUo and Charniak Figures of Merit 
This level of precision is independent of the figure of merit used, so measurement of 
precision does not help evaluate our figures of merit. 
A much more useful measure is recall. Recall is defined as the percentage of 
constituents in the treebank test data that are found by our parser. Again, we measured 
the recall of the best parse of each sentence found within each number of edges. 
Figure 16 shows the results for the figures of merit from Section 3. 
Straight beta clearly shows little or no improvement over the "stack" parser using 
no figure of merit at all. The other figures of merit increase quickly to about 64%, 
the maximum recall attainable with our test grammar. The "stack" parser and the 
one using straight beta, on the other hand, do not reach this maximum level until 
about 50,000 edges. We have no explanation for the relatively poor performance of the 
parser using the trigram estimate compared to the other best-first parsers, as shown 
in Figures 16, 17, and 18. Figure 17 shows the recall values for the O~Lfl figures of merit 
from Section 4, and Figure 18 shows recall for the boundary figures of merit from 
Section 5. Since precision is not a useful measure, we have not included precision data 
for these figures of merit. 
These data confirm that the parser using the boundary trigram figure of merit 
performs better than any of the others. Recall using this figure of merit is consistently 
higher than any of the others at low edge counts, and it reaches the maximum value 
in fewer than 2,000 edges, with the nearest competitors approaching the maximum at 
about 3,000 edges. 
Acknowledgments 
The authors are very grateful to Heidi Fox 
for obtaining the speed vs. accuracy data 
discussed in Appendix C. We also wish to 
thank the anonymous reviewers for their 
comments and suggestions. This research 
was supported in part by NSF grant 
IRI-9319516 and by ONR grant 
N0014-96-1-0549. 
References 
Bobrow, Robert J. 1990. Statistical agenda 
parsing. In DARPA Speech and Language 
Workshop, pages 222-224. 
Caraballo, Sharon and Eugene Charniak. 
1996. Figures of merit for best-first 
probabilistic chart parsing. In Proceedings 
of the Conference on Empirical Methods in 
Natural Language Processing, pages 
127-132. 
Carroll, Glenn and Eugene Charniak. 1992a. 
Learning probabilistic dependency 
grammars from labeled text. In Working 
Notes, Fall Symposium Series, pages 
25-32. AAAI. 
Carroll, Glenn and Eugene Charniak. 1992b. 
Two experiments on learning probabilistic 
dependency grammars from corpora. In 
Workshop Notes, Statistically-Based NLP 
Techniques, pages 1-13. AAAI. 
Charniak, Eugene. 1993. Statistical Language 
Learning. MIT Press. 
Charniak, Eugene. 1996. Tree-bank 
grammars. In Proceedings of the Thirteenth 
National Conference on Artificial Intelligence, 
pages 1031-1036. AAAI. 
Charniak, Eugene and Glenn Carroll. 1994. 
Context-sensitive statistics for improved 
grammatical language models. In 
Proceedings of the Twelfth National Conference 
on Artificial Intelligence, pages 728-733. 
Chitrao, Mahesh V. and Ralph Grishman. 
1990. Statistical parsing of messages. In 
DARPA Speech and Language Workshop, 
pages 263-266. 
Francis, W. Nelson and Henry Ku~era. 1982. 
Frequency Analysis of English Usage: Lexicon 
and Grammar. Houghton Mifflin. 
Jelinek, Frederick and John D. Lafferty. 1991. 
Computation of the probability of initial 
substring generation by stochastic 
context-free grammars. Computational 
Linguisitics, 17:315-323. 
Kochman, Fred and Joseph Kupin. 1991. 
Calculating the probability of a partial 
parse of sentence. In DARPA Speech and 
Language Workshop, pages 237-240. 
Magerman, David M. and Mitchell P. 
Marcus. 1991. Parsing the Voyager 
domain using Pearl. In DARPA Speech and 
Language Workshop, pages 231-236. 
Magerman, David M. and Carl Weir. 1992. 
Efficiency, robustness and accuracy in 
Picky chart parsing. In Proceedings of the 
30th Annual Meeting, Association for 
Computational Linguistics, pages 40--47. 
Association for Computational 
297 
Computational Linguistics Volume 24, Number 2 
Linguistics. 
Marcus, Mitchell P., Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building 
a large annotated corpus of English: The 
Penn treebank. Computational Linguistics, 
19:313-330. 
Miller, Scott and Heidi Fox. 1994. Automatic 
grammar acquisition. In Proceedings of the Human Language Technology Workshop, 
pages 268-271. 
298 
