XML Viewer - w96-0212

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/w96-0212_abstr.xml
Size: 18,775 bytes
Last Modified: 2025-10-06 13:48:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0212">
  <Title>Figures of Merit for Best-First Probabilistic Chart Parsing</Title>
  <Section position="1" start_page="0" end_page="132" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Best-first parsing methods for natural language try to parse efficiently by considering the most likely constituents first. Some figure of merit is needed by which to compare the likelihood of constituents, and the choice of this figure has a substantial impact on the efficiency of the parser.</Paragraph>
    <Paragraph position="1"> While several parsers described in the literature have used such techniques, there is no published data on their efficacy, much less attempts to judge their relative merits. We propose and evaluate several figures of merit for best-first parsing.</Paragraph>
    <Paragraph position="2"> Introduction Chart parsing is a commonly-used algorithm for parsing natural language texts. The chart is a data structure which contains all of the constituents which may occur in the sentence being parsed.</Paragraph>
    <Paragraph position="3"> At any point in the algorithm, there exist constituents which have been proposed but not actually included in a parse. These proposed constituents are stored in a data structure called the keylist. When a constituent is removed from the keylist, the system considers how this constituent can be used to extend its current structural hypothesis. In general this can lead to the creation of new, more encompassing constituents which themselves are then added to the keylist. When we are finished processing one constituent, a new one is chosen to be removed from the keylist, and so on.</Paragraph>
    <Paragraph position="4"> Traditionally, the keylist is represented as a stack, so that the last item added to the keylist is the next one removed.</Paragraph>
    <Paragraph position="5"> Best-first chart parsing is a variation of chart parsing which attempts to find the most likely parses first, by adding constituents to the chart in order of the likelihood that they will appear in a correct parse, rather than simply popping constituents off of a stack. Some figure of merit is assigned to potential constituents, and the constituent maximizing this value is the next to be added to the chart.</Paragraph>
    <Paragraph position="6">  In best-first probabilistic chart parsing a probabilistic measure is used. In this paper we consider probabilities primarily based on probabilistic context-free grammars, though in principle other, more complicated schemes could be used.</Paragraph>
    <Paragraph position="7"> Ideally, we would like to use as our figure of merit the conditional probability of that constituent, given the entire sentence, in order to choose a constituent that not only appears likely in isolation, but maximizes the likelihood of the sentence as a whole; that is, we would like to pick the constituent that maximizes the following quantity:</Paragraph>
    <Paragraph position="9"> where to,n is the sequence of the n tags, or parts of speech, in the sentence (numbered to,..., tn- 1), and Nj, k is a nonterminal of type i covering terms tj...tk_l. However, we cannot calculate this quantity, since in order to do so, we would need to completely parse the sentence. In this paper, we examine the performance of several proposed figures of merit that approximate it in one way or another.</Paragraph>
    <Paragraph position="10"> In our experiments, we use only tag sequences for parsing. More accurate probability estimates should be attainable using lexical information.</Paragraph>
    <Section position="1" start_page="127" end_page="129" type="sub_section">
      <SectionTitle>
Figures of Merit
Straight
</SectionTitle>
      <Paragraph position="0"> It seems reasonable to base a figure of merit on the inside probability fl of the constituent. Inside probability is defined as the probability of the words or tags in the constituent given that the constituent is dominated by a particular nonterminal symbol. This seems to be a reasonable basis for comparing constituent probabilities, and has the additional advantage that it is easy to compute during chart parsing.</Paragraph>
      <Paragraph position="1"> The inside probability of the constituent N~, k is defined as /3(Nj, k) ~ p(tj,klN i) where N i represents the ith nonterminal symbol. null in terms of our earlier discussion, our &amp;quot;ideal&amp;quot; figure of merit can be rewritten as: i p( Nj,k lto,,d p(Nj, , to,n) p(to, ) p(Nij, k, to,j, t j, k, tk, n) p(to,**) p(to,j,Nj, k,tk,~)p(tj,klto,j, ' Nj,a, ta,n) p(to, ) We apply the usual independence assumption that given a nonterminal, the tag sequence it generates depends only on that nonterminal, giving p(to,j, i i N;, k, tk,n)p(tj,k INj,k) P( N;,k lto,,d</Paragraph>
      <Paragraph position="3"> The first term in the numerator is just the definition of the outside probability a of the constituent. Outside probability a of a constituent Nj, k is defined as the probability of that constituent and the rest of the words in the sentence (or rest of the tags in the tag sequence, in our case).</Paragraph>
      <Paragraph position="5"> We can therefore rewrite our ideal figure of merit as</Paragraph>
      <Paragraph position="7"> p(to, ) In this equation, we can see that a(Nj,k) and p(to,~) represent the influence of the surrounding words. Thus using j3 alone assumes that a and P(tom) can be ignored.</Paragraph>
      <Paragraph position="8"> We will refer to this figure of merit as straight ft.</Paragraph>
      <Paragraph position="9"> Normalized /~ One side effect from omitting the a and p(to,,~) terms in the m-only figure above is that inside probability alone tends to prefer shorter constituents to longer ones, as the inside probability of a longer constituent involves the product of  more probabilities. This can result in a &amp;quot;thrashing&amp;quot; effect, where the system parses short constituents, even very low probability ones, while avoiding combining them into longer constituents. To avoid thrashing, typically some technique is used to normalize the inside probability for use as a figure of merit. One approach is to take the geometric mean of the inside probability, to obtain a &amp;quot;per-word&amp;quot; inside probability. (In the &amp;quot;ideal&amp;quot; model, the p(to,~) term acts as a normalizing factor.) null The per-word inside probability of the constituent Nj, k is calculated as We will refer to this figure as normalized/3.</Paragraph>
      <Paragraph position="10"> Normalized aLf~ In the previous section, we showed that our ideal figure of merit can be written as</Paragraph>
      <Paragraph position="12"> However, the a term, representing outside probability, cannot be calculated directly during a parse, since we need the full parse of the sentence to compute it. In some of our figures of merit, we use the quantity p(Nj,k, t0,j), which is closely related to outside probability. We call this quantity the left outside probability, and denote it ai.</Paragraph>
      <Paragraph position="13"> The following recursive formula can be used to compute aL. Let g~,k be the set of all completed edges, or rule expansions, in which the nonterminal Nj, k appears. For each edge e in gj,k, we compute the the product of aL of the nonterminal appearing on the left-hand side (lhs) of the rule, the probability of the rule itself, and /33 of each non-terminal N~s appearing to the left of Nj, a in the rule. Then aL(N),k) is the sum of these products:</Paragraph>
      <Paragraph position="15"> eE$~, k N:., This formula can be infinitely recursive, depending on the properties of the grammar. A method for calculating aL more efficiently can be derived from the calculations given in (3elinek and Lafferty, 1991).</Paragraph>
      <Paragraph position="16"> A simple extension to the normalized fl model allows us to estimate the per-word probability of all tags in the sentence through the end of the constituent under consideration. This allows us to take advantage of information already obtained in a left-right parse. We calculate this quantity as follows: k O~ i i L ( N;,k ) J3( N;,k )&amp;quot; We are again~ taking the geometric mean to avoid thrashing by compensating for the aj3 quantity's preference for shorter constituents, as explained in the previous section.</Paragraph>
      <Paragraph position="17"> We refer to this figure of merit as normalized O~Lfl.</Paragraph>
      <Paragraph position="18"> Trigram estimate An alternative way to rewrite the &amp;quot;ideal&amp;quot; figure of merit is as followS:</Paragraph>
      <Paragraph position="20"> Once again applying the usual independence assumption that given a nonterminal, the tag sequence it generates depends only on that nonterminal, we can rewrite the figure of merit as follows: p(tj,k Ito,j, tk,.) To derive an estimate of this quantity for practical use as a figure of merit, we make some additional independence assumptions. We assume that p(N),klto,j, tk,~) ~ p(N~,k), that is, that the probability of a nonterminal is independent of the tags before and after it in the sentence. We also use a trigram model for the tags themselves, giving p(tj,klto,j, tk,n) ,~ p(tj,kltj_2,j). Then we have:</Paragraph>
      <Paragraph position="22"> p(Nj, ktto,,~) .~. p(tj,kltj_2,j)&amp;quot; We can calculate ~(Nj, k) as usual. The p(N ~) term is estimated from our PCFG as the sum of the counts for all rules having N i as their left-hand side, divided by the sum of the counts for all rules. The p(tj,kltj_2,j) term is just the probability of the tag sequence tj... tk- 1 according to a trigram model. 1 (Technically, this is not a trigram model but a tritag model, since we are considering sequences of tags, not words.) We refer to this model as the trigram estimate.</Paragraph>
      <Paragraph position="23"> 1Our results show that the p(N i) term can be omitted without much effect.</Paragraph>
    </Section>
    <Section position="2" start_page="129" end_page="129" type="sub_section">
      <SectionTitle>
Prefix estimate
</SectionTitle>
      <Paragraph position="0"> We also derived an estimate of the ideal figure of merit which takes advantage of statistics on the first j - 1 tags of the sentence as well as tj,k.</Paragraph>
      <Paragraph position="1"> This estimate represents the probability of the constituent in the context of the preceding tags.</Paragraph>
      <Paragraph position="3"> We again make the independence assumption that p(tj,kINj, k,to,j, tk,~) ~ fl(Nj, k). Additionally, we assume that i P(N~,k,to,i) and p(to,k) are independent of p(tk,n), giving</Paragraph>
      <Paragraph position="5"> The denominator, p(t0,k), is once again calculated from a tritag model. The p(N),k, t0,j) term is just O~L, defined above in the discussion of the normalized O~Lfl model. Thus this figure of merit can be written as</Paragraph>
      <Paragraph position="7"> We will refer to this as the prefix estimate.</Paragraph>
      <Paragraph position="8"> The Experiment We used as our grammar a probabilistic context-free grammar learned from the Brown corpus (see (Francis and K@era, 1982), Carroll and Charniak (1992a) and (1992b), and (Charniak and Carroll, 1994)). We parsed 500 sentences of length 3 to 30 (including punctuation) from the Penn Treebank Wall Street Journal corpus using a best-first parsing method and each of the following estimates for p(Nj, klto,~) as the figure of merit:  1. straight 2. normalized \[3 3. normalized O~Lfl 4. trigram estimate 5. prefix estimate  The probability p(N i) in the trigram estimate was determined from the same training data from which our grammar was learned initially. Our tritag probabilities for the trigram and prefix estimates were learned from this data as well, using the deleted interpolation method for smoothing. For each figure of merit, we compared the performance of best-first parsing using that figure of merit to exhaustive parsing. By exhaustive parsing, we mean continuing to parse until there are no more constituents available to be added to the chart. We parse exhaustively to determine the total probability of a sentence, that is, the sum of the probabilities of all parses found for that sentence. We then computed several quantities for best-first parsing with each figure of merit at the point where the best-first parsing method has found parses contributing at least 95% of the probability mass of the sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="129" end_page="132" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> The chart below presents the following measures for each figure of merit:  1. %E: The percentage of edges, or rule expansions, in the exhaustive parse that have been used by the best-first parse to get 95% of the probability mass. Edge creation is generally considered the best measure of CFG parser effort. null 2. %non-0 E: The percentage of nonzero-length edges used by the best-first parse to get 95%. Zero-length edges are required by our parser as a book-keeping measure, and as such are virtually un-elimitable. We anticipated that removing them from consideration would highlight the &amp;quot;true&amp;quot; differences in the figures of merit. 3. %popped: The percentage of constituents in the exhaustive parse that were used by the best-first parse to get 95% of the probability mass.</Paragraph>
      <Paragraph position="1">  The statistics converged to their final values quickly. The edge-count percentages were generally within .01 of their final values after processing only 200 sentences, so the results were quite stable by the end of our 500-sentence test corpus.</Paragraph>
      <Paragraph position="2"> We gathered statistics for each sentence length from 3 to 30. Sentence length was limited to a maximum of 30 because of the huge number of edges that are generated in doing a full parse of  long sentences; using this grammar, sentences in this length range have produced up to 130,000 edges. Figure 1 shows a graph of %non-0 E, that is, the percent of nonzero-length edges needed to get 95% of the probability mass, for each sentence length.</Paragraph>
      <Paragraph position="3"> We also measured the total CPU time (in seconds) needed to get 95% of the probability mass for each of the 500 sentences. The results are presented in the following chart:  Figure 2 shows the average CPU time to get 95% of the probability mass for each estimate and each sentence length. Each estimate averaged below 1 second on sentences of fewer than 7 words. (The y-axis has been restricted so that the normalized /3 and trigram estimates can be better compared.) null Previous work The literature shows many implementations of best-first parsing, but none of the previous work shares our goal of explicitly comparing figures of merit.</Paragraph>
      <Paragraph position="4"> Bobrow (1990) and Chitrao and Grishman (1990) introduced statistical agenda-based parsing techniques. Chitrao and Grishman implemented a best-first probabilistic parser and noted the parser's tendency to prefer shorter constituents. They proposed a heuristic solution of penalizing shorter constituents by a fixed amount per word. Miller and Fox (1994) compare the performance of parsers using three different types of grammars, and show that a probabilistic context-free grammar using inside probability (unnormalized) as a figure of merit outperforms both a context-free grammar and a context-dependent grammar.</Paragraph>
      <Paragraph position="5"> Kochman and Kupin (1991) propose a figure of merit closely related to our prefix estimate. They do not actually incorporate this figure into a best-first parser.</Paragraph>
      <Paragraph position="6"> Magerman and Marcus (1991) use the geometric mean to compute a figure of merit that is independent of constituent length. Magerman and Weir (1992) use a similar model with a different parsing algorithm.</Paragraph>
      <Paragraph position="7">  From the edge count statistics, it is clear that straight ,3 is a poor figure of merit. Figure 1 also demonstrates that its performance generally worsens as sentence length increases.</Paragraph>
      <Paragraph position="8"> The best performance in terms of edge counts of the figures we tested was the model which used the most information available from the sentence, the prefix model. However, so far, the additional running time needed for the computation of O' L terms has exceeded the time saved by processing fewer edges, as is made clear in the CPU time statistics, where these two models perform substantially worse than even the straight j3 figure. While chart parsing and calculations of j3 can be done in O(n 3) time, we have been unable to find an algorithm to compute the o~ L terms faster than O(nS). When a constituent is removed from the keylist, it only affects the j3 values of its ancestors in the parse trees; however, ~L values are propagated to all of the constituent's siblings to the right and all of its descendants. Recomputing the o~ L terms when a constituent is removed from the keylist can be done in O(n 3) time, and since there are O(n 2) possible constituents, the total time needed to compute the ol L terms in this manner is O(n5).</Paragraph>
      <Paragraph position="9"> The best performer in running time was the parser using the trigram estimate as a figure of merit. This figure has the additional advantage that it can be easily incorporated into existing best-first parsers using a figure of merit based on inside probability. From the CPU time statistics, it can be seen that the running time begins to show a real improvement over the normalized j3 model on sentences of length 25 or greater, and the trend suggests that the improvement would be greater for longer sentences.</Paragraph>
      <Paragraph position="10"> It is also interesting to note that while the models using figures of merit normalized by the geometric mean performed similarly to the other models on shorter sentences, the superior performance of the other models becomes more pronounced as sentence length increases. From Figure 1, we can see that the models using the geometric mean appear to level off with respect to an exhaustive parse when used to parse sentences of length greater than about 15. The other two estimates seem to continue improving with greater sentence length. In fact, the measurements presented here almost certainly underestimate the true benefits of the better models. We restricted sentence length to a maximum of 30 words, in order to keep the number of edges in the exhaustive parse to a practical size; however, since the percentage of edges needed by the best-first parse decreases with increasing sentence length, we assume that the ira- null provement would be even more dramatic for sentences longer than 30 words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML