File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/j98-2004_metho.xml
Size: 19,212 bytes
Last Modified: 2025-10-06 14:14:50
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-2004"> <Title>New Figures of Merit for Best-First Probabilistic Chart Parsing</Title> <Section position="4" start_page="277" end_page="280" type="metho"> <SectionTitle> 3. Simple Figures of Merit </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="277" end_page="278" type="sub_section"> <SectionTitle> 3.1 Straight fl </SectionTitle> <Paragraph position="0"> It seems reasonable to base a figure of merit on the inside probability fl of the constituent. Inside probability is defined as the probability of the words or tags in the constituent given that the constituent is dominated by a particular nonterminal symbol; see Figure 2. This seems to be a reasonable basis for comparing constituent probabilities, and has the additional advantage that it is easy to compute during chart parsing.</Paragraph> <Paragraph position="1"> Appendix A gives details of our on-line computation of ft.</Paragraph> <Paragraph position="2"> The inside probability of the constituent N~, k is defined as: --= p(tj,k IN') where N i represents the ith nonterminal symbol.</Paragraph> <Paragraph position="3"> Caraballo and Charniak Figures of Merit</Paragraph> <Paragraph position="5"> c~ includes the entire context of the constituent.</Paragraph> <Paragraph position="6"> In terms of our earlier discussion, our &quot;ideal&quot; figure of merit can be rewritten as:</Paragraph> <Paragraph position="8"> p(to,j,N~,k, tk, n)p(tj,k l i to,j, Nj, k, tk,n) p(to, n) We apply the usual independence assumption that given a nonterminal, the tag sequence it generates depends only on that nonterminal, giving:</Paragraph> <Paragraph position="10"> The first term in the numerator is just the definition of the outside probability c~ of the constituent. Outside probability o~ of a constituent N~, k is defined as the probability of that constituent and the rest of the words in the sentence (or rest of the tags in the tag sequence, in our case); see Figure 3.</Paragraph> <Paragraph position="11"> = p(t0,j, N), k, tk,n).</Paragraph> <Paragraph position="12"> We can therefore rewrite our ideal figure of merit as:</Paragraph> <Paragraph position="14"> In this equation, we can see that oL(Xj,k) and p(to,n) represent the influence of the surrounding words. Thus using fl alone assumes that o~ and p(to,n) can be ignored.</Paragraph> <Paragraph position="15"> We will refer to this figure of merit as straight ft.</Paragraph> </Section> <Section position="2" start_page="278" end_page="279" type="sub_section"> <SectionTitle> 3.2 Normalized fl </SectionTitle> <Paragraph position="0"> One side effect of omitting the c~ and p(to,,) terms in the straight fl figure above is that inside probability alone tends to prefer shorter constituents to longer ones, as the Computational Linguistics Volume 24, Number 2 inside probability of a longer constituent involves the product of more probabilities. This can result in a &quot;thrashing&quot; effect as noted in Chitrao and Grishman (1990), where the system parses short constituents, even very low-probability ones, while avoiding combining them into longer constituents. To avoid thrashing, some technique is used to normalize the inside probability for use as a figure of merit. One approach is to take the geometric mean of the inside probability, to obtain a per-word inside probability. (In the &quot;ideal&quot; model, the p(to,n) term acts as a normalizing factor.) The per-word inside probability of the constituent N~, k is calculated as: We will refer to this figure as normalized ft.</Paragraph> </Section> <Section position="3" start_page="279" end_page="280" type="sub_section"> <SectionTitle> 3.3 Trigram Estimate </SectionTitle> <Paragraph position="0"> An alternative way to rewrite the &quot;ideal&quot; figure of merit is as follows:</Paragraph> <Paragraph position="2"> Once again applying the usual independence assumption that given a nonterminal, the tag sequence it generates depends only on that nonterminal, we can rewrite the figure of merit as follows:</Paragraph> <Paragraph position="4"> To derive an estimate of this quantity for practical use as a figure of merit, we make some additional independence assumptions. We assume that p(N~, k I to,j, tk,,) ~, p(Nj,k), that is, that the probability of a nonterminal is independent of the tags before and after it in the sentence. We also use a trigram model for the tags themselves, giving p(tj,k I to,j, tk,,) ,~ p(tj,k I tj-2, tj-1). Then we have:</Paragraph> <Paragraph position="6"> We can calculate fl(N~,k) as usual.</Paragraph> <Paragraph position="7"> The p(N i) term is estimated from our PCFG and the training data from which the grammar was learned. We estimate p(N i) as the sum of the counts for all rules having N i as their left-hand side, divided by the sum of the counts for all rules. ~ The p(tj,k I tj-a, tj-1) term is just the probability of the tag sequence tj... tk-1 according to a trigram model. (Technically, this is not a trigram model but a tritag model, since we are considering sequences of tags, not words.) Our tritag probabilities p(ta I ta--2, ta-1) were learned from the training data used for the grammar, using Caraballo and Charniak Figures of Merit Table 1 Results for the fl estimates.</Paragraph> <Paragraph position="8"> Figure of Merit %E %non-0 E %popped CPU Time straight fl 97.6 97.5 93.8 3,966 normalized fl 34.7 31.6 61.5 1,631 Nonzero-length edges for 95% of the probability mass for the fl estimates. ...... straight beta ..... normalized beta .... ~igram estimate the deleted interpolation method for smoothing. Our figure of merit uses:</Paragraph> <Paragraph position="10"> We refer to this figure of merit as the trigram estimate.</Paragraph> </Section> </Section> <Section position="5" start_page="280" end_page="281" type="metho"> <SectionTitle> 3.4 Results </SectionTitle> <Paragraph position="0"> The results for the three figures of merit introduced in the last section according to the measurements given in Section 2.2 are shown in Table 1 (the time to fully parse using the &quot;stack&quot; model is included for easy reference).</Paragraph> <Paragraph position="1"> Figure 4 expands the %non-0 E data to show the percent of nonzero-length edges needed to get 95% of the probability mass for each sentence length.</Paragraph> <Paragraph position="2"> Straight fl performs quite poorly on this measure. In order to find 95% of the probability mass for a sentence, a parser using this figure of merit typically needs to do over 90% of the work. On the other hand, normalized fl and the trigram estimate both result in substantial savings of work. However, while these two models produce near-equivalent performance for short sentences, for longer sentences, with length greater than about 15 words, the trigram estimate gains a clear advantage. In fact, the performance of normalized fl appears to level off in this range, while the amount of work done using the trigram estimate shows a continuing downward trend.</Paragraph> <Paragraph position="3"> Figure 5 shows the average CPU time to get 95% of the probability mass for each estimate and each sentence length. Each estimate averaged below 1 second on sentences of fewer than 7 words. (The y-axis has been restricted so that the normalized fl and trigram estimates can be better compared).</Paragraph> <Paragraph position="4"> Note that while straight fl does perform better than the &quot;stack&quot; model in CPU time, the two models approach equivalent performance as sentence length increases, which is what would be expected from the edge count measures. The other two models provide a real time savings over the &quot;stack&quot; model, as can be seen from Figure 5 and from the total CPU times given earlier. Through most of the length range, the CPU time needed by the normalized fl and the trigram estimate is quite close, but at the upper end of the range we can see better performance by the trigram estimate.</Paragraph> <Paragraph position="5"> (This improvement comes later than in the edge count statistics because of the small additional amount of overhead work needed to use the trigram estimate.)</Paragraph> </Section> <Section position="6" start_page="281" end_page="284" type="metho"> <SectionTitle> 4. Figures Involving Left Outside Probability </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="281" end_page="282" type="sub_section"> <SectionTitle> 4.1 Normalized O~Lfl </SectionTitle> <Paragraph position="0"> Earlier, we showed that our ideal figure of merit can be written as:</Paragraph> <Paragraph position="2"> However, the a term, representing outside probability, cannot be calculated di- null Caraballo and Charniak Figures of Merit Figure 6 Left outside context.</Paragraph> <Paragraph position="3"> N~,k !i'o --. ... ',-, rectly during a parse, since we need the full parse of the sentence to compute it. In some of our figures of merit, we use the quantity p(N~, k, t0,j), which is closely related to outside probability. We call this quantity the left outside probability, and denote it O~L (see Figure 6).</Paragraph> <Paragraph position="4"> The following recursive formula can be used to compute aL. Let Cj/k be the set of all edges, or rule expansions, in which the nonterminal N~, k appears. For each edge e in gjik, we compute the product of aL of the nonterminal appearing on the left-hand side (lhs) of the rule, the probability of the rule itself, and fl of each nonterminal N~,s appearing to the left of Nj, k in the rule. Then aL(N~,k) is the sum of these products:</Paragraph> <Paragraph position="6"> Given a complete parse of the sentence, the formula above gives an exact value for aL. During parsing, the set gjik is not complete, and so the formula gives an approximation of aL.</Paragraph> <Paragraph position="7"> This formula can be infinitely recursive, depending on the properties of the grammar. A method for calculating O~L more efficiently can be derived from the calculations given in Jelinek and Lafferty (1991).</Paragraph> <Paragraph position="8"> A simple extension to the normalized fl model allows us to estimate the per-word probability of all tags in the sentence through the end of the constituent under consideration. This allows us to take advantage of information already obtained in a left-right parse. We calculate this quantity as follows: k N i i We are again taking the geometric mean to avoid thrashing by compensating for the aLfl quantity's preference for shorter constituents, as explained in the previous section.</Paragraph> <Paragraph position="9"> We refer to this figure of merit as normalized OlLfl.</Paragraph> </Section> <Section position="2" start_page="282" end_page="283" type="sub_section"> <SectionTitle> 4.2 Prefix Estimate </SectionTitle> <Paragraph position="0"> We also derived an estimate of the ideal figure of merit that takes advantage of statistics on the first j - 1 tags of the sentence as well as tj,k. This estimate represents the normalized C~Lfl 39.7 36.4 57.3 68,660 prefix estimate 21.8 17.4 38.3 26,520 probability of the constituent in the context of the preceding tags. i p(Nj. k, to.n) P(N~.k I to.,) -- p(to.,) p(tk.,)p(N~, k, to.j I tk.,)p(tj.k I Nji.k, to.j, tk.,) p(tk.,)p(to.k t tk.n) i i p(Nj.k, to.j l tk.n)p(tj.k l Nj. k, to.j, tk.n) p(to,k l tk.n) We again make the independence assumption that p(tj,k I N~,k, tO,j, tk,,) fl(N~,k). Additionally, we assume that p(N~, k, to,j) and p(to,k) are independent of p(tk,n), giving:</Paragraph> <Paragraph position="2"> The denominator, p(to.k), is once again calculated from a tritag model. The p(Nji.k , t0.j) term is just OiL, defined above in the discussion of the normalized OLLfl model. Thus this figure of merit can be written as:</Paragraph> <Paragraph position="4"> We will refer to this as the prefix estimate.</Paragraph> </Section> <Section position="3" start_page="283" end_page="284" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> The results for the figures of merit introduced in the previous section according to the measurements given in Section 2.2 are shown in Table 2.</Paragraph> <Paragraph position="1"> the geometric-mean-based models with sentence length can be seen clearly. Second, when we consider only the two conditional-probability models, we can see that the additional information obtained from context in the prefix estimate gives a substantial improvement in this measure as compared to the trigram estimate.</Paragraph> <Paragraph position="2"> However, the CPU time needed to compute the O~L term exceeds the time saved by processing fewer edges. Note that using this estimate, the parser took over 26,000 seconds to get 95% of the probability mass, while the &quot;stack&quot; model can exhaustively parse the test data in less than 5,000 seconds. Figure 8 shows the average CPU time for each sentence length.</Paragraph> <Paragraph position="3"> While chart parsing and calculations of fl can be done in O(n 3) time (see Appendix A), we have been unable to find an algorithm to compute the OIL terms faster</Paragraph> <Paragraph position="5"> than O(n5). When a constituent is removed from the agenda, it only affects the fl values of its ancestors in the parse trees; however, C~L values are propagated to all of the constituent's siblings to the right and all of its descendants. Recomputing the aL terms when a constituent is removed from the agenda can be done in O(n 3) time, and since there are O(n 2) possible constituents, the total time needed to compute the aL terms in this manner is O(n5).</Paragraph> </Section> </Section> <Section position="7" start_page="284" end_page="287" type="metho"> <SectionTitle> 5. Figures Using Boundary Statistics </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="284" end_page="286" type="sub_section"> <SectionTitle> 5.1 Left Boundary Trigram Estimate </SectionTitle> <Paragraph position="0"> Although the OLL-based models seem impractical, the edge-count and constituent-count statistics show that contextual information is useful. We can derive an estimate similar to the prefix estimate but containing a much simpler model of the context as follows:</Paragraph> <Paragraph position="2"> p( to,j, tk,,, )p( N~, k I to,j, tk,, )p( tj,k \[ Nji, k, to,j, tk,,, ) p(to,j, tk,n)p(tj,k I to,j, tk,,,) Once again applying the usual independence assumption that given a nonterminal, the tag sequence it generates depends only on that nonterminal, we can rewrite the figure of merit as follows: p(Nii, k I to,n) ,~ P(N~, k I to,j, tk,n)fl(N~,k) p(tj,k l to,j, tk,,) As usual, we use a trigram model for the tags, giving p(tj,k I to,j, tk,,) ~ p(tj,k \[ tj-2, tj-1).</Paragraph> <Paragraph position="3"> Now, we assume that p(N~, k I to,j, tk,n) ,~, p(N~, k I tj-1), that is, that the probability of a nonterminal is dependent on the tag immediately before it in the sentence (see</Paragraph> <Paragraph position="5"> We can calculate fl(N~,k) and the tritag probabilities as usual. The p(N~, k I tj-1) probabilities are estimated from our training data by parsing the training data and Caraballo and Charniak Figures of Merit Figure 10 Boundary context.</Paragraph> <Paragraph position="6"> counting the occurrences of the nonterminal and the tag weighted by their probability in the parse. (Further details are provided in Appendix B.) We will refer to this figure as the left boundary trigram estimate.</Paragraph> </Section> <Section position="2" start_page="286" end_page="287" type="sub_section"> <SectionTitle> 5.2 Boundary Trigram Estimate </SectionTitle> <Paragraph position="0"> We can derive a similar estimate using context on both sides of the constituent as follows: p(N~, k \[to,,,)</Paragraph> <Paragraph position="2"> Once again applying the usual independence assumption that given a nonterminal, the tag sequence it generates depends only on that nonterminal and also assuming that the probability of tk+l,n depends only on the previous tags, we can rewrite the figure of merit as follows: p(Nj, k I to,j)fl(N~,k)p(tk \[t0,k, N)i,k) p(N;i,k \[to,,,) ~ p(tj,k+l \[to,j) Now we add some new independence assumptions. We assume that the probability of the nonterminal depends only on the immediately preceding tag, and that the probability of the tag immediately following the nonterminal depends only on the nonterminal (see Figure 10), giving:</Paragraph> <Paragraph position="4"> As usual, we use a trigram model for the tags, giving p(tj,k \] to,j, tk, n) ~ p(tj,k I tj-2, tj-1). Then we have:</Paragraph> <Paragraph position="6"> Computational Linguistics Volume 24, Number 2 We can calculate fl(N~,k) and the tritag probabilities as usual. The p(Njik I tj-1) and p(tk I Nji, k) probabilities are estimated from our training data by parsing the training data and counting the occurrences of the nonterminal and the tag weighted by their probability in the parse. 2 Again, see Appendix B for details of how these estimates were obtained.</Paragraph> <Paragraph position="7"> We will refer to this figure as the boundary trigram estimate.</Paragraph> </Section> <Section position="3" start_page="287" end_page="287" type="sub_section"> <SectionTitle> 5.3 Boundary Statistics Only </SectionTitle> <Paragraph position="0"> We also wished to examine whether contextual information by itself is sufficient as a figure of merit. We can derive an estimate based only on easily computable contextual information as follows:</Paragraph> <Paragraph position="2"> p(to,j)p(N~,k I tod)p(tj,kl N~,k, tOd)p(tk I to,,&quot; iNj,k ' tj,k)p(tk+t,, i t0,j, N~,k ,i t/,k, tk) p(to.j)p(tj.kltO.j)p(tkltO.k)p(tk+l., I to.k+1) p(N~.k l tO.j)p(tj.k \[ i i Nj. k, to.j)p(tk I I Nji.k) to.j. Nj. k, tj.k)p(tk+l., tO.k+1. p(tj.kltO.j)p(tkltO.k)p(tk+l.. I tO.k+1) Most of the independence assumptions we make are the same as in the boundary trigram estimate. We assume that the probability of the nonterminal depends only on the previous tag, that the probability of the immediately following tag depends only on the nonterminal, and that the probability of the tags following that depend only on the previous tags. However, we make one independence assumption that differs from all of our previous estimates. Rather than assuming that the probability of the tags within the constituent depends on the nonterminal, giving an inside probability term, we assume that the probability of these tags depends only on the previous tags.</Paragraph> <Paragraph position="3"> Then we have</Paragraph> <Paragraph position="5"> In the denominator, we take p(tk \[ to,k) ~ p(tk), giving: p(N}.k l to..) ~. P(N~. k I tod)p(tk I Nj.k) p(tk) which is simply the product of the two boundary statistics described in the previous section.</Paragraph> <Paragraph position="6"> We refer to this estimate as boundary statistics only.</Paragraph> <Paragraph position="7"> Nonzero-length edges for 95% of the probability mass for the boundary estimates.</Paragraph> </Section> </Section> class="xml-element"></Paper>