File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/j98-2004_concl.xml
Size: 8,847 bytes
Last Modified: 2025-10-06 13:57:56
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-2004"> <Title>New Figures of Merit for Best-First Probabilistic Chart Parsing</Title> <Section position="10" start_page="292" end_page="296" type="concl"> <SectionTitle> 9. Conclusions </SectionTitle> <Paragraph position="0"> We have presented and evaluated several figures of merit for best-first parsing. The best performer according to all of our measures was the parser using the boundary trigram estimate as a figure of merit, and this result holds for two different grammars.</Paragraph> <Paragraph position="1"> This figure has the additional advantage that it can be easily incorporated into existing best-first parsers using a figure of merit based on inside probability. (As mentioned earlier, the efficient online computation of fl is described in Appendix A.) We strongly recommend this figure of merit as the basis for best-first statistical parsers.</Paragraph> <Paragraph position="2"> The measurements presented here almost certainly underestimate the true benefits of this model. We restricted sentence length to a maximum of 30 words, in order to keep the number of edges in the exhaustive parse to a practical size; however, since the percentage of edges needed by the best-first parse decreases with increasing sentence length, we assume that the improvement would be even more dramatic for sentences longer than 30 words.</Paragraph> <Paragraph position="3"> Appendix A: Efficient On-Line Computation of fl We compute estimates of the inside probability fl for each proposed constituent incrementally as new constituents are added to the chart. Initially, fl is set to 1 for each terminal symbol, since our input is given as a stream of tags, which are our terminals. When a new proposed constituent is added to the agenda, its fl estimate is set to its current inside probability according to the constituents already in the chart. However, as more constituents are added to the chart, we may find a new way to build up a proposed constituent, i.e., additional evidence for that proposed constituent, so we need to update the fl for the proposed constituent (and also for affected constituents already in the chart, since these may in turn affect other proposed constituents).</Paragraph> <Paragraph position="4"> These updates can be quite expensive in terms of CPU time. However, many of the updates are quite small, and do not affect the relative ordering of the proposed constituents on the agenda. Instead of propagating every change to fl, then, we only want to propagate those changes that we expect to have an effect on this ordering.</Paragraph> <Paragraph position="5"> What we have done is to have each constituent store not only its fl value, but also an increment. Increases to the inside probability are added not to fl itself, but to this increment, until the increment exceeds some threshold. Experimentally we have found that we can avoid propagating increments until they exceed 1% of the current value of fl with very little effect on the parser's selection of constituents from the agenda.</Paragraph> <Paragraph position="6"> This thresholding on the propagation of fl allows us to update the fl values on line while still keeping the performance of the parser as O(n 3) empirically.</Paragraph> <Paragraph position="7"> Appendix B: Estimation of Boundary Statistics Our figures of merit incorporating boundary statistics use the figures p(N;, k I t\]-l) to p/tkf~,~) represent the effect of the left context and --pGT- to represent the effect of the right context. For our experiments with the first grammar, which was learned from training data taken from the Brown corpus, we estimated these statistics from the same training data.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 24, Number 2 First, we parsed the training data according to our grammar. (It was necessary to do this, rather than using the hand-annotated parses of the training data, because our grammar does not use the same set of nonterminals as the corpus; see Carroll and Charniak \[1992a, 1992b\] and Charniak and Carroll \[1994\] for details.) Since we use the tags as our input, the probability of a nonterminal appearing with a particular previous tag is the same as the probability of that nonterminal appearing in any sentence containing that tag.</Paragraph> <Paragraph position="9"> We can then count the probability-weighted occurrences of a nonterminal given the previous tag as follows:</Paragraph> <Paragraph position="11"> That is, for each sentence that contains the previous tag tj_l, we increment our count by the probability of the nonterminal N~, k occurring immediately following tj-1 in that sentence.</Paragraph> <Paragraph position="12"> Since we have a complete parse, the inside and outside probabilities and the sentence probability can be easily computed. We can also obtain the count C(tj_l) simply by counting the number of sentences in which that tag appears in position j- 1. We then obtain the conditional probability for the left boundary statistic as follows:</Paragraph> <Paragraph position="14"> The right boundary statistic is computed in the corresponding way.</Paragraph> <Paragraph position="15"> For the experiment using the treebank grammar, these statistics were obtained by counting directly from the Wall Street \]ournal treebank corpus, just as the grammar rules and trigram statistics were.</Paragraph> <Paragraph position="16"> Appendix C: Speed vs. Accuracy As an additional verification of our results, we gathered data on speed versus accuracy. For this experiment, we used the probabilistic context-free grammar learned from the Brown corpus and the average-length test sentences described in Section 5.4. For each figure of merit, we computed the average precision and recall of the best parse found as compared to the number of edges created. We computed unlabeled precision and recall only, since our grammar uses a different set of nonterminals from those used in the test data.</Paragraph> <Paragraph position="17"> Precision is defined as the percentage of the constituents proposed by our parser that are actually correct according to the treebank. For each edge count, we measured the precision of the best parse of each sentence found within that number of edges. Figure 15 is a graph of the average precision for the fl figures of merit from Section 3, plotted against edge counts.</Paragraph> <Paragraph position="18"> The fluctuations at the low edge counts are due to the small amount of data at this level. At a low edge count, very few sentences have actually been parsed, and since these sentences tend to be short and simple, the parses are likely to be correct. The sentences that could not be parsed do not contribute to the measurement of precision. As more sentences are parsed, precision settles at about 47%, the highest precision attainable by our particular test grammar, and remains there as edge counts increase.</Paragraph> <Paragraph position="20"> CarabaUo and Charniak Figures of Merit This level of precision is independent of the figure of merit used, so measurement of precision does not help evaluate our figures of merit.</Paragraph> <Paragraph position="21"> A much more useful measure is recall. Recall is defined as the percentage of constituents in the treebank test data that are found by our parser. Again, we measured the recall of the best parse of each sentence found within each number of edges. Figure 16 shows the results for the figures of merit from Section 3.</Paragraph> <Paragraph position="22"> Straight beta clearly shows little or no improvement over the &quot;stack&quot; parser using no figure of merit at all. The other figures of merit increase quickly to about 64%, the maximum recall attainable with our test grammar. The &quot;stack&quot; parser and the one using straight beta, on the other hand, do not reach this maximum level until about 50,000 edges. We have no explanation for the relatively poor performance of the parser using the trigram estimate compared to the other best-first parsers, as shown in Figures 16, 17, and 18. Figure 17 shows the recall values for the O~Lfl figures of merit from Section 4, and Figure 18 shows recall for the boundary figures of merit from Section 5. Since precision is not a useful measure, we have not included precision data for these figures of merit.</Paragraph> <Paragraph position="23"> These data confirm that the parser using the boundary trigram figure of merit performs better than any of the others. Recall using this figure of merit is consistently higher than any of the others at low edge counts, and it reaches the maximum value in fewer than 2,000 edges, with the nearest competitors approaching the maximum at about 3,000 edges.</Paragraph> </Section> class="xml-element"></Paper>