File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1022_evalu.xml
Size: 6,811 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1022"> <Title>Multilevel Coarse-to-fine PCFG Parsing</Title> <Section position="5" start_page="170" end_page="173" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> In all experiments the system is trained on the Penn tree-bank sections 2-21. Section 23 is used for testing and section 24 for development. The input to the parser are the gold-standard parts of speech, not the words.</Paragraph> <Paragraph position="1"> The point of parsing at multiple levels of granularity is to prune the results of rough levels before going on to finer levels. In particular, it is necessary for any pruning scheme to retain the true (gold-standard WSJ) constituents in the face of the pruning. To gain an idea of what is possible, consider Figure 3. According to the graph, at the zeroth level of parsing and a the pruning level 10[?]4 the probability that a gold constituent is deleted due to pruning is slightly more than 0.001 (or 0.1%). At level three it is slightly more that 0.01 (or 1.0%).</Paragraph> <Paragraph position="2"> The companion figure, Figure 4 shows the retention rate of the non-gold (incorrect) constituents. Again, at pruning level 10[?]4 and parsing level 0 we retain about .3 (30%) of the bad constituents (so we pruned 70%), whereas at level 3 we retain about .004 (0.4%). Note that in the current paper we do not actually prune at level 3, instead return the Viterbi parse. We include pruning results here in anticipation of future work in which level 3 would be a precursor to still more fine-grained parsing.</Paragraph> <Paragraph position="3"> As noted in Section 2, there is some (implicit) for WSJ section 23, sentences of length [?] 100 debate in the literature on using estimates of the outside probability in Equation 1, or instead computing the exact upper bound. The idea is that an exact upper bound gives one an admissible search heuristic but at a cost, since it is a less accurate estimator of the true outside probability. (Note that even the upper bound does not, in general, keep all of the gold constituents, since a non-perfect model will assign some of them low probability.) As is clear from Figure 3, the estimate works very well indeed.</Paragraph> <Paragraph position="4"> On the basis of this graph, we set the lowest allowable constituent probability at [?] 5 * 10[?]4, [?] 10[?]5, and [?] 10[?]4 for levels 0,1, and 2, respectively. No pruning is done at level 3, since there is no level 4. After setting the pruning parameters on the development set we proceed to parse the test set (WSJ section 23). Figure 5 shows the resulting pruning statistics. The total number of constituents created at level 0, for all sentences combined, is 8.82 * 106. Of those 7.55*106 (or 86.5%) are pruned before going on to level 1. At level 1, the 1.3 million left over from level 0 expanded to a total of 9.18 * 106.</Paragraph> <Paragraph position="5"> 70.8% of these in turn are pruned, and so forth.</Paragraph> <Paragraph position="6"> The percent pruned at, e.g., level 1 in Figure 3 is much higher than that shown here because it considers all of the possible level-1 constituents, not just those left unpruned after level 0.</Paragraph> <Paragraph position="7"> There is no pruning at level 3. There we simply return the Viterbi parse. We also show that with pruning we generate a total of 40.4 * 106 constituents. For comparison exhaustively parsing using the tree-bank grammar yields a total of 392*106 constituents. This is the factor-of-10 Level Time for Level Running Total tion 23, with and without pruning workload reduction mentioned in Section 1.</Paragraph> <Paragraph position="8"> There are two points of interest. The first is that each level of pruning is worthwhile. We do not get most of the effect from one or the other level. The second point is that we get significant pruning at level 0. The reader may remember that level 0 distinguishes only between the root node and the rest. We initially expected that it would be too coarse to distinguish good from bad constituents at this level, but it proved as useful as the other levels. The explanation is that this level does use the full tree-bank preterminal tags, and in many cases these alone are sufficient to make certain constituents very unlikely. For example, what is the probability of any constituent of length two or greater ending in a preposition? The answer is: very low. Similarly for constituents of length two or greater ending in modal verbs, and determiners.</Paragraph> <Paragraph position="9"> Not quite so improbable, but nevertheless less likely than most, would be constituents ending in verbs, or ending just short of the end of the sentence.</Paragraph> <Paragraph position="10"> Figure 6 shows how much time is spent at each level of the algorithm, along with a running total of the time spent to that point. (This is for all sentences in the test set, length [?] 100.) The number for the unpruned parser is again about ten times that for the pruned version, but the number for the standard CKY version is probably too high. Because our CKY implementation is quite slow, we ran the unpruned version on many machines and summed the results. In all likelihood at least some of these machines were overloaded, a fact that our local job distributer would not notice. We suspect that the real number is significantly lower, though still WSJ section 23, all sentences of length [?] 100 much higher than the pruned version.</Paragraph> <Paragraph position="11"> Finally Figure 7 shows that our pruning is accomplished without loss of accuracy. The results with pruning include four sentences that did not receive any parses at all. These sentences received zeros for both precision and recall and presumably lowered the results somewhat. We allowed ourselves to look at the first of these, which turned out to contain the phrase: (NP ... (INTJ (UH oh) (UH yes)) ...) The training data does not include interjections consisting of two &quot;UH&quot;s, and thus a gold parse cannot be constructed. Note that a different binarization scheme (e.g. the one used in Klein and Manning (2003b) would have smoothed over this problem. In our case the unpruned version is able to patch together a lot of very unlikely constituents to produce a parse, but not a very good one. Thus we attribute the problem not to pruning, but to binarization.</Paragraph> <Paragraph position="12"> We also show the results for the most similar Klein and Manning (2003b) experiment. Our results are slightly better. We attribute the difference to the fact that we have the gold tags and they do not, but their binarization scheme does not run into the problems that we encountered. null</Paragraph> </Section> class="xml-element"></Paper>