File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1035_evalu.xml

Size: 9,235 bytes

Last Modified: 2025-10-06 14:00:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1035">
  <Title>Inside-Outside Estimation of a Lexicalized PCFG for German</Title>
  <Section position="7" start_page="272" end_page="274" type="evalu">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> For the evaluation, a total of 600 randomly selected clauses were manually annotated by two labelers. Using a chart browser, the labellers filled the appropriate cells with category names of NCs and those of maximal VP projections (cf. Figure 7 for an example of NC-labelling).</Paragraph>
    <Paragraph position="1"> Subsequent alignment of the labelers decisions resulted in a total of 1353 labelled NC categories (with four different cases). The total of 584 labelled VP categories subdivides into 21 different verb frames with 340 different lemma heads. The dominant frames are active transitive (164 occurrences) and active intransitive (117 occurrences). They represent almost half of the annotated frames. Thirteen frames occur less than ten times, five of which just once.</Paragraph>
    <Section position="1" start_page="272" end_page="273" type="sub_section">
      <SectionTitle>
6.1 Methodology
</SectionTitle>
      <Paragraph position="0"> To evaluate iterative training, we extracted maximum probability (Viterbi) trees for the 600 clause test set in each iteration of parsing. For extraction of a maximal probability parse in unlexicalized training, we used Schmid's lopar parser (Schmid, 1999). Trees were mapped to a database of parser generated markup guesses, and we measured precision and recall against the manually annotated category names and spans. Precision gives the ratio of correct guesses over all guesses, and recall the ratio of correct guesses over the number of phrases identified by human annotators. Here, we render only the precision/recall results on pairs of category names and spans, neglecting less interesting measures on spans alone. For the figures of adjusted recall, the number of unparsed misses has been subtracted from the number of possibilities.</Paragraph>
      <Paragraph position="1">  In the following, we focus on the combination of the best unlexicalized model and the lexicalized model that is grounded on the former.</Paragraph>
    </Section>
    <Section position="2" start_page="273" end_page="273" type="sub_section">
      <SectionTitle>
6.2 NC Evaluation
</SectionTitle>
      <Paragraph position="0"> Figure 8 plots precision/recall for the training runs described in section 5.1, with lexicalized parsing starting after 0, 2, or 60 unlexicalized iterations. The best results are achieved by starting with lexicalized training after two iterations of unlexicalized training. Of a total of 1353 annotated NCs with case, 1103 are correctly recognized in the best unlexicalized model and 1112 in the last lexicalized model. With a number of 1295 guesses in the unlexicalized and 1288 guesses in the final lexicalized model, we gain 1.2% in precision (85.1% vs. 86.3%) and 0.6% in recall (81.5% vs. 82.1%) through lexicalized training. Adjustment to parsed clauses yields 88% vs. 89.2% in recall. As shown in Figure 8, the gain is achieved already within the first iteration; it is equally distributed between corrections of category boundaries and labels.</Paragraph>
      <Paragraph position="1"> The comparatively small gain with lexicalized training could be viewed as evidence that the chunking task is too simple for lexical information to make a difference. However, we find about 7% revised guesses from the unlexicalized to the first lexicalized model. Currently, we do not have a clear picture of the newly introduced errors.</Paragraph>
      <Paragraph position="2"> The plots labeled &amp;quot;00&amp;quot; are results for lexicalized training starting from a random initial grammar. The precision measure of the first lexicalized model falls below that of the unlexicalized random model (74%), only recovering through lexicalized training to equalize the precision measure of the random model (75.6%).</Paragraph>
      <Paragraph position="3"> This indicates that some degree of unlexicalized initialization is necessary, if a good lexica\]ized model is to be obtained.</Paragraph>
      <Paragraph position="4"> (Skut and Brants, 1998) report 84.4% recall and 84.2% for NP and PP chunking without case labels. While these are numbers for a simpler problem and are slightly below ours, they are figures for an experiment on unrestricted sentences. A genuine comparison has to await extension of our model to free text.</Paragraph>
    </Section>
    <Section position="3" start_page="273" end_page="274" type="sub_section">
      <SectionTitle>
6.3 Verb Frame Evaluation
</SectionTitle>
      <Paragraph position="0"> Figure 9 gives results for verb frame recognition under the same training conditions. Again, we achieve best results by lexicalising the second unlexicalized model. Of a total of 584 annotated verb frames, 384 are correctly recognized in the best unlexicalized model and 397 through subsequent lexicalized training. Precision for the best unlexicalized model is 68.4%. This is raised by 2% to 70.4% through lexicalized training; recall is 65.7%/68%; adjustment by 41 unparsed misses makes for 70.4%/72.8% in recall. The rather small improvements are in contrast to 88 differences in parser markup, i.e. 15.7%, between the unlexicalized and second lexicalized model. The main gain is observed within the first two iterations (cf. Figure 9; for readability, we dropped the recall curves when more or less parallel to the precision curves).</Paragraph>
      <Paragraph position="1"> Results for lexicalized training without prior unlexicalized training are better than in the NC evaluation, but fall short of our best results by more than 2%.</Paragraph>
      <Paragraph position="2"> The most notable observation in verb frame evaluation is the decrease of precision of frame recognition in unlexicalized training from the second iteration onward. After several dozen it- null erations, results are 5% below a random model and 14% below the best model. The primary reason for the decrease is the mistaken revision of adjoined PPs to argument PPs. E.g.</Paragraph>
      <Paragraph position="3"> the required number of 164 transitive frames is missed by 76, while the parser guesses 64 VPt.nap frames in the final iteration against the annotator's baseline of 12. In contrast, lexicalized training generally stabilizes w.r.t, frame recognition results after only few iterations.</Paragraph>
      <Paragraph position="4"> The plot labeled &amp;quot;lex 60&amp;quot; gives precision for a lexicalized training starting from the unlexicalized model obtained with 60 iterations, which measured by linguistic criteria is a very poor state. As far as we know, lexicalized EM estimation never recovers from this bad state.</Paragraph>
    </Section>
    <Section position="4" start_page="274" end_page="274" type="sub_section">
      <SectionTitle>
6.4 Evaluation of non-PP Frames
</SectionTitle>
      <Paragraph position="0"> Because examination of individual cases showed that PP attachments are responsible for many errors, we did a separate evaluation of non-PP frames. We filtered out all frames labelled with a PP argument from both the maximal probability parses and the manually annotated frames (91 filtered frames), measuring precision and recall against the remaining 493 labeller annotated non-PP frames.</Paragraph>
      <Paragraph position="1"> For the best lexicalized model, we find somewhat but not excessively better results than those of the evaluation of the entire set of frames. Of 527 guessed frames in parser markup, 382 are correct, i.e. a precision of 72.5%. The recall figure of 77.5~0 is considerably better since overgeneration of 34 guesses is neglected.</Paragraph>
      <Paragraph position="2"> The differences with respect to different starting points for lexicalization emulate those in the evaluation of all frames.</Paragraph>
      <Paragraph position="3"> The rather spectacular looking precision and recall differences in unlexicalized training confirm what was observed for the full frame set. From the first trained unlexicalized model throughout unlexicalized training, we find a steady increase in precision (70% first trained model to 78% final model) against a sharp drop in recall (78% peek in the second model vs.</Paragraph>
      <Paragraph position="4"> 50% in the final). Considering our above remarks on the difficulties of frame recognition in unlexicalized training, the sharp drop in recall is to be expected: Since recall measures the correct parser guesses against the annotator's baseline, the tendency to favor PP arguments over PP-adjuncts leads to a loss in guesses when PP-frames are abandoned. Similarly, the rise in precision is mainly explained by the decreasing number of guesses when cutting out non-PP frames. For further discussion of what happens with individual frames, we refer the reader to (Beil et al., 1998).</Paragraph>
      <Paragraph position="5"> One systematic result in these plots is that performance of lexicalized training stabilizes after a few iterations. This is consistent with what happens with rule parameters for individual verbs, which are close to their final values within five iterations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML