XML Viewer - p04-1060

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1060_evalu.xml
Size: 4,275 bytes
Last Modified: 2025-10-06 13:59:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1060">
  <Title>Experiments in Parallel-Text Based Grammar Induction</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> For evaluation, we ran the PCFG resulting from training with the Viterbi algorithm10 on parts of the Wall Street Journal (WSJ) section of the Penn Tree-bank and compared the tree structure for the most  probable parse for the test sentences against the gold standard treebank annotation. (Note that one does not necessarily expect that an induced grammar will match a treebank annotation, but it may at least serve as a basis for comparison.) The evaluation criteria we apply are unlabeled bracketing precision and recall (and crossing brackets). We follow an evaluation criterion that (Klein and Manning, 2002, footnote 3) discuss for the evaluation of a not fully supervised grammar induction approach based on a binary grammar topology: bracket multiplicity (i.e., non-branching projections) is collapsed into a single set of brackets (since what is relevant is the constituent structure that was induced).11 For comparison, we provide baseline results that a uniform left-branching structure and a uniform right-branching structure (which encodes some non-trivial information about English syntax) would give rise to. As an upper boundary for the performance a binary grammar can achieve on the WSJ, we present the scores for a minimal binarized extension of the gold-standard annotation.</Paragraph>
    <Paragraph position="1"> The results we can report at this point are based on a comparatively small training set.12 So, it may be too early for conclusive results. (An issue that arises with the small training set is that smoothing techniques would be required to avoid overtraining, but these tend to dominate the test application, so the effect of the parallel-corpus based information cannot be seen so clearly.) But we think that the results are rather encouraging.</Paragraph>
    <Paragraph position="2"> As the table in figure 5 shows, the PCFG we induced based on the parallel-text derived weight factors reaches 57.5 as the Fa0 -score of unlabeled precision and recall on sentences up to length 10.13 We 11Note that we removed null elements from the WSJ, but we left punctuation in place. We used the EVALB program for obtaining the measures, however we preprocessed the bracketings to reflect the criteria we discuss here.</Paragraph>
    <Paragraph position="3"> 12This is not due to scalability issues of the system; we expect to be able to run experiments on rather large training sets. Since no manual annotation is required, the available resources are practically indefinite.</Paragraph>
    <Paragraph position="4"> 13For sentences up to length 30, the Fa8 -score drops to 28.7 show the scores for an experiment without smoothing, trained on c. 3,000 sentences. Since no smoothing was applied, the resulting coverage (with low-probability rules removed) on the test set is about 80%. It took 74 iterations of the inside-outside algorithm to train the weight-factor-trained grammar; the final version has 1005 rules.</Paragraph>
    <Paragraph position="5"> For comparison we induced another PCFG based on the same X-bar topology without using the weight factor mechanism. This grammar ended up with 1145 rules after 115 iterations. The Fa0 -score is only 51.3 (while the coverage is the same as for the weight-factor-trained grammar).</Paragraph>
    <Paragraph position="6"> Figure 6 shows the complete set of (singular) &amp;quot;NP rules&amp;quot; emerging from the weight-factor-trained grammar, which are remarkably well-behaved, in particular when we compare them to the corresponding rules from the PCFG induced in the standard way (figure 7). (XP categories are written as a0 POS-TAG a1 -P, X head categories are written as a0 POS-TAG a1 -0 - so the most probable NP productions in figure 6 are NP a2 N PP, NP a2 N, NP a2 ADJP N, NP a2 NP PP, NP a2 N PropNP.) Of course we are comparing an unsupervised technique with a mildly supervised technique; but the results indicate that the relatively subtle information discussed in section 2 seems to be indeed very useful.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML