File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0708_evalu.xml

Size: 6,171 bytes

Last Modified: 2025-10-06 14:00:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0708">
  <Title>MDL-based DCG Induction for NP Identification</Title>
  <Section position="7" start_page="65" end_page="66" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> For our experiments we used material supplied by the CoNLL99 workshop organisers. This consisted of 48,224 fully parsed training sentences, and a disjoint set of 984 testing sentences. Both sets were randomly drawn from the parsed section of the Wall Street Journal. The test set came in two versions, differing from each other in how the sentences were marked:up. The first version consisted of sentences with NP bracketings marked (results using this test set are given in table 1).</Paragraph>
    <Paragraph position="1"> The second version had NP bracketings marked, and within each marked NP, there was an internal parse (results for this version are in table 2). These parses were labelled with Penn Nonterminals. Each test sentence was trivially rooted with an S symbol (necessary for the evaluation software). To make this clearer, if an  For computational reasons, we could not deal with all sentences in the training set. and when learning ! rules, we limited ourselves to ser/tences with a maximum length of 15 tokens. During evaluation, we used sentences with a maximum length of 30 tokens. This reduced the training set to 10,249 parsed sentences, and the test set to 739 sentences. Finally, we retagged the CoNLL99 material with the Claws2 tagset (required by TSG). Evaluation was carried out by Parseval (which reports unlabelled bracketing results: we do not report labelled results as TSG does not use the Penn Non-terminal set) \[16\]. Note that evaluation is based upon bracketing, and not word accuracy. For example, if we failed to include one word in a NP that contains four other words, we would have a bracketing accuracy of 0. On the other hand, a word accuracy result would be 80%.</Paragraph>
    <Paragraph position="2"> Asa comparison, we evaluated TSG upon the testing material. This is experiment 5 in tables 1 and 2. The other four experiments differed from each other in terms of what the learner was trained upon:  1. Just tagged sentences.</Paragraph>
    <Paragraph position="3"> 2. Tagged Sentences with NP bracketings marked. We reduced WSJ parses to include just their NP bracketings. null 3. Tagged sentences with NPs bracketing annotated  with an internal parse. Again, we mapped WSJ parses to reduced parses containing just annotated NPs.</Paragraph>
    <Paragraph position="4"> 4. Tagged sentences with a full Wall Street Journal parse.</Paragraph>
    <Paragraph position="5"> For each experiment, we report the size of the final grammar, the percentage of testing sentences covered (assigned a full parse), crossing rates, recall and precision results with respect to testing sentences with NPs bracketed and those containing annotated NPs. For the bracketing task, we mapped full parses, produced by nmdels, to parses just containing NP bracketing. For the aimotation task. we mapped full parses to parses containing just NPs with an internal annotation. Note that within our grammatical framework, the best mapping is not clear (since parses produced by our models have categories using multiple bar levels, whilst WSJ parses ouly use a single level). As a guess, we treated bar 1 and bar 2 nominal categories as being NPs. This means that our precision results are lowered, since in general, we produce more NPs than would be predicted by a WSJ parse.</Paragraph>
    <Paragraph position="6"> In each case. we evaluate the performance of a model in terms of the highest ranked parse, and secondly, in terms of the &amp;quot;best' parse, out of the top 10 parses produced. Here &amp;quot;best' means the parse produced that is closest, in terms of a weighted sum crossing rates, precision and recall, to the manually selected parse. This final set of results gives an indication of how well our system would perform if it had a much better parse selection mechanism. Best figures are marked in parentheses. null Figure 1 gives our results for the bracketing task, whilst figure 2 gives our results for the annotation task. Model size and coverage results were id~-ltical for both tests, so the second table omits them.</Paragraph>
    <Paragraph position="7">  Firstly, when compared with other work on NP recovery, our results are poor. As was mentioned in the search section, this is largely due to our system being based upon a language model that has well known limitations. Furthermore, as was argued in the iutroduction, full NPs are by definition harder to identify than base NPs, so we would expect our results to be worse. Secondly, we see that the bracketing task is easier than the annotation task: generally, the results in table 1 are better than the results in table 2. Given the fact that the annotation search space is larger than the bracketing space, this should come as no surprise. Turning now to the individual experiments, we see that parsed corpora (experiments 2, 3 and 4) is all informative constraint upon NP induction. Rules learnt using parsed corpora better capture regularities than do rules learnt from just raw text (experiment 1). This is shown by the increased coverage results of experiments 2.3 and 4 over  1. In terms of crossing rates, recall and precision, no clear story has emerged. Surprisingly, there seems to be  minimal difference in coverage when using either annotated NPs or full parses. This could be due to a number of reasons, such as WSJ NPs being more reliably annotated than other phrases, simple artifactual problems  with the learner, the evaluation metrics being too coarse to show any real differences, etc. Further, qualitative investigation should determine whether there are any differences in the parses that TSG alone cannot assign to sentences.</Paragraph>
    <Paragraph position="8"> Due to time constraints, we did not measure statistical significance tests between the various experiments. A later version of this paper (available from the author, osborne@let.rug.nl) will report these tests.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML