File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2004_evalu.xml

Size: 5,747 bytes

Last Modified: 2025-10-06 13:59:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2004">
  <Title>The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation</Title>
  <Section position="7" start_page="28" end_page="30" type="evalu">
    <SectionTitle>
5 Evaluation and Discussion
</SectionTitle>
    <Paragraph position="0"> Evaluation results are shown in Table 4. The lines marked LBD evaluate the performance of LBD separately (without Collins' parser).</Paragraph>
    <Paragraph position="1"> LBD is significantly better than the baseline for PP attachment (p &lt; 0.001, all tests are kh2 tests). LBD is also better than baseline for RC attachment, but this result is not significant due to the small size of the data set (264). Note that the baseline for PP attachment is 51.4% as indicated in the table (upper right corner of PP table), but that the base-line for RC attachment is 73.1%. The difference between 73.1% and 76.1% (upper right corner of RC table) is due to the fact that for RC attachment LBD proper is embedded in a decision list. The decision list alone, with an 4This list contains 136 entries and was semiautomatically computed from the Reuters corpus: Antecedents of who relative clauses were extracted, and the top 200 were filtered manually.</Paragraph>
    <Paragraph position="2">  is the size of the test set. The baselines are 73.1% (RC) and 51.4% (PP). The combined method performs better for small training sets. There is no significant difference between 10%, 50% and 100% for the combination method (p &lt; 0.05).</Paragraph>
    <Paragraph position="3"> unlabeled corpus of size 0, achieves a performance of 76.1%.</Paragraph>
    <Paragraph position="4"> The bottom five lines of each table evaluate combinations of a parameter set trained on a subset of WSJ (0.05% - 50%) and a particular size of the unlabeled corpus (100% 0%). In addition, the third column gives the performance of Collins' parser without LBD.</Paragraph>
    <Paragraph position="5"> Recall that test set size (second column) varies because we discard a test instance if Collins' parser does not recognize that there is an ambiguity (e.g., because of a parse failure). As expected, performance increases as the size of the training set grows, e.g., from 58.0% to 82.8% for PP attachment.</Paragraph>
    <Paragraph position="6"> The combination of Collins and LBD is consistently better than Collins for RC attachment (not statistically significant due to the size of the data set). However, this is not the case for PP attachment. Due to the good performance of Collins' parser for even small training sets, the combination is only superior for the two smallest training sets (significant for the smallest set, p &lt; 0.001).</Paragraph>
    <Paragraph position="7"> The most surprising result of the experiments is the small difference between the three unlabeled corpora. There is no clear pattern in the data for PP attachment and only a small effect for RC attachment: an increase between 1% and 2% when corpus size is increased from 10% to 100%.</Paragraph>
    <Paragraph position="8"> We performed an analysis of a sample of incorrectly attached PPs to investigate why unlabeled corpus size has such a small effect. We found that the noisiness of the statistics extracted from Reuters were often responsible for attachment errors. The noisiness is caused by our filtering strategy (ambiguous PPs are not used, resulting in undercounting), by the approximation of counts by Lucene (Lucene overcounts and undercounts as discussed in Section 3) and by minipar parse errors. Parse errors are particularly harmful in cases like the impact it would have on prospects, where, due to the extraction of the NP impact, minipar attaches the PP to the verb. We did not filter out these more complex ambiguous cases. Finally, the two corpora are from distinct sources and from distinct time periods (early nineties vs. mid-nineties). Many topicand time-specific dependencies can only be mined from more similar corpora.</Paragraph>
    <Paragraph position="9"> The experiments reveal interesting differences between PP and RC attachment.</Paragraph>
    <Paragraph position="10"> The dependencies used in RC disambiguation rarely occur in an ambiguous context (e.g., most subject-verb dependencies can be reliably extracted). In contrast, a large proportion of the dependencies needed in PP disambiguation (verb-prep and noun-prep dependencies) do occur in ambiguous contexts. Another difference is that RC attachment is syntactically more complex. It interacts with agreement, passive and long-distance depen- null dencies. The algorithm proposed for RC applies grammatical constraints successfully. A final difference is that the baseline for RC is much higher than for PP and therefore harder to beat.5 An innovation of our disambiguation system is the use of a search engine, lucene, for serving up dependency statistics. The advantage is that counts can be computed quickly and dynamically. New text can be added on an ongoing basis to the index. The updated dependency statistics are immediately available and can benefit disambiguation performance.</Paragraph>
    <Paragraph position="11"> Such a system can adapt easily to new topics and changes over time. However, this architecture negatively affects accuracy. The unsupervised approach of (Hindle and Rooth, 1993) achieves almost 80% accuracy by using partial dependency statistics to disambiguate ambiguous sentences in the unlabeled corpus.</Paragraph>
    <Paragraph position="12"> Ambiguous sentences were excluded from our index to make index construction simple and efficient. Our larger corpus (about 6 times as large as Hindle et al.'s) did not compensate for our lower-quality statistics.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML