File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-2010_evalu.xml
Size: 2,496 bytes
Last Modified: 2025-10-06 13:59:16
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-2010"> <Title>A Machine Learning Approach to German Pronoun Resolution</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Before evaluating the actual system, we compared the performance of boosting to that of C4.5, as reported in (Strube et al., 2002). Trained on the same corpus and evaluated with the 10-fold crossvalidation method, boosting significantly outperforms C4.5 on both personal and possessive pronouns (see Table 2). These results support the intuition that ensemble methods are superior to single classifiers. null To put the performance of our system into perspective, we established a baseline and an upper bound for the task. The baseline chooses as the antecedent the closest non-pronominal markable that agrees in number and gender with the pronoun.</Paragraph> <Paragraph position="1"> The upper bound is the system's performance on the manually annotated (gold standard) data without the semantic features.</Paragraph> <Paragraph position="2"> For the baseline, accuracy is significantly higher for the gold standard data than for the two test sets (see Table 3). This shows that agreement is the most important feature, which, if annotated correctly, resolves almost half of the pronouns.</Paragraph> <Paragraph position="3"> The classification results of the gold standard data, which are much lower than the ones in Table 2 also mance (Fa0a2a1a4a3 ) with (Strube et al., 2002) demonstrate the importance of the semantic features. As for the test sets, while the classifier significantly outperformed the baseline for the HTC set, it did nothing for the Spiegel set. This shows the limitations of an algorithm trained on overly restricted data.</Paragraph> <Paragraph position="4"> Among the selection heuristics, the approach of resolving pronominal antecedents proved consistently more effective than ignoring them, while the results for the closest-first and best-first strategies were mixed. They imply, however, that the bestfirst approach should be chosen if the classifier performed above a certain threshold; otherwise the closest-first approach is safer.</Paragraph> <Paragraph position="5"> Overall, the fact that 67.2 of the pronouns were correctly resolved in the automatically annotated HTC test set, while the upper bound is 82.0, validates the approach taken for this system.</Paragraph> </Section> class="xml-element"></Paper>