File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1023_evalu.xml
Size: 5,794 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1023"> <Title>Coreference Resolution Using Competition Learning Approach</Title> <Section position="10" start_page="10" end_page="10" type="evalu"> <SectionTitle> 5 Evaluation and Discussion </SectionTitle> <Paragraph position="0"> Our coreference resolution approach is evaluated on the standard MUC-6 (1995) and MUC-7 (1998) data set. For MUC-6, 30 &quot;dry-run&quot; documents annotated with coreference information could be used as training data. There are also 30 annotated training documents from MUC-7. For testing, we utilize the 30 standard test documents from MUC-6 and the 20 standard test documents from MUC-7.</Paragraph> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 5.1 Baseline Systems </SectionTitle> <Paragraph position="0"> In the experiment we compared our approach with the following research works: 1. Strube's S-list algorithm for pronoun resolution (Stube, 1998).</Paragraph> <Paragraph position="1"> 2. Ng and Cardie's machine learning approach to coreference resolution (Ng and Cardie, 2002a). 3. Connolly et al.'s machine learning approach to anaphora resolution (Connolly et al., 1997). Among them, S-List, a version of centering algorithm, uses well-defined heuristic rules to rank the antecedent candidates; Ng and Cardie's approach employs the standard single-candidate model and &quot;Best-First&quot; rule to select the antecedent; Connolly et al.'s approach also adopts the twin-candidate model, but their approach lacks of candidate filtering strategy and uses greedy linear search to select the antecedent (See &quot;Related work&quot; for details).</Paragraph> <Paragraph position="2"> We constructed three baseline systems based on the above three approaches, respectively. For comparison, in the baseline system 2 and 3, we used the similar feature set as in our system (see table 1).</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 5.2 Results and Discussion </SectionTitle> <Paragraph position="0"> Table 2 and 3 show the performance of different approaches in the pronoun and non-pronoun resolution, respectively. In these tables we focus on the abilities of different approaches in resolving an anaphor to its antecedent correctly. The recall measures the number of correctly resolved anaphors over the total anaphors in the MUC test data set, and the precision measures the number of correct anaphors over the total resolved anaphors. The F-measure F=2*RP/(R+P) is the harmonic mean of precision and recall.</Paragraph> <Paragraph position="1"> The experimental result demonstrates that our competition learning approach achieves a better performance than the baseline approaches in resolving pronominal anaphors. As shown in Table 2, our approach outperforms Ng and Cardie's single-candidate based approach by 3.7 and 5.4 in F-measure for MUC-6 and MUC-7, respectively.</Paragraph> <Paragraph position="2"> Besides, compared with Strube's S-list algorithm, our approach also achieves gains in the F-measure by 3.2 (MUC-6), and 1.6 (MUC-7). In particular, our approach obtains significant improvement (21.1 for MUC-6, and 13.1 for MUC-7) over Con- null Compared with the gains in pronoun resolution, the improvement in non-pronoun resolution is slight. As shown in Table 3, our approach resolves non-pronominal anaphors with the recall of 51.3 (39.7) and the precision of 90.4 (87.6) for MUC-6 (MUC-7). In contrast to Ng and Cardie's approach, the performance of our approach improves only 0.3 (0.6) in recall and 0.5 (1.2) in precision. The reason may be that in non-pronoun resolution, the coreference of an anaphor and its candidate is usually determined only by some strongly indicative features such as alias, apposition, string-matching, etc (this explains why we obtain a high precision but a low recall in non-pronoun resolution). Therefore, most of the positive candidates are coreferential to the anaphors even though they are not the &quot;best&quot;. As a result, we can only see comparatively slight difference between the performances of the two approaches.</Paragraph> <Paragraph position="3"> Although Connolly et al.'s approach also adopts the twin-candidate model, it achieves a poor performance for both pronoun resolution and non-pronoun resolution. The main reason is the absence of candidate filtering strategy in their approach (this is why the recall equals to the precision in the tables). Without candidate filtering, the recall may rise as the correct antecedents would not be eliminated wrongly. Nevertheless, the precision drops largely due to the numerous invalid NPs in the candidate set. As a result, a significantly low F-measure is obtained in their approach.</Paragraph> <Paragraph position="4"> Table 4 summarizes the overall performance of different approaches to coreference resolution. Different from Table 2 and 3, here we focus on whether a coreferential chain could be correctly identified. For this purpose, we obtain the recall, the precision and the F-measure using the standard MUC scoring program (Vilain et al. 1995) for the coreference resolution task. Here the recall means the correct resolved chains over the whole coreferential chains in the data set, and precision means the correct resolved chains over the whole resolved chains.</Paragraph> <Paragraph position="5"> In line with the previous experiments, we see reasonable improvement in the performance of the coreference resolution: compared with the baseline approach based on the single-candidate model, the F-measure of approach increases from 69.4 to 71.3 for MUC-6, and from 58.7 to 60.2 for MUC-7.</Paragraph> </Section> </Section> class="xml-element"></Paper>