File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4009_evalu.xml
Size: 4,264 bytes
Last Modified: 2025-10-06 13:59:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4009"> <Title>Competitive Self-Trained Pronoun Interpretation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> Reporting on the results of a self-trained system means only evaluating the system against annotated data once, since any system reconfiguration and re-evaluation based on the feedback received would constitute a form of indirectly supervised training. Thus we had to select a configuration as representing our &quot;reportable&quot; system before doing any evaluation. To allow for the closest comparison with our supervised system, we opted to train the system with the same number of pronouns that we had in our supervised training set (2773), and sought to have approximately the same ratio of positive to negative training instances, which meant randomly including one-fifth of the pronouns in the raw data that had more than one possible antecedent (see step 1d). Later we report on post-hoc experiments to assess the effect of training data size on performance.</Paragraph> <Paragraph position="1"> The self-trained system was trained fourteen times, once using each of fourteen different segments of the TDT-2 data that we had arbitrarily apportioned at the inception of the project. The scores reported below and in Table 1 for the self-trained system are averages of the fourteen corresponding evaluations. The final results are as follows: The self-trained system beats the competitive Hobbs baseline system by 4.6% and comes within 2.3% of the supervised system trained on the same number of manually-annotated pronouns.2 Convergence for the self-trained system was fairly rapid, taking between 8 and 14 iterations. The number of changes in the current model's predictions started off relatively high in early iterations (averaging approximately 305 pronouns or 11% of the dataset) and then steadily declined (usually, but not always, monotonically) until convergence. Post-hoc 2All results are reported here in terms of accuracy, thatis, thenumberofpronounscorrectlyresolveddivided by the total number of pronouns read in from the key. An antecedent is considered correct if the ACE keys place the pronoun and antecedent in the same coreference class.</Paragraph> <Paragraph position="2"> In the case of 64 of the 762 pronouns in the evaluation set, none of the antecedents input to the learning algorithms were coreferential. Thus, 91.6% accuracy is the best that these algorithms could have achieved.</Paragraph> <Paragraph position="3"> In Kehler et al. (2004) we describe two ways in which our supervised system was augmented to use predicate-argument frequencies, one which used them in a post-processor and another which modeled them with features alongside our morphosyntactic ones. In our self-trained system, the first of these methods improved performance to 75.1% (compared to 76.8% for the supervised system) and the second to 74.1% (compared to 75.7% for the supervised system).</Paragraph> <Paragraph position="4"> analysis showed that the iterative phase contributed agradual(althoughagainnotcompletelymonotonic) improvement in performance during the course of learning.</Paragraph> <Paragraph position="5"> We then performed a set of post-hoc experiments to measure the effect of training data size on performance for the self-trained system. The results are given in Table 1, which show a gradual increase in performance as the number of pronouns grows. The final row includes the results when all of the &quot;unambiguous&quot; pronouns in each TDT segment are utilized (again, along with approximately one-fifth of the ambiguous pronouns), which amounted to between 7,212 and 11,245 total pronouns.3 (Note that since most pronouns have more than one possible antecedent, the number of pronoun-antecedent training examples fed to MaxEnt is considerably higher than the numbers of pronouns shown in the table.) Perhaps one of the more striking facts is how well the algorithm performs with relatively few pronouns, which suggests that the generality of the features used allow for fairly reliable estimation without much data.</Paragraph> </Section> class="xml-element"></Paper>