File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2302_evalu.xml
Size: 8,464 bytes
Last Modified: 2025-10-06 13:59:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2302"> <Title>Another Evaluation of Anaphora Resolution Algorithms and a Comparison with GETARUNS' Knowledge Rich Approach</Title> <Section position="7" start_page="6" end_page="53" type="evalu"> <SectionTitle> 4. Evaluation and General Discussion </SectionTitle> <Paragraph position="0"> Evaluating anaphora resolution systems calls for a reformulation of the usual parameters of Precision and Recall as introduced in IR/IE field: in that case, there are two levels that are used as valuable results; a first stage where systems are measured for their capacity to retrieve/extract relevant items from the corpus/web (coverage-recall). Then a second stage follows in which systems are evaluated for their capacity to match the content of the query (accuracy-precision). In the field of IR/IE items to be matched are usually constituted by words/phrases and pattern-matching procedures are the norm.</Paragraph> <Paragraph position="1"> However, for AR systems this is not sufficient and NLP heavy techniques are used to get valuable results. As Mitkov also notes, this phase jeopardizes the capacity of AR systems to reach satisfactory accuracy scores simply because of its intrinsic weakness: none of the off-the-shelf parsers currently available overcomes 90% accuracy.</Paragraph> <Paragraph position="2"> To clarify these issues, we present here below two Tables: in the first one we report data related to the vexed question of whether pleonastic &quot;it&quot; should be regarded as part of the task of anaphora resolution or rather part of a separate classification task - as suggested in a number of papers by Mitkov. In the former case, they should contribute to the overall anaphora resolution evaluation metrics; in the latter case they should be compute separately as a case of classification over all occurrences of &quot;it&quot; in the current dataset and discarded from the overall count.</Paragraph> <Paragraph position="3"> Even though we don't agree fully with Mitkov's position, we find it useful to deal with &quot;it&quot; separate, due to its high inherent ambiguity. Besides, it is true that the AR task is not like any Information Retrieval task.</Paragraph> <Paragraph position="4"> In Table 1 below we reported figures for &quot;it&quot; in order to evaluate the three algorithms in relation to the classification task. Then in Table 2. we report general data where we computed the two types of accuracy reported in the literature. In Table 1 we split results for &quot;it&quot; into Wrong Reference vs Wrong Classification: following Mitkov, in case we only computed anaphora related cases and disregarded those cases of &quot;it&quot; which were wrongly classified as expletives. Expletive &quot;it&quot; present in the text are 189: so at first we computed coverage and accuracy with the usual formula that we report below. Then we subtracted wrongly classified cases from the number of total &quot;it&quot; found in one case (following Mitkov who claims that wrongly classified &quot;it&quot; found by the system should not count; in another case, this number is subtracted from the total number of &quot;it&quot; to be found in the text. Only for MARS we then computed different measures of Coverage and Accuracy. If we regard this approach worth pursuing, we come up with two Adjusted Accuracy measures which are related to the revised total numbers of anaphors by the two subtractions indicated above.</Paragraph> <Paragraph position="5"> We computed manually all third person pronominal expressions and came up with a figure 982 which is only confirmed by one of the three systems considered: JavaRAP. Pronouns considered are the following one, lower case and upper case included: Possessives - his, its, her, hers, their, theirs Personals - he, she, it, they, him, her, it, them (where &quot;it&quot; and &quot;her&quot; have to be disambiguated) Reflexives - himself, itself, herself, themselves There are 16 different wordforms. As can be seen from the table below, apart from JavaRAP, none of the other systems considered comes close to 100% coverage.</Paragraph> <Paragraph position="6"> Computing general measures for Precision and Recall we have three quantities (see also Poesio & Kabadjov): * total number of anaphors present in the text; * anaphors identified by the system; * correctly resolved anaphors.</Paragraph> <Paragraph position="7"> Formulas related to Accuracy/Success Rate or Precision are as follows: Accuracy1 = number of successfully resolved anaphors/number of all anaphors; Accuracy2 = number of successfully resolved anaphors/number of anaphors found (attempted to be resolved). Recall - which should correspond to Coverage - we come up with formula: R= number of anaphors found /number of all anaphors to be resolved (present in the text). Finally the formula for F-measure is as follows: 2*P*R/(P+R) where P is chosen as Accuracy 2.</Paragraph> <Paragraph position="8"> In absolute terms best accuracy figures have been obtained by GETARUNS, followed by JavaRAP. So it is still thanks to the classic Recall formula that this result stands out clearly. We also produced another table which can however only be worked out for our system, which uses a distributed approach. We managed to separate pronominal expressions in relation to their contribution at the different levels of anaphora resolution considered: clause level, utterance level, discourse level. At clause level, only those pronouns which must be bound locally are checked, as is the case with reflexive pronouns, possessives, some cases of expletive 'it': both arguments and adjuncts may contribute the appropriate antecedent. At utterance level, in case the sentence is complex or there is more than one clause, also personal subject/object pronouns may be bound (if only preferentially so).</Paragraph> <Paragraph position="9"> Eventually, those pronouns which do not find an antecedent are regarded discourse level pronouns.</Paragraph> <Paragraph position="10"> We collapsed under CLAUSE all pronouns bound at clause and utterance level; DISCOURSE contains only sentence external pronouns. Expletives have been computed in a separate column.</Paragraph> <Paragraph position="11"> As can be noticed easily, the highest percentage of pronouns found is at Clause level: this is not however the best performance of the system, which on the contrary performs better at discourse level.</Paragraph> <Paragraph position="12"> Expletives contribute by far the highest correct result. We also found correctly 47 'there' expletives and 6 correctly classified pronominal 'there' which however have been left unbound. The system also found 48 occurrences of deictic discourse bound &quot;this&quot; and &quot;that&quot;, which corresponds to the full coverage.</Paragraph> <Paragraph position="13"> Finally, nominal expressions: the History List (HL) has been incremented up to 2243 new entities. The system identified 2773 entities from the HL by matching their linguistic description. The overall number of resolution actions taken by the Discourse Level algorithm is 1861: this includes both cases of nominal and pronominal expressions. However, since only 366 can be pronouns, the remaining 1500 resolution actions have been carried out on nominal expressions present in the HL. If we compare these results to the ones computed by GuiTAR, which assign semantic indices to NamedEntities disregarding their status of anaphora, we can see that the whole text is made up of 12731 NEs.</Paragraph> <Paragraph position="14"> GuiTAR finds 1585 cases of identity relations between a NE and an antecedent. However, GuiTAR introduces always new indices and creates local antecedent-referring expression chains rather than repeating the same index of the chain head. In this way, it is difficult if not impossible to compute how many times the text corefers/cospecifies to the same referring expressions. On the contrary, in our case, this can be easily computed by counting how many times the same semantic index is being repeated in a &quot;resolution&quot; or &quot;identity&quot; action of the anaphora resolution algorithm. For instance, the Jury is coreferred/cospecified 12 times; Price Daniel also 12 times and so on.</Paragraph> </Section> class="xml-element"></Paper>