File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1006_metho.xml
Size: 21,406 bytes
Last Modified: 2025-10-06 14:07:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1006"> <Title>Evaluation tool for rule-based anaphora resolution methods</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The evaluation workbench for </SectionTitle> <Paragraph position="0"> anaphora resolution In order to secure a &quot;fair&quot;, consistent and accurate evaluation environment, and to address the problems identified above, we have developed an evaluation workbench for anaphora resolution which allows the comparison of anaphora resolution approaches sharing common principles (e.g. similar pre-processing or resolution strategy). The workbench enables the &quot;plugging in&quot; and testing of anaphora resolution algorithms on the basis of the same pre-processing tools and data. This development is a time-consuming task, given that we have to re-implement most of the algorithms, but it is expected to achieve a clearer assessment of the advantages and disadvantages of the different approaches. Developing our own evaluation environment (and even reimplementing some of the key algorithms) also alleviates the impracticalities associated with obtaining the codes of original programs.</Paragraph> <Paragraph position="1"> Another advantage of the evaluation workbench is that all approaches incorporated can operate either in a fully automatic mode or on human annotated corpora. We believe that this is a consistent way forward because it would not be fair to compare the success rate of an approach which operates on texts which are perfectly analysed by humans, with the success rate of an anaphora resolution system which has to process the text at different levels before activating its anaphora resolution algorithm. In fact, the evaluations of many anaphora resolution approaches have focused on the accuracy of resolution algorithms and have not taken into consideration the possible errors which inevitably occur in the pre-processing stage. In the realworld, fully automatic resolution must deal with a number of hard pre-processing problems such as morphological analysis/POS tagging, named entity recognition, unknown word recognition, NP extraction, parsing, identification of pleonastic pronouns, selectional constraints, etc. Each one of these tasks introduces errors and thus contributes to a drop in the performance of the anaphora resolution system.1 As a result, the vast majority of anaphora resolution approaches rely on some kind of pre-editing of the text which is fed to the resolution algorithm, and some of the methods have only been manually simulated.</Paragraph> <Paragraph position="2"> By way of illustration, Hobbs' naive approach (1976; 1978) was not implemented in its original version. In (Dagan and Itai, 1990; Dagan and Itai, 1991; Aone and Bennett, 1995; Kennedy and Boguraev, 1996) pleonastic pronouns are removed manually2 , whereas in (Mitkov, 1998b; Ferrandez et al., 1997) the outputs of the part-of-speech tagger and the NP extractor/ partial parser are post-edited similarly to Lappin and Leass (1994) where the output of the Slot Unification pre-editing such as the removal of sentences for which the parser failed to produce a reasonable parse, cases where the antecedent was not an NP etc.; Kennedy and Boguraev (1996) manually removed 30 occurrences of pleonastic pronouns (which could not be recognised by their pleonastic recogniser) as well as 6 occurrences of it which referred to a VP or prepositional constituent.</Paragraph> <Paragraph position="3"> make use of annotated corpora and thus do not perform any pre-processing. One of the very few systems3 that is fully automatic is MARS, the latest version of Mitkov's knowledge-poor approach implemented by Evans. Recent work on this project has demonstrated that fully automatic anaphora resolution is more difficult than previous work has suggested (OrVasan et al., 2000).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Pre-processing tools Parser </SectionTitle> <Paragraph position="0"> The current version of the evaluation workbench employs one of the high performance &quot;super-taggers&quot; for English - Conexor's FDG Parser (Tapanainen and J&quot;arvinen, 1997). This super-tagger gives morphological information and the syntactic roles of words (in most of the cases). It also performs a surface syntactic parsing of the text using dependency links that show the head-modifier relations between words.</Paragraph> <Paragraph position="1"> This kind of information is used for extracting complex NPs.</Paragraph> <Paragraph position="2"> In the table below the output of the FDG parser run over the sentence: &quot;This is an input file.&quot; is shown.</Paragraph> <Paragraph position="3"> input file.</Paragraph> <Paragraph position="4"> Noun phrase extractor Although FDG does not identify the noun phrases in the text, the dependencies established between words have played an important role in building a noun phrase extractor. In the example above, the dependency relations help identifying the sequence &quot;an input file&quot;. Every noun phrase is associated with some features as identified by FDG (number, part of speech, grammatical function) and also the linear position of the verb that they are arguments of, and the number of the sentence they appear in. The result of the NP 3Apart from MUC coreference resolution systems which operated in a fully automatic mode.</Paragraph> <Paragraph position="5"> extractor is an XML annotated file. We chose this format for several reasons: it is easily read, it allows a unified treatment of the files used for training and of those used for evaluation (which are already annotated in XML format) and it is also useful if the file submitted for analysis to FDG already contains an XML annotation; in the latter case, keeping the FDG format together with the previous XML annotation would lead to a more difficult processing of the input file. It also keeps the implementation of the actual workbench independent of the pre-processing tools, meaning that any shallow parser can be used instead of FDG, as long as its output is converted to an agreed XML format.</Paragraph> <Paragraph position="6"> An example of the overall output of the pre-processing tools is given below.</Paragraph> <Paragraph position="8"> Example 2: File obtained as result of the pre-processing stage (includes previous coreference an-notation) for the text This is an input file. It is used for evaluation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Shared resources </SectionTitle> <Paragraph position="0"> The three algorithms implemented receive as input a representation of the input file. This representation is generated by running an XML parser over the file resulting from the pre-processing phase. A list of noun phrases is explicitly kept in the file representation. Each entry in this list consists of a record containing: + the word form + the lemma of the word or of the head of the noun phrase + the starting position in the text + the ending position in the text + the part of speech + the grammatical function + the index of the sentence that contains the referent + the index of the verb whose argument this referent is Each of the algorithms implemented for the workbench enriches this set of data with information relevant to its particular needs. Kennedy and Boguraev (1996), for example, need additional information about whether a certain discourse referent is embedded or not, plus a pointer to the COREF class associated to the referent, while Mitkov's approach needs a score associated to each noun phrase.</Paragraph> <Paragraph position="1"> Apart from the pre-processing tools, the implementation of the algorithms included in the workbench is built upon a common programming interface, which allows for some basic processing functions to be shared as well. An example is the morphological filter applied over the set of possible antecedents of an anaphor.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Usability of the workbench </SectionTitle> <Paragraph position="0"> The evaluation workbench is easy to use. The user is presented with a friendly graphical interface that helps minimise the effort involved in preparing the tests. The only information she/he has to enter is the address (machine and directory) of the FDG parser and the file annotated with coreferential links to be processed. The results can be either specific to each method or specific to the file submitted for processing, and are displayed separately for each method. These include lists of the pronouns and their identified antecedents in the context they appear as well as information as to whether they were correctly solved or not. In addition, the values obtained for the four evaluation measures (see section 3.2) and several statistical results characteristic of each method (e.g. average number of candidates for antecedents per anaphor) are computed.</Paragraph> <Paragraph position="1"> Separately, the statistical values related to the annotated file are displayed in a table. We should note that (even though this is not the intended usage of the workbench) a user can also submit unannotated files for processing. In this case, the algorithms display the antecedent found for each pronoun, but no automatic evaluation can be carried out due to the lack of annotated testing data.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Envisaged extensions </SectionTitle> <Paragraph position="0"> While the workbench is based on the FDG shallow parser at the moment, we plan to update the environment in such a way that two different modes will be available: one making use of a shallow parser (for approaches operating on partial analysis) and one employing a full parser (for algorithms making use of full analysis).</Paragraph> <Paragraph position="1"> Future versions of the workbench will include access to semantic information (WordNet) to accommodate approaches incorporating such types of knowledge.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Comparative evaluation of </SectionTitle> <Paragraph position="0"> knowledge-poor anaphora resolution approaches The first phase of our project included comparison of knowledge-poorer approaches which share a common pre-processing philosophy. We selected for comparative evaluation three approaches extensively cited in the literature: Kennedy and Boguraev's parser-free version of Lappin and Leass' RAP (Kennedy and Boguraev, 1996), Baldwin's pronoun resolution method (Baldwin, 1997) and Mitkov's knowledge-poor pronoun resolution approach (Mitkov, 1998b). All three of these algorithms share a similar pre-processing methodology: they do not rely on a parser to process the input and instead use POS taggers and NP extractors; nor do any of the methods make use of semantic or real-world knowledge. We re-implemented all three algorithms based on their original description and personal consultation with the authors to avoid misinterpretations. Since the original version of CogNiac is non-robust and resolves only anaphors that obey certain rules, for fairer and comparable results we implemented the &quot;resolve-all&quot; version as described in (Baldwin, 1997). Although for the current experiments we have only included three knowledge-poor anaphora resolvers, it has to be emphasised that the current implementation of the workbench does not restrict in any way the number or the type of the anaphora resolution methods included. Its modularity allows any such method to be added in the system, as long as the pre-processing tools necessary for that method are available.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Brief outline of the three approaches </SectionTitle> <Paragraph position="0"> All three approaches fall into the category of factor-based algorithms which typically employ a number of factors (preferences, in the case of these three approaches) after morphological agreement checks.</Paragraph> <Paragraph position="1"> Kennedy and Boguraev Kennedy and Boguraev (1996) describe an algorithm for anaphora resolution based on Lappin and Leass' (1994) approach but without employing deep syntactic parsing. Their method has been applied to personal pronouns, reflexives and possessives. The general idea is to construct coreference equivalence classes that have an associated value based on a set of ten factors. An attempt is then made to resolve every pronoun to one of the previous introduced discourse referents by taking into account the salience value of the class to which each possible antecedent belongs.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Baldwin's Cogniac </SectionTitle> <Paragraph position="0"> CogNiac (Baldwin, 1997) is a knowledge-poor approach to anaphora resolution based on a set of high confidence rules which are successively applied over the pronoun under consideration. The rules are ordered according to their importance and relevance to anaphora resolution. The processing of a pronoun stops when one rule is satisfied. The original version of the algorithm is non-robust, a pronoun being resolved only if one of the rules is applied. The author also describes a robust extension of the algorithm, which employs two more weak rules that have to be applied if all the others fail.</Paragraph> <Paragraph position="1"> Mitkov's approach Mitkov's approach (Mitkov, 1998b) is a robust anaphora resolution method for technical texts which is based on a set of boosting and impeding indicators applied to each candidate for antecedent. The boosting indicators assign a positive score to an NP, reflecting a positive likelihood that it is the antecedent of the current pronoun. In contrast, the impeding ones apply a negative score to an NP, reflecting a lack of confidence that it is the antecedent of the current pronoun. A score is calculated based on these indicators and the discourse referent with the highest aggregate value is selected as antecedent.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Evaluation measures used </SectionTitle> <Paragraph position="0"> The workbench incorporates an automatic scoring system operating on an XML input file where the correct antecedents for every anaphor have been marked. The annotation scheme recognised by the system at this moment is MUC, but support for the MATE annotation scheme is currently under developement as well.</Paragraph> <Paragraph position="1"> We have implemented four measures for evaluation: precision and recall as defined by Aone and Bennett (1995)4 as well as success rate and critical success rate as defined in (Mitkov, 2000a). These four measures are calculated as follows: + Precision = number of correctly resolved anaphor / number of anaphors attempted to be resolved + Recall = number of correctly resolved anaphors / number of all anaphors identified by the system + Success rate = number of correctly resolved anaphors / number of all anaphors + Critical success rate = number of correctly resolved anaphors / number of anaphors with more than one antecedent after a morphological filter was applied The last measure is an important criterion for evaluating the efficiency of a factor-based anaphora resolution algorithm in the &quot;critical cases&quot; where agreement constraints alone cannot point to the antecedent. It is logical to assume that good anaphora resolution approaches should 4This definition is slightly different from the one used in (Baldwin, 1997) and (Gaizauskas and Humphreys, 2000). For more discussion on this see (Mitkov, 2000a; Mitkov, 2000b).</Paragraph> <Paragraph position="2"> have high critical success rates which are close to the overall success rates. In fact, in most cases it is really the critical success rate that matters: high critical success rates naturally imply high overall success rates.</Paragraph> <Paragraph position="3"> Besides the evaluation system, the workbench also incorporates a basic statistical calculator which addresses (to a certain extent) the question as to how reliable or realistic the obtained performance figures are - the latter depending on the nature of the data used for evaluation. Some evaluation data may contain anaphors which are more difficult to resolve, such as anaphors that are (slightly) ambiguous and require real-world knowledge for their resolution, or anaphors that have a high number of competing candidates, or that have their antecedents far away both in terms of sentences/clauses and in terms of number of &quot;intervening&quot; NPs etc. Therefore, we suggest that in addition to the evaluation results, information should be provided in the evaluation data as to how difficult the anaphors are to resolve.5 To this end, we are working towards the development of suitable and practical measures for quantifying the average &quot;resolution complexity&quot; of the anaphors in a certain text. For the time being, we believe that simple statistics such as the number of anaphors with more than one candidate, and more generally, the average number of candidates per anaphor, or statistics showing the average distance between the anaphors and their antecedents, could serve as initial quantifying measures (see Table 2). We believe that these statistics would be more indicative of how &quot;easy&quot; or &quot;difficult&quot; the evaluation data is, and should be provided in addition to the information on the numbers or types of anaphors (e.g. intrasentential vs. intersentential) occurring or coverage (e.g.</Paragraph> <Paragraph position="4"> personal, possessive, reflexive pronouns in the case of pronominal anaphora) in the evaluation data.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Evaluation results </SectionTitle> <Paragraph position="0"> We have used a corpus of technical texts manually annotated for coreference. We have decided on 5To a certain extent, the critical success rate defined above addresses this issue in the evaluation of anaphora resolution algorithms by providing the success rate for the anaphors that are more difficult to resolve.</Paragraph> <Paragraph position="1"> this genre because both Kennedy&Boguraev and Mitkov report results obtained on technical texts. The corpus contains 28,272 words, with 19,305 noun phrases and 422 pronouns, out of which 362 are anaphoric. The files that were used are: &quot;Beowulf HOW TO&quot; (referred in Table 1 as BEO), &quot;Linux CD-Rom HOW TO&quot; (CDR), &quot;Access HOW TO&quot; (ACC), &quot;Windows Help file&quot; (WIN). The evaluation files were pre-processed to remove irrelevant information that might alter the quality of the evaluation (tables, sequences of code, tables of contents, tables of references).</Paragraph> <Paragraph position="2"> The texts were annotated for full coreferential chains using a slightly modified version of the MUC annotation scheme. All instances of identity-of-reference direct nominal anaphora were annotated. The annotation was performed by two people in order to minimize human errors in the testing data (see (Mitkov et al., 2000) for further details).</Paragraph> <Paragraph position="3"> Table 1 describes the values obtained for the success rate and precision6 of the three anaphora resolvers on the evaluation corpus. The overall success rate calculated for the 422 pronouns found in the texts was 56.9% for Mitkov's method, 49.72% for Cogniac and 61.6% for Kennedy and Boguraev's method.</Paragraph> <Paragraph position="4"> Table 2 presents statistical results on the evaluation corpus, including distribution of 6Note that, since the three approaches are robust, recall is equal to precision.</Paragraph> <Paragraph position="5"> pronouns, referential distance, average number of candidates for antecedent per pronoun and types of anaphors.7 As expected, the results reported in Table 1 do not match the original results published by Kennedy and Boguraev (1996), Baldwin (1997) and Mitkov (1998b) where the algorithms were tested on different data, employed different pre-processing tools, resorted to different degrees of manual intervention and thus provided no common ground for any reliable comparison.</Paragraph> <Paragraph position="6"> By contrast, the evaluation workbench enables a uniform and balanced comparison of the algorithms in that (i) the evaluation is done on the same data and (ii) each algorithm employs the same pre-processing tools and performs the resolution in fully automatic fashion. Our experiments also confirm the finding of Orasan, Evans and Mitkov (2000) that fully automatic resolution is more difficult than previously thought with the performance of all the three algorithms essentially lower than originally reported.</Paragraph> </Section> </Section> class="xml-element"></Paper>