File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-3004_metho.xml
Size: 78,752 bytes
Last Modified: 2025-10-06 14:09:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-3004"> <Title>Comparing Knowledge Sources for Nominal Anaphora Resolution</Title> <Section position="3" start_page="369" end_page="369" type="metho"> <SectionTitle> 2. The Knowledge Gap and Other Problems for Lexico-semantic Resources </SectionTitle> <Paragraph position="0"> A number of previous studies (Harabagiu 1997; Kameyama 1997; Vieira and Poesio 2000; Harabagiu, Bunescu, and Maiorano 2001; Strube, Rapp, and Mueller 2002; Modjeska 2002; Gardent, Manuelian, and Kow 2003) point to the importance of lexical and world knowledge for the resolution of full NP anaphora and the lack of such knowledge in existing ontologies (Section 2.1). In addition to this knowledge gap, we summarize other, methodological problems with the use of ontologies in anaphora resolution (Section 2.2).</Paragraph> <Section position="1" start_page="369" end_page="369" type="sub_section"> <SectionTitle> 2.1 The Knowledge Gap for Nominal Anaphora with Full Lexical Heads </SectionTitle> <Paragraph position="0"> In the following, we discuss previous studies on the automatic resolution of coreference, bridging and comparative anaphora, concentrating on work that yields insights into the use of lexical and semantic knowledge.</Paragraph> <Paragraph position="1"> 2.1.1 Coreference. The prevailing current approaches to coreference resolution are evaluated on MUC-style (Hirschman and Chinchor 1997) annotated text and treat pronominal and full NP anaphora, named-entity coreference, and non-anaphoric coreferential links that can be stipulated by appositions and copula. The performance of these approaches on definite NPs is often substantially worse than on pronouns and/or named entities (Connolly, Burger, and Day 1997; Strube, Rapp, and Mueller 2002; Ng and Cardie 2002b; Yang et al. 2003). For example, for a coreference resolution algorithm on German texts, Strube, Rapp, and Mueller (2002) report an F-measure of 33.9% for definite NPs that contrasts with 82.8% for personal pronouns.</Paragraph> <Paragraph position="2"> Several reasons for this performance difference have been established. First, whereas pronouns are mostly anaphoric in written text, definite NPs do not have to be so, inducing the problem of whether a definite NP is anaphoric in addition to determining an antecedent from among a set of potential antecedents (Fraurud 1990; Vieira and Poesio 2000).</Paragraph> <Paragraph position="3"> Second, the antecedents of definite NP anaphora can occur at considerable distance from the anaphor, whereas antecedents to pronominal anaphora tend to be relatively close (Preiss, Gasperin, and Briscoe 2004; McCoy and Strube 1999). An automatic system can therefore more easily restrict its antecedent set for pronominal anaphora.</Paragraph> <Paragraph position="4"> Third, it is in general believed that pronouns are used to refer to entities in focus, whereas entities that are not in focus are referred to by definite descriptions (Hawkins 1978; Ariel 1990; Gundel, Hedberg, and Zacharski 1993), because the head nouns of anaphoric definite NPs provide the reader with lexico-semantic knowledge. Antecedent accessibility is therefore additionally restricted via semantic compatibility and does not need to rely on notions of focus or salience to the same extent as for pronouns. Given this lexical richness of common noun anaphors, many resolution algorithms for coreference have incorporated manually controlled lexical hierarchies, such as WordNet. They use, for example, a relatively coarse-grained notion of semantic compatibility between a few high-level concepts in WordNet (Soon, Ng, and Lim 2001), or more detailed hyponymy and synonymy links between anaphor and antecedent head nouns (Vieira and Poesio</Paragraph> </Section> </Section> <Section position="4" start_page="369" end_page="371" type="metho"> <SectionTitle> 6 A two-stage process in which the first stage identifies anaphoricity of the NP and the second the </SectionTitle> <Paragraph position="0"> antecedent for anaphoric NPs (Uryupina 2003; Ng 2004) can alleviate this problem. In this article, we focus on the second stage, namely, antecedent selection.</Paragraph> <Paragraph position="1"> Markert and Nissim Knowledge Sources for Anaphora Resolution 2000; Harabagiu, Bunescu, and Maiorano 2001; Ng and Cardie 2002b, among others). However, several researchers have pointed out that the incorporated information is still insufficient. Harabagiu, Bunescu, and Maiorano (2001) (see also Kameyama 1997) report that evaluation of previous systems has shown that &quot;more than 30% of the missed coreference links are due to the lack of semantic consistency information between the anaphoric noun and its antecedent noun&quot; (page 59). Vieira and Poesio (2000) report results on anaphoric definite NPs in the WSJ that stand in a synonymy or hyponymy relation to their antecedents (as in Example (1)). Using WordNet links to retrieve the appropriate knowledge proved insufficient, as only 35.0% of synonymy relations and 56.0% of hyponymy relations needed were encoded in WordNet as direct or inherited links.</Paragraph> <Paragraph position="2"> The semantic knowledge used might also not necessarily improve on string matching: Soon, Ng, and Lim (2001) final, automatically derived decision tree does not incorporate their semantic-compatibility feature and instead relies heavily on string matching and aliasing, thereby leaving open how much information in a lexical hierarchy can improve over string matching.</Paragraph> <Paragraph position="3"> In this article, we concentrate on this last of the three problems (insufficient lexical knowledge). We investigate whether the knowledge gap for definite NP coreference can be overcome by using corpora as knowledge sources as well as whether the incorporation of lexical knowledge sources improves over simple head noun matching. on comparative anaphora--shows that lexico-semantic knowledge plays a larger role than grammatical salience for other-anaphora. In this article, we show that the semantic knowledge provided via synonymy and hyponymy links in WordNet is insufficient for the resolution of other-anaphora, although the head of the antecedent is normally a synonym or hyponym of the head of the anaphor in other-anaphora (Section 4.4). 2.1.3 Bridging. Vieira and Poesio (2000) report that 62.0% of meronymy relations (see Example (2)) needed for bridging resolution in their corpus were not encoded in WordNet. Gardent, Manuelian, and Kow (2003) identified bridging descriptions in a French corpus, of which 187 (52%) exploited meronymic relations. Almost 80% of these were not found in WordNet. Hahn, Strube, and Markert (1996) report experiments on 109 bridging cases from German information technology reports, using a hand-crafted, domain-specific knowledge base of 449 concepts and 334 relations. They state that 42 (38.5%) links between anaphor and antecedents were missing in their knowledge base, a high proportion given the domain-specific task. In this article, we will not address bridging, although we will discuss the extension of our work to bridging in Section 6.</Paragraph> <Section position="1" start_page="370" end_page="371" type="sub_section"> <SectionTitle> 2.2 Methodological Problems for the Use of Ontologies in Anaphora Resolution </SectionTitle> <Paragraph position="0"> Over the years, several major problems have been identified with the use of ontologies for anaphora resolution. In the following we provide a summary of the different issues raised, using the examples in the Introduction.</Paragraph> <Paragraph position="1"> 7 Whenever we refer to &quot;hyponymy/meronymy (relations/links)&quot; in WordNet, we include both direct and inherited links.</Paragraph> <Paragraph position="2"> 8 From this point on, we will often use the terms anaphor and antecedent instead of head of anaphor and head of antecedent if the context is non-ambiguous.</Paragraph> <Paragraph position="3"> 2.2.1 Problem 1: Knowledge Gap. As discussed above, even in large ontologies the lack of knowledge can be severe, and this problem increases for non-hyponymy relations. None of the examples in Section 1 are covered by synonymy, hyponymy, or meronymy links in WordNet; for example, hoods are not encoded as parts of jackets, and homes are not encoded as a hyponym of facilities. In addition, building, extending, and maintaining ontologies by hand is expensive.</Paragraph> <Paragraph position="4"> 2.2.2 Problem 2: Context-Dependent Relations. Whereas the knowledge gap might be reduced as (semi)automatic efforts to enrich ontologies become available (Hearst 1992; Berland and Charniak 1999; Poesio et al. 2002), the second problem is intrinsic to fixed context-independent ontologies: How much and which knowledge should they include? Thus, Hearst (1992) raises the issue of whether underspecified, context- or point-of-view-dependent hyponymy relations (like the context-dependent link between costs and repercussions in Example (3)) should be included in a fixed ontology, in addition to universally true hyponymy relations. Some other hyponymy relations that we encountered in our studies whose inclusion in ontologies is debatable are age:(risk) factor, coffee:export, pilots:union, country:member.</Paragraph> <Paragraph position="5"> 2.2.3 Problem 3: Information Encoding. Knowledge might be encoded in many different ways in a lexical hierarchy, and this can pose a problem for anaphora resolution (Humphreys et al. 1997; Poesio, Vieira, and Teufel 1997). For example, although magazine and periodical are not linked in WordNet via synonymy/hyponymy, the gloss records magazine as a periodic publication. Thus, the desired link might be derived through the analysis of the gloss together with derivation of periodical from periodic. However, such extensive mining of the ontology (as performed, e.g., by Harabagiu, Bunescu, and Maiorano [2001]) can be costly. In addition, different information sources must be weighed (e.g., is a hyponymy link preferred over a gloss inclusion?) and combined (should hyponyms/hyperonyms/sisters of gloss expressions be considered recursively?). Extensive combinations also increase the risk of false positives.</Paragraph> <Paragraph position="6"> 2.2.4 Problem 4: Sense Proliferation. Using all senses of anaphor and potential antecedents in the search for relations might yield a link between an incorrect antecedent candidate and the anaphor due to an inappropriate sense selection. On the other hand, considering only the most frequent sense for anaphor and antecedent (as is done in Soon, Ng, and Lim [2001]) might lead to wrong antecedent assignment if a minority sense is intended in the text. So, for example, the most frequent sense of hood in WordNet is criminal, whereas the sense used in Example (2) is headdress. The alternatives are either weighing senses according to different domains or a more costly sense disambiguation procedure before anaphora resolution (Preiss 2002).</Paragraph> </Section> </Section> <Section position="5" start_page="371" end_page="373" type="metho"> <SectionTitle> 3. The Alternative: Corpus-Based Knowledge Extraction </SectionTitle> <Paragraph position="0"> There have been a considerable number of efforts to extract lexical relations from corpora in order to build new knowledge sources and enrich existing ones without time9 Even without extensive mining, this risk can be high: Vieira and Poesio (2000) report a high number of false positives for one of their data sets, although they use only WordNet-encoded links.</Paragraph> <Paragraph position="1"> Markert and Nissim Knowledge Sources for Anaphora Resolution consuming hand-modeling. This includes the extraction of hyponymy and synonymy relations (Hearst 1992; Caraballo 1999, among others) as well as meronymy (Berland and Charniak 1999; Meyer 2001).</Paragraph> <Paragraph position="2"> One approach to the extraction of instances of a particular lexical relation is the use of patterns that express lexical relations structurally explicitly in a corpus (Hearst 1992; Berland and Charniak 1999; Caraballo 1999; Meyer 2001), and this is the approach we focus on here. As an example, the pattern NP (Hearst 1992), and it can therefore be postulated that two noun phrases that occur in such a pattern in a corpus should be linked in an ontology via a hyponymy link. Applications of the extracted relations to anaphora resolution are less frequent. However, Poesio et al. (2002) and Meyer and Dale (2002) have used patterns for the corpus-based acquisition of meronymy relations: these patterns are subsequently exploited for bridging resolution.</Paragraph> <Paragraph position="3"> Although automatic acquisition can help bridge the knowledge gap (see Problem 1 in Section 2.2.1), the incorporation of the acquired knowledge into a fixed ontology yields other problems. Most notably, it has to be decided which knowledge should be included in ontologies, because pattern-based acquisition will also find spurious, subjective and context-dependent knowledge (see Problem 2 in Section 2.2.2). There is also the problem of pattern ambiguity, since patterns do not necessarily have a one-to-one correspondence to lexical relations (Meyer 2001). Following our work in Markert, Nissim, and Modjeska (2003), we argue that for the task of antecedent ranking, these problems can be circumvented by not constructing a fixed ontology at all. Instead, we use the pattern-based approach to find lexical relationships holding between anaphor and antecedent in corpora on the fly. For instance, in Example (3), we do not need to know whether costs are always repercussions (and should therefore be linked via hyponymy in an ontology) but only that they are more likely to be viewed as repercussions than the other antecedent candidates. We therefore adapt the pattern-based approach in the following way for antecedent selection.</Paragraph> <Paragraph position="4"> Step 1: Relation Identification. We determine which lexical relation usually holds between anaphor and antecedent head nouns for a particular anaphoric phenomenon. For example, in other-anaphora, a hyponymy/ similarity relation between anaphor and antecedent is exploited (homes are facilities) or stipulated by the context (costs are viewed as repercussions).</Paragraph> <Paragraph position="5"> Step 2: Pattern Selection. We select patterns that express this lexical relation structurally explicitly. For example, the pattern NP (see above).</Paragraph> <Paragraph position="6"> Step 3: Pattern Instantiation. If the lexical relation between anaphor and antecedent head nouns is strong, then it is likely that the anaphor and antecedent also frequently co-occur in the selected explicit patterns. We extract all potential antecedents for each anaphor and instantiate the explicit 10 There is also a long history in the extraction of other lexical knowledge, which is also potentially useful for anaphora resolution, for example, of selectional restrictions/preferences. In this article we focus on the lexical relations that can hold between antecedent and anaphor head nouns. Computational Linguistics Volume 31, Number 3 for all anaphor/antecedent pairs. In Example (4) the pattern NP can be instantiated with ordinances and other facilities, Moon Township and other facilities, homes and other facilities, handicapped and other facilities, and miles and other facilities.</Paragraph> <Paragraph position="7"> Step 4: Antecedent Assignment. The instantiation of a pattern can be searched in any corpus to determine its frequency. We follow the rationale that the most frequent of these instantiated patterns determines the most likely antecedent. Therefore, should the head noun of an antecedent candidate and the anaphor co-occur in a pattern although they do not stand in the lexical relationship considered (because of pattern ambiguity, noise in the corpus, or spurious occurrences), this need not prove a problem as long as the correct antecedent candidate co-occurs more frequently with the anaphor.</Paragraph> <Paragraph position="8"> As the patterns can be elaborate, most manually controlled and linguistically processed corpora are too small to determine the pattern frequencies reliably. Therefore, the size of the corpora used in some previous approaches leads to data sparseness (Berland and Charniak 1999), and the extraction procedure can therefore require extensive smoothing. Thus as a further extension, we suggest using the largest corpus available, the Web, in the above procedure. The instantiation for the correct antecedent homes and other facilities in Example (4), for instance, does not occur at all in the BNC but yields over 1,500 hits on the Web.</Paragraph> <Paragraph position="9"> The competing instantiations (listed in Step 3) yield 0 hits in the BNC and fewer than 20 hits on the Web.</Paragraph> <Paragraph position="10"> In the remainder of this article, we present two comparative case studies on coreference and other-anaphora that evaluate the ontology- and corpus-based approaches in general and our extensions in particular.</Paragraph> </Section> <Section position="6" start_page="373" end_page="396" type="metho"> <SectionTitle> 4. Case Study I: Other-Anaphora </SectionTitle> <Paragraph position="0"> We now describe our first case study for antecedent selection in other-anaphora.</Paragraph> <Section position="1" start_page="373" end_page="374" type="sub_section"> <SectionTitle> 4.1 Corpus Description and Annotation </SectionTitle> <Paragraph position="0"> We use Modjeska's (2003) annotated corpus of other-anaphors from the WSJ. All examples in this section are from this corpus. Modjeska restricts the notion of other-anaphora to anaphoric NPs with full lexical heads modified by other or another (Examples (3)-(4)), thereby excluding idiomatic non-referential uses (e.g., on the other hand), reciprocals such as each other, ellipsis, and one-anaphora. The excluded cases either are non-anaphoric or do not have a full lexical head and would therefore require a mostly non-lexical approach to resolution. Modjeska's corpus also excludes Markert and Nissim Knowledge Sources for Anaphora Resolution other-anaphors with structurally available antecedents: In list contexts such as Example (5), the antecedent is normally given as the left conjunct of the list: (5) [. . .] AZT can relieve dementia and other symptoms in children [. . .] A similar case is the construction Xs other than Ys. For a computational treatment of other-NPs with structural antecedents, see Bierner (2001).</Paragraph> <Paragraph position="1"> The original corpus collected and annotated by Modjeska (2003) contains 500 instances of other-anaphors with NP antecedents in a five-sentence window. In this study we use the 408 (81.6%) other-anaphors in the corpus that have NP antecedents within a two-sentence window (the current or previous sentence).</Paragraph> <Paragraph position="2"> An antecedent candidate is manually annotated as correct if it is the latest mention of the entity to which the anaphor provides the set complement. The tag lenient was used to annotate previous mentions of the same entity. In Example (6), all other bidders refers to all bidders excluding United Illuminating Co., whose latest mention is it. In this article, lenient antecedents are underlined. All other potential antecedents (e.g., offer in Example (6)), are called distractors.</Paragraph> <Paragraph position="3"> (6) United Illuminating Co. raised its proposed offer to one it valued at $2.29 billion from $2.19 billion, apparently topping all other bidders.</Paragraph> <Paragraph position="4"> The antecedent can be a set of separately mentioned entities, like May and July in Example (7). For such split antecedents (Modjeska 2003), the latest mention of each set member is annotated as correct, so that there can be more than one correct antecedent to an anaphor.</Paragraph> <Paragraph position="5"> (7) The May contract, which also is without restraints, ended with a gain of 0.45 cent to 14.26 cents. The July delivery rose its daily permissible limit of 0.50 cent a pound to 14.00 cent, while other contract months showed near-limit advances.</Paragraph> </Section> <Section position="2" start_page="374" end_page="376" type="sub_section"> <SectionTitle> 4.2 Antecedent Extraction and Preprocessing </SectionTitle> <Paragraph position="0"> For each anaphor, all previously occurring NPs in the two-sentence window were automatically extracted exploiting the WSJ parse trees. NPs containing a possessive NP modifier (e.g., Spain's economy) were split into a possessor phrase (Spain) and a possessed entity (Spain's economy).</Paragraph> <Paragraph position="1"> Modjeska (2003) identifies several syntactic positions that cannot serve as antecedents of other-anaphors. We automatically exclude only NPs preceding an appositive other-anaphor from the candidate antecedent set. In &quot;Mary Elizabeth Ariail, another social-studies teacher,&quot;theNPMary Elizabeth Ariail cannot 13 We concentrate on this majority of cases to focus on the comparison of different sources of lexical knowledge without involving discourse segmentation or focus tracking. In Section 5 we expand the window size to allow equally high coverage for definite NP coreference.</Paragraph> <Paragraph position="2"> 14 The occurrence of split antecedents also motivated the distinction between correct and lenient antecedents in the annotation. Anaphors with split antecedents have several antecedent candidates annotated as correct. All other anaphors have only one antecedent candidate annotated as correct, with previous mentions of the same entity marked as lenient.</Paragraph> <Paragraph position="3"> 15 We thank Natalia Modjeska for the extraction and for making the resulting sets of candidate antecedents available to us.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 31, Number 3 be the antecedent of another social-studies teacher as the two phrases are coreferential and cannot provide a set complement to each other.</Paragraph> <Paragraph position="5"> The resulting set of potential NP antecedents for an anaphor ana (with a unique identifier anaid) is called A</Paragraph> <Paragraph position="7"> The final number of extracted antecedents for the whole data set is 4,272, with an average of 10.5 antecedent candidates per anaphor. After extraction, all modification was eliminated, and only the rightmost noun of compounds was retained, as modification results in data sparseness for the corpus-based methods, and compounds are often not recorded in WordNet.</Paragraph> <Paragraph position="8"> For the same reasons we automatically resolved named entities (NEs). They were classified into the ENAMEX MUC-7 categories (Chinchor 1997) PERSON, ORGANIZATION and LOCATION, using the software ANNIE (GATE2; http://gate.ac.</Paragraph> <Paragraph position="9"> uk). We then automatically obtained more-fine-grained distinctions for the NE categories LOCATION and ORGANIZATION, whenever possible. We classified LOCATIONS into COUNTRY,(US) STATE, CITY, RIVER, LAKE,andOCEAN in the following way. First, small gazetteers for these subcategories were extracted from the Web. Second, if an entity marked as LOCATION by ANNIE occurred in exactly one of these gazetteers (e.g., Texas in the (US) STATE gazetteer) it received the corresponding specific label; if it occurred in none or in several of the gazetteers (e.g., Mississippi occurred in both the state and the river gazetteer), then the label was left at the LOCATION level. We further classified an ORGANIZATION entity by using its internal makeup as follows.</Paragraph> <Paragraph position="10"> We extracted all single-word hyponyms of the noun organization from WordNet and used the members of this set, OrgSet, as the target categories for the fine-grained distinctions. If an entity was classified by ANNIE as ORGANIZATION and it had an element <ORG> of OrgSet as its final lemmatized word (e.g., Deutsche Bank)or contained the pattern <ORG> of (for example, Bank of America), it was subclassified as <ORG> (here, BANK). In cases of ambiguity, again, no subclassification was carried out. No further distinctions were developed for the category PERSON. We used regular expression matching to classify numeric and time entities into DAY, MONTH,andYEAR as well as DOLLAR or simply NUMBER. This subclassification of the standard categories provides us with additional lexical information for antecedent selection. Thus, in Example (8), for instance, a finer-grained classification of South Carolina into STATE provides more useful information than resolving both South Carolina and Greenville County as LOCATION only: (8) Use of Scoring High is widespread in South Carolina and common in Greenville County....Experts say there isn't another state in the country where . . .</Paragraph> <Paragraph position="11"> Finally, all antecedent candidates and anaphors were lemmatized. The procedure of extraction and preprocessing results in the following antecedent sets and anaphors for Examples (3) and (4): A = {..., addition, cost, result, exposure, member, measure} and ana = repercussion and A = {..., ordinance, Moon Township [= location], home, handicapped, mile} and ana = facility.</Paragraph> <Paragraph position="12"> Table 1 shows the distribution of antecedent NP types in the other-anaphora data set.</Paragraph> <Paragraph position="13"> NE resolution is clearly important as 205 of 468 (43.8%) of correct antecedents are NEs.</Paragraph> <Paragraph position="14"> 16 In this article the anaphor ID corresponds to the example numbers.</Paragraph> <Paragraph position="15"> 17 Note that there are more correct antecedents than anaphors because the data include split antecedents.</Paragraph> </Section> <Section position="3" start_page="376" end_page="379" type="sub_section"> <SectionTitle> 4.3 Evaluation Measures and Baselines </SectionTitle> <Paragraph position="0"> For each anaphor, each algorithm selects at most one antecedent as the correct one. If this antecedent provides the appropriate set complement to the anaphor (i.e., is marked in the gold standard as correct or lenient), the assignment is evaluated as correct.</Paragraph> <Paragraph position="1"> Otherwise, it is evaluated as wrong. We use the following evaluation measures: Precision is the number of correct assignments divided by the number of assignments, recall is the number of correct assignments divided by the number of anaphors, and F-measure is based on equal weighting of precision and recall. In addition, we also give the coverage of each algorithm as the number of assignments divided by the number of anaphors.</Paragraph> <Paragraph position="2"> This last measure is included to indicate how often the algorithm has any knowledge to go on, whether correct or false. For algorithms in which the coverage is 100%, precision, recall, and F-measure all coincide.</Paragraph> <Paragraph position="3"> We developed two simple rule-based baseline algorithms. The first, a recency-based baseline (baselineREC), always selects the antecedent candidate closest to the anaphor. The second (baselineSTR) takes into account that the lemmatized head of an other-anaphor is sometimes the same as that of its antecedent, as in the pilot's claim . . . other bankruptcy claims. For each anaphor, baselineSTR string-compares its last (lemmatized) word with the last (lemmatized) word of each of its potential antecedents. If the strings match, the corresponding antecedent is chosen as the correct one. If several antecedents produce a match, the baseline chooses the most recent one among them. If no antecedent produces a match, no antecedent is assigned.</Paragraph> <Paragraph position="4"> We tested two variations of this baseline.</Paragraph> <Paragraph position="6"> only the original antecedents for string matching, disregarding named-entity resolution. If string-comparison returns no match, a back-off version (baselineSTR</Paragraph> <Paragraph position="8"> chooses the antecedent closest to the anaphor among all antecedent candidates, thereby yielding a 100% coverage. The second variation (baselineSTR ) uses the replacements for named entities for string matching; again a back-off version (baselineSTR</Paragraph> <Paragraph position="10"> a recency back-off. This baseline performs slightly better, as now cases such as that in Example (8) (South Carolina . . . another state, in which South Carolina is resolved to STATE) can also be resolved. The results of all baselines are summarized in Table 2. Results of the 100% coverage backoff algorithms are indicated by Precision [?] in all tables. The sets of anaphors covered by the string-matching baselines baselineSTR , respectively. These sets do not include the cases assigned by the recency back-off in baselineSTR</Paragraph> <Paragraph position="12"> For our WordNet and corpus-based algorithms we additionally deleted pronouns from the antecedent sets, since they are lexically not very informative and are also not encoded in WordNet. This removes 49 (10.5%) of the 468 correct antecedents (see Table 1); however, we can still resolve some of the anaphors with pronoun antecedents if they also have a lenient non-pronominal antecedent, as in Example (6).</Paragraph> <Paragraph position="13"> After pronoun deletion, the total number of antecedents in our data set is 3,875 for 408 anaphors, of which 419 are correct antecedents, 160 are lenient, and 3,296 are distractors.</Paragraph> <Paragraph position="14"> 4.4 Wordnet as a Knowledge Source for Other-Anaphora Resolution 4.4.1 Descriptive Statistics. As most antecedents are hyponyms or synonyms of their anaphors in other-anaphora, for each anaphor ana, we look up which elements of its antecedent set A anaid are hyponyms/synonyms of ana in WordNet, considering all senses of anaphor and candidate antecedent. In Example (4), for example, we look up whether ordinance, Moon Township, home, handicapped,andmile are hyponyms or synonyms of facility in WordNet. Similarly, in Example (9), we look up whether Will Quinlan [= PERSON], gene,andrisk are hyponyms/synonyms of child.</Paragraph> <Paragraph position="15"> (9) Will Quinlan had not inherited a damaged retinoblastoma supressor gene and, therefore, faced no more risk than other children ...</Paragraph> <Paragraph position="16"> As proper nouns (e.g., Will Quinlan) are often not included in WordNet, we also look up whether the NE category of an NE antecedent is a hyponym/synonym of the anaphor (e.g., whether person is a synonym/hyponym of child) and vice versa (e.g., whether child is a synonym/hyponym of person). This last inverted look-up is necessary, as the NE category of the antecedent is often too general to preserve the normal hyponymy relationship to the anaphor. Indeed, in Example (9), it is the inverted look-up that captures the correct hyponymy relation between person and child. If the single look-up for common nouns or any of the three look-ups for proper nouns is successful, we say that a hyp/syn relation between candidate antecedent and anaphor holds in WordNet. Note that each noun in WordNet stands in a hyp/syn relation to itself. Table 3 summarizes how many correct/ lenient antecedents and distractors stand in a hyp/syn relation to their anaphor in WordNet.</Paragraph> <Paragraph position="17"> Correct/lenient antecedents stand in a hyp/syn relation to their anaphor significantly more often than distractors do (p < 0.001, t-test). The use of WordNet hyponymy/synonymy relations to distinguish between correct/lenient antecedents and distractors is therefore plausible. However, Table 3 also shows two limitations of relying on WordNet in resolution algorithms. First, 57% of correct and lenient antecedents are not linked via a hyp/syn relation to their anaphor in WordNet. This will affect coverage and recall (see also Section 2.2.1). Examples from our data set that are not covered are home:facility, cost:repercussion, age:(risk) factor, pension:benefit, coffee:export,andpilot(s):union, including both missing universal hyponymy links and context-stipulated ones. Second, the raw frequency (296) of distractors that stand in a hyp/syn relation to their anaphor is higher than the combined raw frequency for correct/lenient antecedents (248) that do so, which can affect precision. This is due to both sense proliferation (Section 2.2.4) and anaphors that require more than just lexical knowledge about antecedent and anaphor heads to select a correct antecedent over a distractor. In Example (10), the distractor product stands in a hyp/syn relationship to the anaphor commodity and--disregarding other factors--is a good antecedent candidate.</Paragraph> <Paragraph position="18"> (10) . . . the move is designed to more accurately reflect the value of products andtoputsteel on a more equal footing with other commodities.</Paragraph> <Paragraph position="19"> 4.4.2 The WordNet-Based Algorithm. The WordNet-based algorithm resolves each anaphor ana to a hyponym or synonym in A anaid , if possible. If several antecedent candidates are hyponyms or synonyms of ana, it uses a tiebreaker based on string match and recency. When no candidate antecedent is a hyponym or synonym of ana, string match and recency can be used as a possible back-off.</Paragraph> <Paragraph position="20"> String comparison for tiebreaker and back-off can again use the original or the replaced antecedents, yielding two versions, algoWN</Paragraph> <Paragraph position="22"> antecedents).</Paragraph> <Paragraph position="23"> The exact procedure for the version algoWN given an anaphor ana is as follows: (i) for each antecedent a in A anaid , look up whether a hyp/syn relation between a and ana holds in WordNet; if this is the case, push a into a set that do have a string-matching antecedent candidate will already be covered by the WordNet look-up prior to back-off in almost all cases: Back-off string matching will take effect only if the anaphor/antecedent head noun is not in WordNet at all. Therefore, the described back-off will most of the time just amount to a recency back-off. 22 The algorithm algoWN follows the same procedure apart from the variation in string matching. matches ana, select this one and stop; if several match ana, select the closest to ana within these matching antecedents and stop; if none match, select the closest to ana within as a back-off (iv') if no antecedent can be assigned:</Paragraph> <Paragraph position="25"> to assign an antecedent to ana and stop;</Paragraph> <Paragraph position="27"> achieved the same results, namely, a coverage of 65.2%, precision of 56.8%, and recall of 37.0%, yielding an F-measure of 44.8%. The low coverage and recall confirm our predictions in Section 4.4.1. Using backoff</Paragraph> <Paragraph position="29"> achieves a coverage of 100% and a precision/recall/F-measure of 44.4%.</Paragraph> </Section> <Section position="4" start_page="379" end_page="384" type="sub_section"> <SectionTitle> 4.5 Corpora as Knowledge Sources for Other-Anaphora Resolution </SectionTitle> <Paragraph position="0"> In Section 3 we suggested the use of shallow lexico-semantic patterns for obtaining anaphor-antecedent relations from corpora. In our first experiment we use the Web, which with its approximately 8,058M pages is the largest corpus available to the NLP community. In our second experiment we use the same technique on the BNC, a smaller (100 million words) but virtually noise-free and balanced corpus of contemporary English.</Paragraph> <Paragraph position="1"> expresses a hyponymy/synonymy relationship with X being hyponyms/synonyms of Y (see also Example (5) and [Hearst 1992]). This is only one of the possible structures that express hyponymy/synonymy. Others involve such, including,and especially (Hearst 1992) or appositions and coordination. We derive our patterns from the list-context because it corresponds relatively unambigously to hyponymy/synonymy relations (in contrast to coordination, which often links sister concepts instead of a hyponym and its hyperonym, as in tigers and lions, or even completely unrelated concepts). In addition, it is quite frequent (for example, and other occurs more frequently on the Web than such as and other than). Future work has to explore which patterns have the highest precision and/or recall and how different patterns can be combined effectively without increasing the risk of false positives (see also Section 2.2.3).</Paragraph> <Paragraph position="2"> in Table 4). The three instantiations for NEs are parallel to the three hyp/syn relation look-ups in the WordNet experiment in Section 4.4.1. We submit these instantiations as queries to the Google search engine, making use of the Google API technology.</Paragraph> <Paragraph position="3"> BNC. For BNC patterns and instantiations, we exploit the BNC's part-of-speech tagging. to nouns to avoid noise, and on the other hand, we allow occurrence of modification to improve coverage. We are variables, and and and other are constants.</Paragraph> <Paragraph position="4"> 25 All common-noun instantiations are marked by a superscript c and all proper-noun instantiations by a superscript p.</Paragraph> <Paragraph position="5"> An instantiation for (B1), for example, also matches &quot;homes and the other four facilities.&quot; Otherwise the instantiations are produced parallel to the Web (see Table 4). We search the instantiations in the BNC using the IMS Corpus Query Workbench (Christ 1995). over these frequencies is the score associated with each antecedent (given an anaphor ana), which we will also simply refer to as the antecedent's Web score. For the BNC, we call the corresponding maximum score BM a and refer to it as the antecedent's BNC score. This simple maximum score is biased toward antecedent candidates whose head nouns occur more frequently overall. In a previous experiment we used mutual information to normalize Web scores (Markert, Nissim, and Modjeska 2003). However, the results achieved with normalized and non-normalized scores showed no significant difference. Other normalization methods might yield significant improvements over simple maximum scoring and can be explored in future work.</Paragraph> <Paragraph position="6"> score distributions for correct/lenient antecedents and distractors, including the minimum and maximum score, mean score and standard deviation, median, and number of zero scores, scores of one, and scores greater than one.</Paragraph> <Paragraph position="7"> Web scores resulting from simple pattern-based search produce on average significantly higher scores for correct/lenient antecedents (mean: 2,416.68/807.63; median: 68/68.5) than for distractors (mean: 290.97; median: 1). Moreover, the method produces significantly fewer zero scores for correct/lenient antecedents (19.6%/22.5%) than for distractors (42.3%).</Paragraph> <Paragraph position="8"> Therefore the pattern-based Web method is a good candidate for distinguishing correct/lenient antecedents and distractors in anaphora resolution. In addition, the median for correct/lenient antecedents is relatively high (68/68.5), which ensures a relatively large amount of data upon which to base decisions. Only 19.6% of correct antecedents have scores of zero, which indicates that the method might have high coverage (compared to the missing 57% of hyp/syn relations for correct antecedents in WordNet; Section 4.4).</Paragraph> <Paragraph position="9"> Although the means of the BNC score distributions of correct/lenient antecedents are significantly higher than that of the distractors, this is due to a few outliers; more interestingly, the median for the BNC score distributions is zero for all antecedent groups. This will affect precision for a BNC-based algorithm because of the small amount of data decisions are based on. In addition, although the number of zero scores 26 The star operator indicates zero or more occurrences of a variable. The variable D can be instantiated by any determiner; the variable A can be instantiated by any adjective or cardinal number. 27 Difference in means was calculated via a t-test; for medians we used chi-square, and for zero counts a t-test for proportions. The significance level used was 5%.</Paragraph> <Paragraph position="10"> for correct/lenient antecedents (85.9%/86.9%) is significantly lower than for distractors (97.5%), the number of zero scores is well above 80% for all antecedent groups. Thus, the coverage and recall of a BNC-based algorithm will be very low. Although the BNC scores are in general much lower than Web scores and although the Web scores distinguish better between correct/lenient antecedents and distractors, we observe that Web and BNC scores still correlate significantly, with correlation coefficients between 0.20 and 0.35, depending on antecedent group.</Paragraph> <Paragraph position="11"> To summarize, the pattern-based method yields correlated results on different corpora, but it is expected to depend on large corpora to be really successful. anaphor ana to the antecedent candidate in A anaid with the highest Web score above zero.</Paragraph> <Paragraph position="12"> If several potential antecedents achieve the same Web score, it uses a tiebreaker based on string match and recency. If no antecedent candidate achieves a Web score above zero, string match and recency can be used as a back-off. String comparison for tiebreaker and back-off can again use the original or the replaced antecedents, yielding two versions,</Paragraph> <Paragraph position="14"> (replaced antecedents).</Paragraph> <Paragraph position="15"> The exact procedure for the version algoWeb for an anaphor ana is as follows: matches ana, select this one and stop; if several match ana, select the closest to ana within these matching antecedents and stop; if none match, select the closest to ana within A</Paragraph> <Paragraph position="17"> is empty, make no assigment and stop.</Paragraph> <Paragraph position="18"> The back-off algorithm algoWeb as a back-off (iv') if no antecedent can be assigned (parallel to the back-off in algoWN</Paragraph> <Paragraph position="20"> This happens when the Web score of an antecedent candidate that does not match the anaphor is higher than the Web scores of matching antecedent candidates. In particular, there is no guarantee that matching antecedent candidates are included Given the high precision of baselineSTR, we might want to exclude the possibility that the Web algorithms overrule string matching. Instead we can use string matching prior to Web scoring, use the Web scores only when there are no matching antecedent candidates, and use recency as the final back-off. This variation then achieves the same results on the sets StrSet</Paragraph> <Paragraph position="22"> as the WordNet algorithms and the string-matching baselines. In combination with the possibility of using original or replaced antecedents for string matching this yields four algorithm variations overall (see Table 6). The results (see Table 7) do not show any significant differences according to the variation explored.</Paragraph> <Paragraph position="23"> The BNC-based algorithms follow the same procedures as the Web-based algorithms, using the BNC scores instead of Web scores. The results (see Table 8) are disappointing because of data sparseness (see above). No variation yields considerable tions just apply a string-matching baseline, either as a back-off or prior to checking BNC scores, depending on the variation used.</Paragraph> </Section> <Section position="5" start_page="384" end_page="388" type="sub_section"> <SectionTitle> 4.6 Discussion and Error Analysis </SectionTitle> <Paragraph position="0"> The performances of the best versions of all algorithms for other-anaphora are summarized in Table 9.</Paragraph> <Paragraph position="1"> [?] using two tests throughout this article. We used a t-test to measure the difference between two algorithms in the proportion of correctly resolved anaphors. However, there are many examples which are easy (for example, string-matching examples) and that therefore most or all algorithms will resolve correctly, as well as many that are too hard for all algorithms. Therefore, we also compare two algorithms using McNemar's test, which only relies on the part of the data set in which the algorithms do not give the same answer.</Paragraph> <Paragraph position="2"> If not otherwise stated, all significance claims hold at the 5% level for both the t-test and McNemar's test.</Paragraph> <Paragraph position="3"> The algorithm baselineSTR significantly outperforms baselineREC in precision</Paragraph> <Paragraph position="5"> showing that the &quot;same predicate match&quot; is quite accurate even though not very frequent (coverage is only 30.9%). The WordNet-based and Web-based algorithms achieve a final precision that is significantly better than the baselines' as well as algoBNC's. Most interestingly, the Web-based algorithms significantly outperform the WordNet-based algorithms, confirming our predictions based on the descriptive statistics. The Web approach, for example, resolves Examples (3), (4), (6), and (11) (which WordNet could not resolve) in addition to Examples (8) and (9), which both the Web As expected, the WordNet-based algorithms suffer from the problems discussed in Section 2.2. In particular, Problem 1 proved to be quite severe, as algoWN achieved a coverage of only 65.2%. Missing links in WordNet also affect precision if a good distractor has a link to the anaphor in WordNet, whereas the correct antecedent does not (Example (10)). Missing links are both universal relations that should be included in an ontology (such as home:facility) and context-dependent links (e.g., age:(risk) factor, costs:repercussions; see Problem 2 in Section 2.2.2). Further mining of WordNet beyond following hyponymy/synonymy links might alleviate Problem 1 but is more costly and might lead to false positives (Problem 3). To a lesser degree, the WordNet algorithms also suffer from sense proliferation (Problem 4), as all senses of both anaphor and antecedent candidates were considered. Therefore, some hyp/syn relations based on a sense not intended in the text were found, leading to wrong-antecedent selection and lowering precision. In Example (11), for instance, there is no hyponymy link between the head noun of the correct antecedent (question) and the head noun of the anaphor (issue), whereas there is a hyponymy link between issue and person = [Mr.</Paragraph> <Paragraph position="6"> Dallara] (using the sense of issue as offspring) as well as a synonymy link between number and issue. While in this case considering the most frequent sense of the anaphor issue as indicated in WordNet would help, this would backfire in other cases in our data set in which issue is mostly used in the minority sense of stock, share. Obviously, prior word sense disambiguation would be the most principled but also a more costly solution.</Paragraph> <Paragraph position="7"> (11) While Mr. Dallara and Japanese officials say the question of investors access to the U.S. and Japanese markets may get a disproportionate share of the public's attention, a number of other important economic issues [...] The Web-based method does not suffer as much from these problems. The linguistically motivated patterns we use reduce long-distance dependencies between anaphor and antecedent to local dependencies. By looking up these patterns on the Web we make use of a large amount of data that is very likely to encode strong semantic links via these local dependencies and to do so frequently. This holds both for universal hyponymy relations (addressing Problem 1) and relations that are not necessarily to be included in an ontology (addressing Problem 2). The problem of whether to include subjective and context-dependent relations in an ontology (Problem 2) is circumvented by using Web scores only in comparison to Web scores of other antecedent candidates.</Paragraph> <Paragraph position="8"> In addition, the Web-based algorithm needs no hand-processing or hand-modeling whatsoever, thereby avoiding the manual effort of building ontologies. Moreover, the local dependencies we use reduce the need for prior word sense disambiguation (Problem 4), as the anaphor and the antecedent constrain each other's sense within the Markert and Nissim Knowledge Sources for Anaphora Resolution Figure 1 Decision tree for error classification.</Paragraph> <Paragraph position="9"> context of the pattern. Furthermore, the Web scores are based on frequency, which biases the Web-based algorithms toward frequent senses as well as sense pairs that occur together frequently. Thus, the Web algorithm has no problem resolving issue to question in Example (11) because of the high frequency of the query question OR questions and other issues. Problem 3 is still not addressed, however, as any corpus can encode the same semantic relations via different patterns. Combining patterns might therefore yield problems similar to those presented by combining information sources in an ontology.</Paragraph> <Paragraph position="10"> Our pattern-based method, though, seems to work on very large corpora only.</Paragraph> <Paragraph position="11"> Unlike the Web-based algorithms, the BNC-based ones make use of POS tagging and observe sentence boundaries, thus reducing the noise intrinsic to an unprocessed corpus like the Web. Moreover, the instantiations used in algoBNC allow for modification to occur (see Table 4), thus increasing chances of a match. Nevertheless, the BNC-based algorithms performed much worse than the Web-based ones: Only 4.2% of all pattern instantiations were found in the BNC, yielding very low coverage and recall (see Table 5).</Paragraph> <Paragraph position="12"> still incurs 194 errors (47.6% of 408). Because in several cases there is more than one reason for a wrong assignment, we use the decision tree in Figure 1 for error classification. By using this decision tree, we can, for example, exclude from further analysis those cases that none of the algorithms could resolve because of their intrinsic design.</Paragraph> <Paragraph position="13"> As can be seen in Table 10, quite a large number of errors result from deleting pronouns as well as not dealing with split antecedents (44 cases, or 22.7% of all mistakes). null Out of these 44, 30 involve split antecedents. In 19 of these 30 cases, one of the several correct antecedents has indeed been chosen by our algorithm, but all the correct antecedents need to be found to allow for the resolution to be counted as correct.</Paragraph> <Paragraph position="14"> Given the high number of NE antecedents in our corpus (43.8% of correct, 25% of all antecedents; see Table 1), NE resolution is crucial. In 11.3% of the cases, the algorithm selects a distractor instead of the correct antecedent because the NER module 31 Percentages of errors are rounded to the first decimal; rounding errors account for the coverage of 99.9% of errors instead of 100%.</Paragraph> <Paragraph position="15"> either leaves the correct antecedent unresolved (which could then lead to very few or zero hits in Google) or resolves the named entity to the wrong NE category. String matching is a minor cause of errors (under 10%). This is because, apart from its being generally reliable, there is also a possible string match only in just about 30% of the cases (see Table 2).</Paragraph> <Paragraph position="16"> Many mistakes, instead, occur because other-anaphora can express heavily context-dependent and very unconventional relations, such as the description of dolls as winners in Example (12).</Paragraph> <Paragraph position="17"> (12) Coleco bounced back with the introduction of the Cabbage Patch dolls.[...]Butasthecrazedied,Colecofailedtocome up with another winner. [...] In such cases, the relation between the anaphor and antecedent head nouns is not frequent enough to be found in a corpus even as large as the Web.</Paragraph> <Paragraph position="18"> This is mirrored in the high percentage of zero-score errors (24.7% of all mistakes). Although the Web algorithm suffers from a knowledge gap to a smaller degree than WordNet, there is still a substantial number of cases in which we cannot find the right lexical relation.</Paragraph> <Paragraph position="19"> Errors of type other are normally due to good distractors that achieve higher Web scores than the correct antecedent. A common reason is that the wished-for relation is attested but rare and therefore other candidates yield higher scores. This is similar to zero-score errors. Furthermore, the elimination of modification, although useful to reduce data sparseness, can sometimes lead to the elimination of information that could help disambiguate among several candidate antecedents. Lastly, lexical information, albeit crucial and probably more important than syntactic information (Modjeska 2002), is not sufficient for the resolution of other-anaphora. The integration of other features, such as grammatical function, NP form, and discourse structure, could probably help when very good distractors cannot be ruled out by purely lexical methods (Example (10)). The integration of the Web feature in a machine-learning algorithm using several other features has yielded good results (Modjeska, Markert, and Nissim 2003).</Paragraph> <Paragraph position="20"> 32 Using different or simply more patterns might yield some hits for anaphor-antecedent pairs that return a zero score when instantiated in the pattern we use in this article.</Paragraph> <Paragraph position="21"> Markert and Nissim Knowledge Sources for Anaphora Resolution 5. Case Study II: Definite NP Coreference The Web-based method we have described outperforms WordNet as a knowledge source for antecedent selection in other-anaphora resolution. However, it is not clear how far the method and the achieved comparative results generalize to other kinds of full NP anaphora. In particular, we are interested in the following questions: a114 Is the knowledge gap encountered in WordNet for other-anaphora equally severe for other kinds of full NP anaphora? A partial (mostly affirmative) answer to this is given by previous researchers, who put the knowledge gap for coreference at 30-50% and for bridging at 38-80%, depending on language, domain, and corpus (see Section 2).</Paragraph> <Paragraph position="22"> a114 Do the Web-based method and the specific search patterns we use generalize to other kinds of anaphora? a114 Do different anaphoric phenomena require different lexical knowledge sources? As a contribution, we investigate the performance of the knowledge sources discussed for other-anaphora in the resolution of coreferential NPs with full lexical heads, concentrating on definite NPs (see Example (1)). The automatic resolution of such anaphors has been the subject of quite significant interest in the past years, but results are much less satisfactory than those obtained for the resolution of pronouns (see Section 2).</Paragraph> <Paragraph position="23"> The relation between the head nouns of coreferential definite NPs and their antecedents is again, in general, one of hyponymy or synonymy, making an extension of our approach feasible. However, other-anaphors are especially apt at conveying context-specific or subjective information by forcing the reader via the other-expression to accommodate specific viewpoints. This might not hold for definite NPs.</Paragraph> </Section> <Section position="6" start_page="388" end_page="389" type="sub_section"> <SectionTitle> 5.1 Corpus Collection </SectionTitle> <Paragraph position="0"> We extracted definite NP anaphors and their candidate antecedents from the MUC-6 coreference corpus, including both the original training and test material, for a total of 60 documents. The documents were automatically preprocessed in the following way: All meta-information about each document indicated in XML (such as WSJ category and date) was discarded, and the headline was included and counted as one sentence. Whenever headlines contained three dashes, everything after the dashes was discarded.</Paragraph> <Paragraph position="1"> We then converted the MUC coreference chains into an anaphor-antecedent annotation concentrating on anaphoric definite NPs. All definite NPs which are in, but not at the beginning of, a coreference chain are potential anaphors. We excluded definite NPs with proper noun heads (such as the United States) from this set, since these do not depend on an antecedent for interpretation and are therefore not truly anaphoric. We also excluded appositives, which provide coreference structurally and are therefore 33 We thank an anonymous reviewer for pointing out that this role for coreference is more likely to be provided by demonstratives than definite NPs.</Paragraph> <Paragraph position="2"> 34 Proper-noun heads are approximated by capitalization in the exclusion procedure. Computational Linguistics Volume 31, Number 3 not anaphoric. Otherwise, we strictly followed the MUC annotation for coreference in our extraction, although it is not entirely consistent and not necessarily comprehensive (van Deemter and Kibble 2000). This extraction method yielded a set of 565 anaphoric definite NPs.</Paragraph> <Paragraph position="3"> For each extracted anaphor in a coreference chain C we regard the NP in C that is closest to the anaphor as the correct antecedent, whereas all other previous mentions in C are regarded as lenient. NPs that occur before the anaphor but are not marked as being in the same coreference chain are distractors. Since anaphors with split antecedents are not annotated in MUC, anaphors cannot have more than one correct antecedent. In Example (13), the NPs with the head nouns Pact, contract,andsettlement are marked as coreferent in MUC: In our annotation, the settlement is an anaphor with a correct antecedent headed by contract and a lenient antecedent Pact. Other NPs prior to the anaphor (e.g., Canada or the IWA-Canada union) are distractors.</Paragraph> <Paragraph position="4"> (13) Forest Products Firms Tentatively Agree On Pact in Canada. A group of large British Columbia forest products companies has reached a tentative, three-year labor contract with about 18,000 members of the IWA-Canada union, ...The settlement involves . . .</Paragraph> <Paragraph position="5"> With respect to other-anaphora, we expanded our window size from two to five sentences (the current and the four previous sentences) and excluded all anaphors with no correct or lenient antecedent within this window size, thus yielding a final set of 477 anaphors (84.4% of 565). This larger window size is motivated by the fact that a window size of two would cover only 62.3% of all anaphors (352 out 565).</Paragraph> </Section> <Section position="7" start_page="389" end_page="391" type="sub_section"> <SectionTitle> 5.2 Antecedent Extraction, Preprocessing, and Baselines </SectionTitle> <Paragraph position="0"> All NPs prior to the anaphor within the five-sentence window were extracted as antecedent candidates.</Paragraph> <Paragraph position="1"> We further processed anaphors and antecedents as in Case Study I (see Section 4.2): Modification was stripped and all NPs were lemmatized. In this experiment, named entities were resolved using Curran and Clark's (2003) NE tagger rather than GATE.</Paragraph> <Paragraph position="2"> The identified named entities were further subclassified into finer-grained entities, as described for Case Study I. The final number of extracted antecedents for the whole data set of 477 anaphors is 14,233, with an average of 29.84 antecedent candidates per anaphor. This figure is much higher than the average number of antecedent candidates for other-anaphors (10.5) because of the larger window size used. The data set includes 473 correct antecedents, 803 lenient antecedents, and 12,957 distractors. Table 11 shows the distribution of NP types for correct and lenient antecedents and for distractors.</Paragraph> <Paragraph position="3"> There are fewer correct antecedents (473) than anaphors (477) because the MUC annotation also includes anaphors whose antecedent is not an NP but, for example, a nominal modifier in a compound. Thus, in Example (14), the bankruptcy code is annotated in MUC as coreferential to bankruptcy-law, a modifier in bankruptcy-law protection.</Paragraph> <Paragraph position="4"> 35 All examples in the coreference study are from the MUC-6 corpus. 36 This extraction was conducted manually, to put this study on an equal footing with Case Study I. It presupposes perfect NP chunking. A further discussion of this issue can be found in Section 6. 37 Curran and Clark's (2003) tagger was not available to us during the first case study. Both NE taggers are state-of-the-art taggers trained on newspaper text.</Paragraph> <Paragraph position="5"> on hold when Eastern filed for bankruptcy-law protectionMarch9....Ifit doesn't go quickly enough, the judge said he may invoke a provision of the bankruptcy code [...] In our scheme we extract the bankruptcy code as anaphoric but our method of extracting candidate antecedents does not include bankruptcy-law. Therefore, there are four anaphors in our data set with no correct/lenient antecedent extracted. These cannot be resolved by any of the suggested approaches.</Paragraph> <Paragraph position="6"> We use the same evaluation measures as for other-anaphora as well as the same significance tests for precision (see Table 12 and cf. Table 2). The recency baseline performs worse than for other-anaphora. String matching improves dramatically on simple recency. It also seems to be more relevant than for our other-anaphora data set, achieving higher coverage, precision, and recall. This confirms the high value of string matching that has been assigned to coreference resolution by previous researchers (Soon, Ng, and Lim 2001; Strube, Rapp, and Mueller 2002, among others).</Paragraph> <Paragraph position="7"> As the MUC data set does not include split antecedents, an anaphor ana usually agrees in number with its antecedent. Therefore, we also explored variations of all algorithms that as a first step delete from A anaid all candidate antecedents that do not agree in number with ana.</Paragraph> <Paragraph position="8"> The algorithms then proceed as usual. Algorithms that use number checking are marked with an additional n in the subscript. Using number checking leads to small but consistent gains for all baselines.</Paragraph> <Paragraph position="9"> As in Case Study I, we deleted pronouns for the WordNet- and corpus-based methods, thereby removing 70 of 473 (14.8%) of correct antecedents (see Table 11). After pronoun deletion, the total number of antecedents in our data set is 12,940 for 477 anaphors, of which 403 are correct antecedents, 658 are lenient antecedents, and 11,879 are distractors.</Paragraph> <Paragraph position="10"> 38 The number feature can have the values singular, plural,orunknown. All NE antecedent candidates received the value singular, as this was by far the most common occurrence in the data set. Information about the grammatical number of anaphors and common-noun antecedent candidates was calculated and retained as additional information during the lemmatization process. If lemmatization to both a plural and a singular noun (as determined by WordNet and CELEX) was possible (for example, the word talks could be lemmatized to talk or talks), the value unknown was used. An anaphor and an antecedent candidate were said to agree in number if they had the same value or if at least one of the two values was unknown.</Paragraph> </Section> <Section position="8" start_page="391" end_page="392" type="sub_section"> <SectionTitle> 5.3 WordNet for Antecedent Selection in Definite NP Coreference </SectionTitle> <Paragraph position="0"> We hypothesize that again most antecedents are hyponyms or synonyms of their anaphors in definite NP coreference (see Examples (1) and (13)). Therefore we use the same look-up for hyp/syn relations that was used for other-anaphora (see Section 4.4), including the specifications for common noun and proper name look-ups. Parallel to Table 3, Table 13 summarizes how many correct and lenient antecedents and distractors stand in a hyp/syn relation to their anaphor in WordNet.</Paragraph> <Paragraph position="1"> As already observed for other-anaphora, correct and lenient antecedents stand in a hyp/syn relation to their anaphor significantly more often than distractors do (t-test, p < 0.001). Hyp/syn relations in WordNet might be better at capturing the relation between antecedent and anaphors for definite NP coreference than for other-anaphora: A higher percentage of correct and lenient antecedents of definite NP coreference (71.96%/67.78%) stand in a hyp/syn relation to their anaphors than is the case for other-anaphora (43.0%/42.5%). At the same time, though, there is no difference in the percentage of distractors that stand in a hyp/syn relation to their anaphors (9% for otheranaphora, 8.80% for definite NP coreference). For our WordNet algorithms, this is likely to translate directly into higher coverage and recall and potentially into higher precision than in Case Study I. Still, about 30% of correct antecedents are not in a hyp/syn relation to their anaphor in the current case study, confirming results by Harabagiu, Bunescu, and Maiorano (2001), who also look at MUC-style corpora.</Paragraph> <Paragraph position="2"> This gap, though, is alleviated by a quite high number of lenient antecedents, whose resolution can make up for a missing link between anaphor and correct antecedent.</Paragraph> <Paragraph position="3"> The WordNet-based algorithms are defined exactly as in Section 4.4, with the additional two algorithms that include number checking. Results are summarized in Table 14.</Paragraph> <Paragraph position="4"> All variations of the WordNet algorithms perform significantly better than the corresponding versions of the string-matching baseline (i.e., algoWN whereas we concentrate on definite NPs only, so that the results are not exactly the same. 41 The possibility of resolving to lenient antecedents follows a similar approach as that of Ng and Cardie (2002b), who suggest a &quot;best-first&quot; coreference resolution approach instead of a &quot;most recent first&quot; approach.</Paragraph> <Paragraph position="5"> additional lexical knowledge to string matching. As expected from the descriptive statistics discussed above, the results are better than those obtained by the WordNet algorithms for other-anaphora, even if we disregard the additional morphosyntactic number constraint.</Paragraph> </Section> <Section position="9" start_page="392" end_page="393" type="sub_section"> <SectionTitle> 5.4 The Corpus-Based Approach for Definite NP Coreference </SectionTitle> <Paragraph position="0"> Following the assumption that most antecedents are hyponyms or synonyms of their anaphors in definite NP coreference, we use the same list-context pattern and instantiations that were used for other-anaphora, allowing us to evaluate whether they are transferrable. The corpora we use are again the Web and the BNC.</Paragraph> <Paragraph position="1"> As with other-anaphora, the Web scores do well in distinguishing between correct/lenient antecedents and distractors, with significantly higher means/medians for correct/lenient antecedents (median 472/617 vs. 2 for distractors), as well as significantly fewer zero scores (8% for correct/lenient vs. 41% for distractors). This indicates transferability of the Web-based approach to coreference. Compared to other-anaphora, the number of zero-scores is lower for correct/lenient antecedent types, so that we expect better overall results, similar to our expectations for the WordNet algorithm.</Paragraph> <Paragraph position="2"> The BNC scores can also distinguish between correct/lenient antecedents and distractors, since the number of zero scores for correct/lenient antecedents (68.98%/ 58.05%) is significantly lower than for distractors (96.97%). Although more than 50% of correct/lenient antecedents receive a zero score, there are fewer zero scores than for other-anaphora (for which more than 80% of correct/lenient antecedents received zero scores). However, BNC scores are again in general much lower than Web scores, as measured by means, medians, and zero scores. Nevertheless, Web scores and BNC scores correlate significantly, with the correlations reaching higher coeffi- null Computational Linguistics Volume 31, Number 3 cients (0.53 to 0.65, depending on antecedent group) than they did in the case study for other-anaphora.</Paragraph> <Paragraph position="3"> The corpus-based algorithms for coreference resolution are parallel to those described for other-anaphora and are marked by the same subscripts. The variations that include number checking are again marked by a subscript n. Tables 15 and 16 report the results for all the Web and BNC algorithms, respectively.</Paragraph> </Section> <Section position="10" start_page="393" end_page="396" type="sub_section"> <SectionTitle> 5.5 Discussion and Error Analysis </SectionTitle> <Paragraph position="0"> 5.5.1 Algorithm Comparison. Using the original or the replaced antecedent for string matching (versions v1vs.v2, v1n vs. v2n, v3vs.v4, and v3n vs. v4n) never results in interesting differences in any of the approaches discussed. Also, number matching provides consistent improvements. Therefore, from this point on, our discussion will disregard those variations, that use original antecedents only (v1, v1n, v3, and v3n)as well as algorithms that do not use number matching (v2, v4). We will also concentrate on the final precision [?] of the full-coverage algorithms. The set of anaphors that are covered by the best string-matching baseline, prior to recency back-off, will again be denoted by StrSet v2n . Again, both a t-test and McNemar's test will be used, when statements about significance are made.</Paragraph> <Paragraph position="1"> The results for the string-matching baselines and for the lexical methods are higher for definite coreferential NPs than for other-anaphora. This is largely a result of the higher number of string-matching antecedent/anaphor pairs in coreference, the higher precision of string matching, and to a lesser degree, the lower number of unusual redescriptions.</Paragraph> <Paragraph position="2"> Similar to the results for other-anaphora, the WordNet-based algorithms beat the corresponding baselines. The first striking result is that the Web algorithm variation algoWeb v2n , which relies only on the highest Web scores and is therefore allowed to overrule string matching, does not outperform the corresponding string-matching baseline baselineSTR v2n and performs significantly worse than the corresponding WordNet algorithm algoWN v2n . This contrasts with the results for other-anaphora. When the results were examined in detail, it emerged that for a considerable number of anaphors in StrSet v2n , the highest Web score was indeed achieved by a distractor with a high-frequency head noun when the correct or lenient antecedent could be instead found by a simple string match to the anaphor. This problem is much more severe than for other-anaphora because of (1) the larger window size that includes more distractors and (2) the higher a priori precision of the string-matching baseline, which means that overruling string matching leads to wrong results more frequently. Typical examples involve named-entity recognition and inverted queries. Thus, in Example (15), the anaphor the union is coreferent with the first occurrence of the union, a case easily resolved by string matching. However, the distractor organization [= Chrysler Canada] achieves a higher Web score, because of the score of the inverted query union OR unions and other organizations.</Paragraph> <Paragraph position="3"> (15) [. . . ] The union struck Chrysler Canada Tuesday after rejecting a company offer on pension adjustments. The union said the size of the adjustments was inadequate.</Paragraph> <Paragraph position="4"> Several potential solutions exist to this problem, such as normalization of Web scores or penalizing of inverted queries. The solution we have adopted in algoWeb v4n is to use Web scores only after string matching, thereby making the Web-based approach more comparable to the WordNet approach. Therefore, baselineSTR (as well as algoBNC v4n ) all coincide in their decisions for anaphors in StrSet v2n and only differ in the decisions made for anaphors that do not have a matching antecedent candidate. Indeed, algoWeb v4n performs significantly better than the baselines at the 1% level, and results rise from a precision [?] of 60.6% for algoWeb v2n to 71.3% for algoWeb v4n . It also significantly outperforms the best BNC results, thus showing that overcoming data sparseness is more important than working with a controlled, tagged, and representative corpus. Furthermore, shows better performance than WordNet in the final algorithm variation (71.3% vs. 66.2%).</Paragraph> <Paragraph position="5"> According to results of a t-test, however, this last difference is not significant. McNemar's test, concentrating on the part of the data in which the methods differ, shows instead significance at the 1% level. Indeed, one of the problems in comparing algorithm results for coreference is that such a large number of anaphors are covered by simple string matching, leaving only 42 Remember that this problem does not affect the WordNet-based algorithm, which always achieves the same results as the string-matching baseline on StrSet v2n . Both the correct antecedent and the organization [= Chrysler Canada] distractor stand in a hyp/syn relation to the anaphor, and then string matching is used as a tiebreaker.</Paragraph> <Paragraph position="6"> 43 In general, the WordNet methods achieve higher precision, with the Web method achieving higher recall. assigns the correct antecedent to 13 (8.9%) anaphors by using a recency back-off, the best WordNet method to 55 anaphors (37.67%), and the best Web method to 72 anaphors (49.31%). Therefore the Web-based method is a better complement to string matching than WordNet, which is reflected in the results of McNemar's test. Anaphor-antecedent relations that were not covered in WordNet but that did not prove a problem for the Web algorithm were again both general hyponymy relations, such as retailer:organization, bill:legislation and month:time, and more subjective relations like (wage) cuts:concessions and legislation:attack.</Paragraph> <Paragraph position="7"> v4n , still selects the wrong antecedent for a given anaphor in 137 of 477 cases (28.7%). Again, we use the decision tree in Figure 1 to classify errors. Design errors now do not include split antecedents but do include errors that occur because the condition of number agreement was violated, pronoun deletion errors, and the four cases in which the antecedent is a non-NP antecedent and therefore not extracted in the first place (see Section 5.1 and Example (14)). Table 17 reports the frequency of each error type. Differently from other-anaphora, the design and NE errors together account for under 15% of the mistakes. Also rare are zero-score errors (only 8%). When compared to the number of zero-score errors in other anaphora (24.7%), this low figure suggests that other-anaphora is more prone to exploit rare, unusual, and context-dependent redescriptions than full NP coreference. Nevertheless, it is yet possible to find non-standard redescriptions in coreference as well which yield zero scores, such as the use of transaction to refer to move in Example (16).</Paragraph> <Paragraph position="8"> (16) Conseco Inc., in a move to generate about $200 million in tax deductions, said it induced five of its top executives to exercise stock options to purchase about 3.6 million common shares of the financial-services concern. As a result of the transaction, ...</Paragraph> <Paragraph position="9"> Much more substantial is the weight of errors due to string matching, tiebreaker decisions, and the presence of good distractors (the main reason for errors of type other), which together account for over three-quarters of all mistakes. String matching is quite successful for coreference (baselineSTR Markert and Nissim Knowledge Sources for Anaphora Resolution rules string matching, the errors of baselineSTR v2n are preserved here and account for 24.1% of all mistakes.</Paragraph> <Paragraph position="10"> Tiebreaker errors are quite frequent too (24.8%), as our far-fromsophisticated tiebreaker was needed in nearly half of the cases (224 times; 47.0%). The remaining errors (29.2%) are due to the presence of good distractors that score higher than the correct/lenient antecedent. In Example (17), for instance, a distractor with a higher Web score (comment) prevents the algorithm from selecting the correct antecedent (investigation) for the anaphor the inquiry.</Paragraph> <Paragraph position="11"> (17) Mr. Adams couldn't be reached for comment. Though the investigation has barely begun, persons close to the board said Messrs. Lavin and Young will get a &quot;hard look&quot; as to whether they were involved, and are both considered a &quot;natural focus&quot; of the inquiry.</Paragraph> <Paragraph position="12"> Example (18) shows how stripping modification might have eliminated information crucial to identifying the correct antecedent: Only the head process was retained of the anaphor arbitration process, so that the surface link between anaphor and antecedent (arbitration) was lost and the distractor securities industry, reduced to industry,was instead selected.</Paragraph> <Paragraph position="13"> (18) The securities industry has favored arbitration because it keeps brokers and dealers out of court. But consumer advocates say customers sometimes unwittingly sign away their right to sue. &quot;We don't necessarily have a beef with the arbitration process,&quot; says Martin Meehan, [...]</Paragraph> </Section> </Section> class="xml-element"></Paper>