File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1045_metho.xml
Size: 18,099 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1045"> <Title>Applying Co-Training to Reference Resolution</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Features for Reference Resolution </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Previous Work </SectionTitle> <Paragraph position="0"> Driven by the necessity to provide robust systems for the MUC system evaluations, researchers began to look for those features which were particular important for the task of reference resolution. While most features for pronoun resolution have been described in the literature for decades, researchers only recently began to look for robust and cheap features, i.e., those which perform well over several domains and can be annotated (semi-) automatically. Also, the relative quantitative contribution of each of these features came into focus only after the advent of Computational Linguistics (ACL), Philadelphia, July 2002, pp. 352-359. Proceedings of the 40th Annual Meeting of the Association for corpus-based and statistical methods. In the following, we describe a few earlier contributions with respect to the features used.</Paragraph> <Paragraph position="1"> Decision tree algorithms were used for reference resolution by Aone and Bennett (1995, C4.5), McCarthy and Lehnert (1995, C4.5) and Soon et al. (2001, C5.0). This approach requires the definition of a set of training features describing pairs of anaphors and their antecedents.</Paragraph> <Paragraph position="2"> Aone and Bennett (1995), working on reference resolution in Japanese newspaper articles, use 66 features. They do not mention all of these explicitly but emphasize the features POS-tag, grammatical role, semantic class and distance.</Paragraph> <Paragraph position="3"> The set of semantic classes they use appears to be rather elaborated and highly domain-dependent.</Paragraph> <Paragraph position="4"> Aone and Bennett (1995) report that their best classifier achieved an F-measure of about 77% after training on 250 documents. They mention that it was important for the training data to contain transitive positives, i.e., all possible coreference relations within an anaphoric chain.</Paragraph> <Paragraph position="5"> McCarthy and Lehnert (1995) describe a reference resolution component which they evaluated on the MUC-5 English Joint Venture corpus. They distinguish between features which focus on individual noun phrases (e.g. Does noun phrase contain a name?) and features which focus on the anaphoric relation (e.g. Do both share a common NP?). It was criticized (Soon et al., 2001) that the features used by McCarthy and Lehnert (1995) are highly idiosyncratic and applicable only to one particular domain. McCarthy and Lehnert (1995) achieved results of about 86% F-measure (evaluated according to Vilain et al. (1995)) on the MUC-5 data set. However, only a defined subset of all possible reference resolution cases was considered relevant in the MUC-5 task description, e.g., only entity references. For this case, the domain-dependent features may have been particularly important, making it difficult to compare the results of this approach to others working on less restricted domains.</Paragraph> <Paragraph position="6"> Soon et al. (2001) use twelve features (see Table 1). They show a part of their decision tree in which the weak string identity feature (i.e. identity after determiners have been removed) appears to be the most important one. They also report on the relative contribution of the features where - distance in sentences between anaphor and antecedent - antecedent is a pronoun? - anaphor is a pronoun? - weak string identity between anaphor and antecedent - anaphor is a definite noun phrase? - anaphor is a demonstrative pronoun? - number agreement between anaphor and antecedent - semantic class agreement between anaphor and antecedent null - gender agreement between anaphor and antecedent - anaphor and antecedent are both proper names? - an alias feature (used for proper names and acronyms) - an appositive feature the three features weak string identity, alias (which maps named entities in order to resolve dates, per-son names, acronyms, etc.) and appositive seem to cover most of the cases (the other nine features contribute only 2.3% F-measure for MUC-6 texts and 1% F-measure for MUC-7 texts). Soon et al. (2001) include all noun phrases returned by their NP identifier and report an F-measure of 62.6% for MUC-6 data and 60.4% for MUC-7 data. They only used pairs of anaphors and their closest antecedents as positive examples in training, but evaluated according to Vilain et al. (1995).</Paragraph> <Paragraph position="7"> Cardie and Wagstaff (1999) describe an unsupervised clustering approach to noun phrase coreference resolution in which features are assigned to single noun phrases only. They use the features shown in Table 2, all of which are obtained automatically without any manual tagging.</Paragraph> <Paragraph position="8"> - position (NPs are numbered sequentially) - pronoun type (nom., acc., possessive, ambiguous) - article (indefinite, definite, none) - appositive (yes, no) - number (singular, plural) - proper name (yes, no) - semantic class (based on WordNet: time, city, animal, human, object; based on a separate algorithm: number, money, company) - gender (masculine, feminine, either, neuter) - animacy (anim, inanim) The feature semantic class used by Cardie and Wagstaff (1999) seems to be a domain-dependent one which can only be used for the MUC domain and similar ones.</Paragraph> <Paragraph position="9"> Cardie and Wagstaff (1999) report a performance of 53,6% F-measure (evaluated according to Vilain et al. (1995)).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Our Features </SectionTitle> <Paragraph position="0"> We consider the features we use for our weakly supervised approach to be domain-independent.</Paragraph> <Paragraph position="1"> We distinguish between features assigned to noun phrases and features assigned to the potential coreference relation. They are listed in Table 3 together with their respective possible values. In the literature on reference resolution it is claimed that the antecedent's grammatical function and its realization are important. Hence we introduce the features ante gram func and ante npform. The identity in grammatical function of a potential anaphor and antecedent is captured in the feature syn par. Since in German the gender and the semantic class do not necessarily coincide (i.e. objects are not necessarily neuter as in English) we also provide a semantic-class feature which captures the difference between human, concrete, and abstract objects. This basically corresponds to the gender attribute in English. The feature wdist captures the distance in words between anaphor and antecedent, the feature ddist captures the distance in sentences, the feature mdist the number of markables (NPs) between anaphor and antecedent. Features like the string ident and sub-string match features were used by other researchers (Soon et al., 2001), while the features ante med and ana med were used by Strube et al. (2002) in order to improve the performance for definite NPs. The minimum edit distance (MED) computes the similarity of strings by taking into account the minimum number of editing operations (substitutions s, insertions i, deletions d) needed to transform one string into the other (Wagner and Fischer, 1974). The MED is computed from these editing operations and the length of the potential antecedent m or the length of the anaphor n.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Co-Training </SectionTitle> <Paragraph position="0"> Co-Training (Blum and Mitchell, 1998) is a meta-learning algorithm which exploits unlabeled in addition to labeled training data for classifier learning. A Co-Training classifier is complex in the sense that it consists of two simple classifiers (most often Naive Bayes, e.g. by Blum and Mitchell (1998) and Pierce and Cardie (2001)). Initially, these classifiers are trained in the conventional way using a small set of size L of labeled training data. In this process, each of the two classifiers is trained on a different subset of features of the training data. These feature subsets are commonly referred to as different views that the classifiers have on the data, i.e., each classifier describes a given instance in terms of different features. The Co-Training algorithm is supposed to bootstrap by gradually extending the training data with self-labeled instances. It utilizes the two classifiers by letting them in turn label the p best positive and n best negative instances from a set of size P of unlabeled training data (referred to in the literature as the pool). Instances labeled by one classifier are then added to the other's training data, and vice versa. After each turn, both classifiers are re-trained on their augmented training sets, and the pool is refilled with a1a3a2a5a4a7a6a9a8a11a10a13a12 unlabeled training instances drawn at random. This process is repeated either for a given number of iterations I or until all the unlabeled data has been labeled. In particular the definition of the two data views appears to be a crucial factor which can strongly influence the behaviour of Co-Training. A number of requirements for these views are mentioned in the literature, e.g., that they have to be disjoint or even conditionally independent (but cf. Nigam and Ghani (2000)). Another important factor is the ratio between p and n, i.e., the number of positive and negative instances added in each iteration. These values are commonly chosen in such a way as to reflect the empirical class distribution of the respective instances.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Text Corpus </SectionTitle> <Paragraph position="0"> Our corpus consists of 250 short German texts (total 36924 tokens, 9399 NPs, 2179 anaphoric NPs) about sights, historic events and persons in Heidelberg.</Paragraph> <Paragraph position="1"> The average length of the texts was 149 tokens. The texts were POS-tagged using TnT (Brants, 2000). A basic identification of markables (i.e. NPs) was obtained by using the NP-Chunker Chunkie (Skut and Brants, 1998). The POS-tagger was also used for assigning attributes to markables (e.g. the NP form).</Paragraph> <Paragraph position="2"> The automatic annotation was followed by a man-Document level features 1. doc id document number (1 . . . 250) NP-level features 2. ante gram func grammatical function of antecedent (subject, object, other) 3. ante npform form of antecedent (definite NP, indefinite NP, personal pronoun, demonstrative pronoun, possessive pronoun, proper name) 4. ante agree agreement in person, gender, number 5. ante semanticclass semantic class of antecedent (human, concrete object, abstract object) 6. ana gram func grammatical function of anaphor (subject, object, other) 7. ana npform form of anaphor (definite NP, indefinite NP, personal pronoun, demonstrative pronoun, possessive pronoun, proper name) 8. ana agree agreement in person, gender, number 9. ana semanticclass semantic class of anaphor (human, concrete object, abstract object) Coreference-level features 10. wdist distance between anaphor and antecedent in words (1 . . . n) 11. ddist distance between anaphor and antecedent in sentences (0, 1, a14 1) 12. mdist distance between anaphor and antecedent in markables (NPs) (1 . . . n) 13. syn par anaphor and antecedent have the same grammatical function (yes, no) 14. string ident anaphor and antecedent consist of identical strings (yes, no) 15. substring match one string contains the other (yes, no) 16. ante med minimum edit distance to anaphor: a15a17a16a19a18a21a20 a22a23a20a25a24a27a26a29a28a31a30a32a30a34a33a36a35a9a37a39a38a41a40a43a42a19a44a45a42a36a46a48a47 a3517. ana med minimum edit distance to antecedent:</Paragraph> <Paragraph position="4"> ual correction and annotation phase in which further tags were assigned to the markables. In this phase manual coreference annotation was performed as well. In our annotation, coreference is represented in terms of a member attribute on markables (i.e., noun phrases). Markables with the same value in this attribute are considered coreferring expressions.</Paragraph> <Paragraph position="5"> The annotation was performed by two students. The reliability of the annotations was checked using the kappa statistic (Carletta, 1996).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Coreference resolution as binary </SectionTitle> <Paragraph position="0"> classification The problem of coreference resolution can easily be formulated in such a way as to be amenable to Co-Training. The most straightforward definition turns the task into a binary classification: Given a pair of potential anaphor and potential antecedent, classify as positive if the antecedent is in fact the closest antecedent, and as negative otherwise. Note that the restriction of this rule to the closest antecedent means that transitive antecedents (i.e. those occuring further upwards in the text as the direct antecedent) are treated as negative in the training data. We favour this definition because it strengthens the predictive power of the word distance between potential anaphor and potential antecedent (as expressed in the wdist feature).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Test and Training Data Generation </SectionTitle> <Paragraph position="0"> From our annotated corpus, we created one initial training and test data set. For each text, a list of noun phrases in document order was generated. This list was then processed from end to beginning, the phrase at the current position being considered as a potential anaphor. Beginning with the directly preceding position, each noun phrase which appeared before was combined with the potential anaphor and both entities were considered a potential antecedent-anaphor pair. If applied to a text with a6 noun phrases, this algorithm produces a total of a54a56a55a51a57a45a54a59a58a61a60a63a62a64 noun phrase pairs. However, a number of filters can reasonably be applied at this point. An antecedent-anaphor pair is discarded a0 if the anaphor is an indefinite NP, a0 if one entity is embedded into the other, e.g., if the potential anaphor is the head of the potential antecedent NP (or vice versa),</Paragraph> <Paragraph position="2"> singular or plural in its agreement feature, a0 if both entities have different values in their agreement features2.</Paragraph> <Paragraph position="3"> For some texts, these heuristics reduced to up to 50% the potential antecedent-anaphor pairs, all of which would have been negative cases. We regard these cases as irrelevant because they do not contribute any knowledge for the classifier. After application of these filters, the remaining candidate pairs were labeled as follows: a0 Pairs of anaphors and their direct (i.e. closest) antecedents were labeled P. This means that each anaphoric expression produced exactly one positive instance.</Paragraph> <Paragraph position="4"> a0 Pairs of anaphors and their indirect (transitive) antecedents were labeled TP.</Paragraph> <Paragraph position="5"> a0 Pairs of anaphors and those non-antecedents which occurred before the direct antecedent were labeled N. The number of negative instances that each expression produced thus depended on the number of non-antecedents occurring before the direct antecedent (if any).</Paragraph> <Paragraph position="6"> a0 Pairs of anaphors and non-antecedents were labeled DN (distant N) if at least one true antecedent occurred in between.</Paragraph> <Paragraph position="7"> This produced 250 data sets with a total of 92750 instances of potential antecedent-anaphor pairs (2074 P, 70021 N, 6014 TP and 14641 DN).</Paragraph> <Paragraph position="8"> From this set the last 50 texts were used as a test set. From this set, all instances with class DN and TP were removed, resulting in a test set of 11033 instances. Removing DNs and TPs was motivated by the fact that initial experimentation with C4.5 had indicated that a four way classification gives no advantage over a two way classification. In addition, this kind of test set approximates the decisions made by a simple resolution algorithm that cause in a real-world setting, information about a pronoun's semantic class obviously is not available prior to its resolution. 2This filter applies only if the anaphor is a pronoun. This restriction is necessary because German allows for cases where an antecedent is referred back to by a non-pronoun anaphor which has a different grammatical gender.</Paragraph> <Paragraph position="9"> looks for an antecedent from the current position upwards until it finds one or reaches the beginning.</Paragraph> <Paragraph position="10"> Hence, our results are only indirectly comparable with the ones obtained by an evaluation according to Vilain et al. (1995). However, in this paper we only compare results of this direct binary antecedent-anaphor pair decision.</Paragraph> <Paragraph position="11"> The remaining texts were split in two sets of 50 resp. 150 texts. From the first, our labeled training set was produced by removing all instances with class DN and TP. The second set was used as our unlabeled training set. From this set, no instances were removed because no knowledge whatsoever about the data can be assumed in a realistic setting.</Paragraph> </Section> </Section> class="xml-element"></Paper>