XML Viewer - p02-1014

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1014_metho.xml
Size: 22,161 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1014">
  <Title>Improving Machine Learning Approaches to Coreference Resolution</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Baseline Coreference System
</SectionTitle>
    <Paragraph position="0"> Our baseline coreference system attempts to duplicate both the approach and the knowledge sources employed in Soon et al. (2001). More specifically, it employs the standard combination of classification and clustering described above.</Paragraph>
    <Paragraph position="1"> Building an NP coreference classifier. We use the C4.5 decision tree induction system (Quinlan, 1993) to train a classifier that, given a description of two NPs in a document, NPa2 and NPa3 , decides whether or not they are coreferent. Each training instance represents the two NPs under consideration and consists of the 12 Soon et al. features, which are described in Table 1. Linguistically, the features can be divided into four groups: lexical, grammatical, semantic, and positional.2 The classification associated with a training instance is one of COREFERENT or NOT COREFERENT depending on whether the NPs co-refer in the associated training text. We follow the procedure employed in Soon et al. to cre- null features values computed entirely automatically.</Paragraph>
    <Paragraph position="2"> ate the training data: we rely on coreference chains from the MUC answer keys to create (1) a positive instance for each anaphoric noun phrase, NPa3 , and its closest preceding antecedent, NPa2 ; and (2) a negative instance for NPa3 paired with each of the intervening NPs, NPa2a5a4a7a6 , NPa2a5a4a9a8 ,a10a11a10a12a10 , NPa3a14a13a7a6 . This method of negative instance selection is further described in Soon et al. (2001); it is designed to operate in conjunction with their method for creating coreference chains, which is explained next.</Paragraph>
    <Paragraph position="3"> Applying the classifier to create coreference chains. After training, the decision tree is used by a clustering algorithm to impose a partitioning on all NPs in the test texts, creating one cluster for each set of coreferent NPs. As in Soon et al., texts are processed from left to right. Each NP encountered, NPa3 , is compared in turn to each preceding NP, NPa2 , from right to left. For each pair, a test instance is created as during training and is presented to the coreference classifier, which returns a number between 0 and 1 that indicates the likelihood that the two NPs are coreferent.3 NP pairs with class values above 0.5 are considered COREFERENT; otherwise the pair is considered NOT COREFERENT. The process terminates as soon as an antecedent is found for NPa3 or the beginning of the text is reached.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Baseline Experiments
</SectionTitle>
      <Paragraph position="0"> We evaluate the Duplicated Soon Baseline system using the standard MUC-6 (1995) and MUC-7 (1998) coreference corpora, training the coreference classifier on the 30 &amp;quot;dry run&amp;quot; texts, and applying the coreference resolution algorithm on the 20-30 &amp;quot;formal evaluation&amp;quot; texts. The MUC-6 corpus produces a training set of 26455 instances (5.4% positive) from 4381 NPs and a test set of 28443 instances (5.2% positive) from 4565 NPs. For the MUC-7 corpus, we obtain a training set of 35895 instances (4.4% positive) from 5270 NPs and a test set of 22699 instances (3.9% positive) from 3558 NPs.</Paragraph>
      <Paragraph position="1"> Results are shown in Table 2 (Duplicated Soon Baseline) where performance is reported in terms of recall, precision, and F-measure using the model-theoretic MUC scoring program (Vilain et al., 1995).  a16a22a21 , where p is the number of positive instances and t is the total number of instances contained in the corresponding leaf node.</Paragraph>
      <Paragraph position="2">  NPa24 ; else I.</Paragraph>
      <Paragraph position="3"> Grammatical PRONOUN 1* Y if NPa23 is a pronoun; else N. PRONOUN 2* Y if NPa24 is a pronoun; else N.</Paragraph>
      <Paragraph position="4"> DEFINITE 2 Y if NPa24 starts with the word &amp;quot;the;&amp;quot; else N. DEMONSTRATIVE 2 Y if NPa24 starts with a demonstrative such as &amp;quot;this,&amp;quot; &amp;quot;that,&amp;quot; &amp;quot;these,&amp;quot; or &amp;quot;those;&amp;quot; else N.</Paragraph>
      <Paragraph position="5"> NUMBER* C if the NP pair agree in number; I if they disagree; NA if number information for one or both NPs cannot be determined.</Paragraph>
      <Paragraph position="6"> GENDER* C if the NP pair agree in gender; I if they disagree; NA if gender information for one or both NPs cannot be determined.</Paragraph>
      <Paragraph position="7"> BOTH PROPER NOUNS* C if both NPs are proper names; NA if exactly one NP is a proper name; else I.</Paragraph>
      <Paragraph position="8"> APPOSITIVE* C if the NPs are in an appositive relationship; else I. Semantic WNCLASS* C if the NPs have the same WordNet semantic class; I if they don't; NA if the semantic class information for one or both NPs cannot be determined. ALIAS* C if one NP is an alias of the other; else I.</Paragraph>
      <Paragraph position="9"> Positional SENTNUM* Distance between the NPs in terms of the number of sentences.  features. Non-relational features test some property P of one of the NPs under consideration and take on a value of YES or NO depending on whether P holds. Relational features test whether some property P holds for the NP pair under consideration and indicate whether the NPs are COMPATIBLE or INCOMPATIBLE w.r.t. P; a value of NOT APPLICABLE is used when property P does not apply. *'d features are in the hand-selected feature set (see Section 4) for at least one classifier/data set combination. The system achieves an F-measure of 66.3 and 61.2 on the MUC-6 and MUC-7 data sets, respectively. Similar, but slightly worse performance was obtained using RIPPER (Cohen, 1995), an information-gain-based rule learning system. Both sets of results are at least as strong as the original Soon results (row one of Table 2), indicating indirectly that our Baseline system is a reasonable duplication of that system.4 In addition, the trees produced by Soon and by our Duplicated Soon Baseline are essentially the same, differing only in two places where the Baseline system imposes additional conditions on coreference.</Paragraph>
      <Paragraph position="10"> The primary reason for improvements over the original Soon system for the MUC-6 data set appears to be our higher upper bound on recall (93.8% vs. 89.9%), due to better identification of NPs. For MUC-7, our improvement stems from increases in precision, presumably due to more accurate feature value computation.</Paragraph>
      <Paragraph position="11"> 4In all of the experiments described in this paper, default settings for all C4.5 parameters are used. Similarly, all RIPPER parameters are set to their default value except that classification rules are induced for both the positive and negative instances. 3 Modifications to the Machine Learning Framework This section studies the effect of three changes to the general machine learning framework employed by Soon et al. with the goal of improving precision in the resulting coreference resolution systems.</Paragraph>
      <Paragraph position="12"> Best-first clustering. Rather than a right-to-left search from each anaphoric NP for the first coreferent NP, we hypothesized that a right-to-left search for a highly likely antecedent might offer more precise, if not generally better coreference chains. As a result, we modify the coreference clustering algorithm to select as the antecedent of NPa3 the NP with the highest coreference likelihood value from among preceding NPs with coreference class values above 0.5.</Paragraph>
      <Paragraph position="13"> Training set creation. For the proposed best-first clustering to be successful, however, a different method for training instance selection would be needed: rather than generate a positive training example for each anaphoric NP and its closest antecedent, we instead generate a positive training examples for its most confident antecedent. More specifically, for a non-pronominal NP, we assume that the most confident antecedent is the closest non- null are provided. Results in boldface indicate the best results obtained for a particular data set and classifier combination. pronominal preceding antecedent. For pronouns, we assume that the most confident antecedent is simply its closest preceding antecedent. Negative examples are generated as in the Baseline system.5 String match feature. Soon's string match feature (SOON STR) tests whether the two NPs under consideration are the same string after removing determiners from each. We hypothesized, however, that splitting this feature into several primitive features, depending on the type of NP, might give the learning algorithm additional flexibility in creating coreference rules. Exact string match is likely to be a better coreference predictor for proper names than it is for pronouns, for example. Specifically, we replace the SOON STR feature with three features -- PRO STR, PN STR, and WORDS STR -- which restrict the application of string matching to pronouns, proper names, and non-pronominal NPs, respectively. (See the first entries in Table 3.) Although similar feature splits might have been considered for other features (e.g. GENDER and NUM-BER), only the string match feature was tested here.</Paragraph>
      <Paragraph position="14"> Results and discussion. Results on the learning framework modifications are shown in Table 2 (third block of results). When used in combination, the modifications consistently provide statistically significant gains in precision over the Baseline system 5This new method of training set creation slightly alters the class value distribution in the training data: for the MUC-6 corpus, there are now 27654 training instances of which 5.2% are positive; for the MUC-7 corpus, there are now 37870 training instances of which 4.2% are positive.</Paragraph>
      <Paragraph position="15"> without any loss in recall.6 As a result, we observe reasonable increases in F-measure for both classifiers and both data sets. When using RIPPER, for example, performance increases from 64.3 to 67.2 for the MUC-6 data set and from 60.8 to 63.2 for MUC-7. Similar, but weaker, effects occur when applying each of the learning framework modifications to the Baseline system in isolation. (See the indented Learning Framework results in Table 2.) Our results provide direct evidence for the claim (Mitkov, 1997) that the extra-linguistic strategies employed to combine the available linguistic knowledge sources play an important role in computational approaches to coreference resolution. In particular, our results suggest that additional performance gains might be obtained by further investigating the interaction between training instance selection, feature selection, and the coreference clustering algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 NP Coreference Using Many Features
</SectionTitle>
    <Paragraph position="0"> This section describes the second major extension to the Soon approach investigated here: we explore the effect of including 41 additional, potentially useful knowledge sources for the coreference resolution classifier (Table 3). The features were not derived empirically from the corpus, but were based on common-sense knowledge and linguistic intuitions 6Chi-square statistical significance tests are applied to changes in recall and precision throughout the paper. Unless otherwise noted, reported differences are at the 0.05 level or higher. The chi-square test is not applicable to F-measure.</Paragraph>
    <Paragraph position="1"> regarding coreference. Specifically, we increase the number of lexical features to nine to allow more complex NP string matching operations. In addition, we include four new semantic features to allow finer-grained semantic compatibility tests. We test for ancestor-descendent relationships in Word-Net (SUBCLASS), for example, and also measure the WordNet graph-traversal distance (WNDIST) between NPa3 and NPa2 . Furthermore, we add a new positional feature that measures the distance in terms of the number of paragraphs (PARANUM) between the two NPs.</Paragraph>
    <Paragraph position="2"> The most substantial changes to the feature set, however, occur for grammatical features: we add 26 new features to allow the acquisition of more sophisticated syntactic coreference resolution rules. Four features simply determine NP type, e.g. are both NPs definite, or pronouns, or part of a quoted string? These features allow other tests to be conditioned on the types of NPs being compared. Similarly, three new features determine the grammatical role of one or both of the NPs. Currently, only tests for clausal subjects are made. Next, eight features encode traditional linguistic (hard) constraints on coreference. For example, coreferent NPs must agree both in gender and number (AGREEMENT); cannot SPAN one another (e.g. &amp;quot;government&amp;quot; and &amp;quot;government officials&amp;quot;); and cannot violate the BINDING constraints. Still other grammatical features encode general linguistic preferences either for or against coreference. For example, an indefinite NP (that is not in apposition to an anaphoric NP) is not likely to be coreferent with any NP that precedes it (ARTICLE). The last subset of grammatical features encodes slightly more complex, but generally non-linguistic heuristics. For instance, the CONTAINS PN feature effectively disallows coreference between NPs that contain distinct proper names but are not themselves proper names (e.g. &amp;quot;IBM executives&amp;quot; and &amp;quot;Microsoft executives&amp;quot;).</Paragraph>
    <Paragraph position="3"> Two final features make use of an in-house naive pronoun resolution algorithm (PRO RESOLVE) and a rule-based coreference resolution system (RULE RESOLVE), each of which relies on the original and expanded feature sets described above.</Paragraph>
    <Paragraph position="4"> Results and discussion. Results using the expanded feature set are shown in the All Features block of Table 2. These and all subsequent results also incorporate the learning framework changes from Section 3. In comparison, we see statistically significant increases in recall, but much larger decreases in precision. As a result, F-measure drops precipitously for both learning algorithms and both data sets. A closer examination of the results indicates very poor precision on common nouns in comparison to that of pronouns and proper nouns. (See the indented All Features results in Table 2.7) In particular, the classifiers acquire a number of low-precision rules for common noun resolution, presumably because the current feature set is insufficient. For instance, a rule induced by RIPPER classifies two NPs as coreferent if the first NP is a proper name, the second NP is a definite NP in the subject position, and the two NPs have the same semantic class and are at most one sentence apart from each other. This rule covers 38 examples, but has 18 exceptions. In comparison, the Baseline system obtains much better precision on common nouns (i.e. 53.3 for MUC-6/RIPPER and 61.0 for MUC7/RIPPER with lower recall in both cases) where the primary mechanism employed by the classifiers for common noun resolution is its high-precision string matching facility. Our results also suggest that data fragmentation is likely to have contributed to the drop in performance (i.e. we increased the number of features without increasing the size of the training set). For example, the decision tree induced from the MUC-6 data set using the Soon feature set (Learning Framework results) has 16 leaves, each of which contains 1728 instances on average; the tree induced from the same data set using all of the 53 features, on the other hand, has 86 leaves with an average of 322 instances per leaf.</Paragraph>
    <Paragraph position="5"> Hand-selected feature sets. As a result, we next evaluate a version of the system that employs manual feature selection: for each classifier/data set combination, we discard features used primarily to induce low-precision rules for common noun resolution and re-train the coreference classifier using the reduced feature set. Here, feature selection does not depend on a separate development corpus and 7For each of the NP-type-specific runs, we measure overall coreference performance, but restrict NPa24 to be of the specified type. As a result, recall and F-measure for these runs are not particularly informative.</Paragraph>
    <Paragraph position="6">  AGREEMENT* C if the NPs agree in both gender and number; I if they disagree in both gender and number; else NA.</Paragraph>
    <Paragraph position="7"> stic ANIMACY* C if the NPs match in animacy; else I.</Paragraph>
    <Paragraph position="8"> MAXIMALNP* I if both NPs have the same maximal NP projection; else C. con- PREDNOM* C if the NPs form a predicate nominal construction; else I. stra- SPAN* I if one NP spans the other; else C.</Paragraph>
    <Paragraph position="9"> ints BINDING* I if the NPs violate conditions B or C of the Binding Theory; else C. CONTRAINDICES* I if the NPs cannot be co-indexed based on simple heuristics; else C. For instance, two non-pronominal NPs separated by a preposition cannot be co-indexed. SYNTAX* I if the NPs have incompatible values for the BINDING, CONTRAINDICES, SPAN or  MAXIMALNP constraints; else C.</Paragraph>
    <Paragraph position="10"> ling. INDEFINITE* I if NPa24 is an indefinite and not appositive; else C. prefs PRONOUN I if NPa23 is a pronoun and NPa24 is not; else C.  heuristics null CONSTRAINTS* C if the NPs agree in GENDER and NUMBER and do not have incompatible values for CONTRAINDICES, SPAN, ANIMACY, PRONOUN, and CONTAINS PN; I if the NPs have incompatible values for any of the above features; else NA. CONTAINS PN I if both NPs are not proper names but contain proper names that mismatch on every word; else C.</Paragraph>
    <Paragraph position="11"> DEFINITE 1 Y if NPa23 starts with &amp;quot;the;&amp;quot; else N. EMBEDDED 1* Y if NPa23 is an embedded noun; else N.</Paragraph>
    <Paragraph position="12"> EMBEDDED 2 Y if NPa24 is an embedded noun; else N.</Paragraph>
    <Paragraph position="13">  one classifier/data set combination.</Paragraph>
    <Paragraph position="14"> is guided solely by inspection of the features associated with low-precision rules induced from the training data. In current work, we are automating this feature selection process, which currently employs a fair amount of user discretion, e.g. to determine a precision cut-off. Features in the hand-selected set for at least one of the tested system variations are *'d in Tables 1 and 3.</Paragraph>
    <Paragraph position="15"> In general, we hypothesized that the hand-selected features would reclaim precision, hopefully without losing recall. For the most part, the experimental results support this hypothesis. (See the Hand-selected Features block in Table 2.) In comparison to the All Features version, we see statistically significant gains in precision and statistically significant, but much smaller, drops in recall, producing systems with better F-measure scores. In addition, precision on common nouns rises substantially, as expected. Unfortunately, the hand-selected features precipitate a large drop in precision for pronoun resolution for the MUC-7/C4.5 data set. Additional analysis is required to determine the reason for this.</Paragraph>
    <Paragraph position="16"> Moreover, the Hand-selected Features produce the highest scores posted to date for both the MUC-6 and MUC-7 data sets: F-measure increases w.r.t.</Paragraph>
    <Paragraph position="17"> the Baseline system from 64.3 to 70.4 for MUC6/RIPPER, and from 61.2 to 63.4 for MUC-7/C4.5.</Paragraph>
    <Paragraph position="18"> In one variation (MUC-7/RIPPER), however, the Hand-selected Features slightly underperforms the Learning Framework modifications (F-measure of 63.1 vs. 63.2) although changes in recall and precision are not statistically significant. Overall, our results indicate that pronoun and especially common noun resolution remain important challenges for coreference resolution systems. Somewhat disappointingly, only four of the new grammatical features corresponding to linguistic constraints and preferences are selected by the symbolic learning algorithms investigated: AGREEMENT, ANIMACY, BINDING, and MAXIMALNP.</Paragraph>
    <Paragraph position="19"> Discussion. In an attempt to gain additional insight into the difference in performance between our system and the original Soon system, we compare the decision tree induced by each for the MUC-6</Paragraph>
    <Paragraph position="21"> feature set on the MUC-6 data set.</Paragraph>
    <Paragraph position="22"> data set.8 For our system, we use the tree induced on the hand-selected features (Figure 1). The two trees are fairly different. In particular, our tree makes use of many of the features that are not present in the original Soon feature set. The root feature for Soon, for example, is the general string match feature (SOON STR); splitting the SOON STR feature into three primitive features promotes the ALIAS feature to the root of our tree, on the other hand. In addition, given two non-pronominal, matching NPs (SOON STR NONPRO=C), our tree requires an additional test on ANIMACY before considering the two NPs coreferent; the Soon tree instead determines two NPs to be coreferent as long as they are the same string. Pronoun resolution is also performed quite differently by the two trees, although both consider two pronouns coreferent when their strings match.</Paragraph>
    <Paragraph position="23"> Finally, intersentential and intrasentential pronominal references are possible in our system while inter-sentential pronominal references are largely prohibited by the Soon system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML