XML Viewer - p05-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1020_metho.xml
Size: 10,700 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1020">
  <Title>Machine Learning for Coreference Resolution: From Local Classification to Global Ranking</Title>
  <Section position="4" start_page="157" end_page="160" type="metho">
    <SectionTitle>
3 A Ranking Approach to Coreference
</SectionTitle>
    <Paragraph position="0"> Our ranking approach operates by first dividing the available training texts into two disjoint subsets: a training subset and a held-out subset. More specifically, we first train each of our a0 pre-selected coreference systems on the documents in the training subset, and then use these resolvers to generate a0 candidate partitions for each text in the held-out subset from which a ranking model will be learned. Given a test text, we use our a0 coreference systems to create a0 candidate partitions as in training, and select the highest-ranked partition according to the ranking model to be the final partition.3 The rest of this section describes how we select these a0 learning-based coreference systems and acquire the ranking model.</Paragraph>
    <Section position="1" start_page="157" end_page="159" type="sub_section">
      <SectionTitle>
3.1 Selecting Coreference Systems
</SectionTitle>
      <Paragraph position="0"> A learning-based coreference system can be defined by four elements: the learning algorithm used to train the coreference classifier, the method of creating training instances for the learner, the feature set  used to represent a training or test instance, and the clustering algorithm used to coordinate the coreference classification decisions. Selecting a coreference system, then, is a matter of instantiating these elements with specific values.</Paragraph>
      <Paragraph position="1"> Now we need to define the set of allowable values for each of these elements. In particular, we want to define them in such a way that the resulting coreference systems can potentially generate good candidate partitions. Given that machine learning approaches to the problem have been promising, our choices will be guided by previous learning-based coreference systems, as described below.</Paragraph>
      <Paragraph position="2"> Training instance creation methods. A training instance represents two NPs, NPa0 and NPa1 , having a class value of COREFERENT or NOT COREFERENT depending on whether the NPs co-refer in the associated text. We consider three previously-proposed methods of creating training instances.</Paragraph>
      <Paragraph position="3"> In McCarthy and Lehnert's method, a positive instance is created for each anaphoric NP paired with each of its antecedents, and a negative instance is created by pairing each NP with each of its preceding non-coreferent noun phrases. Hence, the number of instances created by this method is quadratic in the number of NPs in the associated text. The large number of instances can potentially make the training process inefficient.</Paragraph>
      <Paragraph position="4"> In an attempt to reduce the training time, Soon et al.'s method creates a smaller number of training instances than McCarthy and Lehnert's. Specifically, a positive instance is created for each anaphoric NP, NPa1 , and its closest antecedent, NPa0 ; and a negative instance is created for NPa1 paired with each of the intervening NPs, NPa0a3a2a5a4 , NPa0a6a2a8a7 , a9a10a9a10a9 , NPa1a12a11a13a4 . Unlike Soon et al., Ng and Cardie's method generates a positive instance for each anaphoric NP and its most confident antecedent. For a non-pronominal NP, the most confident antecedent is assumed to be its closest non-pronominal antecedent. For pronouns, the most confident antecedent is simply its closest preceding antecedent. Negative instances are generated as in Soon et al.'s method.</Paragraph>
      <Paragraph position="5"> Feature sets. We employ two feature sets for representing an instance, as described below.</Paragraph>
      <Paragraph position="6"> Soon et al.'s feature set consists of 12 surface-level features, each of which is computed based on one or both NPs involved in the instance. The features can be divided into four groups: lexical, grammatical, semantic, and positional. Space limitations preclude a description of these features. Details can be found in Soon et al. (2001).</Paragraph>
      <Paragraph position="7"> Ng and Cardie expand Soon et al.'s feature set from 12 features to a deeper set of 53 to allow more complex NP string matching operations as well as finer-grained syntactic and semantic compatibility tests. See Ng and Cardie (2002b) for details.</Paragraph>
      <Paragraph position="8"> Learning algorithms. We consider three learning algorithms, namely, the C4.5 decision tree induction system (Quinlan, 1993), the RIPPER rule learning algorithm (Cohen, 1995), and maximum entropy classification (Berger et al., 1996). The classification model induced by each of these learners returns a number between 0 and 1 that indicates the likelihood that the two NPs under consideration are coreferent. In this work, NP pairs with class values above 0.5 are considered COREFERENT; otherwise the pair is considered NOT COREFERENT.</Paragraph>
      <Paragraph position="9"> Clustering algorithms. We employ three clustering algorithms, as described below.</Paragraph>
      <Paragraph position="10"> The closest-first clustering algorithm selects as the antecedent of NPa1 its closest preceding coreferent NP. If no such NP exists, then NPa1 is assumed to be non-anaphoric (i.e., no antecedent is selected).</Paragraph>
      <Paragraph position="11"> On the other hand, the best-first clustering algorithm selects as the antecedent of NPa1 the closest NP with the highest coreference likelihood value from its set of preceding coreferent NPs. If this set is empty, then no antecedent is selected for NPa1 . Since the most likely antecedent is chosen for each NP, best-first clustering may produce partitions with higher precision than closest-first clustering.</Paragraph>
      <Paragraph position="12"> Finally, in aggressive-merge clustering, each NP is merged with all of its preceding coreferent NPs.</Paragraph>
      <Paragraph position="13"> Since more merging occurs in comparison to the previous two algorithms, aggressive-merge clustering may yield partitions with higher recall.</Paragraph>
      <Paragraph position="14"> Table 1 summarizes the previous work on coreference resolution that employs the learning algorithms, clustering algorithms, feature sets, and instance creation methods discussed above. With three learners, three training instance creation methods, two feature sets, and three clustering algorithms, we can produce 54 coreference systems in total.</Paragraph>
      <Paragraph position="15">  clustering algorithms, the feature sets, and the training instance creation methods discussed in Section 3.1.</Paragraph>
    </Section>
    <Section position="2" start_page="159" end_page="160" type="sub_section">
      <SectionTitle>
3.2 Learning to Rank Candidate Partitions
</SectionTitle>
      <Paragraph position="0"> We train an SVM-based ranker for ranking candidate partitions by means of Joachims' (2002) SVMa0 a0a2a1a4a3a6a5 package, with all the parameters set to their default values. To create training data, we first generate 54 candidate partitions for each text in the held-out sub-set as described above and then convert each partition into a training instance consisting of a set of partition-based features and method-based features.</Paragraph>
      <Paragraph position="1"> Partition-based features are used to characterize a candidate partition and can be derived directly from the partition itself. Following previous work on using global features of candidate structures to learn a ranking model (Collins, 2002), the global (i.e., partition-based) features we consider here are simple functions of the local features that capture the relationship between NP pairs.</Paragraph>
      <Paragraph position="2"> Specifically, we define our partition-based features in terms of the features in the Ng and Cardie (N&amp;C) feature set (see Section 3.1) as follows. First, let us assume that a7 a0 is the a8 -th nominal feature in N&amp;C's feature set and a9 a0 a1 is the a10 -th possible value of a7 a0 . Next, for each a8 and a10 , we create two partition-based features, a11 a0 a1a4a12 and a11a5a0 a1a12a4 . a11 a0 a1a13a12 is computed over the set of coreferent NP pairs (with respect to the candidate partition), denoting the probability of encountering a7 a0a15a14a16a9 a0 a1 in this set when the pairs are represented as attribute-value vectors using N&amp;C's features. On the other hand, a11 a0 a1a12a4 is computed over the set of non-coreferent NP pairs (with respect to the candidate partition), denoting the probability of encountering a7 a0a17a14a18a9 a0 a1 in this set when the pairs are represented as attribute-value vectors using N&amp;C's features. One partition-based feature, for instance, would denote the probability that two NPs residing in the same cluster have incompatible gender values.</Paragraph>
      <Paragraph position="3"> Intuitively, a good NP partition would have a low probability value for this feature. So, having these partition-based features can potentially help us distinguish good and bad candidate partitions.</Paragraph>
      <Paragraph position="4"> Method-based features, on the other hand, are used to encode the identity of the coreference system that generated the candidate partition under consideration. Specifically, we have one method-based feature representing each pre-selected coreference system. The feature value is 1 if the corresponding coreference system generated the candidate partition and 0 otherwise. These features enable the learner to learn how to distinguish good and bad partitions based on the systems that generated them, and are particularly useful when some coreference systems perform consistently better than the others.</Paragraph>
      <Paragraph position="5"> Now, we need to compute the &amp;quot;class value&amp;quot; for each training instance, which is a positive integer denoting the rank of the corresponding partition among the 54 candidates generated for the training document under consideration. Recall from the introduction that we want to train our ranking model so that higher scored partitions according to the target coreference scoring program are ranked higher. To this end, we compute the rank of each candidate partition as follows. First, we apply the target scoring program to score each candidate partition against the correct partition derived from the training text. We then assign rank a8 to the a8 -th lowest scored partition.4 Effectively, the learning algorithm learns what a good partition is from the scoring program.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML