File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1640_metho.xml

Size: 17,294 bytes

Last Modified: 2025-10-06 14:10:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1640">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Partially Supervised Coreference Resolution for Opinion Summarization through Structured Rule Learning</Title>
  <Section position="5" start_page="337" end_page="338" type="metho">
    <SectionTitle>
3 Source Coreference Resolution
</SectionTitle>
    <Paragraph position="0"> In this section we introduce the problem of source coreference resolution in the context of opinion summarization and argue for the need for novel methods for the task.</Paragraph>
    <Paragraph position="1"> The task of source coreference resolution is to decide which mentions of opinion sources refer to the same entity. Much like traditional coreference resolution, we employ a learning approach; however, our approach differs from traditional coreference resolution in its definition of the learning task. Motivated by the desire to utilize unlabeled examples (discussed later), we define training as an integrated task in which pairwise NP coreference decisions are learned together with the clustering function as opposed to treating each NP pair as a training example. Thus, our training phase takes as input a set of documents with manually annotated opinion sources together with coreference annotations for the sources; it outputs a classifier that can produce source coreference chains for previously unseen documents containing marked (manually or automatically) opinion sources. More specifically, the source coreference resolution training phase proceeds through the following steps:  1. Source-to-NP mapping: We preprocess  each document by running a tokenizer, sentence splitter, POS tagger, parser, and an NP finder. Subsequently, we augment the set of NPs found by the NP finder with the help of a system for named entity detection. We then map the sources to the NPs. Since there is no one-to-one correspondence, we use a set of heuristics to create the mapping. More details about why heuristics are needed and the process used to map sources to NPs can be found in Stoyanov and Cardie (2006).</Paragraph>
    <Paragraph position="2">  2. Feature vector creation: We extract a feature vector for every pair of NPs from the pre-processed corpus. We use the features introduced by Ng and Cardie (2002) for the task of coreference resolution.</Paragraph>
    <Paragraph position="3"> 3. Classifier construction: Using the feature  vectors from step 2, we construct a training set containing one training example per document. Each training example consists of the feature vectors for all pairs of NPs in the document, including those that do not map to sources, together with the available coreference information for the source noun phrases (i.e. the noun phrases to which sources are mapped). The training instances are provided as input to a learning algorithm (see Section 5), which constructs a classifier that can take the instances associated with a new (previously unseen) document and produce a clustering over all NPs in the document.</Paragraph>
    <Paragraph position="4"> The testing phase employs steps 1 and 2 as described above, but replaces step 3 by a straightforward application of the learnt classifier. Since we are interested in coreference information only for the source NPs, we simply discard the non-source NPs from the resulting clustering.</Paragraph>
    <Paragraph position="5"> The approach to source coreference resolution described here would be identical to traditional coreference resolution when provided with training examples containing coreference information for all NPs. However, opinion corpora in general, and our corpus in particular, contain no coreference information about general NPs. Nevertheless, after manual sources are mapped to NPs in  step 1 above, our approach can rely on the available coreference information for the source NPs. Due to the high cost of coreference annotation, we desire methods that can work in the presence of only this limited amount of coreference information. null A possible workaround the absence of full NP coreference information is to train a traditional coreference system only on the labeled part of the data (indeed that is one of the baselines against which we compare). However, we believe that an effective approach to source coreference resolution has to utilize the unlabeled noun phrases because links between sources might be realized through non-source mentions. This problem is illustrated in figure 1. The underlined Moussaoui is coreferent with all of the Moussaoui references marked as sources, but, because it is used in an objective sentence rather than as the source of an opinion, the reference would be omitted from the Moussaoui source chain. Unfortunately, this proper noun phrase might be critical in establishing the coreference of the final source reference he with the other mentions of the source Moussaoui.</Paragraph>
    <Paragraph position="6"> As mentioned previously, in order to utilize the unlabeled data, our approach differs from traditional coreference resolution, which uses NP pairs as training instances. We instead follow the framework of supervised clustering (Finley and Joachims, 2005; Li and Roth, 2005) and consider each document as a training example. As in supervised clustering, this framework has the additional advantage that the learning algorithm can consider the clustering algorithm when making decisions about pairwise classification, which could lead to improvements in the classifier. In the next section we describe our approach to classifier construction for step 3 and compare our problem to traditional weakly supervised clustering, characterizing it as an instance of the novel problem of partially supervised clustering.</Paragraph>
  </Section>
  <Section position="6" start_page="338" end_page="339" type="metho">
    <SectionTitle>
4 Partially Supervised Clustering
</SectionTitle>
    <Paragraph position="0"> In our desire to perform effective source coreference resolution we arrive at the following learning problem - the learning algorithm is presented with a set of partially specified examples of clusterings and acquires a function that can cluster accurately an unseen set of items, while taking advantage of the unlabeled information in the examples.</Paragraph>
    <Paragraph position="1"> This setting is to be contrasted with semi-supervised clustering (or clustering with constraints), which has received much research attention (e.g. Demiriz et al. (1999), Wagstaff and Cardie (2000), Basu (2005), Davidson and Ravi (2005)). Semi-supervised clustering can be defined as the problem of clustering a set of items in the presence of limited supervisory information such as pairwise constraints (e.g. two items must/cannot be in the same cluster) or labeled points. In contrast to our setting, in the semi-supervised case there is no training phase - the algorithm receives all examples (labeled and unlabeled) at the same time together with some distance or cost function and attempts to find a clustering that optimizes a given measure (usually based on the distance or cost function).</Paragraph>
    <Paragraph position="2"> Source coreference resolution might alternatively be approached as a supervised clustering problem. Traditionally, approaches to supervised clustering have treated the pairwise link decisions as a classification problem. These approaches first learn a distance metric that optimizes the pairwise decisions; and then follow the pairwise classification with a clustering step. However, these traditional approaches have no obvious way of utilizing the available unlabeled information.</Paragraph>
    <Paragraph position="3"> In contrast, we follow recent approaches to supervised clustering that propose ways to learn the distance measure in the context of the clustering decisions (Li and Roth, 2005; Finley and Joachims, 2005; McCallum and Wellner, 2003).</Paragraph>
    <Paragraph position="4"> This provides two advantages for the problem of source coreference resolution. First, it allows the algorithm to take advantage of the complexity of the rich structural dependencies introduced by the clustering problem. Viewed traditionally as a hurdle, the structural complexity of clustering may be beneficial in the partially supervised case. We believe that provided with a few partially specified clustering examples, an algorithm might be able to generalize from the structural dependencies to infer correctly the whole clustering of the items.</Paragraph>
    <Paragraph position="5"> In addition, considering pairwise decisions in the context of the clustering can arguably lead to more accurate classifiers.</Paragraph>
    <Paragraph position="6"> Unfortunately, none of the supervised clustering approaches is readily applicable to the partially supervised case. However, by adapting the formal supervised clustering definition, which we do next, we can develop approaches to partially supervised clustering that take advantage of the un- null labeled portions of the data.</Paragraph>
    <Paragraph position="7"> Formal definition. For partially supervised clustering we extend the formal definition of supervised clustering given by Finley and Joachims (2005). In the fully supervised setting, an algorithm is given a set S of n training examples</Paragraph>
    <Paragraph position="9"> set of all possible sets of items and Y is the set of all possible clusterings of these sets. For a training example (x,y), x = {x1,x2,...,xk} is a set of k items and y = {y1,y2,...,yr} is a clustering of the items in x with each yi [?] x. Additionally, each item can be in no more than one cluster ([?]i,j.yi [?]yj = [?]) and in the fully supervised case each item is in at least one cluster (x = uniontextyi).</Paragraph>
    <Paragraph position="10"> The goal of the learning algorithm is to acquire a function h : X - Y that can accurately cluster a (previously unseen) set of items.</Paragraph>
    <Paragraph position="11"> In the context of source coreference resolution the training set contains one example for each document. The items in each training example are the NPs and the clustering over the items is the equivalence relation defined by the coreference information. For source coreference resolution, however, clustering information is unavailable for the non-source NPs. Thus, to be able to deal with this unlabeled component of the data we arrive to the setting of partially supervised clustering, in which we relax the condition that each item is in at least one cluster (x = uniontextyi) and replace it with the condition x [?] uniontextyi. The items with no linking information (items in x\uniontextyi) constitute the unlabeled (unsupervised) component of the partially supervised clustering.</Paragraph>
  </Section>
  <Section position="7" start_page="339" end_page="340" type="metho">
    <SectionTitle>
5 Structured Rule Learner
</SectionTitle>
    <Paragraph position="0"> We develop a novel method for partially supervised clustering, which is motivated by the success of a rule learner (RIPPER) for coreference resolution (Ng and Cardie, 2002). We extend RIPPER so that it can learn rules in the context of single-link clustering, which both suits our task (i.e. pronouns link to their single antecedent) and has exhibited good performance for coreference resolution (Ng and Cardie, 2002). We begin with a brief overview of RIPPER followed by a description of the modifications that we implemented. For ease of presentation, we assume that we are in the fully supervised case. We end this section by describing the changes for the partially supervised case.</Paragraph>
    <Paragraph position="2"> //Keep instances from the same document together while(there are positive uncovered instances) {</Paragraph>
    <Paragraph position="4"> for(every unused feature f){ if (f is nominal feature) { for(every possible value v of f) { mark all instances that have values of v for f with +; compute the transitive closure of the positive instances //(including instances marked + from previous rules); compute the infoGain for the future/value combination;</Paragraph>
    <Paragraph position="6"> create one bag for each feature value and split the instances into bags; do a forward and a backward pass over the bags keeping a running clustering and compute the information gain for each value;</Paragraph>
    <Paragraph position="8"> procedure pruneRule(r, pruneData){ for(all antecedents a in the rule){ apply all antecedents in r up to a to pruneData; compute the transitive closure of the positive instances;  hen (1995) as an extension of an existing rule induction algorithm. Cohen (1995) showed that RIPPER produces error rates competitive with C4.5, while exhibiting better running times. RIPPER consists of two phases - a ruleset is grown and then optimized.</Paragraph>
    <Paragraph position="9"> The ruleset creation phase begins by randomly splitting the training data into a rulegrowing set (2/3 of the training data) and a pruning set (the remaining 1/3). A rule is then grown on the former set by repeatedly adding the antecedent (the feature value test) with the largest information gain until the accuracy of the rule becomes 1.0 or there are no remaining potential antecedents. Next the rule is applied to the pruning data and any rulefinal sequence that reduces the accuracy of the rule is removed.</Paragraph>
    <Paragraph position="10"> The optimization phase uses the full training  set to first grow a replacement rule and a revised rule for each rule in the ruleset. For each rule, the algorithm then considers the original rule, the replacement rule, and the revised rule, and keeps the rule with the smallest description length in the context of the ruleset. After all rules are considered, RIPPER attempts to grow residual rules that cover data not already covered by the ruleset. Finally, RIPPER deletes any rules from the ruleset that reduce the overall minimum description length of the data plus the ruleset. RIPPER performs two rounds of this optimization phase.</Paragraph>
    <Section position="1" start_page="340" end_page="340" type="sub_section">
      <SectionTitle>
5.2 The StRip Algorithm
</SectionTitle>
      <Paragraph position="0"> The property of partially supervised clustering that we want to explore is the structured nature of the decisions. That is, each decision of whether two items (say a and b) belong to the same cluster has an implication for all items aprime that belong to a's cluster and all items bprime that belong to b's cluster.</Paragraph>
      <Paragraph position="1"> We target modifications to RIPPER that will allow StRip (for Structured RIPPER) to learn rules that produce good clusterings in the context of single-link clustering. We extend RIPPER so that every time it makes a decision about a rule, it considers the effect of the rule on the overall clustering of items (as opposed to considering the instances that the rule classifies as positive/negative in isolation). More precisely, we precede every computation of rule performance (e.g. information gain or description length) by a transitive closure (i.e. single link clustering) of the data w.r.t. to the pairwise classifications. Following the transitive closure, all pairs of items that are in the same cluster are considered covered by the rule for performance computation.</Paragraph>
      <Paragraph position="2"> The StRip algorithm is given in figure 2, with modifications to the original RIPPER algorithm shown in bold. Due to space limitations the optimization stage of the algorithm is omitted. Our modifications to the optimization stage of RIPPER are in the spirit of the rest of the StRip algorithm.</Paragraph>
      <Paragraph position="3"> Partially supervised case. So far we described StRip only for the fully supervised case. We use a very simple modification to handle the partially supervised setting: we exclude the unlabeled pairs when computing the performance of the rules. Thus, the unlabeled items do not count as correct or incorrect classifications when acquiring or pruning a rule, although they do participate in the transitive closure. Links in the unlabeled data are inferred entirely through the indirect links between items in the labeled component that they introduce. In the example of figure 1, the two problematic unlabeled links are the link between the source mention &amp;quot;he&amp;quot; and the underlined non-source NP &amp;quot;Mr. Moussaoui&amp;quot; and the link between the underlined &amp;quot;Mr. Moussaoui&amp;quot; to any source mention of Moussaoui. While StRip will not reward any rule (or rule set) that covers these two links directly, such rules will be rewarded indirectly since they put the source he in the chain for the source Moussaoui.</Paragraph>
      <Paragraph position="4"> StRip running time. StRip's running time is generally comparable to that of RIPPER. We compute transitive closure by using a Union-Find structure, which runs in time O(log[?]n), which for practical purposes can be considered linear (O(n)) 3. However, when computing the best information gain for a nominal feature, StRip has to make a pass over the data for each value that the feature takes, while RIPPER can split the data into bags and perform the computation in one pass.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML