File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1008_metho.xml

Size: 16,079 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1008">
  <Title>Combining Sample Selection and Error-Driven Pruning for Machine Learning of Coreference Rules</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Negative Sample Selection
</SectionTitle>
    <Paragraph position="0"> As noted above, skewed class distributions arise when generating all valid instances from the training texts. A number of methods for handling skewed distributions have been proposed in the machine learning literature, most of which modify the learn- null feature values computed entirely automatically.</Paragraph>
    <Paragraph position="1"> Algorithm NEG-SELECT(NEG: set of all possible negative instances) for a9a33a10a34a12a22a14a35a16a36a18a21a20a12a22a14a37a23a25a18a4a26a39a38 NEG do if NPa8 a3 is anaphoric then if NPa7 a3 precedes a40 (NPa8 a3 ) then</Paragraph>
    <Paragraph position="3"> ing algorithm to incorporate a loss function with a much larger penalty for minority class errors than for instances from the majority classes (e.g. Gordon and Perlis (1989), Pazzani et al. (1994)). We investigate here a different approach to handling skewed class distributions -- negative sample selection, i.e. the selection of a smaller subset of negative instances from the set of available negative instances. In the case of NP coreference, we hypothesize that reducing the number of negative instances will improve recall but potentially reduce precision: intuitively, the existence of fewer negative instances should allow RIPPER to more liberally induce positive rules.</Paragraph>
    <Paragraph position="4"> We propose a method for negative sample selection that, for each anaphoric NP, NPa8 a3 , retains only those negative instances for non-coreferent NPs that lie between NPa8 a3 and its farthest preceding antecedent, a40 (NPa8 a3 ). The algorithm for negative sample selection, NEG-SELECT, is shown in Figure 1. NEG-SELECT takes as input the set of all possible negative instances in the training texts, i.e. the set of valid instances a9a33a10a34a12a22a14a35a16a36a18a21a20a12a22a14a37a23a25a18a4a26 such that NPa7 a3 and NPa8 a3 are not in the same coreference chain.</Paragraph>
    <Paragraph position="5"> The intuition behind this approach is very simple.</Paragraph>
    <Paragraph position="6"> Let a45 (NPa8 a3 ) be the set of preceding antecedents of NPa8 a3 , and a46 (NPa7 a3 ,NPa8 a3 ) be the set consisting of NPs NPa7 a3 , NPa10a7a48a47a50a49 a26 a3 ,a51a37a51a52a51 , NPa8 a3 . Recall that the goal during clustering is to compute, for each NP NPa8 a3 , the set a45 (NPa8 a3 ) from which the element with the highest confidence is selected as the antecedent of NPa8 a3 .</Paragraph>
    <Paragraph position="7">  relational features test some property P of one of the NPs under consideration and take on a value of YES or NO depending on whether P holds. Relational features test whether some property P holds for the NP pair under consideration and indicate whether the NPs are COMPATIBLE or INCOMPATIBLE w.r.t. P; a value of NOT APPLICABLE is used when property P does not apply. (2) NPa8 a3 is compared to each preceding NP from right to left by the clustering algorithm, it follows that the set of negative instances whose classifications the classifier needs to determine in order to compute a45 (NPa8 a3 ) is a superset of the set of instances</Paragraph>
    <Paragraph position="9"> coreferent preceding NPs in a46 (a40 (NPa8 a3 ),NPa8 a3 ). Consequently, null</Paragraph>
    <Paragraph position="11"> tive) instances whose classifications will be required during clustering. In principle, to perform the classifications accurately, the classifier needs to be trained on the corresponding set of negative instances from the training set, which is</Paragraph>
    <Paragraph position="13"> is now the a56 th NP in training document a6 . NEG-SELECT is designed essentially to compute this set.</Paragraph>
    <Paragraph position="14"> Next, we examine the effects of this minimalist approach to negative sample selection.</Paragraph>
    <Paragraph position="15"> Evaluation. We evaluate the coreference system with negative sample selection on the MUC-6 and MUC-7 coreference data sets in each case, training the coreference classifier on the 30 &amp;quot;dry run&amp;quot; texts, and applying the coreference resolution algorithm on the 20-30 &amp;quot;formal evaluation&amp;quot; texts. Results are shown in rows 1 and 2 of Table 2 where performance is reported in terms of recall, precision, and F-measure using the model-theoretic MUC scoring program (Vilain et al., 1995). The Baseline system employs no sample selection, i.e. all available training examples are used. Row 2 shows the performance of the Baseline after incorporating NEG-SELECT. With negative sample selection, the percentage of positive instances rises from 2% to 8% for the MUC-6 data set and from 2% to 7% for the MUC-7 data set. For both data sets, we see statistically significant increases in recall and statistically significant, but much larger drops in precision.9 The resulting F-measure scores, however, increase nontrivially from 52.4 to 55.2 (for MUC-6), and from 41.3 to 46.0 (for MUC-7).10</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Positive Sample Selection
</SectionTitle>
    <Paragraph position="0"> Since not all of the coreference relationships derived from coreference chains are equally easy to identify, training a classifier using all possible coreference relationships can potentially lead to the induction of inaccurate rules. Given the observation that one antecedent is sufficient to resolve an anaphor, it may be desirable to learn only from easy positive instances. Similar observations are made by Harabagiu et al. (2001), who point out that intelligent selection of positive instances can potentially minimize the amount of knowledge required to perform coreference resolution accurately. They assume that the easiest types of coreference relationships to resolve are those that occur with high frequencies in the data. Consequently, they mine by hand three sets of coreference rules for covering positive instances from the training data by finding the coreference knowledge satisfied by the largest number of anaphor-antecedent pairs. While the Harabagiu et al. algorithm attempts to mine easy coreference rules from the data by hand, neither the rule creation process nor stopping conditions are precisely defined. In addition, a lot of human intervention is required to derive the rules.</Paragraph>
    <Paragraph position="1"> In this section, we describe an automatic positive sample selection algorithm that coarsely mimics the Harabagiu et al. algorithm by finding a confident antecedent for each anaphor. Overall, our goal is to avoid the inclusion of hard training instances by automating the process of deriving easy coreference rules from the data.</Paragraph>
    <Paragraph position="2"> The Algorithm. The positive sample selection algorithm, POS-SELECT, is shown in Figure 2. It assumes the existence of a rule learner, L, that produces an ordered set of positive rules. POS-SELECT 9Chi-square statistical significance tests are applied to changes in recall and precision throughout the paper. Unless otherwise noted, reported differences are at the 0.05 level or higher. The chi-square test is not applicable to F-measure. 10The F-measure score computed by the MUC scoring program is the harmonic mean of recall and precision.</Paragraph>
    <Paragraph position="3"> Algorithm POS-SELECT(L: positive rule learner, T: set of training instances)</Paragraph>
    <Paragraph position="5"> first uses L to induce a ruleset on the training instances and picks the first rule from the ruleset. For any training instance a9 a10a34a12a22a14 a16a36a18 a20a12a22a14 a23a28a18 a26 correctly covered by this rule, an antecedent NPa7 a3 has been identified for the anaphor NPa8 a3 . As a result, all (positive and negative) training instances formed with NPa8 a3 as the anaphor are no longer needed and are subsequently removed from the training data.11 The process is repeated until L cannot induce a rule to cover the remaining positive instances. The output of POS-SELECT is a set of positive rules selected during each iteration of the algorithm. Hence, positive sample selection in POS-SELECT is implicit in the sense that it is embedded within the rule induction process.</Paragraph>
    <Paragraph position="6"> Evaluation. Results are shown in rows 3 and 4 of Table 2. As in the previous experiments, the rule learner is RIPPER. We run the system twice, first 11We speculate that retaining the negative instances would hurt performance, but this remains to be verified.</Paragraph>
    <Paragraph position="7">  only, the system achieves an F-measure of 64.1 (for MUC-6) and 53.8 (for MUC-7). When POS-SELECT and NEG-SELECT are used in combination, however, the system achieves an F-measure of 69.3 (for MUC-6) and 57.2 (for MUC-7).</Paragraph>
    <Paragraph position="8"> Discussion. The experimental results are largely consistent with our hypothesis. System performance improves dramatically with positive sample selection using POS-SELECT both in the absence and presence of negative sample selection. Without negative sample selection, F-measure increases from 52.4 to 64.1 (for MUC-6), and from 41.3 to 53.8 (for MUC-7). Similarly, with negative sample selection, F-measure increases from 55.2 to 69.3 (for MUC-6), and from 46.0 to 57.2 (for MUC-7). In addition, our results indicate that applying both negative and positive sample selection leads to better performance than applying positive sample selection alone: F-measure increases from 64.1 to 69.3, and from 53.8 to 57.2 for the MUC-6 and MUC-7 data sets, respectively. Nevertheless, reducing the number of negative instances (via negative sample selection) improves recall but damages precision: we see statistically significant gains in recall and statistically significant drops in precision for both data sets. In particular, precision drops precipitously from 78.0 to 55.1 for the MUC-7 data set. We hypothesize that POS-SELECT does not guarantee that hard positive instances will be avoided and that the inclusion of these hard instances is responsible for the poorer precision of the system. Anaphors that do not have easy antecedents can never be removed automatically via the induction of new rules using POS-SELECT. In fact, RIPPER will possibly induce rules to handle these hard instances as long as such kind of anaphors occur sufficiently frequently in the data set relative to the number of negative instances.12 Although it might be beneficial to acquire these rules at the classification level (according to the learning algorithm), they can be detrimental to system performance at the clustering level, especially if the rules cover a large number of examples with a lot of exceptions. Consequently, it is necessary to know which rules are worthy of keeping at the clustering level and not the classification level. We will address this issue in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Pruning the Coreference Ruleset
</SectionTitle>
    <Paragraph position="0"> As noted in the introduction, machine learning approaches to coreference resolution that rely only on pairwise NP coreference classifiers will not necessarily enforce the transitivity constraint inherent in the coreference relation. Although approaches to coreference resolution that rely only on clustering could easily enforce transitivity (as in Cardie and Wagstaff (1999)), they have not performed as well as state-of-the-art approaches to coreference. In this section, we propose a method for resolving this conflict: we introduce an error-driven rule pruning algorithm that considers rules induced by the coreference classifier and discards those that cause the ruleset to perform poorly with respect to the global, clustering-level coreference scoring function.</Paragraph>
    <Paragraph position="1"> The Algorithm. The error-driven pruning algorithm is inspired by the backward elimination algorithm commonly used for feature selection (see Blum and Langley (1997)) and is shown in Figure 3. The algorithm, RULE-SELECT, takes as input a ruleset learned from a training corpus for performing coreference resolution, a pruning corpus (disjoint from the training corpus), and a clustering-level 12More precisely, RIPPER will induce a new rule if the rule is more than 50% accurate and the resulting description length is fewer than 64 bits larger than the smallest description length obtained so far.</Paragraph>
    <Paragraph position="2"> Algorithm RULE-SELECT(R: ruleset, P: pruning corpus, S: scoring function) BestScore a64a68a61 score of the coreference system using R on P w.r.t. S; r a64a68a61 NIL; repeat r := the rule in R whose removal yields a ruleset with which the coreference system achieves the best score b on P w.r.t. S.</Paragraph>
    <Paragraph position="3"> if b a69 BestScore then  coreference scoring function that is the same as the one being used for evaluating the final output of the system.13 At each iteration, RULE-SELECT greedily discards the rule whose removal yields a rule-set with which the coreference system performs the best (with respect to the coreference scoring function) on the pruning corpus. As a hill-climbing procedure, the algorithm terminates when removal of any of the rules in the ruleset fails to improve performance. In contrast to most existing algorithms for coreference resolution, RULE-SELECT establishes a tighter connection between the classification- and clustering-level decisions for coreference resolution and ensures that system performance is optimized with respect to the coreference scoring function. We hypothesize that this optimization of the coreference classifier will improve performance of the resulting coreference system, in particular by increasing its precision.</Paragraph>
    <Paragraph position="4"> Evaluation and Discussion. Results are shown in row 5 of Table 2. In the Pruning experiment, the MUC-7 formal evaluation corpus is the pruning corpus for the MUC-6 run; the MUC-6 formal evaluation corpus is the pruning corpus for the MUC-7 13Importantly, RULE-SELECT assumes no knowledge of the inner workings of the scoring function.</Paragraph>
    <Paragraph position="5"> run. In addition, the quantity that RULE-SELECT optimizes for a given ruleset is the F-measure returned by the MUC scoring function.14 In comparison to the Combined results, we see an improvement of 0.2% (for MUC-6) and 6.2% (for MUC-7) in F-measure. In particular, we see statistically significant gains in precision (from 55.1 to 73.6) and statistically significant, but much smaller, drops in recall (from 59.5 to 54.2) for the MUC-7 data set.</Paragraph>
    <Paragraph position="6"> In general, our results support the hypothesis that rule pruning can be used to improve system performance; moreover, the technique is especially effective at enhancing the precision of the system. However, performance gains may be negligible when pruning is used in systems with high precision, as can be seen from the results for the MUC-6 data set.</Paragraph>
    <Paragraph position="7"> To determine whether performance improvements are instead attributable to the availability of additional &amp;quot;training&amp;quot; data provided by the pruning corpus, we train a classifier (using the same setting as the Combined experiments) on both the training and the pruning corpora. The performance of the system using this unpruned ruleset is shown in the last row of Table 2. In comparison to the Combined results, F-measure drops from 69.3 to 67.6 (for MUC-6), and rises from 57.2 and 57.8 (for MUC-7). These results indicate that the RULE-SELECT algorithm has made a more effective use of the additional data than the learning algorithm without rule pruning by exploiting the feedback provided by the scoring function. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML