XML Viewer - p04-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1020_metho.xml
Size: 20,104 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1020">
  <Title>Learning Noun Phrase Anaphoricity to Improve Coreference Resolution: Issues in Representation and Optimization</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Machine Learning Framework for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Coreference Resolution
</SectionTitle>
      <Paragraph position="0"> The coreference system to which our automatically computed anaphoricity information will be applied implements the standard machine learning approach to coreference resolution combining classification and clustering. Below we will give a brief overview of this standard approach. Details can be found in Soon et al. (2001) or Ng and Cardie (2002b).</Paragraph>
      <Paragraph position="1"> Training an NP coreference classifier. After a pre-processing step in which the NPs in a document are automatically identified, a learning algorithm is used to train a classifier that, given a description of two NPs in the document, decides whether they are COREFERENT or NOT COREFERENT.</Paragraph>
      <Paragraph position="2"> Applying the classifier to create coreference chains. Test texts are processed from left to right.</Paragraph>
      <Paragraph position="3"> Each NP encountered, NPj, is compared in turn to each preceding NP, NPi. For each pair, a test instance is created as during training and is presented to the learned coreference classifier, which returns a number between 0 and 1 that indicates the likelihood that the two NPs are coreferent. The NP with the highest coreference likelihood value among the preceding NPs with coreference class values above 0.5 is selected as the antecedent of NPj; otherwise, no antecedent is selected for NPj.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> In Section 2, we examined how to construct locallyand globally-optimized anaphoricity models. Recall that, for each of these two types of models, the resulting (non-)anaphoricity information can be used by a learning-based coreference system either as hard bypassing constraints or as a feature. Hence, given a coreference system that implements the two-step learning approach shown above, we will be able to evaluate the four different combinations of computing and using anaphoricity information for improving the coreference system described in the introduction. Before presenting evaluation details, we will describe the experimental setup.</Paragraph>
    <Paragraph position="1"> Coreference system. In all of our experiments, we use our learning-based coreference system (Ng and Cardie, 2002b).</Paragraph>
    <Paragraph position="2"> Features for anaphoricity determination. In both the locally-optimized and the globally-optimized approaches to anaphoricity determination described in Section 2, an instance is represented by 37 features that are specifically designed for distinguishing anaphoric and non-anaphoric NPs. Space limitations preclude a description of these features; see Ng and Cardie (2002a) for details.</Paragraph>
    <Paragraph position="3"> Learning algorithms. For training coreference classifiers and locally-optimized anaphoricity models, we use both RIPPER and MaxEnt as the underlying learning algorithms. However, for training globally-optimized anaphoricity models, RIPPER is always used in conjunction with Method 1 and Max-Ent with Method 2, as described in Section 2.2.</Paragraph>
    <Paragraph position="4"> In terms of setting learner-specific parameters, we use default values for all RIPPER parameters unless otherwise stated. For MaxEnt, we always train the feature-weight parameters with 100 iterations of the improved iterative scaling algorithm (Della Pietra et al., 1997), using a Gaussian prior to prevent overfitting (Chen and Rosenfeld, 2000).</Paragraph>
    <Paragraph position="5"> Data sets. We use the Automatic Content Extraction (ACE) Phase II data sets.2 We choose ACE rather than the more widely-used MUC corpus (MUC-6, 1995; MUC-7, 1998) simply because  ACE provides much more labeled data for both training and testing. However, our system was set up to perform coreference resolution according to the MUC rules, which are fairly different from the ACE guidelines in terms of the identification of markables as well as evaluation schemes. Since our goal is to evaluate the effect of anaphoricity information on coreference resolution, we make no attempt to modify our system to adhere to the rules specifically designed for ACE.</Paragraph>
    <Paragraph position="6"> The coreference corpus is composed of three data sets made up of three different news sources: Broadcast News (BNEWS), Newspaper (NPAPER), and Newswire (NWIRE). Statistics collected from these data sets are shown in Table 1. For each data set, we train an anaphoricity classifier and a coreference classifier on the (same) set of training texts and evaluate the coreference system on the test texts.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="2" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, we will compare the effectiveness of four approaches to anaphoricity determination (see the introduction) in improving our baseline coreference system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Coreference Without Anaphoricity
</SectionTitle>
      <Paragraph position="0"> As mentioned above, we use our coreference system as the baseline system where no explicit anaphoricity determination system is employed. Results using RIPPER and MaxEnt as the underlying learners are shown in rows 1 and 2 of Table 2 where performance is reported in terms of recall, precision, and F-measure using the model-theoretic MUC scoring program (Vilain et al., 1995). With RIPPER, the system achieves an F-measure of 56.3 for BNEWS, 61.8 for NPAPER, and 51.7 for NWIRE. The performance of MaxEnt is comparable to that of RIPPER for the BNEWS and NPAPER data sets but slightly worse for the NWIRE data set.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
5.2 Coreference With Anaphoricity
The Constraint-Based, Locally-Optimized
</SectionTitle>
      <Paragraph position="0"> (CBLO) Approach. As mentioned before, in constraint-based approaches, the automatically computed non-anaphoricity information is used as  sifier, as well as performance results in terms of Recall, Precision, F-measure and the corresponding Conservativeness parameter are provided whenever appropriate. The strongest result obtained for each data set is boldfaced. In addition, results that represent statistically significant gains and drops with respect to the baseline are marked with an asterisk (*) and a dagger (+), respectively.</Paragraph>
      <Paragraph position="1"> hard bypassing constraints, with which the coreference system attempts to resolve only NPs that the anaphoricity classifier determines to be anaphoric.</Paragraph>
      <Paragraph position="2"> As a result, we hypothesized that precision would increase in comparison to the baseline system. In addition, we expect that recall will drop owing to the anaphoricity classifier's misclassifications of truly anaphoric NPs. Consequently, overall performance is not easily predictable: F-measure will improve only if gains in precision can compensate for the loss in recall.</Paragraph>
      <Paragraph position="3"> Results are shown in rows 3-6 of Table 2. Each row corresponds to a different combination of learners employed in training the coreference and anaphoricity classifiers.3 As mentioned in Section 2.2, locally-optimized approaches are a special case of their globally-optimized counterparts, with the conservativeness parameter set to the default value of one for RIPPER and 0.5 for MaxEnt.</Paragraph>
      <Paragraph position="4"> In comparison to the baseline, we see large gains in precision at the expense of recall. Moreover, CBLO does not seem to be very effective in improving the baseline, in part due to the dramatic loss in recall. In particular, although we see improvements in F-measure in five of the 12 experiments in this group, only one of them is statistically significant.4  cases.</Paragraph>
      <Paragraph position="5"> The Feature-Based, Locally-Optimized (FBLO) Approach. The experimental setting employed here is essentially the same as that in CBLO, except that anaphoricity information is incorporated into the coreference system as a feature rather than as constraints. Specifically, each training/test coreference instance i(NPi,NPj) (created from NPj and a preceding NP NPi) is augmented with a feature whose value is the anaphoricity of NPj as computed by the anaphoricity classifier.</Paragraph>
      <Paragraph position="6"> In general, we hypothesized that FBLO would perform better than the baseline: the addition of an anaphoricity feature to the coreference instance representation might give the learner additional flexibility in creating coreference rules. Similarly, we expect FBLO to outperform its constraint-based counterpart: since anaphoricity information is represented as a feature in FBLO, the coreference learner can incorporate the information selectively rather than as universal hard constraints.</Paragraph>
      <Paragraph position="7"> Results using the FBLO approach are shown in rows 7-10 of Table 2. Somewhat unexpectedly, this approach is not effective in improving the baseline: F-measure increases significantly in only two of the 12 cases. Perhaps more surprisingly, we see significant drops in F-measure in five cases. To get a bet(1989) is applied to determine if the differences in the F-measure scores between two coreference systems are statistically significant at the 0.05 level or higher.</Paragraph>
      <Paragraph position="8">  anaphoricity determination on the three ACE held-out development data sets. Information on which Learner (RIPPER or MaxEnt) is used to train the coreference classifier as well as performance results in terms of Recall, Precision, F-measure and the corresponding Conservativeness parameter are provided whenever appropriate. The strongest result obtained for each data set is boldfaced.</Paragraph>
      <Paragraph position="9"> ter idea of why F-measure decreases, we examine the relevant coreference classifiers induced by RIP-PER. We find that the anaphoricity feature is used in a somewhat counter-intuitive manner: some of the induced rules posit a coreference relationship between NPj and a preceding NP NPi even though NPj is classified as non-anaphoric. These results seem to suggest that the anaphoricity feature is an irrelevant feature from a machine learning point of view.</Paragraph>
      <Paragraph position="10"> In comparison to CBLO, the results are mixed: there does not appear to be a clear winner in any of the three data sets. Nevertheless, it is worth noticing that the CBLO systems can be characterized as having high precision/low recall, whereas the reverse is true for FBLO systems in general. As a result, even though CBLO and FBLO systems achieve similar performance, the former is the preferred choice in applications where precision is critical.</Paragraph>
      <Paragraph position="11"> Finally, we note that there are other ways to encode anaphoricity information in a coreference system. For instance, it is possible to represent anaphoricity as a real-valued feature indicating the probability of an NP being anaphoric rather than as a binary-valued feature. Future work will examine alternative encodings of anaphoricity.</Paragraph>
      <Paragraph position="12"> The Constraint-Based, Globally-Optimized (CBGO) Approach. As discussed above, we optimize the anaphoricity model for coreference performance via the conservativeness parameter. In particular, we will use this parameter to maximize the F-measure score for a particular data set and learner combination using held-out development data. To ensure a fair comparison between global and local approaches, we do not rely on additional development data in the former; instead we use</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 of the original training texts for acquiring theanaphoricity and coreference classifiers and the
</SectionTitle>
    <Paragraph position="0"> remaining 13 for development for each of the data sets. As far as parameter tuning is concerned, we tested values of 1, 2, . . . , 10 as well as their reciprocals for cr and 0.05, 0.1, . . . , 1.0 for t.</Paragraph>
    <Paragraph position="1"> In general, we hypothesized that CBGO would outperform both the baseline and the locally-optimized approaches, since coreference performance is being explicitly maximized. Results using CBGO, which are shown in rows 11-14 of Table 2, are largely consistent with our hypothesis. The best results on all of the three data sets are achieved using this approach. In comparison to the baseline, we see statistically significant gains in F-measure in nine of the 12 experiments in this group. Improvements stem primarily from large gains in precision accompanied by smaller drops in recall. Perhaps more importantly, CBGO never produces results that are significantly worse than those of the base-line systems on these data sets, unlike CBLO and FBLO. Overall, these results suggest that CBGO is more robust than the locally-optimized approaches in improving the baseline system.</Paragraph>
    <Paragraph position="2"> As can be seen, CBGO fails to produce statistically significant improvements over the baseline in three cases. The relatively poorer performance in these cases can potentially be attributed to the underlying learner combination. Fortunately, we can use the development data not only for parameter tuning but also in predicting the best learner combination. Table 3 shows the performance of the coreference system using CBGO on the development data, along with the value of the conservativeness parameter used to achieve the results in each case. Using the notation Learner1/Learner2 to denote the fact that Learner1 and Learner2 are used to train the underlying coreference classifier and anaphoricity classifier respectively, we can see that the RIPPER/RIPPER combination achieves the best performance on the BNEWS development set, whereas MaxEnt/RIPPER works best for the other two. Hence, if we rely on the development data to pick the best learner combination for use in testing, the resulting coreference system will outperform the baseline in all three data sets and yield the best-performing system on all but the NPAPER data sets, achieving an F-measure of 60.8 (row 11), 63.2 (row 11), and 54.5 (row 13) for the BNEWS, NPAPER,  coreference system for the NPAPER development data using RIPPER/RIPPER and NWIRE data sets, respectively. Moreover, the high correlation between the relative coreference performance achieved by different learner combinations on the development data and that on the test data also reflects the stability of CBGO.</Paragraph>
    <Paragraph position="3"> In comparison to the locally-optimized approaches, CBGO achieves better F-measure scores in almost all cases. Moreover, the learned conservativeness parameter in CBGO always has a larger value than the default value employed by CBLO.</Paragraph>
    <Paragraph position="4"> This provides empirical evidence that the CBLO anaphoricity classifiers are too liberal in classifying NPs as non-anaphoric.</Paragraph>
    <Paragraph position="5"> To examine the effect of the conservativeness parameter on the performance of the coreference system, we plot in Figure 1 the recall, precision, F-measure curves against cr for the NPAPER development data using the RIPPER/RIPPER learner combination. As cr increases, recall rises and precision drops. This should not be surprising, since (1) increasing cr causes fewer anaphoric NPs to be mis-classified and allows the coreference system to find a correct antecedent for some of them, and (2) decreasing cr causes more truly non-anaphoric NPs to be correctly classified and prevents the coreference system from attempting to resolve them. The best F-measure in this case is achieved when cr=4.</Paragraph>
    <Paragraph position="6"> The Feature-Based, Globally-Optimized (FBGO) Approach. The experimental setting employed here is essentially the same as that in the CBGO setting, except that anaphoricity information is incorporated into the coreference system as a feature rather than as constraints.</Paragraph>
    <Paragraph position="7"> Specifically, each training/test instance i(NPi,NPj) is augmented with a feature whose value is the computed anaphoricity of NPj. The development data is used to select the anaphoricity model (and hence the parameter value) that yields the best-performing coreference system. This model is then used to compute the anaphoricity value for the test instances. As mentioned before, we use the same parametric anaphoricity model as in CBGO for achieving global optimization.</Paragraph>
    <Paragraph position="8"> Since the parametric model is designed with a constraint-based representation in mind, we hypothesized that global optimization in this case would not be as effective as in CBGO. Nevertheless, we expect that this approach is still more effective in improving the baseline than the locally-optimized approaches.</Paragraph>
    <Paragraph position="9"> Results using FBGO are shown in rows 15-18 of Table 2. As expected, FBGO is less effective than CBGO in improving the baseline, underperforming its constraint-based counterpart in 11 of the 12 cases. In fact, FBGO is able to significantly improve the corresponding baseline in only four cases. Somewhat surprisingly, FBGO is by no means superior to the locally-optimized approaches with respect to improving the baseline. These results seem to suggest that global optimization is effective only if we have a &amp;quot;good&amp;quot; parameterization that is able to take into account how anaphoricity information will be exploited by the coreference system. Nevertheless, as discussed before, effective global optimization with a feature-based representation is not easy to accomplish.</Paragraph>
  </Section>
  <Section position="8" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6 Analyzing Anaphoricity Features
</SectionTitle>
    <Paragraph position="0"> So far we have focused on computing and using anaphoricity information to improve the performance of a coreference system. In this section, we examine which anaphoricity features are important in order to gain linguistic insights into the problem.</Paragraph>
    <Paragraph position="1"> Specifically, we measure the informativeness of a feature by computing its information gain (see p.22 of Quinlan (1993) for details) on our three data sets for training anaphoricity classifiers. Overall, the most informative features are HEAD MATCH (whether the NP under consideration has the same head as one of its preceding NPs), STR MATCH (whether the NP under consideration is the same string as one of its preceding NPs), and PRONOUN (whether the NP under consideration is a pronoun).</Paragraph>
    <Paragraph position="2"> The high discriminating power of HEAD MATCH and STR MATCH is a probable consequence of the fact that an NP is likely to be anaphoric if there is a lexically similar noun phrase preceding it in the text. The informativeness of PRONOUN can also be expected: most pronominal NPs are anaphoric.</Paragraph>
    <Paragraph position="3"> Features that determine whether the NP under consideration is a PROPER NOUN, whether it is a BARE SINGULAR or a BARE PLURAL, and whether it begins with an &amp;quot;a&amp;quot; or a &amp;quot;the&amp;quot; (ARTICLE) are also highly informative. This is consistent with our intuition that the (in)definiteness of an NP plays an important role in determining its anaphoricity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML