File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1025_evalu.xml

Size: 8,290 bytes

Last Modified: 2025-10-06 13:59:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1025">
  <Title>Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution</Title>
  <Section position="7" start_page="195" end_page="197" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="195" end_page="196" type="sub_section">
      <SectionTitle>
4.1 Performance Metrics
</SectionTitle>
      <Paragraph position="0"> We report in the following tables the MUC score (Vilain et al., 1995). Scores in Table 2 are computed for all noun phrases appearing in either the key or the system response, whereas Tables 3 and 4 refer to scoring only those phrases which appear in both the key and the response. We therefore discard those responses not present in the key, as we are interested in establishing the upper limit of the improvements given by our semantic features. That is, we want to define a baseline against which to establish the contribution of the semantic information sources explored here for coreference resolution.</Paragraph>
      <Paragraph position="1"> In addition, we report the accuracy score for all three types of ACE mentions, namely pronouns, common nouns and proper names. Accuracy is the percentage of REs of a given mention type correctly resolved divided by the total number of REs of the same type given in the key. A RE is said to be correctly resolved when both it and its direct antecedent are placed by the key in the same coreference class.</Paragraph>
      <Paragraph position="2"> 6During prototyping we experimented unpairing the arguments from the predicates, which yielded worse results. This is supported by the PropBank arguments always being defined with respect to a target predicate. Binarizing the features -- i.e. do REi and REj have the same argument or predicate label with respect to their closest predicate? -- also gave worse results.</Paragraph>
    </Section>
    <Section position="2" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
4.2 Feature Selection
</SectionTitle>
      <Paragraph position="0"> For determining the relevant feature sets we follow an iterative procedure similar to the wrapper approach for feature selection (Kohavi &amp; John, 1997) using the development data. The feature subset selection algorithm performs a hill-climbing search along the feature space. We start with a model based on all available features. Then we train models obtained by removing one feature at a time. We choose the worst performing feature, namely the one whose removal gives the largest improvement based on the MUC score F-measure, and remove it from the model. We then train classifiers removing each of the remaining features separately from the enhanced model. The process is iteratively run as long as significant improvement is observed.</Paragraph>
    </Section>
    <Section position="3" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> Table 2 compares the results between our duplicated Soon baseline and the original system. We assume that the slight improvements of our system are due to the use of current pre-processing components and another classifier. Tables 3 and 4 show a comparison of the performance between our baseline system and the ones incremented with semantic features. Performance improvements are highlighted in bold7.</Paragraph>
    </Section>
    <Section position="4" start_page="196" end_page="197" type="sub_section">
      <SectionTitle>
4.4 Discussion
</SectionTitle>
      <Paragraph position="0"> The tables show that semantic features improve system recall, rather than acting as a 'semantic filter' improving precision. Semantics therefore seems to trigger a response in cases where more shallow features do not seem to suffice (see examples (1-2)).</Paragraph>
      <Paragraph position="1"> Different feature sources account for different RE type improvements. WordNet and Wikipedia features tend to increase performance on common 7All changes in F-measure are statistically significant at the 0.05 level or higher. We follow Soon et al. (2001) in performing a simple one-tailed, paired sample t-test between the baseline system's MUC score F-measure and each of the other systems' F-measure scores on the test documents.</Paragraph>
      <Paragraph position="2"> nouns, whereas SRL improves pronouns. Word-Net features are able to improve by 14.3% and 7.7% the accuracy rate for common nouns on the BNEWS and NWIRE datasets (+34 and +37 correctly resolved common nouns out of 238 and 484 respectively), whereas employing Wikipedia yields slightly smaller improvements (+13.0% and +6.6% accuracy increase on the same datasets). Similarly, when SRL features are added to the baseline system, we register an increase in the accuracy rate for pronouns, ranging from 0.7% in BNEWS and NWIRE up to 4.2% in the MERGED dataset (+26 correctly resolved pronouns out of 620).</Paragraph>
      <Paragraph position="3"> If semantics helps for pronouns and common nouns, it does not affect performance on proper names, where features such as string matching and alias suffice. This suggests that semantics plays a role in pronoun and common noun resolution, where surface features cannot account for complex preferences and semantic knowledge is required.</Paragraph>
      <Paragraph position="4"> The best accuracy improvement on pronoun resolution is obtained on the MERGED dataset. This is due to making more data available to the classifier, as the SRL features are very sparse and inherently suffer from data fragmentation. Using a larger dataset highlights the importance of SRL, whose features are never removed in any feature selection process8. The accuracy on common nouns shows that features induced from Wikipedia are competitive with the ones from WordNet. The performance gap on all three datasets is quite small, which indicates the usefulness of using an encyclopedic knowledge base as a replacement for a lexical taxonomy. As a consequence of having different knowledge sources accounting for the resolution of different RE types, the best results are obtained by (1) combining features generated from different sources; (2) performing feature selection. When combining different feature sources, we register an accuracy improvement on pronouns and common nouns, as well as an increase in F-measure due to a higher recall.</Paragraph>
      <Paragraph position="5"> Feature selection always improves results. This is due to the fact that our full feature set is ex8To our knowledge, most of the recent work in coreference resolution on the ACE data keeps the document source separated for evaluation. However, we believe that document source independent evaluation provides useful insights on the robustness of the system (cf. the CoNLL 2005 shared task cross-corpora evaluation).</Paragraph>
      <Paragraph position="6">  ness of the knowledge sources we included overlapping features (i.e. using best and average similarity/relatedness measures at the same time), as well as features capturing the same phenomenon from different point of views (i.e. using multiple measures at the same time). In order to yield the desired performance improvements, it turns out to be essential to filter out irrelevant features.</Paragraph>
      <Paragraph position="7"> Table 5 shows the relevance of the best performing features on the BNEWS section. As our feature selection mechanism chooses the best set of features by removing them (see Section 4.2), we evaluate the contributions of the remaining features as follows. We start with a baseline system using all the features from Soon et al. (2001) that were not removed in the feature selection process (i.e. DIS-TANCE). We then train classifiers combining the current feature set with each feature in turn. We then choose the best performing feature based on the MUC score F-measure and add it to the model. We iterate the process until all features are added to the baseline system. The table indicates that all knowledge sources are relevant for coreference resolution, as it includes SRL, WordNet and Wikipedia features.</Paragraph>
      <Paragraph position="8"> The Wikipedia features rank high, indicating again that it provides a valid knowledge base.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML