File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1021_evalu.xml

Size: 11,698 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1021">
  <Title>Improving Pronoun Resolution Using Statistics-Based Semantic Compatibility Information</Title>
  <Section position="6" start_page="168" end_page="171" type="evalu">
    <SectionTitle>
4 Evaluation and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
4.1 Experiment Setup
</SectionTitle>
      <Paragraph position="0"> In our study we were only concerned about the third-person pronoun resolution. With an attempt to examine the effectiveness of the semantic feature on different types of pronouns, the whole resolution was divided into neutral pronoun (it &amp; they) resolution and personal pronoun (he &amp; she) resolution.</Paragraph>
      <Paragraph position="1"> The experiments were done on the newswire domain, using MUC corpus (Wall Street Journal articles). The training was done on 150 documents from MUC-6 coreference data set, while the testing was on the 50 formal-test documents of MUC-6 (30) and MUC-7 (20). Throughout the experiments, default learning parameters were applied to the C5 algorithm. The performance was evaluated based on success, the ratio of the number of correctly resolved anaphors over the total number of anaphors.</Paragraph>
      <Paragraph position="2"> An input raw text was preprocessed automatically by a pipeline of NLP components. The noun phrase identification and the predicate-argument extraction were done based on the results of a chunk tagger, which was trained for the shared task of CoNLL-2000 and achieved 92% accuracy (Zhou et al., 2000). The recognition of NEs as well as their semantic categories was done by a HMM based NER, which was trained for the MUC NE task and obtained high F-scores of 96.9% (MUC-6) and 94.3% (MUC-7) (Zhou and Su, 2002).</Paragraph>
      <Paragraph position="3"> For each anaphor, the markables occurring within the current and previous two sentences were taken as the initial candidates. Those with mismatched number and gender agreements were filtered from the candidate set. Also, pronouns or NEs that disagreed in person with the anaphor were removed in advance. For the training set, there are totally 645 neutral pronouns and 385 personal pronouns with non-empty candidate set, while for the testing set, the number is 245 and 197.</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
4.2 The Corpus and the Web
</SectionTitle>
      <Paragraph position="0"> The corpus for the predicate-argument statistics computation was from the TIPSTER's Text Research Collection (v1994). Consisting of 173,252 Wall Street Journal articles from the year 1988 to 1992, the data set contained about 76 million words.</Paragraph>
      <Paragraph position="1"> The documents were preprocessed using the same POS tagging and NE-recognition components as in the pronoun resolution task. Cass (Abney, 1996), a robust chunker parser was then applied to generate the shallow parse trees, which resulted in 353,085 possessive-noun tuples, 759,997 verb-object tuples and 1,090,121 subject-verb tuples.</Paragraph>
      <Paragraph position="2"> We examined the capacity of the web and the corpus in terms of zero-count ratio and count number. On average, among the predicate-argument tuples that have non-zero corpus-counts, above 93% have also non-zero web-counts. But the ratio is only around 40% contrariwise. And for the predicate- null on the seen predicate-argument tuples argument tuples that could be seen in both data sources, the count from the web is above 2000 times larger than that from the corpus.</Paragraph>
      <Paragraph position="3"> Although much less sparse, the web counts are significantly noisier than the corpus count since no tagging, chunking and parsing could be carried out on the web pages. However, previous study (Keller and Lapata, 2003) reveals that the large amount of data available for the web counts could outweigh the noisy problems. In our study we also carried out a correlation analysis3 to examine whether the counts from the web and the corpus are linearly related, on the predicate-argument tuples that can be seen in both data sources. From the results listed in Table 3, we observe moderately high correlation, with coefficients ranging from 0.5 to 0.7 around, between the counts from the web and the corpus, for both neutral pronoun (N-Pron) and personal pronoun (P-Pron) resolution tasks.</Paragraph>
    </Section>
    <Section position="3" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
4.3 System Evaluation
</SectionTitle>
      <Paragraph position="0"> Table 2 summarizes the performance of the systems with different combinations of statistics sources and learning frameworks. The systems without the se3All the counts were log-transformed and the correlation coefficients were evaluated based on Pearsons' r.</Paragraph>
      <Paragraph position="1"> mantic feature were used as the baseline. Under the single-candidate (SC) model, the baseline system obtains a success of 65.7% and 86.8% for neutral pronoun and personal pronoun resolution, respectively. By contrast, the twin-candidate (TC) model achieves a significantly (p [?] 0.05, by two-tailed ttest) higher success of 73.9% and 91.9%, respectively. Overall, for the whole pronoun resolution, the baseline system under the TC model yields a success 81.9%, 6.8% higher than SC does4. The performance is comparable to most state-of-the-art pronoun resolution systems on the same data set.</Paragraph>
      <Paragraph position="2"> Web-based feature vs. Corpus-based feature The third column of the table lists the results using the web-based compatibility feature for neutral pronouns. Under both SC and TC models, incorporation of the web-based feature significantly boosts the performance of the baseline: For the best system in the SC model and the TC model, the success rate is improved significantly by around 4.9% and 5.3%, respectively. A similar pattern of improvement could be seen for the corpus-based semantic feature. However, the increase is not as large as using the web-based feature: Under the two learning models, the success rate of the best system with the corpus-based feature rises by up to 2.0% and 2.8% respectively, about 2.9% and 2.5% less than that of the counterpart systems with the web-based feature. The larger size and the better counts of the web against the corpus, as reported in Section 4.2, 4The improvement against SC is higher than that reported in (Yang et al., 2003). It should be because we now used 150 training documents rather than 30 ones as in the previous work. The TC model would benefit from larger training data set as it uses more features (more than double) than SC.</Paragraph>
      <Paragraph position="3">  should contribute to the better performance.</Paragraph>
      <Paragraph position="4"> Single-candidate model vs. Twin-Candidate model The difference between the SC and the TC model is obvious from the table. For the N-Pron and P-Pron resolution, the systems under TC could outperform the counterpart systems under SC by above 5% and 8% success, respectively. In addition, the utility of the statistics-based semantic feature is more salient under TC than under SC for N-Pron resolution: the best gains using the corpus-based and the web-based semantic features under TC are 2.9% and 5.3% respectively, higher than those under the SC model using either un-normalized semantic features (1.6% and 3.3%), or normalized semantic features (2.0% and 4.9%). Although under SC, the normalized semantic feature could result in a gain close to under TC, its utility is not stable: with metric frequency, using the normalized feature performs even worse than using the un-normalized one. These results not only affirm the claim by Yang et al. (2003) that the TC model is superior to the SC model for pronoun resolution, but also indicate that TC is more reliable than SC in applying the statistics-based semantic feature, for N-Pron resolution.</Paragraph>
      <Paragraph position="5"> Web+TC vs. Other combinations The above analysis has exhibited the superiority of the web over the corpus, and the TC model over the SC model. The experimental results also reveal that using the the web-based semantic feature together with the TC model is able to further boost the resolution performance for neutral pronouns. The system with such a Web+TC combination could achieve a high success of 79.2%, defeating all the other possible combinations. Especially, it considerably outperforms (up to 11.5% success) the system with the Corpus+SC combination, which is commonly adopted in previous work (e.g., Kehler et al. (2004)).</Paragraph>
      <Paragraph position="6"> Personal pronoun resolution vs. Neutral pronoun resolution Interestingly, the statistics-based semantic feature has no effect on the resolution of personal pronouns, as shown in the table 2. We found in the learned decision trees such a feature did not occur (SC) or only occurred in bottom nodes (TC). This should be because personal pronouns have strong restriction on the semantic category (i.e., human) of the candidates. A non-human candidate, even with a high predicate-argument statistics, could</Paragraph>
      <Paragraph position="8"> under TC model for N-pron resolution (features ended with &amp;quot; 1&amp;quot; are for the first candidate C1 and those with &amp;quot; 2&amp;quot; are for C2.) not be used as the antecedent (e.g. company said in the sentence &amp;quot;. . . the company . . . he said . . . &amp;quot;). In fact, our analysis of the current data set reveals that most P-Prons refer back to a P-Pron or NE candidate whose semantic category (human) has been determined. That is, simply using features NE and Pron is sufficient to guarantee a high success, and thus the relatively weak semantic feature would not be taken in the learned decision tree for resolution.</Paragraph>
    </Section>
    <Section position="4" start_page="170" end_page="171" type="sub_section">
      <SectionTitle>
4.4 Feature Analysis
</SectionTitle>
      <Paragraph position="0"> In our experiment we were also concerned about the importance of the web-based compatibility feature (using frequency metric) among the feature set. For this purpose, we divided the features into groups, and then trained and tested on one group at a time.</Paragraph>
      <Paragraph position="1"> Table 4 lists the feature groups and their respective results for N-Pron resolution under the TC model.</Paragraph>
      <Paragraph position="2">  The second column is for the systems with only the current feature group, while the third column is with the features combined with the existing feature set. We see that used in isolation, the semantic compatibility feature is able to achieve a success up to 61% around, just 4% lower than the best indicative feature FirstNP. In combination with other features, the performance could be improved by as large as 18% as opposed to being used alone.</Paragraph>
      <Paragraph position="3"> Figure 1 shows the top portion of the pruned decision tree for N-Pron resolution under the TC model. We could find that: (i) When comparing two candidates which occur in the same sentence as the anaphor, the web-based semantic feature would be examined in the first place, followed by the lexical property of the candidates. (ii) When two non-pronominal candidates are both in previous sentences before the anaphor, the web-based semantic feature is still required to be examined after FirstNP and ParaStruct. The decision tree further indicates that the web-based feature plays an important role in N-Pron resolution.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML