File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0302_evalu.xml

Size: 5,592 bytes

Last Modified: 2025-10-06 13:59:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0302">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Toward Opinion Summarization: Linking the Sources</Title>
  <Section position="8" start_page="11" end_page="13" type="evalu">
    <SectionTitle>
7 Evaluation
</SectionTitle>
    <Paragraph position="0"> For evaluation we randomly split the MPQA corpus into a training set consisting of 400 documents  and a test set consisting of the remaining 135 documents. We use the same test set for all evaluations, although not all runs were trained on all 400 training documents as discussed below.</Paragraph>
    <Paragraph position="1"> The purpose of our evaluation is to create a strong baseline utilizing the best settings for the NP coreference approach. As such, we try the two reportedly best machine learning techniques for pairwise classification - RIPPER (for Repeated Incremental Pruning to Produce Error Reduction) (Cohen, 1995) and support vector machines (SVMs) in the SVMlight implementation (Joachims, 1998). Additionally, to exclude possible effects of parameter selection, we try many different parameter settings for the two classifiers. For RIPPER we vary the order of classes and the positive/negative weight ratio. For SVMs we vary C (themargintradeoff)andthetypeandparameter of the kernel. In total, we use 24 different settings for RIPPER and 56 for SVMlight.</Paragraph>
    <Paragraph position="2"> Additionally, Ng and Cardie reported better results when the training data distribution is balanced through instance selection. For instance selection they adopt the method of Soon et al.</Paragraph>
    <Paragraph position="3"> (2001), which selects for each NP the pairs with the n preceding coreferent instances and all intervening non-coreferent pairs. Following Ng and Cardie (2002), we perform instance selection with n = 1 (soon1 in the results) and n = 2 (soon2).</Paragraph>
    <Paragraph position="4"> With the three different instance selection algorithms (soon1,soon2, and none), the total number of settings is 72 for RIPPER and 168 for SVMa.</Paragraph>
    <Paragraph position="5"> However, not all SVM runs completed in the time limit that we set - 200 min, so we selected half of the training set (200 documents) at random and trained all classifiers on that set. We made sure to run to completion on the full training set those SVM settings that produced the best results on the smaller training set.</Paragraph>
    <Paragraph position="6"> Table 2 lists the results of the best performing runs. The upper half of the table gives the results for the runs that were trained on 400 documents and the lower half contains the results for the 200-document training set. We evaluated using the two widely used performance measures for coreference resolution - MUC score (Vilain et al., 1995) and B3 (Bagga and Baldwin, 1998). In addition, we used performance metrics (precision, recall and F1) on the identification of the positive class. We compute the latter in two different ways - either by using the pairwise decisions as the classifiers outputs them or by performing the clustering of the source NPs and then considering a pairwise decision to be positive if the two source NPs belong to the same cluster. The second option (marked actual in Table 2) should be more representative of a good clustering, since coreference decisions are important only in the context of the clusters that they create.</Paragraph>
    <Paragraph position="7"> Table 2 shows the performance of the best RIPPER and SVM runs for each of the four evaluation metrics. The table also lists the rank for each run among the rest of the runs.</Paragraph>
    <Section position="1" start_page="12" end_page="13" type="sub_section">
      <SectionTitle>
7.1 Discussion
</SectionTitle>
      <Paragraph position="0"> The absolute B3 and MUC scores for source coreference resolution are comparable to reported state-of-the-art results for NP coreference resolutions. Results should be interpreted cautiously, however, due to the different characteristics of our data. Our documents contained 35.34 source NPs per document on average, with coreference chains consisting of only 2.77 NPs on average. The low average number of NPs per chain may be producing artificially high score for the B3 and MUC scores as the modest results on positive class identification indicate.</Paragraph>
      <Paragraph position="1"> From the relative performance of our runs, we observe the following trends. First, SVMs trained on the full training set outperform RIPPER trained on the same training set as well as the corresponding SVMs trained on the 200-document training set. The RIPPER runs exhibit the opposite behavior - RIPPER outperforms SVMs on the 200-document training set and RIPPER runs trained on the smaller data set exhibit better performance.</Paragraph>
      <Paragraph position="2"> Overall, the single best performance is observed by RIPPER using the smaller training set.</Paragraph>
      <Paragraph position="3"> Another interesting observation is that the B3 measure correlates well with good &amp;quot;actual&amp;quot; performance on positive class identification. In contrast, good MUC performance is associated with runs that exhibit high recall on the positive class. This confirms some theoretical concerns that MUC score does not reward algorithms that recognize well the absence of links. In addition, the results confirm our conjecture that &amp;quot;actual&amp;quot; precision and recall are more indicative of the true performance of coreference algorithms.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML