File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1640_evalu.xml

Size: 8,386 bytes

Last Modified: 2025-10-06 13:59:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1640">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Partially Supervised Coreference Resolution for Opinion Summarization through Structured Rule Learning</Title>
  <Section position="8" start_page="340" end_page="342" type="evalu">
    <SectionTitle>
6 Evaluation and Results
</SectionTitle>
    <Paragraph position="0"> This section describes the source coreference data set, the baselines, our implementation of StRip, and the results of our experiments.</Paragraph>
    <Section position="1" start_page="340" end_page="341" type="sub_section">
      <SectionTitle>
6.1 Data set
</SectionTitle>
      <Paragraph position="0"> For evaluation we use the MPQA corpus (Wiebe et al., 2005).4 The corpus consists of 535 documents from the world press. All documents in the collection are manually annotated with phrase-level opinion information following the annotation scheme of Wiebe et al. (2005). Discussion of the annotation scheme is beyond the scope of this paper; for our purposes it suffices to say that the annotations include the source of each opinion and coreference information for the sources (e.g. source coreference chains). The corpus contains no additional noun phrase coreference information. null For our experiments, we randomly split the data set into a training set consisting of 400 documents and a test set consisting of the remaining 135 documents. We use the same test set for all experi- null ments, although some learning runs were trained on 200 training documents (see next Subsection).</Paragraph>
      <Paragraph position="1"> The test set contains a total of 4736 source NPs (average of 35.34 source NPs per document) split into 1710 total source NP chains (average of 12.76 chains per document) for an average of 2.77 source NPs per chain.</Paragraph>
    </Section>
    <Section position="2" start_page="341" end_page="341" type="sub_section">
      <SectionTitle>
6.2 Implementation
</SectionTitle>
      <Paragraph position="0"> We implemented the StRip algorithm by modifying JRip - the java implementation of RIPPER included in the WEKA toolkit (Witten and Frank, 2000). The WEKA implementation follows the original RIPPER specification. We changed the implementation to incorporate the modifications suggested by the StRip algorithm; we also modified the underlying data representations and data handling techniques for efficiency. Also due to efficiency considerations, we train StRip only on the 200-document training set.</Paragraph>
    </Section>
    <Section position="3" start_page="341" end_page="341" type="sub_section">
      <SectionTitle>
6.3 Competitive baselines
</SectionTitle>
      <Paragraph position="0"> We compare the results of the new method to three fully supervised baseline systems, each of which employs the same traditional coreference resolution approach. In particular, we use the aforementioned algorithm proposed by Ng and Cardie (2002), which combines a pairwise NP coreference classifier with single-link clustering.</Paragraph>
      <Paragraph position="1"> For one baseline, we train the coreference resolution algorithm on the MPQA src corpus -- the labeled portion of the MPQA corpus (i.e. NPs from the source coreference chains) with unlabeled instances removed.</Paragraph>
      <Paragraph position="2"> The second and third baselines investigate whether the source coreference resolution task can benefit from NP coreference resolution training data from a different domain. Thus, we train the traditional coreference resolution algorithm on the MUC6 and MUC7 coreference-annotated corpora5 that contain documents similar in style to those in the MPQA corpus (e.g. newspaper articles), but emanate from different domains.</Paragraph>
      <Paragraph position="3"> For all baselines we targeted the best possible systems by trying two pairwise NP classifiers (RIPPER and an SVM in the SV Mlight implementation (Joachims, 1998)), many different parameter settings for the classifiers, two different feature sets, two different training set sizes (the 5We train each baseline using both the development set and the test set from the corresponding MUC corpus.</Paragraph>
      <Paragraph position="4"> full training set and a smaller training set consisting of half of the documents selected at random), and three different instance selection algorithms6.</Paragraph>
      <Paragraph position="5"> This variety of classifier and training data settings was motivated by reported differences in performance of coreference resolution approaches w.r.t. these variations (Ng and Cardie, 2002). More details on the different parameter settings and instance selection algorithms as well as trends in the performance of different settings can be found in Stoyanov and Cardie (2006). In the experiments below we report the best performance of each of the two learning algorithms on the MPQA test data.</Paragraph>
    </Section>
    <Section position="4" start_page="341" end_page="342" type="sub_section">
      <SectionTitle>
6.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> In addition to the baselines described above, we evaluate StRip both with and without unlabeled data. That is, we train on the MPQA corpus StRip using either all NPs or just opinion source NPs.</Paragraph>
      <Paragraph position="1"> We use the B3 (Bagga and Baldwin, 1998) evaluation measure as well as precision, recall, and F1 measured on the (positive) pairwise decisions.</Paragraph>
      <Paragraph position="2"> B3 is a measure widely used for evaluating coreference resolution algorithms. The measure computes the precision and recall for each NP mention in a document, and then averages them to produce combined results for the entire output. More precisely, given a mention i that has been assigned to chain ci, the precision for mention i is defined as the number of correctly identified mentions in ci divided by the total number of mentions in ci.</Paragraph>
      <Paragraph position="3"> Recall for i is defined as the number of correctly identified mentions in ci divided by the number of mentions in the gold standard chain for i.</Paragraph>
      <Paragraph position="4"> Results are shown in Table 1. The first six rows of results correspond to the fully supervised baseline systems trained on different corpora --MUC6, MUC7, and MPQA src. The seventh row of results shows the performance of StRip using only labeled data. The final row of the table shows the results for partially supervised learning with unlabeled data. The table lists results from the best performing run for each algorithm.</Paragraph>
      <Paragraph position="5"> Performance among the baselines trained on the MUC data is comparable. However, the two base-line runs trained on the MPQA src corpus (i.e. results rows five and six) show slightly better performance on the B3 metric than the baselines trained  NPs, while MPQA full contains the unlabeled NPs.</Paragraph>
      <Paragraph position="6"> on the MUC data, which indicates that for our task the similarity of the documents in the training and test sets appears to be more important than the presence of complete supervisory information. (Improvements over the RIPPER runs trained on the MUC corpora are statistically significant7, while improvements over the SVM runs are not.) Table 1 also shows that StRip outperforms the baselines on both performance metrics. StRip's performance is better than the baselines when trained on MPQA src (improvement not statistically significant, p &gt; 0.20) and even better when trained on the full MPQA corpus, which includes the unlabeled NPs (improvement over the baselines and the former StRip run statistically significant). These results confirm our hypothesis that StRip improves due to two factors: first, considering pairwise decisions in the context of the clustering function leads to improvements in the classifier; and, second, StRip can take advantage of the unlabeled portion of the data.</Paragraph>
      <Paragraph position="7"> StRip's performance is all the more impressive considering the strength of the SVM and RIPPER baselines, which which represent the best runs across the 336 different parameter settings tested for SV Mlight and 144 different settings tested for RIPPER. In contrast, all four of the StRip runs using the full MPQA corpus (we vary the loss ratio for false positive/false negative cost) outperform those baselines.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML