File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1662_evalu.xml

Size: 10,450 bytes

Last Modified: 2025-10-06 13:59:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1662">
  <Title>Sentence Ordering with Manifold-based Classification in Multi-Document Summarization</Title>
  <Section position="8" start_page="529" end_page="532" type="evalu">
    <SectionTitle>
6 Experiments and Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="529" end_page="529" type="sub_section">
      <SectionTitle>
6.1 Data
</SectionTitle>
      <Paragraph position="0"> We used the DUC04 document dataset. The dataset contains 50 document clusters and each cluster includes 20 content-related documents.</Paragraph>
      <Paragraph position="1"> For each cluster, 4 manual summaries are provided.</Paragraph>
    </Section>
    <Section position="2" start_page="529" end_page="529" type="sub_section">
      <SectionTitle>
6.2 Evaluation Measure
</SectionTitle>
      <Paragraph position="0"> The proposed method in this paper consists of two main steps: sentence classification and sentence ordering. For classification, we used pointwise entropy (Dash et al., 2000) to measure the quality of the classification result due to lack of enough labeled data. For a nxm matrix M, whose row vectors are normalized as 1, its pointwise entropy is defined in 9).</Paragraph>
      <Paragraph position="2"> Intuitively, if Mi,j is close to 0 or 1, E(M) tends towards 0, which corresponds to clearer distinctions between classes; otherwise E(M) tends towards 1, which means there are no clear boundaries between classes. For comparison between different matrices, E(M) needs to be averaged over nxm.</Paragraph>
      <Paragraph position="3"> For sentence ordering, we used Kendall's t coefficient (Lapata, 2003), as defined in 10),</Paragraph>
      <Paragraph position="5"> where, NI is number of inversions of consecutive sentences needed to transform output of the algorithm to manual summaries. The measure ranges from -1 for inverse ranks to +1 for identical ranks, and can also be seen as a kind of edit similarity between two ranks: smaller values for lower similarity, and bigger values for higher similarity.</Paragraph>
    </Section>
    <Section position="3" start_page="529" end_page="530" type="sub_section">
      <SectionTitle>
6.3 Evaluation of Classification
</SectionTitle>
      <Paragraph position="0"> For sentence classification, we need to estimate the parameter t. We randomly chose 5 document clusters and one manual summary from the four.</Paragraph>
      <Paragraph position="1">  with t for each cluster and the values of t maximizing the margin are different for different clusters. For the 5 clusters, the estimated t is 16, 8, 14, 12 and 21 respectively. So we need to  estimate the best t for each cluster.</Paragraph>
      <Paragraph position="2"> After estimation of t, EM was used to estimate the membership probabilities. Table 1 gives the average pointwise entropy for top 10% to top 100% sentences in each cluster, where sentences were ordered by their membership probabilities. The values were averaged over 20 runs, and for each run, 10 document clusters and one summary were randomly selected, and the entropy was averaged over the summaries.</Paragraph>
      <Paragraph position="3">  In Table 1, the column E_Semi shows entropies of the semi-supervised classification. It indicates that the entropy increases as more sentences are considered. This is not surprising since the sentences are ordered by their membership probabilities in a cluster, which can be seen as a kind of measure for closeness between sentences and cluster centroids, and the boundaries between clusters become dim with more sentences considered.</Paragraph>
      <Paragraph position="4"> To compare the performance between this semi-supervised classification and a standard supervised method like Support Vector Machines (SVM), Table 1 also lists the average entropy of a SVM (E_SVM) over the runs. Similarly, we found that the entropy also increases as sentences increase. Table 2 also gives the significance sign over the runs, where *, ** and ~ represent p-values &lt;=0.01, (0.01, 0.05] and &gt;0.05, and indicate that the entropy of the semi-supervised classification is lower, significantly lower, or almost the same as that of SVM respectively.</Paragraph>
      <Paragraph position="5"> Table 1 demonstrates that when the top 10% or 20% sentences are considered, the performance between the two algorithms shows no difference. The reason may be that these top sentences are closer to cluster centroids in both cases, and the cluster boundaries in both algorithms are clear in terms of these sentences.</Paragraph>
      <Paragraph position="6"> For the top 30% sentences, the entropy for semi-supervised classification is lower than that for a SVM, and for the top 40%, the difference becomes significantly lower. The reason may go to the substantial assumptions behind the two algorithms. SVM, based on local comparison, is successful only when more labeled data is available. With only one sentence labeled as in our case, the semi-supervised method, based on global distribution, makes use of a large amount of unlabeled data to reveal the underlying manifold structure. Thus, the performance is much better than that of a SVM when more sentences are considered.</Paragraph>
      <Paragraph position="7"> For the top 50% to 70% sentences, E_Semi is still lower, but not by much. The reason may be that some noisy documents are starting to be included. For the top 80% to 100% sentences, the performance shows no difference again. The reason may be that the lower ranking sentences may belong to other classes than those represented by summary sentences, and with these sentences included, the cluster boundaries become unclear in both cases.</Paragraph>
    </Section>
    <Section position="4" start_page="530" end_page="532" type="sub_section">
      <SectionTitle>
6.4 Evaluation of Ordering
</SectionTitle>
      <Paragraph position="0"> We used the same classification results to test the performance of our ordering algorithm HO as well as MO and PO. Table 2 lists the Kendall's t coefficient values for the three algorithms (t_1).</Paragraph>
      <Paragraph position="1"> The value was averaged over 20 runs, and for each run, 10 summaries were randomly selected and the t score was averaged over summaries.</Paragraph>
      <Paragraph position="2"> Since a summary sentence tends to generalize  some sentences in the documents, we also tried to combine two or three consecutive sentences into one, and tested their ordering performance (t_2 and t_3) respectively.</Paragraph>
      <Paragraph position="3">  sentences harms the performance. To see why, we checked the classification results, and found that the pointwise entropies for two and three sentence combinations (for the top 40% sentence in each cluster) increase 12.4% and 18.2% respectively. This means that the cluster structure becomes less clear with two or three sentence combinations, which would lead to less similar sentences being clustered with summary sentences. This result also suggests that if the summary sentence subsumes multiple sentences in the documents, they tend to be not consecutive. Fig. 8 shows change of t scores with different number of sentences used for ordering, where x axis denotes top (1-x)*100% sentences in each cluster. The score was averaged over 20 runs, and for each run, 10 summaries were randomly selected and evaluated.</Paragraph>
      <Paragraph position="4">  &gt;=0.7) used for ordering, the performance decreases. The reason may be that with fewer and fewer sentences used, the result is deficient training data for the ordering. On the other hand, with more sentences used (x &lt;0.6), the performance also decreases. The reason may be that as more sentences are used, the noisy sentences could dominate the ordering. That's why we considered only the top 40% sentences in each cluster as training data for sentence reordering here.</Paragraph>
      <Paragraph position="5"> As an example, the following is a summary for a cluster of documents about Central American storms, in which the ordering is given manually. 1) A category 5 storm, Hurricane Mitch roared across the northwest Caribbean with 180 mph winds across a 350-mile front that devastated the mainland and islands of Central America. 2) Although the force of the storm diminished, at least 8,000 people died from wind, waves and flood damage.</Paragraph>
      <Paragraph position="6">  3) The greatest losses were in Honduras where some 6,076 people perished.</Paragraph>
      <Paragraph position="7"> 4) Around 2,000 people were killed in Nicaragua, 239 in El Salvador, 194 in Guatemala, seven in Costa Rica and six in Mexico. 5) At least 569,000 people were homeless across Central America. 6) Aid was sent from many sources (European Union, the UN, US and Mexico).</Paragraph>
      <Paragraph position="8"> 7) Relief efforts are hampered by extensive damage.  Compared with the manual ordering, our algorithm HO outputs the ordering [1, 3, 4, 2, 5, 6, 7]. In contrast, PO and MO created the orderings [1, 3, 4, 5, 6, 7, 2] and [1, 3, 2, 6, 4, 5, 7] respectively. In HO's output, sentence 2 was put in the wrong position. To check why this was so, we found that sentences in cluster 2 and cluster 3 (clusters containing sentence 2 or sentence 3) were very similar, and the size of cluster 3 was bigger than that of cluster 2. Also we found that sentences in cluster 4 mostly followed those in cluster 3. This may explain why the ordering [1, 3, 4] occurred. Due to the link between cluster 2 and cluster 1 or 3, sentence 2 followed sentence 4 in the ordering. In PO, sentence 2 was put at the end of the ordering, since it only considered the most recent selection when determining next, so cluster 1 would not be considered when determining the  4th position. This suggests that consideration of selection history does in fact help to group those related sentences more closely, although sentence 2 was ranked lower than expected in the example. In MO, we found sentence 2 was put immediately behind sentence 3. The reason was that, after sentence 1 and 3 were selected, the in-edges of the node representing cluster 2 became 0 in the cluster directed graph, and its in-out edge difference became the biggest among all nodes in the graph, so it was chosen. For similar reasons, sentence 6 was put behind sentence 2. This suggests that it may be difficult to consider the selection history in MO, since its selection is mainly based on the current status of clusters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML