File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1071_evalu.xml

Size: 2,901 bytes

Last Modified: 2025-10-06 14:00:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1071">
  <Title>Information Fusion in the Context of Multi-Document Summarization</Title>
  <Section position="6" start_page="554" end_page="555" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation of multi-document summarization is difficult. First, we have not yet found an existing collection of human written summaries of multiple documents which could serve as a gold standard. We have begun a joint project with the Columbia Journalism School which will provide such data in the future. Second, methods used for evaluation of extraction-based systems are not applicable for a system which involves text regeneration. Finally, the manual effort needed to develop test beds and to judge sys- null tem output is far more extensive than for single document summarization; consider that a human judge would have to read many input articles (our largest test set contained 27 input articles) to rate the validity of a summary.</Paragraph>
    <Paragraph position="1"> Consequently, the evaluation that we performed to date is limited. We performed a quantitative evaluation of our content-selection component. In order to prevent noisy input from the theme construction component from skewing the evaluation, we manually constructed 26 themes, each containing 4 sentences on average. Far more training data is needed to tune the generation portion. While we have tuned the system to perform with minor errors on the manual set of themes we have created (the missing article in the fourth sentence of the summary in Figure 1 is an example), we need more robust input data from the theme construction component, which is still under development, to train the generator before beginning large scale testing. One problem in improving output is determining how to recover from errors in tools used in early stages of the process, such as the tagger and the parser.</Paragraph>
    <Section position="1" start_page="555" end_page="555" type="sub_section">
      <SectionTitle>
5.1 Intersection Component
</SectionTitle>
      <Paragraph position="0"> The evaluation task for the content selection stage is to measure how well we identify common phrases throughout multiple sentences.</Paragraph>
      <Paragraph position="1"> Our algorithm was compared against intersections extracted by human judges from each theme, producing 39 sentence-level predicate-argument structures. Our intersection algorithm identified 29 (74%) predicate-argument structures and was able to identify correctly 69% of the subjects, 74% of the main verbs, and 65% of the other constituents in our list of model predicate-argument structures. We present system accuracy separately for each category, since identifying a verb or a subject is, in most cases, more important than identifying other sentence constituents.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML