File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0509_intro.xml

Size: 1,907 bytes

Last Modified: 2025-10-06 14:01:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0509">
  <Title>A Survey for Multi-Document Summarization</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Document Sets
</SectionTitle>
    <Paragraph position="0"> First, we describe how we accumulated our 100 multi-document data. We found that the topics of DUC multi-document data are a bit biased as it is pre-filtered for evaluation purposes, i.e. DUC document sets are carefully chosen as described in the guidelines. The pre-filtering is useful for evaluation purposes, but it does not necessarily reflect the distribution of user needs or distribution of topics in the news. We would like to obtain relatively more balanced document sets. We adopted the procedure described in the following, where the entire experiment was done using a Japanese newspaper corpus (Mainichi 1998 and 1999).</Paragraph>
    <Paragraph position="1">  a0 Select an article randomly from the corpus (seed) a0 Choose keywords from each article. Keywords  are all nouns of frequency more than 1, except for some special types of nouns a0 Use dice coefficient to retrieve articles similar to the seed article. Gather all documents that have coefficient more than 0.5.</Paragraph>
    <Paragraph position="2"> a0 Select article sets that have more than 3 articles. About 300 such sets are obtained and among them, we selected 100 document sets, preferring more documents in a set and avoiding overlapping topics.</Paragraph>
    <Paragraph position="3"> The average number of articles in a document set was 4.7 and the average number of sentences in a document was 12.9. Annotators read the articles in each set and detected if there were articles that are different from the topic throughout the document set. Such articles, which turned out to be very few in number, were excluded in the following experiments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML