File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2020_intro.xml
Size: 2,284 bytes
Last Modified: 2025-10-06 14:03:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2020"> <Title>Topic-Focused Multi-document Summarization Using an Approximate Oracle Score</Title> <Section position="4" start_page="152" end_page="152" type="intro"> <SectionTitle> 2 The Data </SectionTitle> <Paragraph position="0"> The 2005 Document Understanding Conference (DUC 2005) data used in our experiments is partitioned into 50 topic sets, each containing 25-50 documents. A topic for each set was intended to mimic a real-world complex questioninganswering task for which the answer could not be given in a short &quot;nugget.&quot; For each topic, four human summarizers were asked to provide a 250-word summary of the topic. Topics were labeled as either &quot;general&quot; or &quot;specific&quot;. We present an example of one of each category.</Paragraph> <Paragraph position="1"> Set d408c Granularity: Specific Title: Human Toll of Tropical Storms Narrative: What has been the human toll in death or injury of tropical storms in recent years? Where and when have each of the storms caused human casualties? What are the approximate total number of casualties attributed to each of the storms? Set d436j Granularity: General Title: Reasons for Train Wrecks Narrative: What causes train wrecks and what can be done to prevent them? Train wrecks are those events that result in actual damage to the trains themselves not just accidents where people are killed or injured.</Paragraph> <Paragraph position="2"> For each topic, the goal is to produce a 250-word summary. The basic unit we extract from a document is a sentence.</Paragraph> <Paragraph position="3"> To prepare the data for processing, we segment each document into sentences using a POS (part-of-speech) tagger, NLProcessor (http://www.infogistics.com/posdemo.htm). The newswire documents in the DUC 05 data have markers indicating the regions of the document, including titles, bylines, and text portions. All of theextractedsentencesinthisstudyaretakenfrom the text portions of the documents only.</Paragraph> <Paragraph position="4"> We define a &quot;term&quot; to be any &quot;non-stop word.&quot; Our stop list contains the 400 most frequently occurring English words.</Paragraph> </Section> class="xml-element"></Paper>