File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3805_metho.xml
Size: 6,098 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3805"> <Title>A Study of Two Graph Algorithms in Topic-driven Summarization</Title> <Section position="4" start_page="0" end_page="29" type="metho"> <SectionTitle> 3 Data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="29" type="sub_section"> <SectionTitle> 3.1 Topics </SectionTitle> <Paragraph position="0"> We work with a list of topics from the test data in the DUC 2005 challenge. A topic has an identi er, category (general/speci c), title and a sequence of statements or questions, for example:</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> d307bspecific New Hydroelectric Projects </SectionTitle> <Paragraph position="0"> What hydroelectric projects are plannedor in progress and what problems are associated with them? We apply MiniPar to the titles and contents of the topics, and to all documents. The output is post-processed to produce dependency pairs only for open-class words. The dependency pairs bypass prepositions and subordinators/coordinators between clauses, linking the corresponding open-class words. After post-processing, the topic will be represented like this:</Paragraph> </Section> </Section> <Section position="5" start_page="29" end_page="29" type="metho"> <SectionTitle> QUESTION NUMBER: d307bLIST OF WORDS: </SectionTitle> <Paragraph position="0"> associate, hydroelectric, in, plan, problem, progress, project, new, themLIST OF PAIRS: relation(project, hydroelectric)relation(project, new) relation(associate, problem)relation(plan, project) relation(in, progress)relation(associate, them) The parser does not always produce perfect parses. In this example it did not associate the phrase in progress with the noun projects, so we missed the connection between projects and progress.</Paragraph> <Paragraph position="1"> In the next step, we expand each open-class word in the topic with all its WordNet synsets and one-step hypernyms and hyponyms. We have two variants of the topic le: with all open-class words from the topic description Topicsall, and only with nouns and verbs TopicsNV .</Paragraph> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 3.2 Documents </SectionTitle> <Paragraph position="0"> For each topic, we summarize a collection of up to 50 news items. In our experiments, we build a le with all documents for a given topic, one sentence per line, cleaned of XML tags. We process each le with MiniPar, and post-process the output similarly to the topics. For documents we keep the list of dependency relations but not a separate list of words.</Paragraph> <Paragraph position="1"> This processing also gives one le per topic, each sentence followed by its list of dependency relations.</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 3.3 Summary Content Units </SectionTitle> <Paragraph position="0"> The DUC 2005 summary evaluation included an analysis based on Summary Content Units. SCUs are manually-selected topic-speci c summary-worthy phrases which the summarization systems are expected to include in their output (Nenkova and Passonneau, 2004; Copeck and Szpakowicz, 2005).</Paragraph> <Paragraph position="1"> The SCUs for 20 of the test topics became available after the challenge. We use the SCU data to measure the performance of our graph-matching and path-search algorithms: the total number, weight and number of unique SCUs per summary, and the number of negative SCU sentences, explicitly marked as not relevant to the summary.</Paragraph> </Section> </Section> <Section position="6" start_page="29" end_page="29" type="metho"> <SectionTitle> 4 Algorithms </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 4.1 Topic$sentence graph matching (GM) </SectionTitle> <Paragraph position="0"> We treat a sentence and a topic as graphs. The nodes are the open-class words in the sentence or topic (we also refer to them as keywords), and the edges are the dependency relations extracted from MiniPar's output. In order to maximize the matching score, we replace a word wS in the sentence with wQ from the query, if wS appears in the WordNet expansion of words in wQ.</Paragraph> <Paragraph position="1"> To score a match between a sentence and a graph, we compute and then combine two partial scores: SN (node match score) the node (keyword) overlap between the two text units. A keyword count is equal to the number of dependency pairs it appears with in the document sentence; SE (edge match score) the edge (dependency relation) overlap.</Paragraph> <Paragraph position="2"> The overall score is S = SN + WeightFactor SE, where WeightFactor 2 f0, 1, 2, ..., 15, 20, 50, 100g. Varying the weight factor allows us to nd various combinations of node and edge score matches which work best for sentence extraction in summarization.</Paragraph> <Paragraph position="3"> When WeightFactor = 0, the sentence scores correspond to keyword counts.</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 4.2 Path search for topic keyword pairs (PS) </SectionTitle> <Paragraph position="0"> Here too we look at sentences as graphs. We only take the list of words from the topic representation.</Paragraph> <Paragraph position="1"> For each pair of those words, we check whether they both appear in the sentence and are connected in the sentence graph. We use the list of WordNetexpanded terms again, to maximize matching. The nal score for the sentence has two components: the node-match score SN , and SP , the number of word pairs from the topic description connected by a path in the sentence graph. The nal score is</Paragraph> <Paragraph position="3"> the same range as previously, is meant to boost the contribution of the path score towards the nal score of the sentence.</Paragraph> </Section> </Section> class="xml-element"></Paper>