File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2027_intro.xml

Size: 2,404 bytes

Last Modified: 2025-10-06 14:02:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2027">
  <Title>Machine Learning Approach To Augmenting News Headline Generation</Title>
  <Section position="2" start_page="0" end_page="155" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper we present an approach to headline generation for a single document. This headline generation task was added to the annual summ5arisation evaluation in the Document Understanding Conference (DUC) 2003. It was also included in the DUC 2004 evaluation plan where summary quality was automatically judged using a set of n-gram word overlap metrics called ROUGE (Lin and Hovy, 2003).</Paragraph>
    <Paragraph position="1"> Eighteen research groups participated in the headline generation task at DUC 2004, i.e. Task 1: very short summary generation. The Topiary system was the top performing headline system at DUC 2004. It generated headlines by combining a set of topic descriptors with a compressed version of the lead sentence, e.g.</Paragraph>
    <Paragraph position="2"> KURDISH TURKISH SYRIA: Turkey sent 10,000 troops to southeastern border. These topic descriptors were automatically identified using a statistical approach called Unsupervised Topic Discovery (UTD) (Zajic et al., 2004). The disadvantage of this technique is that meaningful topic descriptors will only be identified if this technique is trained on the corpus containing the news stories that are to be summarised. In addition, the corpus must contain clusters of related news stories to ensure that reliable cooccurrence statistics are generated.</Paragraph>
    <Paragraph position="3"> In this paper we compare the UTD method with an alternative topic label identifier that can be trained on an auxiliary news corpus, and observe the effect of these labels on summary quality when combined with compressed lead sentences. Our topic labeling technique works by combining linguistic and statistical information about terms using the C5.0 (Quinlan, 1998) machine learning algorithm, to predict which words in the source text should be included in the resultant gist with the compressed lead sentence. In this paper, we compare the performance of this system, HybridTrim, with the Topiary system and a number of other baseline gisting systems on a collection of news documents from the DUC 2004 corpus (DUC, 2003).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML