File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1011_intro.xml

Size: 3,213 bytes

Last Modified: 2025-10-06 14:01:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1011">
  <Title>Automatic Title Generation for Spoken Broadcast News</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> To create a title for a document is a complex task. To generate a title for a spoken document becomes even more challenging because we have to deal with word errors generated by speech recognition.</Paragraph>
    <Paragraph position="1"> Historically, the title generation task is strongly connected to traditional summarization because it can be thought of extremely short summarization. Traditional summarization has emphasized the extractive approach, using selected sentences or paragraphs from the document to provide a summary. The weaknesses of this approach are inability of taking advantage of the training corpus and producing summarization with small ratio. Thus, it will not be suitable for title generation tasks.</Paragraph>
    <Paragraph position="2"> More recently, some researchers have moved toward &amp;quot;learning approaches&amp;quot; that take advantage of training data. Witbrock and Mittal [1] have used Naive Bayesian approach for learning the document word and title word correlation. However they limited their statistics to the case that the document word and the title word are same surface string. Hauptmann and Jin [2] extended this approach by relaxing the restriction. Treating title generation problem as a variant of Machine translation problem, Kennedy and Hauptmann [3] tried the iterative Expectation-Maximization algorithm. To avoid struggling with organizing selected title words into human readable sentence, Hauptmann [2] used K nearest neighbour method for generating titles. In this paper, we put all those methods together and compare their performance over 1000 speech recognition documents.</Paragraph>
    <Paragraph position="3"> We decompose the title generation problem into two parts: learning and analysis from the training corpus and generating a sequence of title words to form the title.</Paragraph>
    <Paragraph position="4"> For learning and analysis of training corpus, we present five different learning methods for comparison: Naive Bayesian approach with limited vocabulary, Naive Bayesian approach with full vocabulary, K nearest neighbors, Iterative Expectation-Maximization approach, Term frequency and inverse document frequency method. More details of each approach will be presented in Section 2.</Paragraph>
    <Paragraph position="5"> For the generating part, we decompose the issues involved as follows: choosing appropriate title words, deciding how many title words are appropriate for this document title, and finding the correct sequence of title words that forms a readable title 'sentence'.</Paragraph>
    <Paragraph position="6"> The outline of this paper is as follows: Section 1 gave an introduction to the title generation problem. The details of the experiment and analysis of results are presented in Section 2.</Paragraph>
    <Paragraph position="7"> Section 3 discusses our conclusions drawn from the experiment and suggests possible improvements.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML