File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1011_metho.xml

Size: 5,090 bytes

Last Modified: 2025-10-06 14:07:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1011">
  <Title>Automatic Title Generation for Spoken Broadcast News</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. THE CONTRASTIVE TITLE
GENERATION EXPERIMENT
</SectionTitle>
    <Paragraph position="0"> In this section we describe the experiment and present the results.</Paragraph>
    <Paragraph position="1"> Section 2.1 describes the data. Section 2.2 discusses the evaluation method. Section 2.3 gives a detailed description of all the methods, which were compared. Results and analysis are presented in section 2.4.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Data Description
</SectionTitle>
      <Paragraph position="0"> In our experiment, the training set, consisting of 21190 perfectly transcribed documents, are obtain from CNN web site during 1999. Included with each training document text was a human assigned title. The test set, consisting of 1006 CNN TV news story documents for the same year (1999), are randomly selected from the Informedia Digital Video Library. Each document has a closed captioned transcript, an alternative transcript generated with CMU Sphinx speech recognition system with a 64000-word broadcast news language model and a human assigned title.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> First, we evaluate title generation by different approaches using the F1 metric. For an automatically generated title Tauto, F1 is measured against corresponding human assigned title Thuman as follows: F1 = 2xprecisionxrecall / (precision + recall) Here, precision and recall is measured respectively as the number of identical words in Tauto and Thuman over the number of words in Tauto and the number of words in Thuman. Obviously the sequential word order of the generated title words is ignored by this metric.</Paragraph>
      <Paragraph position="1"> To measure how well a generated title compared to the original human generated title in terms of word order, we also measured the number of correct title words in the hypothesis titles that were in the same order as in the reference titles.</Paragraph>
      <Paragraph position="2"> We restrict all approaches to generate only 6 title words, which is the average number of title words in the training corpus. Stop words were removed throughout the training and testing documents and also removed from the titles.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Description of the Compared Title
Generation Approaches
</SectionTitle>
      <Paragraph position="0"> The five different title generation methods are:  1. Naive Bayesian approach with limited vocabulary (NBL). It tries to capture the correlation between the words in the document and the words in the title. For each document word DW, it counts the occurrence of title word same as DW and apply the statistics to the test documents for generating titles. 2. Naive Bayesian approach with full vocabulary (NBF). It relaxes the constraint in the previous approach and counts all the document-word-title-word pairs. Then this full statistics will be applied on generating titles for the test documents. 3. Term frequency and inverse document frequency  approach (TF.IDF). TF is the frequency of words occurring in the document and IDF is logarithm of the total number of documents divided by the number of documents containing this word. The document words with highest TF.IDF were chosen for the title word candidates.</Paragraph>
      <Paragraph position="1"> 4. K nearest neighbor approach (KNN). This algorithm is similar to the KNN algorithm applied to topic classification. It searches the training document set for the closest related document and assign the training document title to the new document as title.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Iterative Expectation-Maximization approach (EM). It
</SectionTitle>
    <Paragraph position="0"> views documents as written in a 'verbal' language and their titles as written a 'concise' language. It builds the translation model between the 'verbal' language and the 'concise' language from the documents and titles in the training corpus and 'translate' each testing document into title.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 The sequentializing process for title word
candidates
</SectionTitle>
      <Paragraph position="0"> To generate an ordered set of candidates, equivalent to what we would expect to read from left to right, we built a statistical trigram language model using the SLM tool-kit (Clarkson, 1997) and the 40,000 titles in the training set. This language model was used to determine the most likely order of the title word candidates generated by the NBL, NBF, EM and TF.IDF methods.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML