File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1046_evalu.xml

Size: 9,122 bytes

Last Modified: 2025-10-06 14:00:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1046">
  <Title>Statistical Models for Topic Segmentation</Title>
  <Section position="6" start_page="360" end_page="362" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We will present results for broadcast news data and for identifying chapter boundaries labelled by authors.</Paragraph>
    <Section position="1" start_page="360" end_page="361" type="sub_section">
      <SectionTitle>
5.1 HUB-4 Corpus Performance
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the results of segmenting the test portion of the HUB-4 coqgus, which consisted of transcribed broadcasts divided into segments by the LDC. We measured performance by comparing our segmentation to the gold standard annotation produced by the LDC.</Paragraph>
      <Paragraph position="1"> The row labelled Random guess shows the performance of a baseline algorithm that randomly guessed boundary locations with probability equal to the fraction of possible boundary sites that were boundaries in the gold standard. The row TextTiling shows the performance of the publicly available version of that algorithm (Hearst, 1994). Optimization is the algorithm we proposed in (Reynar, 1994). Word frequency and Max. Ent.</Paragraph>
      <Paragraph position="2"> Model are the algorithms we described above. Our word frequency algorithm does better than chance, TextTiling and our previous work and our maximum entropy model does better still. See (Reynar, 1998) for graphs showing the effects of trading precision for recall with these models.</Paragraph>
      <Paragraph position="3">  We also tested our models on speech-recognized broadca.sts from the 1997 TREC spoken document retrieval corpus. We did not have sufficient data to train the maximum entropy model, but our word frequency algorithm achieved precision of 0.36 and recall of 0.52, considerably better, than the baseline of 0.19 precision and recall. Using manually produced transcripts of the same data naturally yielded better performance--precision was 0.50 and.</Paragraph>
      <Paragraph position="4"> recall 0.58.</Paragraph>
      <Paragraph position="5"> Our performance on broadcast data was surprisingly good considering we trained the word frequency model from newswire data.</Paragraph>
      <Paragraph position="6"> Given a large corpus of broadcast data, we expect our algorithms would perform even better.</Paragraph>
      <Paragraph position="7"> We were curious, however, how much of the performance was attributable to having numerous parameters (3 per word) in the G model and how much comes from the nature of the model. To address this, we discarded the or, ~, and B parameters particular to each word and instead used the same parameter values for each word-namely, those assigned to unknown words through our smoothing process. This reduced the number of parameters from 3 .per word to only 3 parameters total. Performance of this hobbled version of our word frequency algorithm was so good on the HUB-4 English corpuswachieving precision of 0.42 and recall of 0.50---that we tested it on Spanish broadcast news data from the HUB-4 corpus. Even for that corpus we found much better than baseline performance. Baseline for Spanish was precision and recall of 0.28, yet our 3-parameter word frequency model achieved 0.50 precision and recall of 0.62. To reiterate, we used our word frequency model with a total of 3 parameters trained from English newswire text to segment Spanish broadcast news data We believe that the G model, which captures the notion of burstiness very well, is a good model for segmentation. However, the more important lesson from this work is that the concept of burstiness alone can be used to segment texts.</Paragraph>
      <Paragraph position="8"> Segmentation performance is better when models have accurate measures of the likelihood of 0, 1 and 2 or more occurrences of a word. However, the mere fact that content words are bursty and are relatively unlikely to appear in neighboring regions of a document unless those two regions are about the same topic is sufficient to segment many texts. This explains our ability to segment Spanish broadcast news using a 3 parameter model trained from English newswire data.</Paragraph>
    </Section>
    <Section position="2" start_page="361" end_page="362" type="sub_section">
      <SectionTitle>
5.2 Recovering Authorial Structure
</SectionTitle>
      <Paragraph position="0"> Authors endow some types of documents with structure as they write. They may divide documents into chapters, chapters into sections, sections into subsections and so forth. We exploited these structures to evalUate topic segmentation techniques by comparing algorithmic determinations of structure to the author's original divisions. This method of evaluation is especially useful because numerous documents are now available in electronic form.</Paragraph>
      <Paragraph position="1"> We tested our word frequency algorithm on four randomly selected texts from Project Gutenberg.</Paragraph>
      <Paragraph position="2"> The four texts were Thomas Paine's pamphlet Common Sense which was published in 1791, the first .volume of Decline and Fall of the Roman Empire by Edward Gibbon, G.K. Chesterton's book Orthodoxy. and Herman Melville's classic Moby Dick. We permitted the algorithm to guess boundaries only between paragraphs, which were marked by blank lines in each document.</Paragraph>
      <Paragraph position="3"> To assess performance, we set the number of boundaries to be guessed to the number the authors themselves had identified. As a result, this evaluation focuses solely on the algorithm's ability to rank candidate boundaries and not on its adeptness at determining how many boundaries to select. To evaluate performance, we computed the accuracy of the algorithm's guesses compared to the chapter boundaries the authors identified. The documents we used for this evaluation may have contained legitimate topic boundaries which did not correspond to chapter boundaries, but we scored guesses at those boundaries incorrect.</Paragraph>
      <Paragraph position="4"> Table 4 presents results for the four works. Our algorithm performed better than randomly assigning boundaries for each of the documents except the pamphlet Common Sense. Performance on the other three works was significantly better than chance and ranged from an improvement of a factor of three in accuracy over the baseline to a factor of nearly 9 for the lengthy Decline and Fall of the Roman Empire.</Paragraph>
    </Section>
    <Section position="3" start_page="362" end_page="362" type="sub_section">
      <SectionTitle>
5.3 IR Task Performance
</SectionTitle>
      <Paragraph position="0"> The data from the HUB-4 corpus was also used for the TREC Spoken document retrieval task.</Paragraph>
      <Paragraph position="1"> We tested the utility of our segmentations by comparing IR performance when we indexed documents, the segments annotated by the LDC and the segments identified by our algorithms.</Paragraph>
      <Paragraph position="2"> We modified SMART (Buckley, 1985) to perform better normalization for variations in document length (Singhal et al., 1996) prior to conducting our IR experiments.</Paragraph>
      <Paragraph position="3"> This IR task is atypical in that there is only 1 relevant document in the collection for each query. Consequently, performance is measured by determining the average rank determined by the IR system for the document relevant to each query. Perfect performance would be an average rank of 1, hence lower average ranks are better. Table 5 presents our results. Note that indexing the segments identified by our algorithms was better than indexing entire documents and that our best algorithm even outperformed indexing the gold standard annotation produced by the  numbers are better.</Paragraph>
      <Paragraph position="4"> Conclusion We described two new algorithms for topic segmentation. The first, based solely on word frequency, performs better than previous algorithms on broadcast news data. It performs well on speech recognized English despite recognition errors. Most surprisingly, a version of our first model that requires little training data could segment Spanish broadcast news documents as well---even with parameters estimated from English documents. Our second technique, a statistical model that combined numerous clues about segmentation, performs better than the first, but requires segmented training data.</Paragraph>
      <Paragraph position="5"> We showed an improvement on a simple IR task to demonstrate the potential of topic segmentation algorithms for improving IR. Other potential uses of these algorithms include better language modeling by building topic-based language models, improving NLP algorithms (e.g.</Paragraph>
      <Paragraph position="6"> coreference resolution), summarization, hypertext linking (Salton and Buckley, 1992), automated essay grading (Burstein et al., 1997) and topic detection and tracking (TDT program committee, 1998). Some of these are discussed in (Reynar, 1998), and others will be addressed in future work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML