File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1046_intro.xml
Size: 3,456 bytes
Last Modified: 2025-10-06 14:06:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1046"> <Title>Statistical Models for Topic Segmentation</Title> <Section position="3" start_page="357" end_page="357" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> Much research has been devoted to the task of structuring text--that is dividing texts into units based on information within the text. This work falls roughly into two categories. Topic segmentation focuses on identifying topically-coherent blocks of text several sentences through several paragraphs in length (e.g. see Hearst, 1994). The prime motivation for identifying such units is to improve performance on language-processing or IR tasks. Discourse segmentation, on the other hand, is often finer-grained, and focuses on identifying relations between utterances (e.g. Grosz and Sidner, 1986 or Hirschberg and Grosz, 1992).</Paragraph> <Paragraph position="1"> Many topic segmentations algorithms have been proposed in the literature. There is not enough space to review them all here, so we will focus on describing a representative sample that covers most of the features used to predict the location of boundaries. See (Reynar, 1998) for a more thorough review.</Paragraph> <Paragraph position="2"> Youmans devised a technique called the Vocabulary Management Profile based on the location of first uses of word types. He posited that large clusters of first uses frequently followed topic boundaries since new topics generally introduce new vocabulary items (Youmans, 1991).</Paragraph> <Paragraph position="3"> Morris and Hirst developed an algorithm (Morris and Hirst, 1991) based on lexical cohesion relations (Halliday and Hasan, 1976). They used Roget's 1977 Thesaurus to identify synonyms and other cohesion relations.</Paragraph> <Paragraph position="4"> Kozima defined a measure called the Lexical Cohesion Profile (LCP) based on spreading activation within a semantic network derived from. a machine-readable dictionary. He identified topic boundaries where the LCP score was low (Kozima, 1993).</Paragraph> <Paragraph position="5"> Hearst developed a technique called TextTiling that automatically divides expository texts into multi-paragraph segments using the vector space model from IR (Hearst, 1994). Topic boundaries were positioned where the similarity between the block Of text before and after the boundary was low.</Paragraph> <Paragraph position="6"> In previous work (Reynar, 1994), we described a method of finding topic boundaries using an optimisation algorithm based on word repetition that was inspired by a visualization technique known as dotplotting (Helfman, 1994).</Paragraph> <Paragraph position="7"> Ponte and Croft predict topic boundaries using a model of likely topic length and a query expansion technique called Local Content Analysis that maps sets of words into a space of concepts (Ponte and Croft, 1997).</Paragraph> <Paragraph position="8"> Richmond, Smith and Amitay designed an algorithm for topic segmentation that weighted words based on their frequency within a document and subsequently used these weights in a formula based on the distance between repetitions of word types (Richmond et al., 1997).</Paragraph> <Paragraph position="9"> Beeferman, Berger and Lafferty used the relative performance of two statistical language models and cue words to identify topic boundaries (Beeferman et al., 1997).</Paragraph> </Section> class="xml-element"></Paper>