File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1046_metho.xml
Size: 12,474 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1046"> <Title>Statistical Models for Topic Segmentation</Title> <Section position="4" start_page="357" end_page="359" type="metho"> <SectionTitle> 3 New Clues for Topic Segmentation </SectionTitle> <Paragraph position="0"> Prior work on topic segmentation has exploited many different hints about where topic boundaries lie. The algorithms we present use many cues from the literature as well as novel ones. Our approach is statistical in nature and weights evidence based on its utility in segmenting a training corpus. As a result, we do not use clues to form hard and fast rules. Instead, they all contribute evidence used to either increase or decrease the likelihood of proposing a topic boundary between two regions of text.</Paragraph> <Section position="1" start_page="357" end_page="357" type="sub_section"> <SectionTitle> 3.1 Domain-specific Cue Phrases </SectionTitle> <Paragraph position="0"> Many discourse segmentation techniques (e.g.</Paragraph> <Paragraph position="1"> Hirschberg and Litman, 1993) as well as some topic segmentation algorithms rely on cue words and phrases (e.g. Beeferman et al., 1997), but the types of cue words used vary greatly. Those we employ are highly domain specific. Taking an : 358 example from the broadcast news domain where we will demonstrate the effectiveness of our algorithms, the phrase joining us is a good indicator that a topic shift has just occurred because news anchors frequently say things such as joining us to discuss the crisis in Kosovo is Congressman... when beginning new stories.</Paragraph> <Paragraph position="2"> Consequently, our algorithms use the presence of phrases such as this one to boost the probability of a topic boundary having occurred.</Paragraph> <Paragraph position="3"> joining us good evening brought to you by this just in welcome back <person name> <station> this is <person name> we employ.</Paragraph> <Paragraph position="4"> Some cue phrases are more complicated and contain word sequences of particular types. Not surprisingly, the phrase this is is common in broadcast news. When it is followed by a person's name, however, it serves as a good clue that a topic is about to end. This is <person name> is almost always said when a reporter is signing off after finishing an on-location report. Generally such signoffs are followed by the start of new news stories. A sampling of the cue phrases we use is found in Table 1. Since our training corpus was relatively small we identified these by hand, but on a different corpus we induced them automatically (Reynar, 1998). The results we present later in the paper rely solely on manually identified cues phrases.</Paragraph> <Paragraph position="5"> Identifying complex cue phrases involves pattern matching and determining whether particular word sequences belong to various classes. To address this, we built a named entity recognition system in the spirit of those used for the Message Understanding Conference evaluations (e.g. Bikel et al., 1997). Our named entity recognizer used a maximum entropy model, built with Adwait Ratnaparkhi's tools (Ratnaparkhi, 1996) to label word sequences as either person, place, company or none of the above based on local cues including the surrounding words and whether honorifics (e.g. Mrs. or Gen.) or corporate designators (e.g. Corp. or Inc.) were present. Our algorithm's labelling accuracy of 96.0% by token was sufficient for our purposes, but performance is not directly comparable to the MUC competitors'. Though we trained from the same data, we preprocessed the data to remove punctuation and capitalization so the model could be applied to broadcast news data that lacked these helpful clues. We separately identified television network acronyms using simple regular expressions.</Paragraph> </Section> <Section position="2" start_page="357" end_page="357" type="sub_section"> <SectionTitle> 3.2 Word Bigram Frequency </SectionTitle> <Paragraph position="0"> Many topic segmentation algorithms in the literature use word frequency (e.g. Hearst, 1994; Reynar, 1994; Beeferman et al., 1997). An obvious extension to using word frequency is to use the frequency of multi-word phrases. Such phrases are useful because they approximate word sense disambiguation techniques. Algorithms that rely exclusively on word frequency might be fooled into suggesting that two stretches of text containing the word plant were part of the same story simply because of the rarity of plant and the low odds that two adjacent stories contained it due to chance. However, if plant in one section participated in bigrams such as wild plant, native plant and woody plant but in the other section was only in the bigrams chemical plant, manufacturing plant and processing plant, the lack of overlap between sets of bigrams could be used to decrease the probability that the two sections of text were in the same story. We limited the bigrams we used to those containing two content words.</Paragraph> </Section> <Section position="3" start_page="357" end_page="357" type="sub_section"> <SectionTitle> 3.3 Repetition of Named Entities </SectionTitle> <Paragraph position="0"> The named entities we identified for use in cue phrases are also good indicators of whether two sections are likely to be in the same story or not.</Paragraph> <Paragraph position="1"> Companies, people and places figure prominently in many documents, particularly those in the domain of broadcast news. The odds that different stories discuss the same entities are generally low.</Paragraph> <Paragraph position="2"> There are obviously exceptions--the President of the U.S. may figure in many stories in a single broadcast--but nonetheless the presence of the same entities in two blocks of text suggest that they are likely to be part of the same story.</Paragraph> </Section> <Section position="4" start_page="357" end_page="359" type="sub_section"> <SectionTitle> 3.4 Pronoun Usage </SectionTitle> <Paragraph position="0"> In her dissertation, Levy described a study of the impact of the type of referring expressions used, the location of first mentions of people and the gestures speakers make upon the cohesiveness of discourse (Levy, 1984). She found a strong correlation between the types of referring expressions people used, in particular how explicit they were, and the degree of cohesiveness with the preceding context. Less cohesive utterances generally contained more explicit referring expressions, such as definite noun phrases or phrases consisting of a possessive followed by a noun, while more cohesive utterances more.</Paragraph> <Paragraph position="1"> frequently contained zeroes and pronouns.</Paragraph> <Paragraph position="2"> We will use the converse of Levy's observation about pronouns to gauge the likelihood of a topic shift. Since Levy generally found pronouns in utterances that exhibited a high degree of cohesion with the prior context, we assume that the presence of a pronoun among the first words immediately following a putative topic boundary provides some evidence that no topic boundary actually exists there.</Paragraph> </Section> </Section> <Section position="5" start_page="359" end_page="360" type="metho"> <SectionTitle> 4 Our Algorithms </SectionTitle> <Paragraph position="0"> We designed two algorithms for topic segmentation. The first is based solely on word frequency and the second combines the results of the first with other sources of evidence. Both of these algorithms are applied to text following some preprocessing including tokenization, conversion to lowercase and the application of a lemmatizer (Karp et al., 1992).</Paragraph> <Section position="1" start_page="359" end_page="360" type="sub_section"> <SectionTitle> 4.1 Word Frequency Algorithm </SectionTitle> <Paragraph position="0"> Our word frequency algorithm uses Katz's G model (Katz, 1996). The G model stipulates that words occur in documents either topically or nontopically. The model defines topical words as those that occur more than 1 time, while non-topical words occur only once. Counterexamples of these uses of topical and nontopical, of course, abound.</Paragraph> <Paragraph position="1"> We use the G model, shown below, to determine the probability that a particular word, w, occurred k times in a document. We trained the model from a corpus of 78 million words of Wall Street Journal text and smoothed .the parameters using Dan Melamed's implementation of Good-Turing smoothing (Gale and Sampson, 1995) and additional ad hoc smoothing to account for unknown words.</Paragraph> <Paragraph position="2"> ot w is the probability that a document contains at least 1 occurrence of word w.</Paragraph> <Paragraph position="3"> Y w is the probability that w is used topically in a document given that it occurs at all.</Paragraph> <Paragraph position="4"> B w is the average number of occurrences in documents with more than l occurrence of w.</Paragraph> <Paragraph position="5"> 6 is a function with value 1 if x = y and 0 x,v otherwise.</Paragraph> <Paragraph position="6"> The simplest way to view the G model is to decompose it into 3 separate terms that are summed. The first term is the probablility of zero occurrences of a word, the second is the probability of one occurrence and the third is the probability of any number of occurrences greater than one.</Paragraph> <Paragraph position="7"> To detect topic boundaries, we used the model to answer this simple question. Is it more or less likely that the words following a putative topic boundary were generated independently of those before it? Given a potential topic boundary, we call the text before the boundary region 1 and the text after it region 2. For the sake of our algorithm, the size of these regions was fixed at 230 words--the average size of a topic segment in our training corpus, 30 files from the HUB-4 Broadcast News Corpus annotated with topic boundaries by the LDC (HUB-4, 1996). Since the G model, unlike language models used for speech recognition, computes the probability of a bag of words rather than a word sequence, we can use it to compute the probability of some text given knowledge of what words have occurred before that text. We computed two probabilities with the model. P,,,,, is the probability that region 1 and region 2 discuss the same subject matter and hence that there is no topic boundary between them. P ..... is the probability that they discuss different subjects and are separated by a topic boundary. P ....... therefore, is the probability of seeing the words in region 2 given the context, called C, of region 1. P, is the probability of seeing the words in region 2 independent of the words in region 1. Formulae for P ..... and P, are shown below. Boundaries were placed where P, was greater than P,,,,, by a certain threshold. The threshold was used to trade precision for recall and vice versa when identifying topic boundaries. The most natural threshold is a very small nonzero value, which is equivalent to placing a boundary wherever P.,, is region after the putative boundary were first uses? * Were pronouns used in the first five words after the putative topic boundary? We trained this model from 30 files of HUB-4 data that was disjoint from our test data.</Paragraph> <Paragraph position="8"> Computing Pov,, is straightforward, but P,,,requires computing conditional probabilities of the number of occurrences of each word in region 2 given the number in region 1. The formulae for the conditional probabilities are shown in Table 2. We do not have space to derive these formulae here, but they can be found in (Reynar, 1998). M is a normalizing term required to make the conditional probabilities sum to 1. In the table, x+ means x occurrences or more.</Paragraph> <Paragraph position="10"/> <Paragraph position="12"/> </Section> <Section position="2" start_page="360" end_page="360" type="sub_section"> <SectionTitle> 4.2 A Maximum Entropy Model </SectionTitle> <Paragraph position="0"> Our second algorithm is a maximum entropy model that uses these features: * Did our word frequency algorithm suggest a topic boundary? * Which domain cues (such as Joining us or This is <person>) were present? * How many content word bigrams were common to both regions adjoining the putative topic boundary?</Paragraph> </Section> </Section> class="xml-element"></Paper>