File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1029_intro.xml
Size: 2,994 bytes
Last Modified: 2025-10-06 14:01:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1029"> <Title>Fine-Grained Hidden Markov Modeling for Broadcast- News Story Segmentation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Current technology makes the automated capture, storage, indexing, and categorization of broadcast news feasible allowing for the development of computational systems that provide for the intelligent browsing and retrieval of news stories [Maybury, Merlino & Morey '97; Kubula, et al., '00]. To be effective, such systems must be able to partition the undifferentiated input signal into the appropriate sequence of news-story segments.</Paragraph> <Paragraph position="1"> In this paper we discuss an approach to segmentation based on the use of a fine-grained Hidden Markov Model [Rabiner, `89] to model the generation of the words produced during a news program. We present the model topology, and the textual features used. Critical to this approach is the application of non-parametric estimation techniques, employed to obtain robust estimates for both transition and observation probabilities. Visualization methods developed for the analysis of system performance are also presented.</Paragraph> <Paragraph position="2"> Typically, approaches to news-story segmentation have been based on extracting features of the input stream that are likely to be different at boundaries between stories from what is observed within the span of individual stories. In [Beeferman, Berger, & Lafferty '99], boundary decisions are based on how well predictions made by a long-range exponential language model compare to those made by a short range trigram model. [Ponte and Croft, '97] utilize Local Context Analysis [Xu, J. and Croft, '96] to enrich each sentence with related words, and then use dynamic programming to find an optimal boundary sequence based on a measure of word-occurrence similarity between pairs of enriched sentences. In [Greiff, Hurwitz & Merlino, `99], a naive Bayes classifier is used to make a boundary decision at each word of the transcript. In [Yamron, et al., '98], a fully connected Hidden Markov Model is based on automatically induced topic clusters, with one node for each topic. Observation probabilities for each node are estimated using smoothed unigram statistics.</Paragraph> <Paragraph position="3"> The approach reported in this paper goes further along the lines of find-grained modeling in two respects: 1) differences in feature patterns likely to be observed at different points in the development of a news story are exploited, in contrast to approaches that focus on boudary/no-boundary differences; and 2) a more detailed modeling of the story-length distribution profile, unique to each news source (for example, see the histogram of story lengths for ABC World News Tonight shown in the top graph of Figure 3, below).</Paragraph> </Section> class="xml-element"></Paper>