File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1203_intro.xml
Size: 5,372 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1203"> <Title>Combining Optimal Clustering and Hidden Markov Models for Extractive Summarization</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Multi-document summarization (MDS) is the summarization of a collection of related documents (Mani (1999)). Its application includes the summarization of a news story from different sources where document sources are related by the theme or topic of the story. Another application is the tracking of news stories from the single source over different time frame. In this case, documents are related by topic over time.</Paragraph> <Paragraph position="1"> Multi-document summarization is also an extension of single document summarization. One of the most robust and domain-independent summarization approaches is extraction-based or shallow summarization (Mani (1999)). In extraction-based summarization, salient sentences are automatically extracted to form a summary directly (Kupiec et.</Paragraph> <Paragraph position="2"> al, (1995), Myaeng & Jang (1999), Jing et. al, (2000), Nomoto & Matsumoto (2001,2002), Zha (2002), Osborne (2002)), or followed by a synthesis stage to generate a more natural summary (McKeown & Radev (1999), Hovy & Lin (1999)).</Paragraph> <Paragraph position="3"> Summarization therefore involves some theme or topic identification and then extraction of salient segments in a document.</Paragraph> <Paragraph position="4"> Story segmentation, document and sentence and classification can often be accomplished by unsupervised, clustering methods, with little or no requirement of human labeled data (Deerwester (1991), White & Cardie (2002), Jing et. al (2000)).</Paragraph> <Paragraph position="5"> Unsupervised methods or hybrids of supervised and unsupervised methods for extractive summarization have been found to yield promising results that are either comparable or superior to supervised methods (Nomoto & Matsumoto (2001,2002)). In these works, vector space models are used and document or sentence vectors are clustered together according to some similarity measure (Deerwester (1991), Dagan et al. (1997)).</Paragraph> <Paragraph position="6"> The disadvantage of clustering methods lies in their ad hoc nature. Since sentence vectors are considered to be independent sample points, the sentence order information is lost. Various heuristics and revision strategies have been applied to the general sentence selection schema to take into consideration text cohesion (White & Cardie (2002), Mani and Bloedorn (1999), Aone et. al (1999), Zha (2002), Barzilay et al., (2001)). We would like to preserve the natural linear cohesion of sentences in a text as a baseline prior to the application of any revision strategies.</Paragraph> <Paragraph position="7"> To compensate for the ad hoc nature of vector space models, probabilistic approaches have regained some interests in information retrieval in recent years (Knight & Marcu (2000), Berger & Lafferty (1999), Miller et al., (1999)). These recent probabilistic methods in information retrieval are largely inspired by the success of probabilistic models in machine translation in the early 90s (Brown et. al), and regard information retrieval as a noisy channel problem. Hidden Markov Models proposed by Miller et al. (1999), and have shown to outperform tf, idf in TREC information retrieval tasks. The advantage of probabilistic models is that they provide a more rigorous and robust framework to model query-document relations than ad hoc information retrieval. Nevertheless, such probabilistic IR models still require annotated training data.</Paragraph> <Paragraph position="8"> In this paper, we propose an iterative unsupervised training method for multi-document extractive summarization, combining vectors space model with a probabilistic model. We iteratively classify news articles, then paragraphs within articles, and finally sentences within paragraphs into common story themes, by using modified K-means (MKM) clustering and segmental K-means (SKM) decoding. We obtain an initial clustering of article classes by MKM, which determines the inherent number of theme classes of all news articles. Next, we use SKM to classify paragraphs and then sentences. SKM iterates between a k-means clustering step, and a Viterbi decoding step, to obtain a final classification of sentences into theme classes. Our MKM-SKM paradigm combines vector space clustering model with a probabilistic framework, preserving some of the natural sentence cohesion, without the requirement of annotated data. Our method also avoids any arbitrary or ad hoc setting of parameters.</Paragraph> <Paragraph position="9"> In section 2, we introduce the modified K-means algorithm as a better alternative than conventional K-means for document clustering. In section 3 we present the stochastic framework of theme classification and sentence extraction. We describe the training algorithm in section 4, where details of the model parameters and Viterbi scoring are presented. Our sentence selection algorithm is described in Section 5. Section 6 describes our evaluation experiments. We discuss the results and conclude in section 7.</Paragraph> </Section> class="xml-element"></Paper>