File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1203_metho.xml
Size: 17,861 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1203"> <Title>Combining Optimal Clustering and Hidden Markov Models for Extractive Summarization</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Story Segmentation using Modified K- means (MKM) Clustering </SectionTitle> <Paragraph position="0"> The first step in multi-document summarization is to segment and classify documents that have a common theme or story. Vector space models can be used to compare documents (Ando et al. (2000), Deerwester et al. (1991)). K-means clustering is commonly used to cluster related document or sentence vectors together. A typical k-means clustering algorithm for summarization is as follows: 1. Arbitrarily choose K vectors as initial centroids; 2. Assign vectors closest to each centroid to its cluster; 3. Update centroid using all vectors assigned to each cluster; 4. Iterate until average intra-cluster distance falls below a threshold; We have found three problems with the standard k-means algorithm for sentence clustering. First, the initial number of clusters k, has to be set arbitrarily by humans. Second, the initial partition of a cluster is arbitrarily set by thresholding. Hence, the initial set of centroids is arbitrary. Finally, during clustering, the centroids are selected as the sentence among a group of sentences that has the least average distance to other sentences in the cluster. All these characteristics of K-means can be the cause of a non-optimal cluster configuration at the final stage.</Paragraph> <Paragraph position="1"> To avoid the above problems, we propose using modified K-means (MKM) clustering algorithm (Wilpon & Rabiner(1985)), coupled with virtual document centroids. MKM starts from a global centroid and splits the clusters top down until the clusters stabilize: 1. Compute the centroid of the entire training set; 2. Assign vectors closest to each centroid to its cluster; 3. Update centroid using all vectors assigned to each cluster; 4. Iterate 2-4 until vectors stop moving between clusters; null 5. Stop if clusters stabilizes, and output final clusters, else goto step 6; 6. Split the cluster with largest intra-cluster distance into two by finding the pair of vectors with largest distance in the cluster. Use these two vectors as new centroids, and repeat steps 2-5.</Paragraph> <Paragraph position="2"> In addition, we do not use any existing document in the collection as the selected centroid. Rather, we introduce virtual centroids that contain the expected value of all documents in a cluster. An element of the centroid is the average weight of the same index term in all documents within that cluster: null</Paragraph> <Paragraph position="4"> The vectors are document vectors in this step. The number of clusters is determined after the clusters are stabilized. The resultant cluster configuration is more optimal and balanced than that from using conventional k-means clustering. Using the MKM algorithm with virtual centroids, we segment the collection of news articles into clusters of related articles. Articles covering the same story from different sources now carry the same theme label.</Paragraph> <Paragraph position="5"> Articles from the same source over different time period also carry the same theme label. In the next stage, we iteratively re-classify each paragraph, and then re-classify each sentence in each paragraph into final theme classes.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Stochastic Process of Theme Classifi- </SectionTitle> <Paragraph position="0"> cation After we have obtained story labels of each article, we need to classify the paragraphs and then the sentences according to these labels. Each paragraph in the article is assigned the cluster number of that article, as we assume all paragraphs in the same article share the same story theme.</Paragraph> <Paragraph position="1"> We suggest that the entire text generation process can be considered as a stochastic process that starts in some theme class, generates sentences one after another for that theme class, then goes to the next theme and generates the sentences, so on and so forth, until it reaches the final theme class in a document, and finishes generating sentences in that class. This is an approximation of the authoring process where a writer thinks of a certain structure for his/her article, starts from the first section, writes sentences in that section, proceeds to the next section, etc., until s/he finishes the last sentence in the last section.</Paragraph> <Paragraph position="2"> Given a document of sentences, the task of summary extraction involves discovering the underlying theme class transitions at the sentence boundaries, classify each sentence according to these theme concepts, and then extract the salient sentences in each theme class cluster.</Paragraph> <Paragraph position="3"> We want to find )|(maxarg DCP Note that the total number of theme classes is far fewer than the total number of sentences in a document and the mapping is not one-to-one. Our task is similar to the concept of discourse parsing (Marcu (1997)), where discourse structures are extracted from the text. In our case, we are carrying out discourse tagging, whereby we assign the class labels or tags to each sentence in the document. null We use Hidden Markov Model for this stochastic process, where the classes are assumed to be hidden states.</Paragraph> <Paragraph position="4"> We make the following assumptions: * The probability of the sentence given its past only depends on its theme class (emission probabilities); * The probability of the theme class only depends on the theme classes of the previous N sentences (transition probabilities).</Paragraph> <Paragraph position="5"> The above assumptions lead to a Hidden Markov Model with M states representing M different theme classes. Each state can generate different sentences according to some probability distribution--the emission probabilities. These states are hidden as we only observe the generated sentences in the text. Every theme/state can transit to any other theme/state, or to itself, according to some probabilities--the transition probabilities.</Paragraph> <Paragraph position="7"> Our theme tagging task then becomes a search problem for HMM: Given the observation sequence ))(,),(,),2(),1(( TstsssD = , and the model l , how do we choose a corresponding state sequence )))((,)),((,)),2(()),1((( TsctscscscC = , that best explains the sentence sequence? To train the model parameter l , we need to solve another problem in HMM: How do we adjust the model parameters ),,( pil BA= , the transition, emission and initial probabilities, to maximize the likelihood of the observation sentence sequences given our model? In a supervised training paradigm, we can obtain human labeled class-sentence pairs and carry out a relative frequency count for training the emission and transition probabilities. However, hand labeling some large collection of texts with theme classes is very tedious. One main reason is that there is a considerable amount of disagreement between humans on manual annotation of themes and topics. How many themes should there be? Where should each theme start and end? It is therefore desirable to decode the hidden theme or topic states using an unsupervised training method without manually annotated data. Consequently, we only need to cluster and label the initial document according to cluster number. In the HMM framework, we then improve upon this initial clustering by iteratively estimate ),,( pil BA= , and maximize )|( DCP using a Viterbi decoder.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Sentence Feature Vector and Similarity Measure </SectionTitle> <Paragraph position="0"> Prior to the training process, we need to define sentence feature vector and the similarity measure for comparing two sentence vectors.</Paragraph> <Paragraph position="1"> As we consider a document D to be a sequence of sentences, the sentences themselves are represented as feature vectors )(ts of length L, where t is the position of the sentence in the document and L is the size of the vocabulary. Each element of the vector )(ts is an index term in the sentence, weighted by its text frequency (tf) and inverse document frequency (idf) where tf is defined as the frequency of the word in that particular sentence, and idf is the inverse frequency of the word in the larger document collection Ndflog[?] where df is the number of sentences this particular word appears in and N is the total number of sentences in the training corpus. We select the sentences as documents in computing the tf and idf because we are comparing sentence against sentence.</Paragraph> <Paragraph position="2"> In the initial clustering and subsequent Viterbi training process, sentence feature vectors need to be compared to the centroid of each cluster. Various similarity measures and metrics include the cosine measure, Dice coefficient, Jaccard coefficient, inclusion coefficient, Euclidean distance, KL convergence, information radius, etc (Manning & Scha0 tze (1999), Dagan et al. (1997), Salton and McGill (1983)). We chose the cosine similarity measure for its ease in computation:</Paragraph> <Paragraph position="4"> In this section, we describe an iterative training process for estimation of our HMM parameters.</Paragraph> <Paragraph position="5"> We consider the output of the MKM clustering process in Section 2 as an initial segmentation of text into class sequences. To improve upon this initial segmentation, we use an iterative Viterbi training method that is similar to the segmental k-means clustering for speech processing (Rabiner & Juang(1993)). All articles in the same story cluster are processed as follows: 1. Initialization: All paragraphs in the same story class are clustered again. Then all sentences in the same paragraph shares the same class label as that paragraph. This is the initial class-sentence segmentation. Initial class transitions are counted.</Paragraph> <Paragraph position="6"> 2. (Re-)clustering: Sentence vectors with their class labels are repartitioned into K clusters (K is obtained from the MKM step previously) using the K-means algorithm. This step is iterated until the clusters stabilize. null 3. (Re-)estimation of probabilities: The centroids of each cluster are estimated. Update emission probabilities from the new clusters.</Paragraph> <Paragraph position="7"> 4. (Re-)classification by decoding: the updated set of model parameters from step 2 are used to rescore the (unlabeled) training documents into sequences of class given sentences, using Viterbi decoding. Update class transitions from this output.</Paragraph> <Paragraph position="8"> 5. Iteration: Stop if convergence conditions are met, else repeat steps 2-4.</Paragraph> <Paragraph position="9"> The segmental clustering algorithm is iterated until the decoding likelihood converges. The final trained Viterbi decoder is then used to tag un-annotated data sets into class-sentence pairs. In the following Sections 4.1 and 4.2, we discuss in more detail steps 3 and 4.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Estimation of Probabilities </SectionTitle> <Paragraph position="0"> We need to train the parameters of our HMM such that the model can best describe the training data.</Paragraph> <Paragraph position="1"> During the iterative process, the probabilities are estimated from class-sentence pair sequences either from the initialization stage or the re-classification stage.</Paragraph> <Paragraph position="2"> and Text Segmentation Text cohesion (Halliday and Hasan (1996)) is an important concept in summarization as it underlines the theme of a text segment based on connectivity patterns between sentences (Mani (2002)). When an author writes from theme to theme in a linear text, s/he generates sentences that are tightly linked together within a theme. When s/he proceeds to the next theme, the sentences that are generated next are quite separate from the previous theme of sentences but are they themselves tightly linked again.</Paragraph> <Paragraph position="3"> As mentioned in the introduction, most extraction-based summarization approaches give certain consideration to the linearity between sentences in a text. For example, Mani (1999) uses spread activation weight between sentence links, (Barzilay et al, 2001) uses a cohesion constraint that led to improvement in summary quality. Anone et al. (1999) uses linguistic knowledge such as aliases, synonyms, and morphological variations to link lexical items together across sentences.</Paragraph> <Paragraph position="4"> Term distribution has been studied by many NLP researchers. Manning & Schutze (1999) gives a good overview of various probability distributions used to describe how a term appears in a text. The distributions are in general non-Gaussian in nature. Our Hidden Markov Model provides a unified framework to incorporate text cohesion and term distribution information in the transition probabilities of theme classes. The class of a sentence depends on the class labels of the previous N sentences. The linearity of the text is hence preserved in our model. In the preliminary experiment, we set N to be one, that is, we are using a bi-gram class model.</Paragraph> <Paragraph position="5"> tion of terms For the emission probabilities, there are a number of possible formulations. We cannot use relative frequency counts of number of sentences in clusters divided by the total sentences in the cluster since most sentences occur only once in the entire corpus. Looking at the sentence feature vector, we take the view that the probability of a sentence vector being generated by a particular cluster is the product of the probabilities of the index terms in the sentence occurring in that cluster according to some distribution, and that these term distribution probabilities are independent of each other.</Paragraph> <Paragraph position="6"> For a sentence vector of length L, where L is the total size of the vocabulary, its elements--the index terms--have certain probability density function (pdf). In speech processing, spectral features are assumed to follow independent Gaussian distributions. In language processing, several models have been proposed for term distribution, including the Poisson distribution, the two-Poisson model for content and non-content words (Bookstein and Swanson (1975)), the negative binomial (Mosteller and Wallace (1984), Church and Gale (1995)) and Katz's k-mixture (Katz (1996)). We adopt two schemes for comparison (1) the unigram distribution of each index term in the clusters; (2) the Poisson distribution as pdf. for modeling the term emission probabilities:</Paragraph> <Paragraph position="8"> At each estimation step of the training process, the l for the Poisson distribution is estimated from the centroid of each theme cluster. 1 1 Strictly speaking, we ought to re-estimate the IDF in the k-mixture during each iteration by using the re-estimated clusters from the k-means step as the documents. However, we simplify the process by using the pre-computed IDF from all training documents.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Viterbi Decoding: Re-classification with </SectionTitle> <Paragraph position="0"> sentence cohesion After each re-estimation, we use a Viterbi decoder to find the best class sequence given a document containing sentence sequences. The &quot;time sequence&quot; corresponds to the sequence of sentences in a document whereas the states are the theme classes.</Paragraph> <Paragraph position="1"> At each node of the trellis, the probability of a sentence given any class state is computed from the transition probabilities and the emission probabilities. After Viterbi backtracking, the best class sequence of a document is found and the sentences are relabeled by the class tags.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Salient Sentence Extraction </SectionTitle> <Paragraph position="0"> The SKM algorithm is iterated until the decoding likelihood converges. The final trained Viterbi decoder is then used to tag un-annotated data sets into class-sentence pairs. We can then extract salient sentences from each class to be included in a summary, or for question-answering.</Paragraph> <Paragraph position="1"> To evaluate the effectiveness of our method as a foundation for extractive summarization, we extract sentences from each theme class in each document using four features, namely: (1) the position of the sentence np 1= -- the further it is from the title, the less important it is supposed to be; (2) the cosine similarity of the sentence with the centroid of its class ps1; (3) its similarity with the first sentence in the article ps2; and (4) the so-called Z model (Zechner (1996), Nomoto & Ma null tsumoto (2000)), where the mass of a sentence is computed as the sum of tf, idf values of index terms in that sentence and the center of mass is chosen as the salient sentence to be included in a summary.</Paragraph> <Paragraph position="3"> The above features are linearly combined to yield a final saliency score for every sentence: zwwwpwsw [?]+[?]+[?]+[?]= 423121)( psps Our features are similar to those in an existing system (Radev 2002), with the difference in the centroid computation (and cluster definition), resulting from our stochastic system.</Paragraph> </Section> class="xml-element"></Paper>