File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-1203_concl.xml

Size: 2,646 bytes

Last Modified: 2025-10-06 13:53:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1203">
  <Title>Combining Optimal Clustering and Hidden Markov Models for Extractive Summarization</Title>
  <Section position="7" start_page="0" end_page="0" type="concl">
    <SectionTitle>
7 Conclusion and Discussion
</SectionTitle>
    <Paragraph position="0"> We have presented a stochastic HMM framework with modified K-means and segmental K-means algorithms for extractive summarization. Our method uses an unsupervised, probabilistic approach to find class centroids, class sequences and class boundaries in linear, unrestricted texts in order to yield salient sentences and topic segments for summarization and question and answer tasks.</Paragraph>
    <Paragraph position="1"> We define a class to be a group of connected sentences that corresponds to one or multiple topics in the text. Such topics can be answers to a user query, or simply one concept to be included in the summary. We define a Markov model where the states correspond to the different classes, and the observations are continuous sequences of sentences in a document. Transition probabilities are the class transitions obtained from a training corpus. Emission probabilities are the probabilities of an observed sentence given a specific class, following a Poisson distribution. Unlike conventional methods where texts are treated as independent sentences to be clustered together, our method incorporates text cohesion information in the class transition probabilities. Unlike other HMM and noisy channel, probabilistic approaches for information retrieval, our method does not require annotated data as it is unsupervised.</Paragraph>
    <Paragraph position="2"> We also suggest using modified K-means clustering algorithm to avoid ad hoc choices of initial cluster set as in the conventional K-means algorithm. For unsupervised training, we use a segmental K-means training method to iteratively improve the clusters. Experimental results show that the content-based performance of our system is 22.8% above that of an existing extractive summarization system, and 46.3% above that of simple top-N sentence selection system. Even though the evaluation on the training set is not a close evaluation since the training is unsupervised, we will also evaluate on testing data not included in the training set as our trained decoder can be used to classify sentences in unseen texts. Our framework serves as a foundation for future incorporation of other statistical and linguistic information as vector features, such as part-of-speech tags, name aliases, synonyms, and morphological variations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML