File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/j05-4004_concl.xml
Size: 5,165 bytes
Last Modified: 2025-10-06 13:54:36
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4004"> <Title>Induction of Word and Phrase Alignments for Automatic Document Summarization</Title> <Section position="8" start_page="526" end_page="527" type="concl"> <SectionTitle> 6. Conclusion and Discussion </SectionTitle> <Paragraph position="0"> Currently, summarization systems are limited to either using hand-annotated data or using weak alignment models at the granularity of sentences, which serve as suitable training data only for sentence extraction systems. To train more advanced extraction systems, such as those used in document compression models or in next-generation abstraction models, we need to better understand the lexical correspondences between documents and their human written abstracts. Our work is motivated by the desire to leverage the vast number of <document, abstract> pairs that are freely available on the Internet and in other collections and to create word- and phrase-aligned <document, abstract> corpora automatically.</Paragraph> <Paragraph position="1"> This article presents a statistical model for learning such alignments in a completely unsupervised manner. The model is based on an extension of a hidden Markov model, in which multiple emissions are made in the course of one transition. We have described efficient algorithms in this framework, all based on dynamic programming.</Paragraph> <Paragraph position="2"> Using this framework, we have experimented with complex models of movement and lexical correspondences. Unlike the approaches used in machine translation, where only very simple models are used, we have shown how to efficiently and effectively leverage such disparate knowledge sources and WordNet, syntax trees, and identity models.</Paragraph> <Paragraph position="3"> We have empirically demonstrated that our model is able to learn the complex structure of <document, abstract> pairs. Our system outperforms competing approaches, including the standard machine translation alignment models (Brown et al. 1993; Vogel, Ney, and Tillmann 1996) and the state-of-the-art Cut and Paste summary alignment technique (Jing 2002).</Paragraph> <Paragraph position="4"> We have analyzed two sources of error in our model, including issues of null-generated summary words and lexical identity. Within the model itself, we have already suggested two major sources of error in our alignment procedure. Clearly more work needs to be done to fix these problems. One approach that we believe will be particularly fruitful would be to add a fifth model to the linearly interpolated rewrite model based on lists of synonyms automatically extracted from large corpora. Additionally, Computational Linguistics Volume 31, Number 4 investigating the possibility of including some sort of weak coreference knowledge into the model might serve to help with the second class of errors made by the model.</Paragraph> <Paragraph position="5"> One obvious aspect of our method that may reduce its general usefulness is the computation time. In fact, we found that despite the efficient dynamic programming algorithms available for this model, the state space and output alphabet are simply so large and complex that we were forced to first map documents down to extracts before we could process them (and even so, computation took roughly 1,000 processor hours). Though we have not pursued it in this work, we do believe that there is room for improvement computationally, as well. One obvious first approach would be to run a simpler model for the first iteration (for example, Model 1 from machine translation (Brown et al. 1993), which tends to be very recall oriented) and use this to see subsequent iterations of the more complex model. By doing so, one could recreate the extracts at each iteration using the previous iteration's parameters to make better and shorter extracts. Similarly, one might only allow summary words to align to words found in their corresponding extract sentences, which would serve to significantly speed up training and, combined with the parameterized extracts, might not hurt performance.</Paragraph> <Paragraph position="6"> A final option, but one that we do not advocate, would be to give up on phrases and train the model in a word-to-word fashion. This could be coupled with heuristic phrasal creation as is done in machine translation (Och and Ney 2000), but by doing this, one completely loses the probabilistic interpretation that makes this model so pleasing.</Paragraph> <Paragraph position="7"> Aside from computational considerations, the most obvious future effort along the lines of this model is to incorporate it into a full document summarization system. Since this can be done in many ways, including training extraction systems, compression systems, headline generation systems, and even extraction systems, we left this to future work so that we could focus specifically on the alignment task in this article.</Paragraph> <Paragraph position="8"> Nevertheless, the true usefulness of this model will be borne out by its application to true summarization tasks.</Paragraph> </Section> class="xml-element"></Paper>