File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/p00-1038_concl.xml
Size: 3,384 bytes
Last Modified: 2025-10-06 13:52:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1038"> <Title>Query-Relevant Summarization using FAQs</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Summary </SectionTitle> <Paragraph position="0"> The task of summarization is difficult to define and even more difficult to automate. Historically, a rewarding line of attack for automating language-related problems has been to take a machine learning perspective: let a computer learn how to perform the task by &quot;watching&quot; a human perform it many times. This is the strategy we have pursued here.</Paragraph> <Paragraph position="1"> There has been some work on learning a probabilistic model of summarization from text; some of the earliest work on this was due to Kupiec et al. (1995), who used a collection of manually-summarized text to learn the weights for a set of features used in a generic summarization system. Hovy and Lin (1997) present another system that learned how the position of a sentence affects its suitability for inclusion in a summary of the document. More recently, there has been work on building more complex, structured models--probabilistic syntax trees--to compress single sentences (Knight and Marcu, 2000). Mani and Bloedorn (1998) have recently proposed a method for automatically constructing decision trees to predict whether a sentence should or should not be included in a document's summary. These previous approaches focus mainly on the generic summarization task, not query relevant summarization.</Paragraph> <Paragraph position="2"> The language modelling approach described here does suffer from a common flaw within text processing systems: the problem of synonymy. A candidate answer containing the termConstantinopleis likely to be relevant to a question about Istanbul, but recognizing this correspondence requires a step beyond word frequency histograms. Synonymy has received much attention within the document retrieval community recently, and researchers have applied a variety of heuristic and statistical techniques--including pseudo-relevance feedback and local context analysis (Efthimiadis and Biron, 1994; Xu and Croft, 1996). Some recent work in statistical IR has extended the basic language modelling approaches to account for word synonymy (Berger and Lafferty, 1999).</Paragraph> <Paragraph position="3"> This paper has proposed the use of two novel datasets for summarization: the frequently-asked questions (FAQs) from Usenet archives and question/answer pairs from the call centers of retail companies. Clearly this data isn't a perfect fit for the task of building a QRS system: after all, answers are not summaries. However, we believe that the FAQs represent a reasonable source of query-related document condensations. Furthermore, using FAQs allows us to assess the effectiveness of applying standard statistical learning machinery--maximum-likelihood estimation, the EM algorithm, and so on--to the QRS problem. More importantly, it allows us to evaluate our results in a rigorous, non-heuristic way. Although this work is meant as an opening salvo in the battle to conquer summarization with quantitative, statistical weapons, we expect in the future to enlist linguistic, semantic, and other non-statistical tools which have shown promise in condensing text.</Paragraph> </Section> class="xml-element"></Paper>