File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0504_intro.xml

Size: 6,432 bytes

Last Modified: 2025-10-06 14:01:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0504">
  <Title>Summarization of Noisy Documents: A Pilot Study</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Previous work in text summarization has focused predominately on clean, well-formatted documents, i.e., documents that contain relatively few spelling and grammatical errors, such as news articles or published technical material. In this paper, we present a pilot study of noisy document summarization, motivated primarily by the impact of various kinds of physical degradation that pages may endure before they are scanned and processed using optical character recognition (OCR) software.</Paragraph>
    <Paragraph position="1"> As more and more documents are now scanned in by OCR, an understanding of the impact of OCR on summarization is crucial and timely. The Million Book Project is one of the projects that uses OCR technology for digitizing books. Pioneered by researchers at Carnegie Mellon University, it aims to digitize a million books by 2005, by scanning the books and indexing their full text with OCR technology (http://www.archive.org/texts/millionbooks.php).</Paragraph>
    <Paragraph position="2"> Understandably, summarizing documents that contain many errors is an extremely difficult task. In our study, we focus on analyzing how the quality of summaries is affected by the level of noise in the input document, and how each stage in summarization is impacted by the noise. Based on our analysis, we suggest possible ways of improving the performance of automatic summarization systems for noisy documents. We hope to use what we have learned from this initial investigation to shed light on future directions.</Paragraph>
    <Paragraph position="3"> What we ascertain from studying the problem of noisy document summarization can be useful in a number of other applications as well. Noisy documents constitute a significant percentage of documents we encounter in everyday life. The output from OCR and speech recognition (ASR) systems typically contain various degrees of errors, and even purely electronic media, such as email, are not error-free. To summarize such documents, we need to develop techniques to deal with noise, in addition to working on the core algorithms. Whether we can successfully handle noise will greatly influence the final quality of summaries of such documents.</Paragraph>
    <Paragraph position="4"> Some researchers have studied problems relating to information extraction from noisy sources. To date, this work has focused predominately on errors that arise during speech recognition, and on problems somewhat different from summarization. For example, Gotoh and Renals propose a finite state modeling approach to extract sentence boundary information from text and audio sources, using both n-gram and pause duration information (Gotoh and Renals, 2000). They found that precision and recall of over 70% could be achieved by combining both kinds of features. Palmer and Ostendorf describe an approach for improving named entity extraction by explicitly modeling speech recognition errors through the use of statistics annotated with confidence scores (Palmer and Ostendorf, 2001). Hori and Furui summarize broadcast news speech by extracting words from automatic transcripts using a word significance measure, a confidence score, linguistic likelihood, and a word concatenation probability (Hori and Furui, 2001).</Paragraph>
    <Paragraph position="5"> There has been much less work, however, in the case of noise induced by optical character recognition. Early papers by Taghva, et al. show that moderate error rates have little impact on the effectiveness of traditional information retrieval measures (Taghva et al., 1996a; Taghva et al., 1996b), but this conclusion does not seem to apply to the task of summarization. Miller, et al. study the performance of named entity extraction under a variety of scenarios involving both ASR and OCR output (Miller et al., 2000), although speech is their primary interest. They found that by training their system on both clean and noisy input material, performance degraded linearly as a function of word error rates. They also note in their paper: &amp;quot;To our knowledge, no other information extraction technology has been applied to OCR material&amp;quot; (pg. 322). An intriguing alternative to text-based summarization is Chen and Bloomberg's approach to creating summaries without the need for optical character recognition (Chen and Bloomberg, 1998). Instead, they extract indicative summary sentences using purely image-based techniques and common document layout conventions. While this is effective when the final summary is to be viewed on-screen by the user, the issue of optical character recognition must ultimately be faced in most applications of interest (e.g., keyword-driven information retrieval).</Paragraph>
    <Paragraph position="6"> For the work we present in this paper, we performed a small pilot study in which we selected a set of documents and created noisy versions of them. These were generated both by scanning real pages via OCR and by using a filter we have developed that injects various levels of noise into an original source document. The clean and noisy documents were then piped through a summarization system. We tested different modules that are often included in such systems, including sentence boundary detection, part-of-speech tagging, syntactic parsing, extraction, and editing of extracted sentences. The experimental results show that these modules suffer significant degradation as the noise level in the document increases. We discuss the errors made at each stage and how they affect the quality of final summaries.</Paragraph>
    <Paragraph position="7"> In Section 2, we describe our experiment, including the data creation process and various tests we performed.</Paragraph>
    <Paragraph position="8"> In Section 3, we analyze the results of the experiment and correlate the quality of summaries with noise levels in the input document and the errors made at different stages of the summarization process. We then discuss some of the challenges in summarizing noisy documents and suggest possible methods for improving the performance of noisy document summarization. We conclude with future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML