File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0504_metho.xml
Size: 28,476 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0504"> <Title>Summarization of Noisy Documents: A Pilot Study</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Data creation </SectionTitle> <Paragraph position="0"> We selected a small set of four documents to study in our experiment. Three of four documents were from the TREC corpus and one was from a Telecommunications corpus we collected ourselves (Jing, 2001). All are professionally written news articles, each containing from 200 to 800 words (the shortest document was 9 sentences and the longest was 38 sentences).</Paragraph> <Paragraph position="1"> For each document, we created 10 noisy versions. The first five corresponded to real pages that had been printed, possibly subjected to a degradation, scanned at 300 dpi using a UMAX Astra 1200S scanner, and then OCR'ed with Caere OmniPage Limited Edition. These included: clean The page as printed.</Paragraph> <Paragraph position="2"> fax A faxed version of the page.</Paragraph> <Paragraph position="3"> dark An excessively dark (but legible) photocopy.</Paragraph> <Paragraph position="4"> light An excessively light (but legible) photocopy.</Paragraph> <Paragraph position="5"> skew The clean page skewed on the scanner glass.</Paragraph> <Paragraph position="6"> Note that because the faxed and photocopied documents were processed by running them through automatic page feeders, these pages can also exhibit noticeable skew.</Paragraph> <Paragraph position="7"> The remaining five sample documents in each case were electronic copies of the original that had had synthetic noise (single-character deletions, insertions, and substitutions) randomly injected at predetermined rates: 5%, 10%, 15%, 20%, and 25%.</Paragraph> <Paragraph position="8"> In general, we want to study both real and synthetic noise. The arguments in favor of the former are quite obvious. The arguments in favor of the latter is that it is easier to control synthetic noise effects, and often they have exactly the same impact on the overall process as real noise. Even though the errors may be artificial, the impact on later processes is probably the same. For example, changing &quot;nuclear&quot; to &quot;nZclear&quot; does not reflect a common OCR error. But it does have the same effect - changing a word in the dictionary to a word that is no longer recognized. If the impact is identical and it is easier to control, then it is beneficial to use synthetic noise in addition to real noise.</Paragraph> <Paragraph position="9"> A summary was created for each document by human experts. For the three documents from the TREC corpus, the summaries were generated by taking a majority opinion. Each document was given to five people who were asked to select 20% of the original sentences as the summary. Sentences selected by three or more of the five human subjects were included in the summary of the document. For the document from the Telecommunications corpus, an abstract of the document was provided by a staff writer from the news service. These human-created summaries are useful in evaluating the quality of the automatic summaries.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Summarization stages </SectionTitle> <Paragraph position="0"> We are interested in testing how each stage of a summarization system is affected by noise, and how this in turn affects the quality of the summaries. Many summarization approaches exist, and it would be difficult to study the effects of noise on all of them. However, the following stages are common to many summarization systems: Step 1: Tokenization. The main task here is to break the text into sentences. Tokens in the input text are also identified.</Paragraph> <Paragraph position="1"> Step 2: Preprocessing. This typically involves part-of-speech tagging and syntactic parsing. This step is optional; some systems do not perform tagging and parsing at all. Topic segmentation is deployed by some summarization systems, but not many.</Paragraph> <Paragraph position="2"> Step 3: Extraction. This is the main step in summarization, in which the automatic summarizer selects key sentences (sometimes paragraphs or phrases) to include in the summary.</Paragraph> <Paragraph position="3"> Step 4: Editing. Some systems post-edit the extracted sentences to make them more coherent and concise.</Paragraph> <Paragraph position="4"> For each stage, we selected one or two systems that perform the task and tested their performance on both clean and noisy documents.</Paragraph> <Paragraph position="5"> For tokenization, we tested two tokenizers: one is a rule-based system that decides sentence boundaries based on heuristic rules encoded in the program, and the other one is a trainable tokenizer that uses a decision tree approach for detecting sentence boundaries and has been trained on a large amount of data.</Paragraph> <Paragraph position="6"> For part-of-speech tagging and syntactic parsing, we tested the English Slot Grammar (ESG) parser (Mc-Cord, 1990). The outputs from both tokenizers were tested on ESG. The ESG parser requires as input divided sentences and returns a parse tree for each input sentence, including a part-of-speech tag for each word in the sentence. The reason we chose a full parser such as ESG rather than a part-of-speech tagger and a phrase chunking system is that the summary editing system in Step 4 uses the output from ESG. Although many sentence extraction systems do not use full syntactic information, it is not rare for summarization systems that do use parsing output to use a full parser, whether it is ESG or a statistical parser such as Collin's, since such summarization systems often perform operations that need deep understanding of the original text.</Paragraph> <Paragraph position="7"> For extraction, we used a program that relies on lexical cohesion, frequency, sentence positions, and cue phrases to identify key sentences (Jing, 2001). The length parameter of the summaries was set to 20% of the number of sentences in the original document.</Paragraph> <Paragraph position="8"> The output from the rule-based tokenizer was used in this step. This particular extraction system does not use tagging and parsing.</Paragraph> <Paragraph position="9"> In the last step, we tested a cut-and-paste system that edits extracted sentences by simulating the revision operations often performed by professional abstractors (Jing, 2001). The outputs from all the three previous steps were used by the cut-and-paste system.</Paragraph> <Paragraph position="10"> All of the summaries produced in this experiment were generic, single-document summaries.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Results and Analysis </SectionTitle> <Paragraph position="0"> In this section, we present results at each stage of summarization, analyzing the errors made and their effects on the quality of summaries.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 OCR performance </SectionTitle> <Paragraph position="0"> We begin by examining the overall performance of the OCR process. Using standard edit distance techniques (Esakov et al., 1994), we can compare the output of OCR to the ground-truth to classify and quantify the errors that have arisen. We then compute, on a percharacter and per-word basis, a figure for average precision (percentage of characters or words recognized that are correct) and recall (percentage of characters or words in the input document that are correctly recognized). As indicated in Table 1, OCR performance varies widely depending on the type of degradation. Precision values are generally higher than recall because, in certain cases, the OCR system failed to produce output for a portion of the page in question. Since we are particularly interested in punctuation due to its importance in delimiting sentence boundaries, we tabulate a separate set of precision and recall values for such characters. Note that these are uniformly lower than the other values in the table. Recall, in particular, is a serious issue; many punctuation marks are missed in the OCR output.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Sentence boundary errors </SectionTitle> <Paragraph position="0"> Since most summarization systems rely on sentence extraction, it is important to identify sentence boundaries correctly. For clean text, the reported accuracy of sentence boundary detection is usually above 95% (Palmer and Hearst, 1997; Reyner and Ratnaparkhi, 1997; Riley, 1989). However, detecting sentence boundaries in noisy documents is a serious challenge since punctuation and capitalization, which are important features in sentence boundary detection, are unreliable in noisy documents.</Paragraph> <Paragraph position="1"> As we have just noted, punctuation errors arise frequently in the OCR output of degraded page images.</Paragraph> <Paragraph position="2"> We tested two tokenizers: one is a rule-based system and the other is a decision tree system. The experimental results show that for the clean text, the two systems perform almost equally well. Manual checking of the results indicates that both tokenizers made very few errors.</Paragraph> <Paragraph position="3"> There should be 90 sentence boundaries in total. The decision tree tokenizer correctly identified 88 of the sentence boundaries and missed two (precision: 100%; recall: 98%). The rule-based tokenizer correctly identified 89 of the boundaries and missed one (precision: 100%; recall: 99%). Neither system made any false positive errors (i.e., they did not break sentences at non-sentence boundaries).</Paragraph> <Paragraph position="4"> For the noisy documents, however, both tokenizers made significant numbers of errors. The types of errors they made, moreover, were quite different. While the rule-based system made many false negative errors, the decision tree system made many false positive errors.</Paragraph> <Paragraph position="5"> Therefore, the rule-based system identified far fewer sentence boundaries than the truth, while the decision tree system identified far more than the truth.</Paragraph> <Paragraph position="6"> ber of sentences detected and average words per sentence for two tokenizers. Tokenizer 1 is decision tree based, and tokenizer 2 is rule based.</Paragraph> <Paragraph position="7"> Table 2 shows the number of sentences identified by each tokenizer for different versions of the documents.</Paragraph> <Paragraph position="8"> As we can see from the table, the noisier the documents, the more errors the tokenizers made. This relationship was demonstrated clearly by the results for the documents with synthetic noise. As the noise rate increases, the number of boundaries identified by the decision tree tokenizer gradually increases, and the number of boundaries identified by the rule-based tokenizer gradually decreases. Both numbers diverge from truth, but they err in opposite directions.</Paragraph> <Paragraph position="9"> The two tokenizers behaved less consistently on the OCR'ed documents. For OCR.light, OCR.dark, and OCR.fax, the decision tree tokenizer produced more sentence boundaries than the rule-based tokenizer. But for OCR.clean and OCR.skew, the decision tree tokenizer produced fewer sentence boundaries. This may be related to the noise level in the document. OCR.clean and OCR.skew contain fewer errors than the other noisy versions (recall Table 1). This indicates that the decision tree tokenizer tends to identify fewer sentence boundaries than the rule-based tokenizer for clean text or documents with very low levels of noise, but more sentence boundaries when the documents have a relatively high level of noise.</Paragraph> <Paragraph position="10"> Errors made at this stage are extremely detrimental, since they will propagate to all of the other modules in a summarization system. When a sentence boundary is incorrectly marked, the part-of-speech tagging and the syntactic parsing are likely to fail. Sentence extraction may become problematic; for example, one of the documents in our test set contains 24 sentences, but for one of its noisy versions (OCR.dark), the rule-based tokenizer missed most sentence boundaries and divided the document into only three sentences, making extraction at the sentence level difficult at best.</Paragraph> <Paragraph position="11"> Since sentence boundary detection is important to summarization, the development of robust techniques that can handle noisy documents is worthwhile. We will return to this point in Section 4.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Parsing errors </SectionTitle> <Paragraph position="0"> Some summarization systems use a part-of-speech tagger or a syntactic parser in their preprocessing steps.</Paragraph> <Paragraph position="1"> We computed the percentage of sentences that ESG failed to return a complete parse tree, and used that value as one way of measuring the performance of the parser on the noisy documents. If the parser cannot return a complete parse tree, then it definitely fails to analyze the sentence; but even when a complete parse tree is returned, the parse can be wrong. As we can see from Table 3, a significant percentage of noisy sentences were not parsed.</Paragraph> <Paragraph position="2"> Even for the documents with synthetic noise at a 5% rate, around 60% of the sentences cannot be handled by the parser. This indicates that a full parser such as ESG is very sensitive to noise.</Paragraph> <Paragraph position="3"> Even when ESG produces a complete parse tree for a noisy sentence, the result is incorrect most of times. For instance, the sentence &quot;Internet sites found that almost 90 percent collected personal information from youngsters&quot; was transformed to &quot;uInternet sites fo6ndha alQmostK0 pecent coll / 9ed pe?&quot; after adding synthetic noise at a 25% rate. For this noisy sentence, the parser returned a complete parse tree that marked the word &quot;sites&quot; as the main verb of the sentence, and tagged all the other words in the sentence as nouns.1 Although a complete parse tree is returned in this case, it is incorrect. This explains the phenomenon that the parser returned a higher percentage of complete parse trees for documents with synthetic noise at the 25% rate than for documents with lower levels of noise.</Paragraph> <Paragraph position="4"> The above results indicate that syntactic parsers are very vulnerable to noise in a document. Even low levels of noise lead to a significant drop in performance.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Extract quality versus noise level </SectionTitle> <Paragraph position="0"> In the next step, we studied how the sentence extraction module in a summarization system is affected by noise in the input document. The sentence extractor we used (Jing, 2001) relies on lexical links between words, word frequency, cue phrases, and sentence positions to identify key sentences. The performance of the system is affected by noise in multiple dimensions: lexical links are less reliable in a noisy condition; cue phrases are likely to be missed due to noisy spelling; and word frequency is less accurate due to different noisy occurrences of the same word.</Paragraph> <Paragraph position="1"> Evaluation of noisy document summaries is an interesting problem. Both intrinsic evaluation and extrinsic evaluation need to deal with noise effect on the quality 1One reason might be that the tagger is likely to tag unknown words as nouns, and all the noisy words are considered unknown words.</Paragraph> <Paragraph position="2"> of final summaries. For intrinsic evaluation, it is debatable whether clean human summaries or noisy document summaries (or both) should be used for comparison. There are two issues related to 'noisy' human summaries: one, whether such summaries are obtainable, and two, whether such summaries should be used in evaluation. We note that it is already difficult for a human to recover the information in the noisy documents when the synthetic noise rate reached 10%. Therefore, noisy human summaries will not be available for documents with relatively high level of noise. Secondly, even though the original documents are noisy, it is desirable for the final summaries to be fluent and clean. Therefore, if our ultimate goal is to produce a fluent and clean summary, it benefits to compare the automatic summaries with such summaries rather than noisy summaries.</Paragraph> <Paragraph position="3"> We compared the noisy automatic summaries with the clean human summaries by using three measures: uni-gram overlap between the automatic summary and the human-created summary, bigram overlap, and the simple cosine. These results are shown in Table 4. The unigram overlap is computed as the number of unique words occurring both in the extract and the ideal summary for the document, divided by the total number of unique words in the extract. Bigram overlap is computed similarly, replacing words with bigrams. The simple cosine is computed as the cosine of two document vectors, the weight of each element in the vector being 1=pN, where N is the total number of elements in the vector.</Paragraph> <Paragraph position="4"> Not surprisingly, summaries of noisier documents generally have a lower overlap with human-created summaries. However, this can be caused by either the noise in the document or poor performance of the sentence extraction system. To separate these effects and measure the performance of sentence extraction alone, we also computed the unigram overlap, bigram overlap, and cosine between each noisy document and its corresponding original text. These numbers are included in Table 4 in parentheses; they are an indication of the average noise level in a document. For instance, the table shows that 97% of words that occurred in OCR.clean documents also appeared in the original text, while only 62% of words that occurred in OCR.light appeared in the original. This indicates that OCR.clean is less noisy than OCR.light.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Abstract generation for noisy documents </SectionTitle> <Paragraph position="0"> To generate more concise and coherent summaries, a summarization system may edit extracted sentences. To study how this step in summarization is affected by noise, we tested a cut-and-paste system that edits extracted sentences by simulating revision operations often used by human abstractors, including the operations of removing phrases from an extracted sentence, and combining a reduced sentence with other sentences (Jing, 2001). This cut-and-paste stage relies on the results from sentence extraction in the previous step, the output from ESG, and a co-reference resolution system.</Paragraph> <Paragraph position="1"> For the clean text, the cut-and-paste system performed sentence reduction on 59% of the sentences that were extracted in the sentence extraction step, and sentence combination on 17% of the extracted sentences. For the noisy text, however, the system applied very few revision operations to the extracted (noisy) sentences. Since the cut-and-paste system relies on the output from ESG and co-reference resolution, which failed on most of the noisy text, it is not surprising that it did not perform well under these circumstances.</Paragraph> <Paragraph position="2"> Editing sentences requires a deeper understanding of the document and, as the last step in the summarization pipeline, relies on results from all of the previous steps. Hence, it is affected most severely by noise in the input document.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Challenges in Noisy Document </SectionTitle> <Paragraph position="0"> Summarization In the previous section, we have presented and analyzed errors at each stage of summarization when applied to noisy documents. The results show that the methods we tested at every step are fragile, susceptible to failures and errors even with slight increases in the noise level of a document. Clearly, much work needs to be done to achieve acceptable performance in noisy document summarization. We need to develop summarization algorithms that do not suffer significant degradation when used on noisy documents. We also need to develop robust natural language processing techniques. For example, it will be useful to develop a sentence boundary detection system that can identify sentence breaks in noisy documents more reliably. One way to achieve this might be to retrain an existing system on tokenized noisy documents so that it will learn features that are indicative of sentence breaks in noisy documents. However, this is only applicable if the noise level in the documents is low. For document with high level of noise, such approach will not be effective.</Paragraph> <Paragraph position="1"> In the remainder of this section, we discuss several issues in noisy document summarization, identifying the problems and proposing possible solutions. We regard this as a first step towards a more comprehensive study on the topic of noisy document summarization.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Choosing an appropriate granularity </SectionTitle> <Paragraph position="0"> It is important to choose an appropriate unit level to represent the summaries. For clean text, sentence extraction is a feasible goal since we can reliably identify sentence boundaries. For documents with very low levels of noise, sentence extraction is still possible since we can probably improve our programs to handle such documents. However, for documents with relatively high noise rates, we believe it is better to forgo sentence extraction and instead favor extraction of keywords or noun phrases, or generation of headline-style summaries. In our experiment, when the synthetic noise rate reached 10% (which is representative of what can happen when real-world documents are degraded), it was already difficult for a human to recover the information intended to be conveyed from the noisy documents.</Paragraph> <Paragraph position="1"> Keywords, noun phrases, or headline-style summaries are informative indications of the main topic of a document. For documents with high noise rates, extracting keywords or noun phrases is a more realistic and attainable goal than sentence extraction. Still, it may be desirable to correct the noise in the extracted keywords or phrases, either before or after summarization. There has been past work on correcting spelling mistakes and errors in OCR output; these techniques could be useful for this purpose.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Using other information sources </SectionTitle> <Paragraph position="0"> In addition to text, target documents contain other types of useful information that could be employed in creating summaries. As noted previously, Chen and Bloomberg's image-based summarization technique avoids many of the problems we have been discussing by exploiting document layout features. A possible approach to summarizing noisy documents, then, might be to use their method to create an image summary and then apply OCR afterwards to the resulting page. We note, though, that it seems unlikely this would lead to an improvement of the overall OCR results, a problem which may almost certainly must be faced at some point in the process.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Assessing error rates without ground-truth </SectionTitle> <Paragraph position="0"> The quality of summarization is directly tied to the level of noise in a document. In this context, it would be useful to develop methods for assessing document noise levels without having access to the ground-truth. Intuitively, OCR may create errors that cause the output text to deviate from &quot;normal&quot; text. Therefore, one way of evaluating OCR output, in the absence of the original ground-truth, is to compare its features against features obtained from a large corpus of correct text. Letter trigrams (Church and Gale, 1991) are commonly used to correct spelling and OCR errors (Angell et al., 1983; Kuckich, 1992; Zamora et al., 1981), and can be applied to evaluate OCR output.</Paragraph> <Paragraph position="1"> We computed trigram tables (including symbols and punctuation marks) for 10 days of AP news articles and evaluated the documents used in our experiment. The tri-grams were computed on letters and Good-Turing estimation is used for smoothing. The values in the table are average trigram scores for each document set. As expected, OCR errors create rare or previously unseen trigrams that lead to higher trigram scores in noisy documents. As indicated in Table 5, the ground-truth (original) documents have the lowest average trigram score. These scores provide a relative ranking that reflects the controlled noise levels (Snoise.05 through Snoise.25), as well as certain of the real OCR data (OCR.clean, OCR.dark, and OCR.light).</Paragraph> <Paragraph position="2"> Different texts have very different baseline trigram scores. The ranges of scores for clean and noisy text overlap. This is because some documents contain more instances of frequent words than others (such as &quot;the&quot;), which bring down the average scores. This issue makes it impractical to use trigram scores in isolation to judge OCR output.</Paragraph> <Paragraph position="3"> It may be possible to identify some problems if we scan larger units and incorporate contextual information.</Paragraph> <Paragraph position="4"> For example, a window of three characters is too small to judge whether the symbol @ is used properly: a@b seems to be a potential OCR error, but is acceptable when it appears in an email address such as lsa@bbb.com. Increasing the unit size will create sparse data problems, however, which is already an issue for trigrams.</Paragraph> <Paragraph position="5"> In the future, we plan to experiment with improved methods for identifying problematic regions in OCR text, including using language models and incorporating grammatical patterns. Many linguistic properties can be identified when letter sequences are encoded in broad classes. For example, long consonant strings are rare in English text, while long number strings are legal. These properties can be captured when characters are mapped into carefully selected classes such as symbols, numbers, upper- and lower-case letters, consonants, and vowels.</Paragraph> <Paragraph position="6"> Such mappings effectively reduce complexity, allowing us to sample longer strings to scan for abnormal patterns without running into severe sparse data problems.</Paragraph> <Paragraph position="7"> Our intention is to establish a robust index that measures whether a given section of text is &quot;summarizable.&quot; This problem is related to the general question of assessing OCR output without ground-truth, but we shift the scope of the problem to ask whether the text is summarizable, rather than how many errors it may contain.</Paragraph> <Paragraph position="8"> We also note that documents often contain logical components that go beyond basic text. Pages may include photographs and figures, program code, lists, indices, etc. Tables, for example, can be detected, parsed, and reformulated so that it becomes possible to describe their over-all structure and even allow users to query them (Hu et al., 2000). Developing appropriate ways of summarizing such material is another topic of interest.</Paragraph> </Section> </Section> class="xml-element"></Paper>