File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0108_intro.xml
Size: 4,633 bytes
Last Modified: 2025-10-06 14:06:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0108"> <Title>A Statistical Approach to Automatic OCR Error Correction in Context</Title> <Section position="2" start_page="0" end_page="88" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word errors present problems for various text- or speech-based applications such as optical character recognition (OCR) and voice-input computer interfaces. In particular, though current OCR technology is quite refined and robust, sources such as old books, poor-quality (nth-generation) photocopies, and faxes can still be difficult to process and may cause many OCR errors. For OCR to be truly useful in a wide range of applications, such as office automation and information retrieval systems, OCR reliability must be improved. A method for the automatic correction of OCR errors would be clearly beneficial.</Paragraph> <Paragraph position="1"> Essentially, there are two types of word errors: non-word errors and real-word errors. A non-word error occurs when a word in a source text is interpreted (under OCR) as a string that does not correspond to any valid word in a given word list or dictionary. A real-word error occurs when a source-text word is interpreted as a string that actually does occur in the dictionary, but is not identical with the source-text word. For example, if the source text &quot;John found the man&quot; is rendered as &quot;John fornd he man&quot; by an OCR device, then &quot;fornd&quot; is a non-word error and &quot;he&quot; is a real-word error. In general, non-word errors will never correspond to any dictionary entries and will include wildly incorrect strings (such as &quot;#--&&') as well as misrecognized alpha-numeric sequences (such as &quot;BN234&quot; for &quot;8N234&quot;). However, some non-word errors might become real-word errors if the size of the word list or dictionary increases. (For example, the word &quot;ruel &quot;~ might count as a non-word error for the source-text word &quot;rut&quot; if a small dictionary is used for reference, but count as a real-word error if an unabridged dictionary is used.) While non-word errors might be corrected without considering the context in which the error occurs, a real-word error can only be corrected by taking context into account.</Paragraph> <Paragraph position="2"> The problems of word-error detection and correction have been studied for several decades.</Paragraph> <Paragraph position="3"> A good survey in this area can be found in \[Kukich 1992\]. Most traditional word-correction techniques concentrate on non-word error correction and do not consider the context in which the error appears.</Paragraph> <Paragraph position="4"> Recently, statistical language models (SLMs) and feature-based methods have been used for context-sensitive spelling-error correction. For example, Atwell and Elliittm \[1987\] have used a part-of-speech (POS) tagging method to detect the real-word errors in text. Mays and colleagues \[1991\] have exploited word trigrams to detect and correct both the non-word and real-word errors that were artificially generated from 100 sentences. Church and Gale \[1991\] have used a Bayesian classifier method to improve the performance for non-word error correction. Golding \[1995\] has applied a hybrid Bayesian method for real-word error correction and Golding and Schabes \[1996\] have combined a POS trigram and Bayesian methods for the same purpose.</Paragraph> <Paragraph position="5"> The goal of the work described here is to investigate the effectiveness and efficiency of SLM-based methods applied to the problem of OCR error correction. Since POS-based methods are not effective in distinguishing among candidates with the same POS tags and since methods based on word-trigram models involve extensive training data and require that huge word-trigram tables be available at run time, we used a word-bigram SLM as the first step in our investigation.</Paragraph> <Paragraph position="6"> In this paper, we describe a system that uses a word-bigram SLM technique to correct OCR errors. The system takes advantage of information from multiple sources, including letter ngrams, character confusion probabilities, and word bigram probabilities, to effect context-based word error correction. It can correct non-word as well as real-word errors. In addition, the system can learn the character confusion probability table for a specific OCR environment and use it to achieve better performance.</Paragraph> </Section> class="xml-element"></Paper>