XML Viewer - w96-0108

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0108_intro.xml
Size: 4,633 bytes
Last Modified: 2025-10-06 14:06:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0108">
  <Title>A Statistical Approach to Automatic OCR Error Correction in Context</Title>
  <Section position="2" start_page="0" end_page="88" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word errors present problems for various text- or speech-based applications such as optical character recognition (OCR) and voice-input computer interfaces. In particular, though current OCR technology is quite refined and robust, sources such as old books, poor-quality (nth-generation) photocopies, and faxes can still be difficult to process and may cause many OCR errors. For OCR to be truly useful in a wide range of applications, such as office automation and information retrieval systems, OCR reliability must be improved. A method for the automatic correction of OCR errors would be clearly beneficial.</Paragraph>
    <Paragraph position="1"> Essentially, there are two types of word errors: non-word errors and real-word errors. A non-word error occurs when a word in a source text is interpreted (under OCR) as a string that does not correspond to any valid word in a given word list or dictionary. A real-word error occurs when a source-text word is interpreted as a string that actually does occur in the dictionary, but is not identical with the source-text word. For example, if the source text &amp;quot;John found the man&amp;quot; is rendered as &amp;quot;John fornd he man&amp;quot; by an OCR device, then &amp;quot;fornd&amp;quot; is a non-word error and &amp;quot;he&amp;quot; is a real-word error. In general, non-word errors will never correspond to any dictionary entries and  will include wildly incorrect strings (such as &amp;quot;#--&amp;&amp;') as well as misrecognized alpha-numeric sequences (such as &amp;quot;BN234&amp;quot; for &amp;quot;8N234&amp;quot;). However, some non-word errors might become real-word errors if the size of the word list or dictionary increases. (For example, the word &amp;quot;ruel &amp;quot;~ might count as a non-word error for the source-text word &amp;quot;rut&amp;quot; if a small dictionary is used for reference, but count as a real-word error if an unabridged dictionary is used.) While non-word errors might be corrected without considering the context in which the error occurs, a real-word error can only be corrected by taking context into account.</Paragraph>
    <Paragraph position="2"> The problems of word-error detection and correction have been studied for several decades.</Paragraph>
    <Paragraph position="3"> A good survey in this area can be found in \[Kukich 1992\]. Most traditional word-correction techniques concentrate on non-word error correction and do not consider the context in which the error appears.</Paragraph>
    <Paragraph position="4"> Recently, statistical language models (SLMs) and feature-based methods have been used for context-sensitive spelling-error correction. For example, Atwell and Elliittm \[1987\] have used a part-of-speech (POS) tagging method to detect the real-word errors in text. Mays and colleagues \[1991\] have exploited word trigrams to detect and correct both the non-word and real-word errors that were artificially generated from 100 sentences. Church and Gale \[1991\] have used a Bayesian classifier method to improve the performance for non-word error correction. Golding \[1995\] has applied a hybrid Bayesian method for real-word error correction and Golding and Schabes \[1996\] have combined a POS trigram and Bayesian methods for the same purpose.</Paragraph>
    <Paragraph position="5"> The goal of the work described here is to investigate the effectiveness and efficiency of SLM-based methods applied to the problem of OCR error correction. Since POS-based methods are not effective in distinguishing among candidates with the same POS tags and since methods based on word-trigram models involve extensive training data and require that huge word-trigram tables be available at run time, we used a word-bigram SLM as the first step in our investigation.</Paragraph>
    <Paragraph position="6"> In this paper, we describe a system that uses a word-bigram SLM technique to correct OCR errors. The system takes advantage of information from multiple sources, including letter ngrams, character confusion probabilities, and word bigram probabilities, to effect context-based word error correction. It can correct non-word as well as real-word errors. In addition, the system can learn the character confusion probability table for a specific OCR environment and use it to achieve better performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML