File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2138_intro.xml

Size: 4,320 bytes

Last Modified: 2025-10-06 14:06:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2138">
  <Title>Combining Trigram and Winnow in Thai OCR Error Correction</Title>
  <Section position="2" start_page="0" end_page="836" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Optical character recognition (OCR) is useful in a wide range of applications, such as office automation and information retrieval system.</Paragraph>
    <Paragraph position="1"> However, OCR in Thailand is still not widely used, partly because existing Thai OCRs are not quite satisfactory in terms of accuracy. Recently, several research projects have focused on spelling correction for many types of errors including those from OCR (Kukich, 1992). Nevertheless, the strategy is slightly different from language to language, since the characteristic of each language is different.</Paragraph>
    <Paragraph position="2"> Two characteristics of Thai which make the task of error correction different from those of other languages are: (1) there is no explicit word boundary, and (2) characters are written in three levels; i.e., the middle, the upper and the lower levels. In order to solve the problem of OCR error correction, the first task is usually to detect error strings in the input sentence. For languages that have explicit word boundary such as English in which each word is separated from the others by white spaces, this task is comparatively simple. If the tokenized string is not found in the dictionary, it could be an error string or an unknown word.</Paragraph>
    <Paragraph position="3"> However, for the languages that have no explicit word boundary such as Chinese, Japanese and Thai, this task is much more complicated.</Paragraph>
    <Paragraph position="4"> Even without errors from OCR, it is difficult to determine word boundary in these languages.</Paragraph>
    <Paragraph position="5"> The situation gets worse when noises are introduced in the text. The existing approach for correcting the spelling error in the languages that have no word boundary assumes that all substrings in input sentence are error strings, and then tries to correct them (Nagata, 1996).</Paragraph>
    <Paragraph position="6"> This is computationally expensive since a large portion of the input sentence is correct. The other characteristic of Thai writing system is that we have many levels for placing Thai characters and several characters can occupy more than one level. These characters are easily connected to other characters in the upper or lower level. These connected characters cause difficulties in the process of character segmentation which then cause errors in Thai OCR.</Paragraph>
    <Paragraph position="7"> Other than the above problems specific to Thai, real-word error is another source of errors that is difficult to correct. Several previous works on spelling correction demonstrated that  feature-based approaches are very effective for solving this problem.</Paragraph>
    <Paragraph position="8"> In this paper, a hybrid method for Thai OCR error correction is proposed. The method combines the part-of-speech (POS) trigram model with a feature-based model. First, the POS tri-gram model is employed to correct non-word as well as real-word errors. In this step, the number of non-word errors are mostly reduced, but some real-word errors still remain because the POS trigram model cannot capture some useful features in discriminating candidate words. A feature-based approach using Winnow algorithm is then applied to correct the remaining errors. In order to overcome the expensive computation cost of the existing approach, we propose the idea of reducing the scope of correction by using word segmentation algorithm to find the approximate error strings from the input sentence. Though the word segmentation algorithm cannot give the accurate boundary of an error string, many of them can give clues of unknown strings which may be error strings.</Paragraph>
    <Paragraph position="9"> We can use this information to reduce the scope of correction from entire sentence to a more narrow scope. Next, to capture the characteristic of Thai OCR errors, we have defined the modified edit distance and use it to enumerate plausible candidates which deviate from the word in question within k-edit distance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML