File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2138_metho.xml

Size: 16,331 bytes

Last Modified: 2025-10-06 14:15:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2138">
  <Title>Combining Trigram and Winnow in Thai OCR Error Correction</Title>
  <Section position="3" start_page="836" end_page="837" type="metho">
    <SectionTitle>
2 Problems of Thai OCR
</SectionTitle>
    <Paragraph position="0"> The problem of OCR error correction can be defined as : given the string of characters S = clc2...cn produced by OCR, find the word sequence W -- wlw2.., w~ that maximizes the probability P(WIS ). Before describing the methods used to model P(WIS), below we list some main characteristics of Thai that poses difficulties for correcting Thai OCR error.</Paragraph>
    <Paragraph position="1"> * Words are written consecutively without word boundary delimiters such as white space characters. For example, the phrase &amp;quot;r~u~u~lJU&amp;quot; (Japan at present) in Figure 1, actually consists of three words: &amp;quot;~du&amp;quot; (Japan), '%&amp;quot; (at), and &amp;quot;~u&amp;quot; (present). Therefore, Thai OCR error correction has to overcome word boundary ambiguity as well as select the most probable correction candidate at the same time. This is similar to the problem of Connected Speech Recognition and is sometimes called Connected Text Recognition (Ingels, 1996).</Paragraph>
    <Paragraph position="2"> * There are 3 levels for placing Thai characters and some characters can occupy more than one level. For example, in Figure 2 &amp;quot;~&amp;quot; consists of characters in three levels, q i.e., ~, ,, ~ and ~ are in the top, the bottom, the middle and both the middle and top levels, respectively. The character that occupies more than one level like ~ usually connects to other characters (~) and causes error on the output of OCR, i.e., ~ may be recognized as ~ or \]. Therefore, to correct characters produced by OCR, not only substitution errors but also deletion and insertion errors must be considered. In addition, in such a case, the candidates ranked by OCR output are unreliable and cannot be used to reduce search space. This is because the connected characters tend to have very different features from the original separated ones.</Paragraph>
  </Section>
  <Section position="4" start_page="837" end_page="838" type="metho">
    <SectionTitle>
3 Our Methods
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="837" end_page="837" type="sub_section">
      <SectionTitle>
3.1 Trigram Model
</SectionTitle>
      <Paragraph position="0"> To find W that maximizes P(WIS), we can use the POS trigram model as follows.</Paragraph>
      <Paragraph position="2"> The probability P(W) is given by the language model and can be estimated by the tri-gram model as:</Paragraph>
      <Paragraph position="4"> OCR, and can be estimated by collecting statistical information from original text and the text produced by OCR. We assume that given the original word sequence W composed of characters vlv2... Vm, OCR produces the sequence as string S (= ctc2.., an) by repeatedly applying the following operation: substitute a character with another; insert a character; or delete a character. Let Si be the /-prefix of S that is formed by first character to the/-character of S (= clc2...ci), and similarly Wj is the jprefix of W (= vlv2.., vj). Using dynamic programming technique, we can calculate P(SIW )</Paragraph>
      <Paragraph position="6"> where P(ins(c)), P(del(v)) and P(clv ) are the probabilities that letter c is inserted, letter v is deleted and letter v is substituted with c, respectively. null One method to do OCR error correction using the above model is to hypothesize all sub-strings in the input sentence as words (Nagata, 1996). Both words in the dictionary that exactly match with the substrings and those that approximately match are retrieved. To cope with unknown words, all other substrings not matched must also be considered. The word lattice is then scanned to find the N-best word sequences as correction candidates. In general, this method is perfectly good, except in one aspect: its time complexity. Because it generates a large number of hypothesized words and has to find the best combination among them, it is very slow.</Paragraph>
    </Section>
    <Section position="2" start_page="837" end_page="838" type="sub_section">
      <SectionTitle>
3.2 Selective Trigram Model
</SectionTitle>
      <Paragraph position="0"> To alleviate the above problem, we try to reduce the number of hypothesized words by generating them only when needed. Having analyzed the OCR output, we found that a large portion of input sentence are correctly recognized and need no approximation. Therefore, instead of hypothesizing blindly through the whole sentence, if we limit our hypotheses to only dubious areas, we can save considerable amount of time.</Paragraph>
      <Paragraph position="1"> Following is our algorithm for correcting OCR</Paragraph>
      <Paragraph position="3"> Find dubious areas: Find all substrings in the input sentence that exactly match words in the dictionary. Each substring may overlap with others. The remaining parts of sentence which are not covered by any of these substrings are considered as dubious areas.</Paragraph>
      <Paragraph position="4"> Make hypotheses for nonwords and unknown words: (a) For each dubious string obtained from 1., the surrounding words are also considered to form candidates for correction by concatenating them with the dubious string. For example, in &amp;quot;inform at j off', j is an unknown string representing a dubious area, and inform at and on are words. In this  case, the unknown word and its surrounding known words are combined together, resulting in &amp;quot;in/ormatjon&amp;quot; as a new unknown string.</Paragraph>
      <Paragraph position="5"> (b) For each unknown string obtained form 2(a), apply the candidate generation routine to generate approximately matched words within k-edit distance.</Paragraph>
      <Paragraph position="6"> The value of k is varied proportionally to the length of candidate word.</Paragraph>
      <Paragraph position="7">  (c) All substrings except for ones that violate Thai spelling rules, i.e., lead by non-leading character, are hypothesized as unknown words.</Paragraph>
      <Paragraph position="8"> 3. Find good word sequences: Find the N-best word sequences according  to equation (2). For unknown words, P(wilUnknown word) is computed by using the unknown word model in (Nagata, 1996).</Paragraph>
      <Paragraph position="9"> 4. Make hypotheses for real-word error: For each word wi in N-best word sequence where the local probabilities P(wi-1, wi, wi+l, ti-1, ti, ti+l) are below a threshold, generate candidate words by applying the process similar to step 2 except that the nonword in step 2 is replaced with the word wi. Find the word sequences whose probabilities computed by equation (2) are better than original ones.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="838" end_page="840" type="metho">
    <SectionTitle>
5. Find the N-best word sequences:
</SectionTitle>
    <Paragraph position="0"> From all word sequences obtained from step 4, select the N-best ones.</Paragraph>
    <Paragraph position="1"> The candidate generation routine uses a modification of the standard edit distance and employs the error-tolerant finite-state recognition algorithm (Oflazer, 1996) to generate candidate words. The modified edit distance allows arbitrary number of insertion and/or deletion of upper level and lower level characters, but allows no insertion or deletion of the middle level characters. In the middle level, it allows only k substitution. This is to reflect the characteristic of Thai OCR which, 1. tends to merge several characters into one when the character which spans two levels are adjacent to characters in the upper and lower level, and 2. rarely causes insertion and deletion errors in the middle level. For example, applying the candidate generation routine with 1 edit distance to the string &amp;quot;~&amp;quot; gives the set of candidates {~. ~, ~. ~, ~,~, ~, From our experiments, we found that the selective trigram model can deal with nonword errors fairly well. However, the model is not enough to correct real-word errors as well as words with the same part of speech. This is because the POS trigram model considers only coarse information of POS in a fixed restricted range of context, some useful information such as specific word collocation may be lost. Using word N-gram could recover some word-level information but requires an extremely large corpus to estimate all parameters accurately and consumes vast space resources to store the huge word N-gram table. In addition, the model losses generalized information at the level of POS.</Paragraph>
    <Paragraph position="2"> For English, a number of methods have been proposed to cope with real-word errors in spelling correction (Golding, 1995; Golding and Roth, 1996; Golding and Schabes, 1993; Tong and Evans, 1996). Among them, the feature-based methods were shown to be superior to other approaches. This is because the methods can combine several kinds of features to determine the appropriate word in a given context. For our task, we adopt a feature-based algorithm called Winnow. There are two reasons why we select Winnow. First, it has been shown to be the best performer in English context-sensitive spelling correction (Golding and Roth, 1996). Second, it was shown to be able to handle difficult disambiguation tasks in Thai (Meknavin et al.~ 1997).</Paragraph>
    <Paragraph position="3"> Below we describe Winnow algorithm that is used for correcting real-word error.</Paragraph>
    <Section position="1" start_page="838" end_page="839" type="sub_section">
      <SectionTitle>
3.3 Winnow Algorithm
</SectionTitle>
      <Paragraph position="0"> A Winnow algorithm used in our experiment is the algorithm described in (Blum, 1997). Winnow is a multiplicative weight updating and incremental algorithm (Littlestone, 1988; Golding and Roth, 1996). The algorithm is originally designed for learning two-class (positive and negative class) problems, and can be extended to multiple-class problems as shown in Figure 3.</Paragraph>
      <Paragraph position="1"> Winnow can be viewed as a network of one target node connected to n nodes, called specialists, each of which examines one feature and  Let Vh..., vm be the values of the target concept to be learned, and xi be the prediction of the /-specialist.</Paragraph>
      <Paragraph position="2">  1. Initialize the weights wx,..., Wn of all the specialists to 1. 2. For Each example x = {xl,..., Xn} Do (a) Let V be the value of the target concept of the example. (b) Output ~)j = arg maxvie{vl,...,v,,,} ~'~i:xi=v i Wi (c) If the algorithm makes a mistake (~)j ~ V), then:  i. for each xi equal to V, wi is updated to wi * o~ ii. for each xi equal to C/~j, wi is updated to wi * where, c~ &gt; 1 and/3 &lt; 1 are promotion parameter and demotion parameter, and are set to 3/2 and 1/2, respectively.</Paragraph>
      <Paragraph position="3">  predicts xi as the value of the target concept. The basic idea of the algorithm is that to extract some useful unknown features, the algorithm asks for opinions from all specialists, each of whom has his own specialty on one feature, and then makes a global prediction based on a weighted majority vote over all those opinions as described in Step 2-(a) of Figure 3. In our experiment, we have each specialist examine one or two attributes of an example. For example, a specialist may predict the value of the target concept by checking for the pairs &amp;quot;(attributel ---- valuel) and (attribute2 = value2)&amp;quot;. These pairs are candidates of features we are trying to extract.</Paragraph>
      <Paragraph position="4"> A specialist only makes a prediction if its condition &amp;quot;(attributel = valuel)&amp;quot; is true in case of one attribute, or both of its conditions &amp;quot;(attributel -- value1) and (attibute2 -- value2)&amp;quot; are true in case of two attributes, and in that case it predicts the most popular outcome out of the last k times it had the chance to predict. A specialist may choose to abstain instead of giving a prediction on any given example in case that it did not see the same value of an attribute in the example. In fact, we may have each specialist examines more than two attributes, but for the sake of simplification of preliminary experiment, let us assume that two attributes for each specialist are enough to learn the target concept.</Paragraph>
      <Paragraph position="5"> The global algorithm updates the weight wi of any specialist based on the vote of that specialist. The weight of any specialist is initialized to 1. In case that the global algorithm predicts incorrectly, the weight of the specialist that predicts incorrectly is halved and the weight of the specialist that predicts correctly is multiplied by 3/2. This weight updating method is the same as the one used in (Blum, 1997). The advantage of Winnow, which made us decide to use for our task, is that it is not sensitive to extra irrelevant features (Littlestone, 1988).</Paragraph>
    </Section>
    <Section position="2" start_page="839" end_page="840" type="sub_section">
      <SectionTitle>
3.3.2 Constructing Confusion Set and
Defining Features
</SectionTitle>
      <Paragraph position="0"> To employ Winnow in correcting OCR errors, we first define k-edit distance confusion set. A k-edit distance confusion set S = {c, wl, w2,..., Wn} is composed of one centroid word c and words wl, w2,..., Wn generated by applying the candidate generation routine with maximum k modified edit distance to the centroid word. If a word c is produced by OCR output or by the previous step, then it may be corrected as wl,w2,...,Wn or c itself. For example, suppose that the centroid word is know, then all possible words in 1-edit distance confusion set are {know, knob, knop, knot, knew, enow, snow, known, now}. Furthermore, words with probability lower than a threshold are excluded from the set. For example, if a specific OCR has low probability of substituting t with w, &amp;quot;knof' should be excluded from the set.</Paragraph>
      <Paragraph position="1"> Following previous works (Golding, 1995; Meknavin et al., 1997), we have tried two types of features: context words and collocations.</Paragraph>
      <Paragraph position="2"> Context-word features is used to test for the  presence of a particular word within //- M words of the target word, and collocations test for a pattern of up to L contiguous words and/or part-of-speech tags around the target word. In our experiment M and L is set to 10 and 2, respectively. Examples of features for discriminating between snow and know include: (1) I {know, snow} (2) winter within /10 words where (1) is a collocation that tends to imply know, and (2) is a context-word that tends to imply snow. Then the algorithm should extract the features (&amp;quot;word within /10 words of the target word&amp;quot; = &amp;quot;winter&amp;quot;) as well as (&amp;quot;one word before the target word&amp;quot; -- 'T') as useful features by assigning them with high weights.</Paragraph>
    </Section>
    <Section position="3" start_page="840" end_page="840" type="sub_section">
      <SectionTitle>
3.3.3 Using the Network to Rank
Sentences
</SectionTitle>
      <Paragraph position="0"> After networks of k-edit distance confusion sets are learned by Winnow, the networks are used to correct the N-best sentences received from POS trigram model. For each sentence, every real word is evaluated by the network whose the centroid word is that real word. The network will then output the centroid word or any word in the confusion set according to the context.</Paragraph>
      <Paragraph position="1"> After the most probable word is determined, the confidence level of that word will be calculated.</Paragraph>
      <Paragraph position="2"> Since every specialist has weight voting for the target word, we can consider the weight as confidence level of that specialist for the word. We define the confidence level of any word as all weights that vote for that word divided by all weights in the network. Based on the confidence levels of all words in the sentence, the average of them is taken as the confidence level of the sentence. The N-best sentences are then re-ranked according to the confidence level of the sentences.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML