Combining Trigram and Winnow in Thai OCR Error Correction 
Surapant Meknavin 
National Electronics and Computer Technology Center 73/1 
Rama VI Road, Rajthevi, Bangkok, Thailand 
surapan@nectec.or.th 
Boonserm Kijsirikul, Ananlada Chotimongkol and Cholwich Nuttee 
Department of Computer Engineering 
Chulalongkorn University, Thailand 
fengbks@chulkn.chula.ac.th 
Abstract 
For languages that have no explicit word bound- 
ary such as Thai, Chinese and Japanese, cor- 
recting words in text is harder than in English 
because of additional ambiguities in locating er- 
ror words. The traditional method handles this 
by hypothesizing that every substrings in the 
input sentence could be error words and trying 
to correct all of them. In this paper, we pro- 
pose the idea of reducing the scope of spelling 
correction by focusing only on dubious areas in 
the input sentence. Boundaries of these dubious 
areas could be obtained approximately by ap- 
plying word segmentation algorithm and finding 
word sequences with low probability. To gener- 
ate the candidate correction words, we used a 
modified edit distance which reflects the charac- 
teristic of Thai OCR errors. Finally, a part-of- 
speech trigram model and Winnow algorithm 
are combined to determine the most probable 
correction. 
1 Introduction 
Optical character recognition (OCR) is useful 
in a wide range of applications, such as office 
automation and information retrieval system. 
However, OCR in Thailand is still not widely 
used, partly because existing Thai OCRs are 
not quite satisfactory in terms of accuracy. Re- 
cently, several research projects have focused on 
spelling correction for many types of errors in- 
cluding those from OCR (Kukich, 1992). Nev- 
ertheless, the strategy is slightly different from 
language to language, since the characteristic of 
each language is different. 
Two characteristics of Thai which make the 
task of error correction different from those of 
other languages are: (1) there is no explicit 
word boundary, and (2) characters are written 
in three levels; i.e., the middle, the upper and 
the lower levels. In order to solve the prob- 
lem of OCR error correction, the first task is 
usually to detect error strings in the input sen- 
tence. For languages that have explicit word 
boundary such as English in which each word 
is separated from the others by white spaces, 
this task is comparatively simple. If the tok- 
enized string is not found in the dictionary, it 
could be an error string or an unknown word. 
However, for the languages that have no ex- 
plicit word boundary such as Chinese, Japanese 
and Thai, this task is much more complicated. 
Even without errors from OCR, it is difficult to 
determine word boundary in these languages. 
The situation gets worse when noises are intro- 
duced in the text. The existing approach for 
correcting the spelling error in the languages 
that have no word boundary assumes that all 
substrings in input sentence are error strings, 
and then tries to correct them (Nagata, 1996). 
This is computationally expensive since a large 
portion of the input sentence is correct. The 
other characteristic of Thai writing system is 
that we have many levels for placing Thai char- 
acters and several characters can occupy more 
than one level. These characters are easily con- 
nected to other characters in the upper or lower 
level. These connected characters cause diffi- 
culties in the process of character segmentation 
which then cause errors in Thai OCR. 
Other than the above problems specific to 
Thai, real-word error is another source of er- 
rors that is difficult to correct. Several previous 
works on spelling correction demonstrated that 
836 
tone 
w ! upper level 
middle level 
baseline 
? @.I I lower level consonant 
Figure 1: No explicit word delimiter in Thai 
feature-based approaches are very effective for 
solving this problem. 
In this paper, a hybrid method for Thai OCR 
error correction is proposed. The method com- 
bines the part-of-speech (POS) trigram model 
with a feature-based model. First, the POS tri- 
gram model is employed to correct non-word as 
well as real-word errors. In this step, the num- 
ber of non-word errors are mostly reduced, but 
some real-word errors still remain because the 
POS trigram model cannot capture some use- 
ful features in discriminating candidate words. 
A feature-based approach using Winnow algo- 
rithm is then applied to correct the remaining 
errors. In order to overcome the expensive com- 
putation cost of the existing approach, we pro- 
pose the idea of reducing the scope of correc- 
tion by using word segmentation algorithm to 
find the approximate error strings from the in- 
put sentence. Though the word segmentation 
algorithm cannot give the accurate boundary of 
an error string, many of them can give clues 
of unknown strings which may be error strings. 
We can use this information to reduce the scope 
of correction from entire sentence to a more nar- 
row scope. Next, to capture the characteristic 
of Thai OCR errors, we have defined the modi- 
fied edit distance and use it to enumerate plau- 
sible candidates which deviate from the word in 
question within k-edit distance. 
2 Problems of Thai OCR 
The problem of OCR error correction can be 
defined as : given the string of characters 
S = clc2...cn produced by OCR, find the 
word sequence W -- wlw2.., w~ that maximizes 
the probability P(WIS ). Before describing the 
methods used to model P(WIS), below we list 
some main characteristics of Thai that poses dif- 
ficulties for correcting Thai OCR error. 
• Words are written consecutively without 
word boundary delimiters such as white 
space characters. For example, the phrase 
"r~u~u~lJU" (Japan at present) in Figure 
1, actually consists of three words: "~du" 
(Japan), '%" (at), and "~u" (present). 
Therefore, Thai OCR error correction has 
to overcome word boundary ambiguity as 
well as select the most probable correction 
candidate at the same time. This is similar 
to the problem of Connected Speech Recog- 
nition and is sometimes called Connected 
Text Recognition (Ingels, 1996). 
• There are 3 levels for placing Thai charac- 
ters and some characters can occupy more 
than one level. For example, in Figure 2 
"~" consists of characters in three levels, q 
i.e., ~, ,, ~ and ~ are in the top, the bot- 
tom, the middle and both the middle and 
top levels, respectively. The character that 
occupies more than one level like ~ usually 
connects to other characters (~) and causes 
error on the output of OCR, i.e., ~ may 
be recognized as ~ or \]. Therefore, to cor- 
rect characters produced by OCR, not only 
substitution errors but also deletion and in- 
sertion errors must be considered. In addi- 
tion, in such a case, the candidates ranked 
by OCR output are unreliable and cannot 
be used to reduce search space. This is 
because the connected characters tend to 
have very different features from the origi- 
nal separated ones. 
837 
tone consonant 
=i 
vowel 2 
I upper 
toplhe 
I middle level 
baseline I lower level 
Figure 2: Three levels for placing Thai charac- 
ters 
3 Our Methods 
3.1 Trigram Model 
To find W that maximizes P(WIS), we can use 
the POS trigram model as follows. 
arg mwax P(WIS ) 
= argmwaxP(W)P(SlW)/P(S ) (1) 
= argmwaxP(W)P(S\[W ) (2) 
The probability P(W) is given by the lan- 
guage model and can be estimated by the tri- 
gram model as: 
P(W) = P(W, T) = H P(ti\] ti-2,ti-1)P(wilti) 
(3) 
P(SIW ) is the characteristics of specific 
OCR, and can be estimated by collecting sta- 
tistical information from original text and the 
text produced by OCR. We assume that given 
the original word sequence W composed of char- 
acters vlv2... Vm, OCR produces the sequence 
as string S (= ctc2.., an) by repeatedly apply- 
ing the following operation: substitute a char- 
acter with another; insert a character; or delete 
a character. Let Si be the /-prefix of S that 
is formed by first character to the/-character 
of S (= clc2...ci), and similarly Wj is the j- 
prefix of W (= vlv2.., vj). Using dynamic pro- 
gramming technique, we can calculate P(SIW ) 
(= P(SnlWm)) by the following equation: 
P(SiIWj) = max(P(Si_llWj) * P(ins(ci)), 
P(SilWj_I) • P(del(vj)), 
P(Si-llW -l) • P(cilv )) (4) 
where P(ins(c)), P(del(v)) and P(clv ) are the 
probabilities that letter c is inserted, letter v is 
deleted and letter v is substituted with c, re- 
spectively. 
One method to do OCR error correction us- 
ing the above model is to hypothesize all sub- 
strings in the input sentence as words (Nagata, 
1996). Both words in the dictionary that ex- 
actly match with the substrings and those that 
approximately match are retrieved. To cope 
with unknown words, all other substrings not 
matched must also be considered. The word 
lattice is then scanned to find the N-best word 
sequences as correction candidates. In general, 
this method is perfectly good, except in one as- 
pect: its time complexity. Because it generates 
a large number of hypothesized words and has 
to find the best combination among them, it is 
very slow. 
3.2 Selective Trigram Model 
To alleviate the above problem, we try to reduce 
the number of hypothesized words by generat- 
ing them only when needed. Having analyzed 
the OCR output, we found that a large por- 
tion of input sentence are correctly recognized 
and need no approximation. Therefore, instead 
of hypothesizing blindly through the whole sen- 
tence, if we limit our hypotheses to only dubious 
areas, we can save considerable amount of time. 
Following is our algorithm for correcting OCR 
output. 
. 
. 
Find dubious areas: Find all substrings 
in the input sentence that exactly match 
words in the dictionary. Each substring 
may overlap with others. The remaining 
parts of sentence which are not covered by 
any of these substrings are considered as 
dubious areas. 
Make hypotheses for nonwords and 
unknown words: 
(a) For each dubious string obtained from 
1., the surrounding words are also con- 
sidered to form candidates for correc- 
tion by concatenating them with the 
dubious string. For example, in "in- 
form at j off', j is an unknown string 
representing a dubious area, and in- 
form at and on are words. In this 
838 
case, the unknown word and its sur- 
rounding known words are combined 
together, resulting in "in/ormatjon" as 
a new unknown string. 
(b) For each unknown string obtained 
form 2(a), apply the candidate genera- 
tion routine to generate approximately 
matched words within k-edit distance. 
The value of k is varied proportionally 
to the length of candidate word. 
(c) All substrings except for ones that 
violate Thai spelling rules, i.e., lead 
by non-leading character, are hypoth- 
esized as unknown words. 
3. Find good word sequences: Find 
the N-best word sequences according 
to equation (2). For unknown words, 
P(wilUnknown word) is computed by us- 
ing the unknown word model in (Nagata, 
1996). 
4. Make hypotheses for real-word er- 
ror: For each word wi in N-best word 
sequence where the local probabilities 
P(wi-1, wi, wi+l, ti-1, ti, ti+l) are below a 
threshold, generate candidate words by ap- 
plying the process similar to step 2 except 
that the nonword in step 2 is replaced with 
the word wi. Find the word sequences 
whose probabilities computed by equation 
(2) are better than original ones. 
5. Find the N-best word sequences: 
From all word sequences obtained from step 
4, select the N-best ones. 
The candidate generation routine uses a mod- 
ification of the standard edit distance and em- 
ploys the error-tolerant finite-state recognition 
algorithm (Oflazer, 1996) to generate candidate 
words. The modified edit distance allows ar- 
bitrary number of insertion and/or deletion of 
upper level and lower level characters, but al- 
lows no insertion or deletion of the middle level 
characters. In the middle level, it allows only k 
substitution. This is to reflect the characteristic 
of Thai OCR which, 1. tends to merge several 
characters into one when the character which 
spans two levels are adjacent to characters in 
the upper and lower level, and 2. rarely causes 
insertion and deletion errors in the middle level. 
For example, applying the candidate generation 
routine with 1 edit distance to the string "~" 
gives the set of candidates {~. ~, ~. ~, ~,~, ~, 
From our experiments, we found that the se- 
lective trigram model can deal with nonword 
errors fairly well. However, the model is not 
enough to correct real-word errors as well as 
words with the same part of speech. This is 
because the POS trigram model considers only 
coarse information of POS in a fixed restricted 
range of context, some useful information such 
as specific word collocation may be lost. Using 
word N-gram could recover some word-level in- 
formation but requires an extremely large cor- 
pus to estimate all parameters accurately and 
consumes vast space resources to store the huge 
word N-gram table. In addition, the model 
losses generalized information at the level of 
POS. 
For English, a number of methods have 
been proposed to cope with real-word errors in 
spelling correction (Golding, 1995; Golding and 
Roth, 1996; Golding and Schabes, 1993; Tong 
and Evans, 1996). Among them, the feature- 
based methods were shown to be superior to 
other approaches. This is because the methods 
can combine several kinds of features to deter- 
mine the appropriate word in a given context. 
For our task, we adopt a feature-based algo- 
rithm called Winnow. There are two reasons 
why we select Winnow. First, it has been shown 
to be the best performer in English context- 
sensitive spelling correction (Golding and Roth, 
1996). Second, it was shown to be able to han- 
dle difficult disambiguation tasks in Thai (Mek- 
navin et al.~ 1997). 
Below we describe Winnow algorithm that is 
used for correcting real-word error. 
3.3 Winnow Algorithm 
3.3.1 The algorithm 
A Winnow algorithm used in our experiment is 
the algorithm described in (Blum, 1997). Win- 
now is a multiplicative weight updating and in- 
cremental algorithm (Littlestone, 1988; Golding 
and Roth, 1996). The algorithm is originally de- 
signed for learning two-class (positive and neg- 
ative class) problems, and can be extended to 
multiple-class problems as shown in Figure 3. 
Winnow can be viewed as a network of one 
target node connected to n nodes, called spe- 
cialists, each of which examines one feature and 
839 
Let Vh..., vm be the values of the target concept to be learned, and xi be the prediction of the 
/-specialist. 
1. Initialize the weights wx,..., Wn of all the specialists to 1. 
2. For Each example x = {xl,..., Xn} Do 
(a) Let V be the value of the target concept of the example. 
(b) Output ~)j = arg maxvie{vl,...,v,,,} ~'~i:xi=v i Wi 
(c) If the algorithm makes a mistake (~)j ~ V), then: 
i. for each xi equal to V, wi is updated to wi • o~ 
ii. for each xi equal to ¢~j, wi is updated to wi • 
where, c~ > 1 and/3 < 1 are promotion parameter and demotion parameter, and are set to 3/2 and 
1/2, respectively. 
Figure 3: The Winnow algorithm for learning multiple-class concept. 
predicts xi as the value of the target concept. 
The basic idea of the algorithm is that to ex- 
tract some useful unknown features, the algo- 
rithm asks for opinions from all specialists, each 
of whom has his own specialty on one feature, 
and then makes a global prediction based on a 
weighted majority vote over all those opinions 
as described in Step 2-(a) of Figure 3. In our ex- 
periment, we have each specialist examine one 
or two attributes of an example. For example, 
a specialist may predict the value of the target 
concept by checking for the pairs "(attributel 
---- valuel) and (attribute2 = value2)". These 
pairs are candidates of features we are trying to 
extract. 
A specialist only makes a prediction if its con- 
dition "(attributel = valuel)" is true in case 
of one attribute, or both of its conditions "(at- 
tributel -- value1) and (attibute2 -- value2)" 
are true in case of two attributes, and in that 
case it predicts the most popular outcome out of 
the last k times it had the chance to predict. A 
specialist may choose to abstain instead of giv- 
ing a prediction on any given example in case 
that it did not see the same value of an attribute 
in the example. In fact, we may have each spe- 
cialist examines more than two attributes, but 
for the sake of simplification of preliminary ex- 
periment, let us assume that two attributes for 
each specialist are enough to learn the target 
concept. 
The global algorithm updates the weight wi 
of any specialist based on the vote of that spe- 
cialist. The weight of any specialist is initialized 
to 1. In case that the global algorithm predicts 
incorrectly, the weight of the specialist that pre- 
dicts incorrectly is halved and the weight of the 
specialist that predicts correctly is multiplied by 
3/2. This weight updating method is the same 
as the one used in (Blum, 1997). The advan- 
tage of Winnow, which made us decide to use 
for our task, is that it is not sensitive to extra 
irrelevant features (Littlestone, 1988). 
3.3.2 Constructing Confusion Set and 
Defining Features 
To employ Winnow in correcting OCR er- 
rors, we first define k-edit distance confusion 
set. A k-edit distance confusion set S = 
{c, wl, w2,..., Wn} is composed of one centroid 
word c and words wl, w2,..., Wn generated by 
applying the candidate generation routine with 
maximum k modified edit distance to the cen- 
troid word. If a word c is produced by OCR 
output or by the previous step, then it may be 
corrected as wl,w2,...,Wn or c itself. For ex- 
ample, suppose that the centroid word is know, 
then all possible words in 1-edit distance con- 
fusion set are {know, knob, knop, knot, knew, 
enow, snow, known, now}. Furthermore, words 
with probability lower than a threshold are ex- 
cluded from the set. For example, if a specific 
OCR has low probability of substituting t with 
w, "knof' should be excluded from the set. 
Following previous works (Golding, 1995; 
Meknavin et al., 1997), we have tried two types 
of features: context words and collocations. 
Context-word features is used to test for the 
840 
presence of a particular word within ÷/- M 
words of the target word, and collocations test 
for a pattern of up to L contiguous words and/or 
part-of-speech tags around the target word. In 
our experiment M and L is set to 10 and 2, 
respectively. Examples of features for discrimi- 
nating between snow and know include: 
(1) I {know, snow} 
(2) winter within ÷10 words 
where (1) is a collocation that tends to imply 
know, and (2) is a context-word that tends to 
imply snow. Then the algorithm should extract 
the features ("word within ÷10 words of the 
target word" = "winter") as well as ("one word 
before the target word" -- 'T') as useful features 
by assigning them with high weights. 
3.3.3 Using the Network to Rank 
Sentences 
After networks of k-edit distance confusion sets 
are learned by Winnow, the networks are used 
to correct the N-best sentences received from 
POS trigram model. For each sentence, every 
real word is evaluated by the network whose the 
centroid word is that real word. The network 
will then output the centroid word or any word 
in the confusion set according to the context. 
After the most probable word is determined, the 
confidence level of that word will be calculated. 
Since every specialist has weight voting for the 
target word, we can consider the weight as con- 
fidence level of that specialist for the word. We 
define the confidence level of any word as all 
weights that vote for that word divided by all 
weights in the network. Based on the confidence 
levels of all words in the sentence, the average 
of them is taken as the confidence level of the 
sentence. The N-best sentences are then re- 
ranked according to the confidence level of the 
sentences. 
4 Experiments 
We have prepared the corpus containing about 
9,000 sentences (140,000 words, 1,300,000 char- 
acters) for evaluating our methods. The corpus 
is separated into two parts; the first part con- 
taining about 80 % of the whole corpus is used 
as a training set for both the trigram model 
and Winnow, and the rest is used as a test set. 
Based on the prepared corpus, experiments were 
conducted to compare our methods. The results 
Type 
Non-word Error 
l~al-word Error 
Total 
Error 
18.37% 
3.60% 
21.97% 
Table 1: The percentage of word error from 
OCR 
Type 
Non-word Error 
Real-word Error 
Introduced Error 
Trigram 
82.16% 
75.71% 
1.42% 
Trigram + 
Winnow 
90.27% 
87.60% 
1.56% 
Table 2: The percentage of corrected word er- 
rors after applying Trigram and Winnow 
are shown in Table 1, and Table 2. 
Table 1 shows the percentage of word errors 
from the entire text. Table 2 shows the percent- 
age of corrected word errors after applying Tri- 
gram and Winnow. The result reveals that the 
trigram model can correct non-word and real- 
word, but introduced some new errors. By the 
trigram model, real-word errors are more diffi- 
cult to correct than non-word. Combining Win- 
now to the trigram model, both types of errors 
are further reduced, and improvement of real- 
word error correction is more acute. 
The reason for better performance of Tri- 
gram+Winnow over Trigram alone is that the 
former can exploit more useful features, i.e., 
context words and collocation features, in cor- 
rection. For example, the word "d~" (to bring) 
is frequently recognized as "~" (water) because 
the characters "~" is misreplaced with a sin- 
gle character " "~' by OCR. In this case, Tri- 
gram cannot effectively recover the real-word 
error "d~" to the correct word "~". The word 
"d~" is effectively corrected by Winnow as the 
algorithm found the context words that indicate 
the occurence of "~" such as the words "=L~a" 
(evaporate) and "~" (plant). Note that these 
context words cannot be used by Trigram to 
correct the real-word errors. 
841 
5 Conclusion 
We have examined the application of the modi- 
fied edit distance, POS trigram model and Win- 
now algorithm to the task of Thai OCR error 
correction. The experimental result shows that 
our proposed method reduces both non-word er- 
rors and reai-word errors effectively. In future 
work, we plan to test the method with much 
more data and to incorporate other sources of 
information to improve the quality of correc- 
tion. It is also interesting to examine how 
the method performs when applied to human- 
generated misspellings. 
Acknowledgement 
We would like to thank Paisarn Charoenporn- 
sawat who helps us run experiment with Win- 
now. This work was partly supported by the 
Thai Government Research Fund. 

References 
Avrim Blum. 1997. Empirical support for win- 
now and weighted-majority algorithm: Re- 
sults on a calendar scheduling domain. Ma- 
chine Learning, 26. 
Andrew R. Golding and Dan Roth. 1996. Ap- 
plying winnow to context-sensitive spelling 
correction. In Proceedings of the Thirteenth 
International Conference on Machine Learn- 
ing. 
Andrew R. Golding and Yves Schabes. 1993. 
Combining trigram-based and featured-based 
methods for context-sensitive spelling cor- 
rection. Technical Report TR-93-03a, Mit- 
subishi Electric Research Laboratory. 
Andrew R. Golding. 1995. A bayesian hybrid 
method for context-sensitive spelling correc- 
tion. In Proceedings of the Third Workshop 
on Very Large Corpora. 
Peter Ingels. 1996. Connected text recognition 
using layered HMMs and token passing. In 
Proceedings of the Second Conference on New 
Methods in Language Processing. 
Karen Kukich. 1992. Techniques for automati- 
cally correction words in text. A CM Comput- 
ing Surveys, 24(4). 
Nick Littlestone. 1988. Learning quickly when 
irrelevant attributes abound: A new linear- 
threshold algorithm. Machine Learning, 2. 
Surapant Meknavin, Paisarn Charoenporn- 
sawat, and Boonserm Kijsirikul. 1997. 
Feature-based Thai word segmentation. In 
Proceedings of Natural Language Processing 
Pacific Rim Symposium '97. 
Masaaki Nagata. 1996. Context-base spelling 
correction for Japanese OCR. In Proceedings 
of COLING '96. 
Kemai Oflazer. 1996. Error-tolerant finite-state 
recognition with applications to morphologi- 
cai analysis and spelling correction. Compu- 
tational Linguistics, 22(1). 
Xiang Tong and David A. Evans. 1996. A 
statistical approach to automatic OCR error 
correction in context. In Proceedings of the 
Fourth Workshop on Very Large Corpora. 
