Context-Based Spelling Correction for Japanese OCR 
Masaaki NAGATA 
NTT Information and Communication Systems I,aboratorics 
1-2356 Take, Yokosuka-Shi, Kanag~wa, 238-03 Japan 
naga%aenttnly, isl. ntt. j p 
Abstract 
We present a novel spelling correction 
method \['or those languages that have 
no delimiter between words, such ~rs 
,lap;mese, (.',hinese, ,~nd ThM. It con- 
sists of an al)proximate word match- 
ing method and an N-best word seg 
mental|on Mgorithm using a statistical 
la.nguage model. For OCR errors, the 
proposed word-based correction method 
outperf.ornrs the conventional charactm'- 
b`ased correction method. When the 
bmselme character recognition accuracy 
is 90%, it achieves 96.0% character 
recognition accuracy and 96.3% word 
segmentation accuracy, while the cilar- 
acter recognition accuracy of cilaracter- 
b,ased correction is 
1 Introduction 
Automatic spelling correction research dates t)ack 
in the 1960s. ~lbday, there are some excellent 
academic ~nd commercial spell checkers available 
\['or English (Kukich, 1992). However, for those 
languages that have a different morphology and 
writing system from English, spelling correction 
remMns one of the signillcant unsolved researcil 
problems in computational linguistics. 
'\['he b,asic strategy for English spelling correc- 
tion is sitnple: Word boundaries are defined by 
white space characters. If the tokenized string is 
not found in the dictionary, it, is either a non- 
word or an unknown word. For a. non word, 
correction candidates axe generated t)y approxi- 
nm.tely matching the string with the dictionary, 
using context independent word dis|mice mea- 
sures such ,as edit distance (Wagner and l,'ischer, 
1974; Kernighan et M., 19q0). 
It is impossible to apply these "isolated word 
error correction" techniques to Japanese in two 
re`asons: First, in noisy texts, word tokenization 
is difficult because there are no delimiters be- 
tween words. Second, context-independent word 
distance measures ~re useless because the average 
word length is very short (< 2), and the chnra.cter 
set is huge (> 3000). There are a large number of 
one edit distaalce height)ors for a ,lapanese word. 
In English spelling correction, "word bound 
a.ry problem", such as splits (forgot -~ .lot gol) 
a.nd run-ons (in form --+ in.lbrm), mad "short word 
problem'(ot -~ on, or, of, at, it, to, etc.) are 
also known to I)e very dilIicult. Context infof 
mat|on, such as word N-gram, is used to sup- 
plement the underlying context-independent co> 
reel|on tbr these problematic examples (GMe and 
(~hurch, 1990; Mays et aJ., 1991). To the contra.ry, 
Japanese spelling correction must be essentially 
context-dependent, because Japanese sentence is, 
as it were, a. run on sequence of short words, pos- 
sibly including some typos, something like (lfor- 
.qololinfo'mnyou --~ I forgot to inibrtn you). 
In this pa.per, we present a novel ~t)proach for 
spelling correction, which is suite.hie for those l~n- 
guages that have no delimiter between words, such 
f~s aN)anese. It consists of two stages: First, 
MI substrings in the input sentence are hypoth- 
esized ms words, and those words that approxi- 
mately matched with the substrings axe retrieved 
from the dictionary ms well ,as those that exactly 
matched, l{,ased on the statisticM language model, 
the Nd)est word sequences are then selected as 
correction ca,ndidates from all combinations of ex- 
actly and approximately matched words. Fig 
ure 1 illustrates this ~pproach. Out of the list 
of character recognition candidates for the input 
sentence "~ b~R~7-~,~Y2~g)k~ 79o " 
which means "to hill out the necessary items in the 
application form.", the system searches the eombi- 
n,~tion of exactly matched words (solid boxes) and 
apl)roximately matched words (dashed boxes) 1 
The major contribution of this paper is its 
solutions of the word boundary problem and 
short word problem in Japanese spelling coffee. 
tion. lly introducing a statistical model of word 
tOCR output tm,ds to be very noisy, est)e(:ially for 
hand writing. To (:omt)ensate for this bd,avior, OC'Rs 
usuMly output at, ordered list of tit(! best N elutra(:ters. 
The list of the (:~uMid~tes for an int)ut string is called 
etl~*ra(:ter m~mix. 
806 
input sentence 
character matrix 
171 i~ d H }\[:¢~ K ,i~, -~ dt fJl 
itt4 1 D p\] X. -~t7 ~- t7j~ h 
.~ ( 7 b ,,7 
forward search 
\[\] T_~K-} \[\] 
A 5 
~J 3 
exactly matched word 
i ..... I approximately matched word ...... • 
I"igure 1: l'ossihle ( 1oinl)hia,l,ioliS or l:,XalCl, ly a, nd Atfl)roxhua,l,ely ~l\] n,l,ched Words 
Imigl;h a,iid sl/ellhig , l,lie proposed sysLei\[l a,Ccti 
ra, t, ely phi,ces word bounda, rics in noisy LexLs 1,1u/,L 
iliclude liOll-words n,nd tlllkl/OWll words. Ily using 
t, he c|la,ra,cl,or I)ased CC~l\]l;0xl; lilode\], il; a,c(:ura,1;ely 
selecl, s (:orr0£1,iOll (umdid~l.es \['or shorL words \['i'Ol\[i 
Lhe h~rge n tlllll)cr o\[' .~pproxini;~l;ely ill~l,L(tiletI worcts 
wiLh Lhe slmlc edit, disl;n,nco. 
The gold o\[ our project, is l,o iniplenienl, a, iI h'li,~r 
~ci.ive word correcl,or for ~ lia,ndwriLi, m~ FAX OCI{, 
sysl, eili. ldT~ a,re especia, lly inl;eresl;ed in 1;exl, s t,lia,l, 
include a,ddresses, IHI,IIICS, ~l,lld \[iiessa,~es, such as 
order fOrlllS, quesLionnn,ires, a,nd t, ch~gi'ig)h. 
2 Noisy Channel Model for 
CharacLer Recognition 
FirsL, we (\]lrniula,l,e i;hc, spellinT; c(~rrcct;ion o\[' O(2 I~ 
<~i'rors in I,he noisy cha, nncl pa, r;~div;ln. I,el, (< rop 
resenL 1,tie inpuL sLrhlg a,ncl ,\ rct>resenl, l;tie ()(Jlt, 
oul;Imt, st, rhig. l"hidhig Llie fllosL tm~lmble sf, i'hig C < 
given iJle O( ill, oul,ptll, <\" a,nioulil,s I;o ma, xiniizili~; 
u,e l;,,,,~,io,, r(x IC)/-'((;), 
¢: = ,H~ ,,,,~ n(c'lX ) = ,~,.~ ,IID, X \])(X\[~.))\])({.; ) (i) 
(3 {' 
lieca, use lia,yes' rule sl;;d,es l;ha, L, 
#'(~'l.V) = s'(xl(:.')#'(~:.') (.~) 
/'( X ) 
/'(C) is ca, tied the hmgu;Lge model. II, is c.ni 
puled \[l'Oilt l,lw l,ra,iilhig CoI'\])CiS. I,et, us cMI 
P(XI(. <) l,lte O(',ll, niodel. IL ca, it |)o conttmLed 
l'roni t;iic a, priori likelihood o,gLilnn,Les for individ 
ua, I cila, ra,cl;ers, 
is 
p(x IC) = l-i/'{":' I<, ) (:~) 
where 'n is l, he sl,rhlv; leu~>l,h, t'(:l:ilcT) is c;i,IN'd 
L|ic COlifUsion lun,l, rix o\[ ch;i,ra,(tl,or<~. IL i,g t,i';i,hicd 
tishlg l,lie hipuL a, nd oul,1)ul, sl;rhigs o\[' Lhe ()(ill 
The coii\['tiSiOll \[lialrix is hig|lly depentionL eli 
l:iie chn,r;l,ct,er r0co~niLion a,l~orit, iun a,nd Lhe qua, I 
iLy of the hlput, docunionl,. IL is a, lld)or inl;ensive 
1;aM~ Lo prepa, re a, con\[tiSiOl/ niaJ;rix lbr ca,oh cha,r 
a,cl,er recognilJon sys|,ellt, sillce .I a,p<~liose ha.s lilore 
1;hail 3,0()() c|lai','l, cl;ers. 'l'here\[7)re, we used a, shn 
pD ()(\]11, model wliere l, lie confusion ni~t, rix is a>p 
proximai, ed by t, tie correcl; cha,r;~cl,er dist, ribul;ion 
ovm" t, he r&nk of Ltie ca,ndicl&Lcs. We asstttlle t;ll~|; 
l, hera, nk erder disl, i;ibul, ion of l,|io correcL clia,ra,c 
Lore is a, geonl0Lric disf, ribuLion whoso pa,ra,niet,er 
is l;he a:CCtll';%cy OI" Lhe firM; ca,ndida, Le. 
l,ei, c~ be tha 7-i,h c|la, ra,cl,er in Lhe inpuL sl, rin~4, 
:l: G \])e l,|ie j t,h ca,ndida,t,e \[or (:~, ;uid p 1)o Life prol> 
idfilii, y l, ha,i, l, he lh'sl; ca,ndida,Le is correct,. 'l'lie COlt 
f'tiSiOll pt'olml>ilil~y \['(;v U \[r:i) is a,ppro×inla,t, od as, 
r(~:,~l<,,) ~ P(:,:,~ i,~ <:<,,.,.<:<:c) .~ ~,(l p)~ ' (,1) 
I'klua,l:ion (d) a, hus t,o a,Pl~roxhna,i:e l,he a,ccura,cy 
of t,hc firsL ca,ndida.im, a,nd tJle l,endency tJmt, tj.- 
relia.bilit;y of Lhe ca.ndhla.l;e ch'cramses al,'uptly as 
its ranlc incr(m.~es. For exa.mple, ill the recognition 
~ccura.cy of t, he lirsi; candida.t.e p is 0.75, we will as 
sign i,he prob;dfilii~y of Lllo Iirst,, secmid~ ;rod i, hird 
cn, ndida.i,cs l,o 0.75, 0.\]9, a,i/d (I.05, respect.ively, 
regli, i'dloss ~lI' Lhe hipuL a,iid Ollt;pul, cha,r~cl;ers. 
(-)11o ~.)J' Lhe I:lelietil,s o\[ usin<g a, siliiplc ()(It{ 
nlodel is Lha,L Lhe spelling correct, ion s.yM,eni be 
coiiles hi~4hly imh3)endenl, of l,lie underlying; ()( i1{ 
cha,raxq, crisl,ics. Obviously, a, more sophislficaLed 
O(\]11, niodol would iniprove OI'FOF Col'rect;ioli /)el" 
retina, liCe, hut, eVeli l,his shnlHe O(II/ilit>d<q works 
fa,h'iy wdl in our eXllerinient,s 7 
2()11(' (if (,lit! I)i'ax:l,iciil r(',a,Stlllb ~'(Jr ilSill,~ Lhc ~tXJlllCIL- 
rh: di~lrilml, hJli i~ ilia, i, we ll~cd l,hc cuid'H,HO, nlal, ri× 
for ilnl)h!iiicnl, in ~ the O(;R silnlila, l,Ol-. \'\"c fccl h, i,~ 
ilnfllir I,t) ii~t! I, hc sliillc con\[llblon lilaA, rix btJLh ftJl' t!lr()l 
l'~Clicr&l, ioli illld error corrc, d, lon. 
{7 0 7 
3 Word Segmentation Algorithm 
3.1 Statistical Language Model 
For the language model in Equation (1), we used 
the part of speech trigram nlodel (POS trigranl or 
2nd-order HMM). It is used ,as tagging mode\[ in 
English (Church, 1988; Cutting et al., 1992) and 
morphological analysis nlodel (word segmentation 
and tagging) in Japanese (Nagata, 1994). 
Let the input character sequence be (/ = 
c\]c.e...c .... We approxinlate P(C)by P(W, 7'), 
the joint prol>ability of' word sequence W = 
wlw2...'u),~ and part of speech sequence '\[' = 
tlt.e.., t,,. P(W,T) is then approximated t>y the 
product of parts of speech trigram probabilities 
P(ti\]ti-'2, |i-l) and word output probabilities for 
given part of speech P(wiltl), 
71 
p(c) _- p(w,~') --= IX p(t, lt,-=,t,-,)p(~,lt,) (5) 
i=1 
P(tilti-,e,ti-~) and /-'(w~lti ) are estimated \[>y 
computing the relative frequencies of the corre- 
sponding events in training corpus a 
3.2 Forward-DP Backward-A* Algorithm 
\[/sing the language model (5), .Japanese morp\[lo- 
logical analysis can be detined ,as finding tile set 
of word segmentation and parts of speech (1~/, 7'') 
that maximizes the joint probability of word se- 
quence and tag sequence P(W, 7'). 
(V¢, ~') = ,~,-g,,,~× P(w,'J') (~) 
W~ T 
This maxinfization search can be efficiently im- 
plemented t>y using the forward-DP backward-A* 
algorithm (Nagata, 1994). It is a natural exten- 
sion of the Viteri>i algorithm (Church, 1<,)88; Cut- 
ting et al., 1992) for those languages that do not 
have delimiters between words, and it can gener- 
ate N-best morphological analysis hypotheses, like 
tree trellis search (Soong and l\[uang, 1991). 
The algorithm consists of a forward dynamic 
programming search and a backward A* search. 
The fbrward search starts from tile beginning of 
the input sentence, and proceeds character by 
character. At each point in tile sentence, it looks 
up the combination of the best partial parses end- 
ing at the point and word hypotheses starting at 
the point. If the connection between a partial 
parse and a word hypothesis is allowed by the lan- 
guage model, that is, the corresponding part of 
speech trigram probability is positive, a new con- 
tinuation parse is made and registered in the best 
partial path table. \[,'or example, at point 4 in Fig- 
ure 1, tile final word of the partial parses ending at 
4 are ga b~ ('application'), .~ ('prospect'), 
SAs a word segmeotal, ion nmdel, the advantage of 
the POS trigram model is that it can be trained using 
a smaller <:orpus, than the word bigram mode.1. 
and ~ ('inclusive'), while tile word hypothe- 
ses starting at 4 are m¢~ ('form'), ~ ('s~ne'), Y\] 
('moon'), and Fq ('circle'). 
In tile backward A* search, we consider a par- 
tial parse recorded in the best partial path tat>lc `as 
a state in A* seareiL 'File backward search starts 
at tile end of the input sentence, and backtracks 
to tile beginning of the sentence. Since the prob 
abilities of the best possible remaining paths are 
exactly known by the forward search, the back- 
ward search is admissible. 
We made two extensions to tile original fbrward- 
DP backward-A* algorithm to handle OCR out- 
puts. First, it retrieves all words in tile dictionary 
that match the strings which consist of a combina- 
tion of the characters in the matrix. Second, the 
path probability is changed to the product of the 
language model probability and the OCR model 
probability, so as to get the most likely character 
sequence, according to Equation (1). 
4 Word Model for Non-Words and 
Unknown Words 
The identification of non:words and unknown 
words is a key to implement Japanese spelling cot- 
rector, because word identilication error severely 
atDets the segmentation of neighboring words. 
We take tile following approach for this word 
boundary problem. We first tlypothesize all sub: 
strings in the input sentence as words, and assign 
a reasonable non-zero probal>ility. \[,'or example, 
at point 7 in Figure 1, other than the exactly and 
approximately matched words starting at 7 such 
,as ,.g,~ ('necessary'), ~,'~ ('necessarily'), and alZ 
('pond'), we tlypothesize the sut>strings ~,, ~,~, 
~,@~, ,.g,@~, ... as words. We then locate the 
most likely word boundaries using the forward- 
I)P backward-A* algorithm, taking into account 
the entire sentence. 
We use a statistical word model to assign a 
probat>ility to each subs|ring (Nagata, 1996). It 
is defined as tile joint probability of tile character 
sequence if it is an unknown word. Without loss 
of generality, we can write, 
P(~ I<~z>) = p(c~... ,:~ I<~z>) 
= r(k)P(,:,...,;~lk) (7) 
where <'.1 • • • <'+ is the character sequence of length 
k that constitutes word wi. We call P(k) the 
word length model, and P(cl • .. ck \]k) the spelling 
nmdel. 
We assume that word length probability P(k) 
obeys a Poisson distribution whose parameter is 
the average word length A, 
(.~ __ \] )k 
This means that we think word length is the in 
terval between hidden word boundary markers, 
which are randomly placed where tile average in- 
terval equals tile average word length. Although 
808 
this word length model is very simple, it plays a 
key role in making tile word segmentation algo 
rithm rot>ust. 
We al)proximate the spelling probability given 
word length P(el ... ck \]k) |>y tile word-t)a~ed char- 
acter trigram model, regardless of word length. 
Since there are more than 3,000 characters in 
Japanese, tile amount of training data would be 
too small if we divided them by word length. 
@:~-.. "~) -- P(c~ I#, #)P(c= I#, q ) 
k 
z=3 
where "#" indicates the word t>oundary marker. 
Note that tile word-I>,%sed character trigram 
model is different from tile sentence-b~Lsed char- 
acter trigram model. 'l'he tbrmer is estimated 
from tile corpus which is segmented into words. It 
a,ssigns large probabilities to character sequences 
that appear within a word, and small probat>ilities 
to those that appear across word boundaries. 
5 Approximate Match for 
Correction Candidates 
As described t>elBre, we hypothesize all sul>strings 
in the input sentence ,as words, and retrieve ap: 
proximately matched words from the dictionary 
as correction candidates. For a word hypoth-. 
esis, correction candidates are generated based 
on tile minimmn edit distance technique (Wag- 
net anti l!'ischer, 1974). Edit distance is defined 
as the ntiniulum number of editing operations (in 
sertions, deletions, and substitutions) required to 
transform one string into another. If tile target 
is OCIL output, we can restrict tile type of errors 
to substitutions only. Thus, the similarity of two 
words can be computed as c/n, where c is tile 
nund)er of matched characters and n is tile length 
of the misspelled (and dictionary) word. 
For longer words (._> 3 characters), it is rea: 
sonable to generate correction candidates t>y re- 
trieving all words in the dictionary with similarity 
above a certain threshold (eta >_ 0.5). For exam- 
pie, at point 0 in Figure 1, g+ b~ ('application') 
is retrieved by approximately ntatching the string 
Itt L~;9- with the dictionary (c/n = 3/4 = 0.75). 
Ilowever, tbr short words (1 or 2 character 
word), this strategy is unrealistic because there 
are a large numt>cr of words with one edit dis- 
lance. Since the total nund)er of one character 
words and two <:haracter words anlounts to luore 
than 80% of the total word tokens in Japanese, we 
cannot neglect these short words. 
It is natural to resort to context-dependent 
word correction methods to overcome tile short 
word prol>lem. In English, ((-;ale and (\]hurch, 
199(t) achieved good spelling check performance 
using word bigranLs, llowever, in ,lapanese, we 
cannot use word bigram to rank correction can- 
didates, because we have to rank them betbre we 
pertbrm word segnmntation. 
Therefbre, we used character context instead of 
word context. For a short word, correction candi- 
dates with the same edit distance are ranked by 
tile joint probability of tile previous and tile fol- 
lowing two characters in the context. This probw 
bility is computed using the sentence-based char- 
acter trigram model. For 2 character words, for 
example, we first retrieve a set of words in the 
dictionary that match exactly one character with 
the one in the input string. We then compute the 
6 grant probability Ibr all candidate words .siSi+l, 
and rank them according to the prot>ability. 
P(c,_2, ci-l, .sl, si+.t , ci+:~, ci+a ) : 
P(.s'ilci-~, cl-t ) P(si+l \]ci 4, .'~i) 
P(ci+=lsl,.si+l)P(ci+al.si+t,ci+.2) (10) 
For example, at point 12 in Figure 1, there are 
many two character words whose first character 
is ~g, such ~s -gEil~ ('mention'), ~E~4$ ('article'), ~0. 
.~ ('journalist'), gg.zX. ('entry'), g0,,&~, ('commen> 
oration'), etc. By using character contexts, tile 
system selects gg)k. anti ~t\]fti;~ as approximately 
matched word hypotheses. 
6 Experiments 
6.1 Language Data and OCR Simulator 
We used tile NI'R Dialogue Database (Ehara et 
el., 1990) to train and test tile spelling correc- 
tion method. It is a corpus of approximately 
800,000 words whose word segmentation anti part 
ok' speech tagging were laboriously performed by 
hmu\[. In this experiment, we used one lburth of 
tile ATR, Corpus, a portion of tile keyboard dia- 
logues in the conference registration domain. 'l'a- 
ble 1 shows the nmnber of sentences, words, and 
characters for training anti test data. The test 
data is not included in the training data. That is, 
open data were tested in the experiment. 
Tat>le it: The Amount of 'l¥aining and '\[>st Data 
Training set Test set 
Senten<:es 10945 l O0 
Words 150039 1134 
C, haracters 268830 2097 
For the spelling correction experiment, we used 
an OC, R simulator because it is very difficult to 
obtain a large amount of test data with arbitrary 
recognition accuracies. The OCR, simulator takes 
an input string anti generates a character matrix 
using a conflmion matrix for Japanese handwriting 
OCI{,, developed in our laboratory. The parame- 
ters of the OCR sinmlator are tile recognition ac- 
curacy of the lirst candidate (lirst candklate cor- 
rect rate), anti tile percentage of tile correct the.r- 
809 
acters included in tile character matrix (correct 
candidate included rate). 
In general, the accuracy of current Japanese 
handwriting OCR is around 90%. It is lower than 
that of printed characters (around 98%) due to the 
wide variability in handwriting. When the input 
comes from FAX, it degrades another 10% to 15%, 
because tile resolution of most FAX machines is 
200dpi, while that of scanners is 400dpi. There- 
\['ore, we made \[bur test sets of' character matri- 
ces whose first candidate correct rates and correct 
candidate included rates were (70%, 90%), (80%, 
95%), (90%, 98%), and (95%, 98%), respectively. 
The average numt>er of candidates ibr a character 
w~s 8.9 in these character matrices 4 
6.2 Character Recognition Accuracy 
First, we compared the proposed word-based 
spelling corrector using the POS trigram model 
(POSe) with tile conventional character I)msed 
spelling eorreetor using tile character trigram 
model (Char3). Table 2 shows tile character 
recognition accuracies after error correction \['or 
various b~seline OCR accuracies. We also changed 
the condition of the approximate word match. In 
Tat)le 2, Matehl, Match2, and Match3 represent 
that tilt approximate mM;ch fbr substrings whose 
lengths were more than or equal to one, two, and 
three characters, respectively. 
In generM, tile approximate match for short 
words improves character recognition accuracy by 
about one percent. When the lirst candidate cor- 
rect rate is low (70% and 80%), tile word based 
corrector significantly outperIbrnL~ tile character- 
based corrector. This is because, by approximate 
word matching, tile word-based corrector can cor- 
rect words even if the correct, characters are not 
present in the matrix. When the first candidate 
correct rate is high (90% and 95%), the word- 
I>~sed corrector still outperl`orms tile character 
based eorrector, although the ditDrenee is small. 
This is because most correct characters are al 
ready included in the ma.trix. 
Table 2: Comparison of Character Recognition 
Accuracy (Character Trigram vs. POS trigra.m) 
OCR (thou'3 
70% (90%) 74.4% 
80% (9a%) 8~.0% 
~),~% (98%) !)5.o% 
M~m:h l 
84.6% 
~)2..5% 
96.0% 
,~)6.~% 
POSe 
Match2 Mateh3 
8a.9% 83.1% 
92.0% 90.6% 
95.9% 95.6% 
96.7% 95.9% 
~The par~m/eters ~rre sc|ected considering the filet 
that the corre.ct candidate included r~ttc increases a.s 
the tirst candi(hm~ correct rate incrc~Lscs, a.nd that 
NOllle correct characters ~re l|ev(:r \[)resellt ill tile Illg-- 
trix ewm if the first candidate correct ,:~Lt(~ is high. 
6.3 Word Segmentation and Word 
Correction Accuracy 
First, we deline the performance mea,sures of 
J apanese word segmentation and word correction. 
We will think of' tile output of tile spelling eor- 
rector ~ a set of 2-tuples, word segmentation and 
orthography. We then compare tile tuples con 
tained in the system's output to tile tuptes con- 
tained in the standard analysis. For tile N-best 
candidate, we will make the union of tile tuples 
contained in each candidate, in other words, we 
will make a word lattice from N-best candidates, 
and compare them to tile tuples in the standard. 
For comparison, we count tile number of tuples in 
tile standard (Std), the number of tuples in the 
system output (Sys), and tile number of matching 
tuples (M). We' then calculate recall (M/Std) and 
precision (M/Sys) as accuracy measures. 
We define two degrees of equality among tuples 
for counting the number of matching tuples. For 
word segmentation accuracy, two tuples are equal 
if they have tile same word segmentation regard 
less of orthography. For word correction accuracy, 
two tuples are equal if they have the same word 
segmentation and orthography. 
Table 5 shows the words segmentation accuracy 
and word correction accuracy. The word segmen 
ration accuracy of tile spelling eorrector is sig- 
nitieantly high, even if the input is very noisy. 
For example, when the accuracy of the baseline 
OCI{. is 80%, since tile a.verage numlmr of char 
acters and words in the test sentences are 20.1 
and 11.3, there are 4.0 (=20.1'(1-0.80)) chm'ac- 
tee errors in the sentence, in average. Ilowever, 
94.5% word segmentation recall means that there 
are only 0.62 (=11.3'(1-0.945)) word segmenta 
tions that are not found in the first candidate. 
Moreover, we t>el the word correction accuracy 
in Table 3 is satisfactory \['or an interactive spelling 
corrector. For example, when the accuracy of the 
b~seline OCI{ is 90%, there are 2.0 (=20.1"(1 
0.90)) cha.racter errors in the test sentence, llow 
ever, 92.8% reca.ll for the first candidate and 95.6% 
recall for tile top 5 candidates means that there 
are only 0.81 (11.3"0-0.928)) words that are not 
found in the lirst candidate, and if you exa.mine 
the top 5 candidates, this wdue is reduced to 0.50 (~1.3'(1-0.9S@). 
That is, about half of the er 
rors in the lirst candidate are corrected by simply 
selecting tile alternatives in the word lattice. 
7 Discussion 
Previous works oil Japanese OCR error correction 
are l)ased on either the character trigram model or 
tile part of speech t)igram model. Their targets are 
printed characters, not handwritten characters. 
That is, they assutne the underlying OCI{.'s ac 
curacy is over 90%. Moreover, their treatment of 
unknown words and short words is rather ad hoe. 
810 
'l'a,ble 3: Word Segmenta.tion Accura,cy a, nd Word (Jorrection Accuracy for Noisy Texts 
O(:11 
7o% (9o%) 
8o% (9~%) 
9o% (:)8%) 
95% (98%) 
Wor(l Segtnent;~tion 
R(x:M1 (llest-5) l)re<:ision (l\]est-5) 
89.o% (9e.1%) ~.a% (752%) 
94.5% (97.4%) 90.5% (81.7%) 
96.a% (97.9%) 9a.(~% (85.s%) 
97.3% (98.6%) !\]4.8% (86.8%) 
Wet(| (-',or re(:t h) n 
1{ c(:all (l\]est-5) P rc.(:ision (l}t:st-5 
77.1% (82.4%) 71.a% (58.2% 
87.9% (92.6%) 84.2% (67.2% 
!\]2.8% (95.6%) 90.1% (72.1% 
94.a% (!\]7.0%) !}1.8% (74.0% 
('l'Mmo and Nishino, 1989) used 1)~u't of speech bi 
gra, m a,nd best \[irsl, sea+rob for ()C,I{, correction. 
They used heuristic templal;es \[Lr ttnkllown words. 
( 11;o a,nd M a,rtty,'tma,, 1.()92) used pa,rt of speech I)i 
graan a, nd \]lea,In search ill order to get, niultiple 
c,'mdidaJ, es in their int;eracl;ive 0(-:11, correcter r, 
The proposed Ja,paa\]ese spelling correction 
meLh.od uses pa,rt of speech trigra,m ;rod N best 
sea,reh, This (:oml>ina,l,ion is l, heoretica, lly a, nd 
pra,ctica,lly iilore ;l,CCtlr;l, Le (;\[liLII previous reel, hods. 
In addition, t>y using sl,a,t;istiea,I word ntodel, a,nd 
cc)llteXt; I>a,sed n,l)lm)xin\]a,l,c word \[na, l, ch, il, t)e 
comes robust enottgh |;o }tm~dle very noisy texts, 
such a,s the ottl,put o\[' FAX O(111, systetns. 
To improve the word correction a,ccuraey, more 
powerful hmgua,ge models, stteh as word bigram, 
are required. (Jelinek, 1.(.)85) pointed out that 
"I)()S (pa,rt of speech) elassiliea,tion is too crude 
a,nd not necessa,rily suited 1,o la+ngtutge modeling". 
Ilowever, il; is 1;c)o expensive to prepa, re a, la,rge 
m,~nua, lly segmented (:ort~,tts ()f e;tch l,a, rget do 
Ilia,ill L()(:O\[llpute the word 1)igra,m. 'l'her<q'ore, we 
a,re thinking o\[' ran,king a, set\[" orga,tfized word seg 
meni;aJ, ion method I)y generMizing the l"orwm'd 
Ibtekwa,rd a, lgoritlml \['or those hmgua,ges tha, t ha,ve 
no delimiter between words (Na,gaJ, a,, 199(i), 
8 Conclusion 
We h;tve present;ed ~ spelling eorrecl, ion met,hod 
tbr noisy ,la,pa,nese texts. We a,re currently 
I>uilding a,n intera,ctive Ja,pa,nese spelling correc 
tor jspcll, where words are the I)msic object: ma. 
nipuhtt, ed 1)y t, he user in ope\]'~l;ions such as re 
pla.ee, a,ceept, and edit. It is something like the 
J a,pa,nese countert)a,rt of I Jnix's spelling correcter 
ispell, with a, user interf~tce similar to kan(t-lo- 
ka'njZ converter, a, popu\[a,r J a,pa, nese inpul, method 
~A(:<:ording to Fig. 6 ill (~\]'a,k;to and NMtim), 1989), 
they achieved iti)Olll, 95(~1 (:ha, ra<:tcr I'C~:Og, tlil, ioII &(:CII- 
r;t(:y when |,Ira I)ms('.llnr. ~L(:cllr,~l(:y iS 9\[% for ill;tga.- 
7,ines ~tnd int, ro(\]ll(:lA)ry t(!xl,1)ooks of scien(:(! and t,e(:ll- 
nology dmu;tiu. According to TM)Ic. I in (11,o ~tnd 
Ma, ruya, tn~t, 1992), they itchicvcd 94.61% ,<ha, la,<.:tcr 
I't?(:O,~,ll\[LiOll a+c(:tu'+t<:y when |,}it |)a, selinc ~tc(:ur+~<:y is 
87.46% \[m" pal, elLS h, uhx:tri<: c.gilme.rlng, dora;tin. We 
~t(:hit:vcd 9fi.0% c\]l.+trltci, cr J'e(:og.il, ion +~(:c.ra(:y, when 
the I)+~ell.c a+(:c.r+~cy i~ 90% in thu cunf('.rcn<:c roy< 
istr~tLion doma.i.. It is very (liflic,lt to c,.)nlt)a+v.! our 
rusu\]ts with thu previous rcsUlll\[,~ I>(:ca, ust! t,\]l(', expuri- 
merit <:onditio.s a, rc (:Oml)h:Lt:ly dill'rxenL. 
for the AS(3l keyt>oaa'd. 
|{e\['erel\] ces 
Kt!rltl(!th W. (.~,l|llr(:h. 1988. A Sto<:ha.sti(: P+u't~ Pro- 
g,,'am +u.l No.n Phra.se P~H's,.:I" for IJlll'(!s|,rlcIA'd 
q'ext,, \]n lJrocc(:d{ng.~ (ff A NI, tLSb;, t)~tges 136- 143. 
I)oug (',utting, Julhul Kut)ie(:, J~ul Pudersen~, +tn(l 
IJcllelot)c Sibun. 1992. A t)llajcth:+tl \[)a, nlt-o\[-Sl)eu(:h 
~l'a+gger, In t)roc('+:ding.~ (ff A N1,1)-92, I)a+gcs 13;1- 140. 
q'erumasa, lC}l;u% Kunl, aJu Ogura, Tsuyoshl Mori- 
mol,u. 19!}0. ATI{ I)ia,\]oguc \[)atal>asu. 1, lq'o('ccd- 
i.,g.~ of ICSI, P, \[m;q, es 1(193-109(J. 
Wi\]li:-lln A. (\]a\]l.~ a+nd I<r.n.etll W. (~lillr(:h. 199(\]. Pool 
l'~sthna, t, cs of C, ontcxt are Worse. th;Ln Nora:. lit Pro- 
ccc(liT*ys of I)AHPA A'atur, I Lan(j'u,gc m*d St)etch 
Workshop, 1)+tgcs 28a-287. 
M~trk D. I<crnigha, n, Kenneth W. Chur(:h, and 
Willi~un A. (~th!. 1990. A Spelling Correction Pro- 
gr~un Based o. a, Noi.sy (-lh.;utnel Model. In l'roccc(l- 
in(js of (701,1N(;-90, l)~tgus 2(-15-210. 
K~u'r.n Kuki<:h. 111!12. 'l'ccllniqurs For Aut,omnth:a.lly 
Col'reeLing Words in Text. A(:M (/omlmlin 9 ,%+r- 
O(:g.'¢~ VoI.L)4, No.4, I)~t~,(:s ;\]77-4119. 
Nol)uyms. \[to ;¢lld l\[iroshi M :¢rilya, lllat. 19!12. A, Method 
of \])u.te(:ting aim (hJrrecLing l£t'rors in the t{esults of 
Jitl)a+ne:.+e OCR. In Tra.nsaclion o\] In\]otto+ilion Pro- 
f'¢?S,'fi~L~(J ~'OC+dll I of ,/f+pfllZ, Vc)l.;\],~, No.5, J)a.gC?S {J{J4- 
670 (in J+q)+tnese). 
I"re(h!rh:k Je.linek+ 1985. Self-org+uli/,(!d l,a,.:e,u+~gc 
Mo(h!ling, \[o17 St)et!(:h Rtx:ognition. IBM t{,.:l)Od,. 
I",rh: M+tys, I"rcd J. J)+uncratlt, aJM l/obt:rt I,. Mcr- 
(:er. 1991. C, tmtcxt l}+Ls(!(l Spelling (',orn't!ctio.. /n- 
Jormalion l)ro(:cssing ~',; M,nagcmcnl, V.I. 27, Nt).5, 
I)~g(!s 517-522. 
Masa~tki N +tgati,~t. 1!}!t4. A S tO(:}U a,M,i(: ,\];~\[)}ttlcSC 
Mort)hoh)glcal Amdyzer Using it l"orwa, r(l-l)l ~ 
l}ackw~r(1-/t* N-Best Se+tr(:h Algorithm. In t¥occ(:d- 
inys of (JOI,1N(7-9~, l)a#.e,s 201-207. 
Ma,,,+t+tki N itgiLL+L 1996. alui, onl~tti<: l~Xl, l';-:t(:lion of New 
Words from J~tl)+tnese q'exl, s u,,,ing Gc,uraJizcd 
I"orwa.r(1-1btckw~u'd ~qe~tr<:h. q'o att)l)(,.itr ill l)r.'wccd - 
my.~ of EMNI,I'. 
I"r~nk K. ,qt)onZ ~tnd l'\]n,e--l"oulg llu~tng. 1991. A Ti'cc- 
Trellis lbk,~ud l"+tst Suar<:h for I"indi,g the N l}usl 
Senten(:e. llyl)otheses h! (hmtinuous Speech I{e(:og- 
nil, ion. l. /)roccc<ling* of 1(7A SSILOI, I)a,g{:sT05:7(IS. 
"l'el,suya,su Ta, k~m +~l~(l FunlihiLo NiMfino. 1989. lml)h> 
m(,.nt;tl, ion atit(I Ewdui~tlon of Post-processiug for 
Ja.l)aJl(!St! I)O(:lltncnL \]{.(!a,(l,.:u's. lit ~}'nns.(:li,n* o/ 
hffo,'m++liou t'l'occ.'t,¢i?+fj ,qo(:icly of Japeln, Vol.30, 
No.l I, l)a, ges 13!)4 1401 (1. ,\];Ltla.llt:Se ). 
i{ol)erl, A. W;~z,ur +ul(I Miclut(!l .I. Fis(:her. 1974. q'he 
~l, rlng-l,tr.,qtring C, orrr.cl,ion Probh!m. In ,h)mrud of 
thc ,4UM, Vol.21, No.I, t)a+Zes 168-173. 
811 
