AN EVALUATION TO DETECT AND CORRECT ERRONEOUS 
CHARACTERS WRONGLY SUBSTITUTED, DELETED AND 
INSERTED IN JAPANESE AND ENGLISH SEN~IENCES 
USING MARKOV MODELS 
Tetsuo ARAKI\]" Satoru It(EIIARAiI Nobuyuld TSUI'~AIIARA\] Yasunori KOMATSU\]' 
"l Faculty of Engineering, l'~ukui University I ukui, 910 21APAN 
"i'~ NTT Communication Science l,al)oratories 1-2356 q'a.ke Yokosnka-Shi 2a8-03 Japan 
ABSTRACT 
In optical character recognition and coni.in- 
uous speech recognition of a natural language, 
it has been diflicult to detect error characters 
which are wrongly deleted and inserted. \]n <>r- 
der to judge three types of the errors, which 
are characters wrongly substituted, deleted or 
inserted in a Japanese "bunsetsu" and an l';n- 
glish word, and to correct these errors, this 
paper proposes new methods using rn-th or- 
der Markov chain model for Japanese "l~anji- 
kana" characters and Fmglish alphabets, as- 
suming that Markov l)robability of a correct 
chain of syllables or "kanji-kana" characters is 
greater than that of erroneous chains. 
From the results of the experiments, it is 
concluded that the methods is usefld for de- 
tecting as well as correcting these errors in 
Japanese "bunsetsu" and English words. 
Key words: Markov model, error detection, 
error correction, bunsetsu, substitution, dele- 
tion, insertion 
1 Introduction 
In order to improve the man-machine in- 
terface with computers, the <tevelopment of 
input devices such as optical cha.racter tea<l- 
ets (OCR) or speech recognition devices are 
expected, llowew;r, it is not easy to input 
Japanese sentences J)y these devices, because. 
they are written by many kinds of charac- 
ters, especially thousands of "kanji" charac- 
ters. The sentences input through an OCR. 
or a speech recognition device usuMly contain 
erroneous character strings. 
The techniques of natural language process- 
ing are expected to find and correct these er- 
rors. tIowever, since current technologies of 
natural language analysis have been developed 
for correct sentences, they cannot directly be 
applied to these problems. Up to now, statis- 
tical approaches have been made to this prob- 
lem. 
Markov mo<lels are considered to be one 
of" machine learning models, sinfilar to neural 
networks a.nd fuzzy models. They have been 
applied to character chains of natural lan- 
g,,a~ges (e.g.,l);nglish)\[l\],\[2\], a.nd to phoneme 
reco~gnition 3 . \[41 cha.ins in continuous speech. . \[ 1~1. ¢1' 
2nd-orde.r Markov model nt bunsets',l is 
known to be useful to correct errors in "kanji- 
kana." "/m nsetsu" \[(;\],to choose a correct sylla- 
ble chain from Japa.nese syllable "bunsetsu" 
candidates \[7\], and to re(!nce the ambigui- 
ties in translation processing of non-segmented 
"kana." sentences into "kanji-kana" sentences 
\[8\]. 
The erroneous characters can be classilied 
Ul,O three types, lhe hrst is w~ongly recog- 
nized chal;aclers instead of correct (haracters. 
The second and the third are wrongly inserted 
and deleted (skipped) characters respectively. 
Markov chain mode.Is above mentioned were 
restricted to tind and correct the first type of 
errors\[5\],\[6\]. No method has been proposed for 
correcting errors of the second and the. third 
types. 'Phe. rea.son might be considered to be 
I.he di\[ticulties of finding the error location and 
distinguishing between deletion and insertion 
er I'ors. 
On the other hand, contextual algorithm 
utilizing ,,-g,'atn letl.er statistics (e.g.\[.()\]) a,,d 
a dictionary look-ul) algorithm\[10\] have been 
discussed to detect a.nd correct erroneous char- 
acters in English sentences, which is seg- 
mented into words. 
This paper proposes new methods, which 
are able to be applied to a nor>segmented 
chains or" characters, to judge three types of 
the errors, which are characters wrongly sub- 
st.ituted, deleted a.nd inserted in a Japanese 
"bunsetsu", and to correct these errors in 
Japanese "kanji-l<ana" chains using m-th of 
der Markov chain model. The methods are 
based on the idea about the relation between 
the types of errors and the length of a chain 
in which the wdnes of Markov joint probability 
remain small, l,'urthermore, this method is ap- 
187 
plied to detect and correct errors in segmented 
English words• 
Experiments were conducted for the case of 
2nd-order and 3rd-order Markov model, and 
they were applied to Japanese and English 
newspaper arhcles. Relevance Factor 1 and 
"Reeall Factor" R for erroneous characters de- 
tected and corrected by this method were ex- 
perimentally evaluated using statistical data 
for 70 issues of a daily Japanese newspaper 
and 5 issues of a daily English newspaper. 
2 Basic Definitions and the 
Method of Error Detection 
and Error Correction using 
2nd-Order Markov Model 
2,1 Basic Definitions 
In this paper, two types of natural lan- 
guage's sentences are discussed. One is a 
Japanese sentence, which is non-segmented 
sentence and the other is an English sentence, 
which is segmented into words. 
A Japanese sentence can be separated into 
syntactic units called '%unsetsu", where a {~" ( 
"bunsetsu" is composed of one m lependent 
word" and ~ sequence of n (greater than equal 
to 0) "dependent words". 
A "bunsetsu" is a chain of Japanese "kanji- 
kana" characters or an English word is a chain 
of alphabets, and are represented by 3' = 
sl s2...s,~, where s~ is a "kanji-kana" character 
or an alphabet. In particular, a chain, 7 , is 
called a "J-bunsetsu" when all of its elements 
are "kanji-kana" characters, and is called a "\[iJ- 
word" when all of its elements are English al- 
phabets. The set of eorre, ct .lapanese "bun- 
setsu" or English words is represented by Pc. 
Three types of erroneous "J-bunsetsu" or 
E-word are dehned as follows: 
First, a chain ce = N,¢~... s\[Zlg;.., s,;~ is 
called a "(i,k)-Erroneous J-bunsetsu or E- 
word Wrongly Substituted " ( (i, k)-EWS) 
if a subehain fl = tltu... Ih is wrongly sub- 
stituted at the location i of ce, that is 3 7 C- 
re, -y = ~(o11/< Here ~(Ollf3 donotes 
substitution of a subchain fl at, the loca- 
tion i in a chain c~ , that is, d01i/  - 
8-iS- 2 • "" Si--lll¢ 2 "" • \[,kSi-+k "'" ,S'~n , and l I 6-- 
&,"',tk ~-- siq~-,. 
Next, a chain c~ = &g.,... si~_lgi ... s;,~ is 
called a "(i,k)-Erroneous J-bunsetsu or I';- 
word Wrongly Deleted" ((i,~)-~WD) if a 
subchMn fl = t~t=...tk is wrongly deleted at 
the location i of a, that is ~7 ~ l'c, "y = 
c~ (1) << ft. Here {,(0 << fl denotes insertion of a 
subchain fl at the location i in a chain c, , that 
is, a (0 << fl -- s't.sS"' Si~-lltl2 "" "lk,qi''" S~n. 
Finally, a chain cr = .¢t • "" s\[-lgi"" si-(k-1 
s;+k"' s;,~ is also called "(i, k)-Erroneous J- 
bunsetsu or F~word Wrongly Inserted" ( 
(i,k)-EWI) if a sul)chain /3 = tlt~...tk is 
wrongly inserted at the location i of % that 
is 37 E Pc', 7 = d;) >> ft. tIere c~ (1) >> fl 
denotes deletion of a subchain f3 at the loca- 
tion i in a chain c~ , that is, at0 >> fl = 
.¢~'2'"siZlsi+~"'sZ, and tl = gl,'",tk = 
Si+k-1. 
The set of (i,k)-EWS, (i, k')-EWD and 
(i, k)-EWI are represented by P(~)s', P~) and 
17(1 ~) respectively. In this paper, all inputs 
"bunsetsn" or all inputs words to computers 
are assumed to belong to one of l-'c , p(k)s , P~) 
and 1'(1 k). 
Next, the meaning of detecting and cor- 
recting errors are define.d in the Nllowing. 
The words, "error detection problem", means 
the problem how to detect the location i of 
error in if, and "error correction problem" 
means the problem how to replace an erro- 
neous "d-bunsetsu" or an "E-word" ~v by a cor- 
rect "bunsetsu" or an English word 7, where 
s ,aEP , oreeEP aud'rcic. 
"Relevance Factor" p(D) and "Recall l)'ac- 
tor" R (D) for tile "error detection problem" is 
defined as follows: 
\]): p(D) at ( the number of "J-bunsetsu" or 2 
' '~ " t,. of the \],-word location i and length k 
error ill I '(k) p(k) \]7,5k) .s' , "n or is correctly detected q 
he total number of J-lmnsetsu or B- )/ ,, . . ,, 
word detected as erroneous ,l-bunsetsn or 
"E-word"). 
(2): R. (D) ~_ ( the ,mmber of "a-bunsetsu" {{ ~ " 
• . or l~-word that the location, and length k of 
error in P!s. ~), p(k) p~k) ~ D or is correctly detected 
) / ( the number of all "3-bunsetsu" or "E- 
wo,d" in t,,e set ) o,,  :'  )p,,epared in 
advance ). 
"Releva.nce factor" p(C) and "Recall factor" 
R (c) for the, "error correction problem" is also 
similarly defined. Here p}D) denotes the "Rel- 
evance Factor" for tile "error detection prob- ~(k) (c) 
lem" of \] ~' , and R D denotes the "Recall 
Factor" for the "error correction problem" of 
p(k) respectively. D 
2.2 The Method of Error Detection using 
2nd-Order Markov Model 
We introduce the following assumption ac- 
cording to the experiences. 
Assumption Each Markov probability for 
erroneous chains of "ka@-kana" characters or 
English alphabets is small compared to that 
188 
of correct chains. 
! 
According to this assumption, the. procedure. 
of detecting the location i .~nd the length k of 
error chains arc detined as follows: 
Pwcedure 1 ( Method of detecting the lo- 
cation and the length of chain wrongly sub- 
stituted in p(.k) and substituted or in.qerted in 
sPi~\[(l the subchMn of lelx~th k which satisfy 
the followin~ conditions..\['his chain is iudge~t 
to be wrongly inserted at the location ~. 
(1) P(Xh I Sh-,~ "'' Z,,_t) > 5", r,,r I,,--i- I 
orh.=i+k+mand 
(2) r(xs I xs-,,... ,'%.<) < 'r, ro~. vj su,:H 
that i < j < i+ k + m- 1, 
where P(Xj I Xj-v,...Xj-+)is ,~+-th order 
Markov chain probability which denotes prol> 
ability of occurrence of sueeessiw+ character Xj 
when string Xj .... 
• • • Xj-t has occurred, mtd X,, denotes a space 
symbol if u < 0. And T denotes a critical 
v',dne of m-th order Markov probability used 
for detecting errors. 
ri'his procedure detects that k characters a.rv. 
wrongly substituted or inserte(l at the. location 
i, if m-th order Markov probability for,cha.in 
remMn smMler vMue than critical wdue 1' just 
(k+m) times fi'om the location i to i+k+m- 1. 
l?or an example, the change o~ the val ~(; oI 
2hal-order Markov probability for each eharac.- 
ter of the erroneous chain \[,!~,e) or l'~ '2) is shown 
in l~ig. 1. In this ex~tmph{, \[wo charaet,ers are 
wrongly substituted or inserted. According 
to the previous assumption, 2nd-order Markov 
probability for erroneous~:!tain remain smaller 
value than eriticM value l just four tinms. 
S~ Sz $3 S,, Ss S~ Sv Sa 
O0000C) O0 
I__1 ' l' (S3ISISI) > T 
L__X__J I' (S~\[$2S3) < T 
L----X_---~ l' (S~ISJS~) < T 
'O; Enoneous chalacter \[_ X j l' (S6\[SISs) < T 
X : \] ecation o\[chatacler which has tile ~--J I' (S~\[q~";6~ <T 
value n\[ Ma~kov probability smaller than '\[' L ...... J 
T: Clitical value of Markov probability 1' (S81,S:g, S7 ) >T 
Fig.1. Change ofthewflueof2nd-ordcrMarkovprobabilltlcs 
l'roee, dure 2 ( Method of detecting the lo- 
cation of chain wrongly deleted in 1.'52 ~) ) 
Find the, subchain of length k which Satisfy 
the following conditious. 'Fhis clmin is judged 
t,o /)e wrongly deleted ~t the. location i. 
(~) r(x,, I~ ,,-.,,, " x,~_,) > I, to,. h = i-l 
or h.=i+k+'m, and 
(2) e'(X# I Xj ..... -..Xj_,) < '/', for V# such 
that i<j <i-t-'m-- 1, 
whe.re 7' denotes a critical value of 'n>th order 
Markov i.'obMfility used for detecting errors. 
I 
If m-th order Mal'kOv probM)ilities R)r chain 
remMn smaller than the critical wdue 7' just 
m tinms from the location i Lo i + m-1, 
it is judged that some characters are wrongly 
(teleted at the location i. l\[owever note that 
length k of characters wrongly deleted at the 
location i, can not be, de.termined by this pro- 
cedure, the length k is determined by the pro- 
c(,dure 4 shown in Sec. 2.3. 
Table \] shows that the relation of times that 
Markov prolm.bilities remain slnaller than 7' in 
the cases of Ist- aml 2nd-order Markov mod- 
els. li'rroneous (:hains (:an I)e classified into the 
following two eases: on('. is a case of the eh;.mm- 
ters wrongly substituted or inserted, the other 
is a class of the eha.racl.ers wrongly deleted. 
Table I' q'he mnnber of times that Markov 
l)rol)al)ility of the erroneous chains remain a 
smaller than T 
i),ypii tsi-i,,.,ti;, M~;\]~i;v \[~iil:oi:;t(i,: M,~,HiSv 
/2 i t~,re,~ ~.i,nes " :ro.r iime~ ...... 
-7'!~ ) , (k+\])-iiiiiiiS- (k-l-2)-iimes - 
IJp one times two times 
...... 
I' ' ) ¢me ti rues two times 
- ;(~Y I " one times two times D 
i'~ ':) two ti.,e~ ~ " t~,rol i3~,el- 
,'5 ~y tl.'ee times - ro"r-tim;s 
~,3 (/c-k:i 5 thnes - -(k'fl~27Lirne, s .... 
for each character of the erroneous rain,t: inch~dhu: ........................... 
Wt'ongly substltutcd or inserted chalactcrs lln case to (h+LecL errors in P~ 2) using 2rid-order 
Markov model, it is able to presunw.d Lhal; a sul)chM,L 
# of length '2 is wrongly inserted at I,h(,. location i 
of erroneous ('\]tah~ (~, if 2n(l<n'der Mar\]coy prol)al)il- 
try for erroneous chain ~v remMn smalhw than .'/' just 
four times from location i. 
189 
However, this method can not distinguish 
the erroneous characters wrongly substituted, 
from the characters wrongly inserted in the 
former c~e~ and can not determine the length 
k for the type of 1?~ ), because the Markov 
probability of any erroneous chmns in Pl) ~) 
remMns small value just the same times for 
length k. These problems can be solved by 
the procedt/re 3 and 4 shown in Sec.2.3. 
In this paper, the effect to detect errors for 
cases of length k = 1, 2 is evaluated. 
2.3 The Method of Error Correction us- 
ing 2rid-Order Markov Model 
The procedure of replacing erroneous chains 
by correct chains using Markov model is pre- 
sented as follows: 
Procedure 3 ( Method of correcting the 
chains in r(2 or ) 
"bunsetsu" 
or words ~ = s-lg2 • "" s(l& • • • si+h-lSi+h or o: 
= ,¢ts~:~ ''" si"--tg~"" s~+~-ts¢+k"" s~ denotes a 
" (i, k)-EWS" and a "(i, k)-EWI" and a sub- 
chain fl = tit=.., tk is assumed to be wrongly 
substituted or inserted at the location i of cY 
respectively. Then the erroneous chain ae can 
be replaced by the following correct chain "y in 
I'c if condition (1) is satisfied. 
l I ~- &,"',lk +--- s~-I or 7=c~(;) >> fl ~_ 
^ ,,, ^ ^ ... ^ ^ ,, l h = 8182 Si_tSi+ k S,n : and 11 = sl, ,, 
si+k-1 
P(xs I > r for Vj such 
thati+k<j < i+k + m-1. 
By comparing Markov probability for correct 
chains in two cases above, choose a correct 
chain which has the great Markov probability. 
| 
Procedure 4 ( Method of correcting the er- 
roneous chains in P~) ) 
A chain ~ = &&...scLtgl...s;,~ denotes a 
"(, )i k-EWD", and. asubcham' .. o=ltl,>...lk.~s 
assumed to be wrongly delete(t)at the location 
i of c~. Then the erroneous chain c~ can be 
replaced by the following correct chain 7 in 
I'e if condition (1) is satisfied. 
3' = oe(O << fl 
~qt ~42 • ' • Si~.t~l\[ 2 • • • ~kS-i • • • Sin. 
(1) P(X~ I x~_,~.., xj_,) > r, rot Vj such 
thati+k<j <i+k+m-1. | 
An example of correcting the erroneous 
chain, two characters of which are wrongly 
substituted (P(~) ), is shown in Fig. 2. If 
Markov probabilities do not remain smaller 
than critical value T, then it is judged that 
these erroneous chains have been corrected. 
SI 
0 
$2 Sa $4 Ss So Sz SB 
OOQ®O00 
(1) Correction of 
insertion errors 
$1 $2 Sa So $7 So 
OOOO00 
\] P(X6lX2Xa) ) \]I 
J P(XTIXaX~O>T 
(:Critical w~lue of / (2) Correction of 
Markov probability ~ su\[)st\[tulio\[1 CI'I'OI'S 
$1 $2 Sa S~1 So~ Sa Sz $8 
OOO@@OOO 
I I P(XellX2Xa) > T 
I ! P(X~,elXaXe,) > T 
I I P(XGIX,~IX~) > T 
P(XTIX~2Xo) > T 
Choose the candidate of "bunsetsu",which has a 
great Markov probability in two cases 
Fi9.2 Procedure for correcting an erroneous 
string using error detection 
3 Experimental Results 
3.1 Experimental Conditions 
\]. The number of "bunsetsu" for 70 issues 
of a daily Japanese newspaper: 283,96:~ 
~bnnsets\[l" 
2. The number of words for 5 issues ofadaily 
English newspaper: 155,,159 wor(Is 
3. Type of errors and the numt)(,r of "bnn- 
se{sll" : 
8(10 "bunsetsu" are prepared for each of 
l:!,!), rS? ) l'i and l'? 
(a) The average length of "bunsetsu" 
composed of "kanji-kana" character 
chmns: 6 characters 
(b) The twera,ge length of alphabets 
composed of correct English words 
chains : 7 characters 
4. Markov model of Japanese "kanji-l{ana" 
characters : 2nd-order Markov Model 
5. Markov models of Fnglish alphabets: 
2nd- and 3rd-order Markov models 
190 
3.2 Experimental Results and Discussion 
The accuracy of error detection ~md error 
correction depends on the critical va,lue 7' of 
Markov proba, bilities. "Rehwance Factor" P 
and "ReaM1 Factor" R, for e;tch method were 
obtained by changing the wdue of T. 
\[1\] The Relation between P and R of Detect- 
mg Erroneous ChMn Using \])eteetion Proee~ 
dure 
'Phe relation between P and R for the loca- 
tion of erroneous "k,~nji-kan£' chains det, ecl.ed 
in p(t). s' , P(~)s, .,P(t)., ..\]'(~), 1'~ t), ;rod l'~ ~) using 
Procedure 1 a,Ild 2, are. S\]lOWll ill Fig. 3, ;t.lI{{ 
those for erroneous Mp}utbets chains ;u:e shown 
in Fig. 4. 
From these figures, the following results are 
obtained : 
1. The maximum wdue of P and R of detect- 
ins erroneous characters wrongly inserted or 
substituted, is greater than that of erroneous 
characters wrongly deleted'. 
(a) In the case of "J-bunsetsu" : 
r}") = 07- 09%, n.~ '') = ~}7 - ,a~)% 
l~fl ) 100%, r~(~)) = "~t) = 57-58% \]{/') 
= 88- 94%, //') = 88 -%% 
(b) In the case of 'q~word": 
P}') = 38-49%, R5 ") = 38-:1.~% 
,{p) = 94- ,as{~, 4'/) = :1(~- \]~)'x, 
P~/') = 42 - s8%, n!,P ) = :\]9 - 42% 
2. Compsred with the, se maximM wdues, it 
iS shown tha% the Irla, xilnuin va, hle o\[ i)ro(\] - 
uct of P and R for "k~nji-kmu~" %unsetsu" ix 
35%--60% greater than that of English words. 
\[2\] The Relation between l ) and .R of Cha.ins 
Corrected Using Correction l'rocedure 
The relation betwee, n 1" ~nd IC of ",l- 
bunsetsn" corrected using Procedure 3 and 
4 for p(t) -p(2) pO),(2) F~l.), 
P~ 2) of ".\]-bunsetsu" are shown ill Fig. 5. 
From this tigure, the following results ~l'e 
obtained : 
The maximum wduc of P and \]Z of correct- 
ins erroneous etum~eters wrongly inserted or 
substituted, using 2nd-order M~u'kov model, 
is greater thcnn that of erroneous cluu'acters 
wrongly deleted. 
r} ~> = 92- 98%, n,? '~ = ~):~ -9r% p~O) 
: r,a - 8~r0, .4; ,) = 4(;- <.)% 
PI. c) 69 - 9,t%, R (c) {P2- 88% 
100 
b 
o 
r 5C 
# .... 7¢ *w 
A 
"o ^o 
^e ,o 
~n 
o 6 ^ 
u 
o ^ 
• o u 
• o 
0--0 : Fo BI 
o--I : F t~ I 
D--KJ : I'1 °) 
W--Mr : l'n ~1 
A--Z~ : Fs (n} 
~k--,& : Fs (a 
50 I00 Recall lacier l~l~ 
Fig.3. Experireenlal resuRs for detecting a location of an erroneous 
"kanji-kana" string using the error detection procedure 
r I 7-----r ----7 l 1-- 
10C • - ¢¢o 
8 
d q 
! o 
o a,0,,'/'\[ 
5C 'e~ o--o : I" ~'} 
o ,,,~,t~ \[-\]-D : 1", ('~ 
N N FI R) 
A~-A : l's (q 
.~-..& : l-s I:) 
____ l .... I_ __t __ l L 0 2 0 ,I 0 610 
Recall lacier \[%\] 
Fig.4. Experienlal resulls Ior delecling a localion of an erroneous 
I£ngli."Ji words using Ihe error doleclion procedure 
100 o u, n • cl ~l=j 
;i o , Z °'a' ~ °'~ : , L, 
5C O--O : Po 0) 
m • PO p) 
~--\[~ : P a Iq 
~.--~. : i,s p) 
0 S0 100 Recall lactor\[~\]+ 
I:ig.5. Expelimenlal results (or correcting an erfolleous 
"kanji-kana" siring using error correction procedure 
197 
\[3\] The Combinatorial Effect to Correct 
Erroneous Lnghsh Words Using the Spell 
Checker and the Correction Procedure by 
Marker Model 
The experimental results of detecting er- 
rors in English words using Ispell ( Interactive 
Spell checker ) is shown in Table 2. l?rom the 
results, it is seen that Ispell cart almost per- 
fectly detect erroneous words in U~, I'~) and P s. 
using dictionary, but it cannot perfectly cur 
rect erroneous words, because it can output 
the correct candidates for erroneous words in 
p~), r(~), pO)s, but can not output the correct 
candidates for erroneous words in F~ ~), P(~) ~1) 1 
F~ ~). It is necessary todetect the locationofer- 
roneous alphabets in words to detect MI these 
errors. However, it should be noted that \[s- 
pell can not detect the location of erroneous 
alphabets in words. 
In order to detect and correct erroneous "E- 
word" more effectively, the method to combine 
Ispell and the procedure (in see. 2.3 ) using 
Markov model is expected• The combinato- 
rial method is denoted in the following way: 
(1) At first, erroneous "E-words" are detected 
by Ispell, but the locations of erroneous alpha- 
bets in words can not be detected by it. (2) 
Next decide the correct candidates words by procedure 3 and 4. (3) Finally, ls!)ell again 
checks if these candidates are correct words. 
The experimental results using this method is 
shown in Fig. 6(2nd-order) and in Fig. 7( 
3rd-order ). From the results, it is seen that 
this combinatorial method of Ispell and the 
procedure by 3rd-order Markov model to very 
useflfl to detect and correct all errors in En- 
glish words. 
It takes about 10 milli-seeonds and 6 sec- 
onds in average to detect and to correct er- 
roneous "bunsetsu" . Examples of "bun- 
setsu" and the output results of error de\]co- 
tion and error correction using Mm!kov model, 
are shown in Fig. 8. 
Table 2 The capability of error detection, usipg Ispcll 
r i) (e 
r 01 
l" i U) 
r,,~ 
Able Io dclecl 
Widl corrcct candidate 
7 ({. 0% 
0% 
B2. 0% 
0% 
80. 5% 
4. 0% 
Wltht}ut carrcct candM;dc 
17. 5% 
7~, 5% 
18. 0% 
i00. 0% 
18. 5% 
g6, 0% 
Unable It) dclcct 
G, 5% 
20. 5% 
0% 
0% 
1. 0% 
0% 
100 
50 
o o . . . , , , , , . . . , 
• o o o 
, A ~ Q o 
0--0: F o o 
o .o 
o " D--O : r 1,1 
o l--Ill : rl In 
Z~--A : rs I') 
~-~ : I~S Ct) 
0 10 20 Rank 
Fig.6. Exprimental resuR lot correcting erroneous English words 
using Ispell and error correction procedure in case of 
2rd-order Marker rnodel 
50 0--0 : rJ" 
o--e : 1 ~ D qll 
\[3--\[\] : I" II) 
! M--Ill : rl I~} 
~.--g3~ : \]'S Ill 
~k--~. : rs p) 
o ~'o .... ~ Flank .~ 
Fig.7. Exprimental result rot correcting erroneous English words 
using Ispell and error correction precedtgo in case of 
3rd-order Marker model 
\[1'; .......... inlmt "b ..... tsu" :-: I, Y ~'7--/~-~ \] 
• O.tput result (co.ccl bunsetsu) of er,or corrcclion : -: b.~ ov'411 * 
(a) Case of an erroneous sylluble "bunselsu" for \]@1 
Erroneous hq)ut "bunselsu" : l'l..J,l~.\] 
• Output result (crror lx)shion) of error detcclion : first character 
• Oulpul tcsull (cogccl bunsctsu) of error coefcclion : ~I~..~.. A~,~,,.'~ 
(b) Case of an crroccorls "kanji-ka\],,'~" "bu,rso st," for FD (' 
\]:\]g.8. E×amp\]cs of cn'oncous "buaselsu" and the resulls 
of relot dctcctlon and error correclion 
192 
4 Conclusion 
This l),~per proposed the methods to .ittdge 
three type of errors mM correct these errors, 
which are characters wrongly substituted, in- 
serted ~nd deleted in the .l~panese "ka.nji- 
kmt,~" chains and English words using m-th 
order Marker model. 
The effects of the methods were experimen- 
tally ev;dnated for the case of 2nd- and 3rd- 
order M~rkov chain. ~'rom the exI)erimental 
results, the following conclusions have been 
obt;dned: 
1. The m;~ximum vahte of P ;rod .le of detect- 
ing erroneous ch~racters wrongly inserte<l 
or substituted, is greater than that of er- 
roneous ehm'aeLers wrongly deleted. 
2. This method is specially useful to detect 
~md correct erroneous characters wrongly 
inserted att(l substituted in "k~mji-l~a,n~ ~' 
"bunsetsu", but is not so useful 1.(; detect. 
and correct errors in English words. 
3. The combin,~toriM method of lspell a.nd 
the procedure by ard-order M arkov model 
is usefull to detect and correct all errors 
in Fmglish words. 
llowever they are not so usefltl for detecting 
and correcting of eharactells, wrongly deleted in 
"k,~I\ji-kana" "bunsetsu". 1: hen, m(>re e.flicient 
rrmthods are expected for this type of errors. 
References 
\[1\] T.Araki,J.MurM~;mfi and S.Ilcehm:a "l';f- 
feet of Reducing Ambiguity of I/eeog- 
nition Candidates in J~panese thmset.su 
Unit by 2nd-Order Mm'kov Model of Syl- 
lables", Information Processing .S'ociely of 
Japan, Vol.30, No.4, pp.4(;7-,t77 (l!)8!)) 
\[2\] S.Ikehar~ ;rod S.Shira.i "./al)anese (?ha, r- 
atter Error Detection by Word Analysis 
mid Correction Candidate l'~xtra.ction t>y 
2nd-Order Mm:kov Model ", h@,'mation 
Processing Sociely of Japan, Vol.25, No.2, 
pp.298-305 (:1984) 
\[3\] F.Jelinek "Contimmus Speech Recogni- 
tion by Statistical Methods", Pro< of lhe 
IEEE, Vol.64, No.4, pp.532-556 (197(;) 
\[4:\] 'P. Kurita ~md T.Aiz~w~ "A Method for 
Correcting Errors on Japanese Word \]n- 
put and Its Application to Si)oken WoM 
Recognition with Large Voenlml+u:y", /n- 
formation Processing Society of Japan, 
Vol.25, No.5, pp.831-841 (1984) 
\[5\] J.Murakami,T.Araki and S.\]keharn "The 
Elfect of 'Prigr~m Model in .hq)anese 
Speech ILecognition", 7'he Institute of 
l';leeironies, lnfln'raation and Com)'nu- 
nicalion Engineers, Vol.J75-1)-ll, No.l, 
pp.l 1-20 (1992) 
\[G\] Y. Ooyama. ~tn(l Y. Miya.za.ki "Natura.l 
Languag{: lq'ocessing in ;~ JN)a.nese.-text- 
to-speech System ", Information Fro- 
('essin(.l ,¢ociety of Jal)an , Vo1.27, No.111, 
p p. 1053-10(; 1 ( 1 !) 86) 
\[7\] ,}.l,.l)eterso)t "Coml)uter Progra.ms for 
l)et.ecthtg and Correcting Spelling l~r- 
rots", Comm., A(;M, Vol. 23, No. :12, pl ). 
(;7(;-(;87 (1980) 
\[8\] l,.l{.lLa.biner,S.l';.l~evinson a.nd M.M. 
,qo)tdai "On the Al:>l>lic++tion of VeC- 
(r()~ u Quantization arl|(\] Ilidden Ma.rlmv 
Models t(> Sl)eaker-indepetl<lent , lsol;~ted 
Word Ilecognition", Bell ,5"ystern Techni- 
cal .\]o.urnal, Vol.62, No.4, pp.1075-1105 
\[9\] I';.M.l-(i.qeman +rod A.l/.. l\[a.nson "A (\[~on- 
textttM l)osi.processing System for I';rror 
Correction Using I\]ina.ry n-C,r~m", II','EI'; 
Trans. G'ompul., Vol. (?-22/, No. 5, pp. 
480-<I!)3 (11974) 
\[10\] C.l'3.,qlmtmon "Ma.(.hematical Theory of 
Commurfication", tlell £'yslem %'~chnical 
.\]ournal, Vol.27, i)1).379-423, (;23-656, Oc- 
tober (l 9'18) 
\[1 it\] C.f,;.Shamt(m "l)redi(:l.ion and Entropy of 
Printed lgnglish", Hell Syslem "Technical 
Journal, Vet. 30, i)i).50-6,1, January (1951 ) 
?93 
