File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2136_intro.xml
Size: 4,663 bytes
Last Modified: 2025-10-06 14:06:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2136"> <Title>Context-Based Spelling Correction for Japanese OCR</Title> <Section position="3" start_page="0" end_page="806" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic spelling correction research dates t)ack in the 1960s. ~lbday, there are some excellent academic ~nd commercial spell checkers available \['or English (Kukich, 1992). However, for those languages that have a different morphology and writing system from English, spelling correction remMns one of the signillcant unsolved researcil problems in computational linguistics.</Paragraph> <Paragraph position="1"> '\['he b,asic strategy for English spelling correction is sitnple: Word boundaries are defined by white space characters. If the tokenized string is not found in the dictionary, it, is either a non-word or an unknown word. For a. non word, correction candidates axe generated t)y approxinm.tely matching the string with the dictionary, using context independent word dis|mice measures such ,as edit distance (Wagner and l,'ischer, 1974; Kernighan et M., 19q0).</Paragraph> <Paragraph position="2"> It is impossible to apply these &quot;isolated word error correction&quot; techniques to Japanese in two re`asons: First, in noisy texts, word tokenization is difficult because there are no delimiters between words. Second, context-independent word distance measures ~re useless because the average word length is very short (< 2), and the chnra.cter set is huge (> 3000). There are a large number of one edit distaalce height)ors for a ,lapanese word.</Paragraph> <Paragraph position="3"> In English spelling correction, &quot;word bound a.ry problem&quot;, such as splits (forgot -~ .lot gol) a.nd run-ons (in form --+ in.lbrm), mad &quot;short word problem'(ot -~ on, or, of, at, it, to, etc.) are also known to I)e very dilIicult. Context infof mat|on, such as word N-gram, is used to supplement the underlying context-independent co> reel|on tbr these problematic examples (GMe and (~hurch, 1990; Mays et aJ., 1991). To the contra.ry, Japanese spelling correction must be essentially context-dependent, because Japanese sentence is, as it were, a. run on sequence of short words, possibly including some typos, something like (lfor.qololinfo'mnyou --~ I forgot to inibrtn you).</Paragraph> <Paragraph position="4"> In this pa.per, we present a novel ~t)proach for spelling correction, which is suite.hie for those l~nguages that have no delimiter between words, such f~s aN)anese. It consists of two stages: First, MI substrings in the input sentence are hypothesized ms words, and those words that approximately matched with the substrings axe retrieved from the dictionary ms well ,as those that exactly matched, l{,ased on the statisticM language model, the Nd)est word sequences are then selected as correction ca,ndidates from all combinations of exactly and approximately matched words. Fig ure 1 illustrates this ~pproach. Out of the list of character recognition candidates for the input sentence &quot;~ b~R~7-~,~Y2~g)k~ 79o &quot; which means &quot;to hill out the necessary items in the application form.&quot;, the system searches the eombin,~tion of exactly matched words (solid boxes) and apl)roximately matched words (dashed boxes) 1 The major contribution of this paper is its solutions of the word boundary problem and short word problem in Japanese spelling coffee.</Paragraph> <Paragraph position="5"> tion. lly introducing a statistical model of word tOCR output tm,ds to be very noisy, est)e(:ially for hand writing. To (:omt)ensate for this bd,avior, OC'Rs usuMly output at, ordered list of tit(! best N elutra(:ters. The list of the (:~uMid~tes for an int)ut string is called etl~*ra(:ter m~mix.</Paragraph> <Paragraph position="6"> Imigl;h a,iid sl/ellhig , l,lie proposed sysLei\[l a,Ccti ra, t, ely phi,ces word bounda, rics in noisy LexLs 1,1u/,L iliclude liOll-words n,nd tlllkl/OWll words. Ily using t, he c|la,ra,cl,or I)ased CC~l\]l;0xl; lilode\], il; a,c(:ura,1;ely selecl, s (:orr0PS1,iOll (umdid~l.es \['or shorL words \['i'Ol\[i Lhe h~rge n tlllll)cr o\[' .~pproxini;~l;ely ill~l,L(tiletI worcts wiLh Lhe slmlc edit, disl;n,nco.</Paragraph> <Paragraph position="7"> The gold o\[ our project, is l,o iniplenienl, a, iI h'li,~r ~ci.ive word correcl,or for ~ lia,ndwriLi, m~ FAX OCI{, sysl, eili. ldT~ a,re especia, lly inl;eresl;ed in 1;exl, s t,lia,l, include a,ddresses, IHI,IIICS, ~l,lld \[iiessa,~es, such as order fOrlllS, quesLionnn,ires, a,nd t, ch~gi'ig)h.</Paragraph> </Section> class="xml-element"></Paper>