File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2136_metho.xml
Size: 13,651 bytes
Last Modified: 2025-10-06 14:14:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2136"> <Title>Context-Based Spelling Correction for Japanese OCR</Title> <Section position="4" start_page="806" end_page="806" type="metho"> <SectionTitle> 2 Noisy Channel Model for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="806" end_page="806" type="sub_section"> <SectionTitle> CharacLer Recognition </SectionTitle> <Paragraph position="0"> FirsL, we (\]lrniula,l,e i;hc, spellinT; c(~rrcct;ion o\[' O(2 I~ <~i'rors in I,he noisy cha, nncl pa, r;~div;ln. I,el, (< rop resenL 1,tie inpuL sLrhlg a,ncl ,\ rct>resenl, l;tie ()(Jlt, oul;Imt, st, rhig. l&quot;hidhig Llie fllosL tm~lmble sf, i'hig C < given iJle O( ill, oul,ptll, <\&quot; a,nioulil,s I;o ma, xiniizili~; u,e l;,,,,~,io,, r(x IC)/-'((;),</Paragraph> <Paragraph position="2"> /'(C) is ca, tied the hmgu;Lge model. II, is c.ni puled \[l'Oilt l,lw l,ra,iilhig CoI'\])CiS. I,et, us cMI P(XI(. <) l,lte O(',ll, niodel. IL ca, it |)o conttmLed l'roni t;iic a, priori likelihood o,gLilnn,Les for individ ua, I cila, ra,cl;ers,</Paragraph> <Paragraph position="4"> where 'n is l, he sl,rhlv; leu~>l,h, t'(:l:ilcT) is c;i,IN'd L|ic COlifUsion lun,l, rix o\[ ch;i,ra,(tl,or<~. IL i,g t,i';i,hicd tishlg l,lie hipuL a, nd oul,1)ul, sl;rhigs o\[' Lhe ()(ill The coii\['tiSiOll \[lialrix is hig|lly depentionL eli l:iie chn,r;l,ct,er r0co~niLion a,l~orit, iun a,nd Lhe qua, I iLy of the hlput, docunionl,. IL is a, lld)or inl;ensive 1;aM~ Lo prepa, re a, con\[tiSiOl/ niaJ;rix lbr ca,oh cha,r a,cl,er recognilJon sys|,ellt, sillce .I a,p<~liose ha.s lilore 1;hail 3,0()() c|lai','l, cl;ers. 'l'here\[7)re, we used a, shn pD ()(\]11, model wliere l, lie confusion ni~t, rix is a>p proximai, ed by t, tie correcl; cha,r;~cl,er dist, ribul;ion ovm&quot; t, he r&nk of Ltie ca,ndicl&Lcs. We asstttlle t;ll~|; l, hera, nk erder disl, i;ibul, ion of l,|io correcL clia,ra,c Lore is a, geonl0Lric disf, ribuLion whoso pa,ra,niet,er is l;he a:CCtll';%cy OI&quot; Lhe firM; ca,ndida, Le.</Paragraph> <Paragraph position="5"> l,ei, c~ be tha 7-i,h c|la, ra,cl,er in Lhe inpuL sl, rin~4, :l: G \])e l,|ie j t,h ca,ndida,t,e \[or (:~, ;uid p 1)o Life prol> idfilii, y l, ha,i, l, he lh'sl; ca,ndida,Le is correct,. 'l'lie COlt f'tiSiOll pt'olml>ilil~y \['(;v U \[r:i) is a,pproxinla,t, od as,</Paragraph> <Paragraph position="7"> of t,hc firsL ca,ndida.im, a,nd tJle l,endency tJmt, tj.relia.bilit;y of Lhe ca.ndhla.l;e ch'cramses al,'uptly as its ranlc incr(m.~es. For exa.mple, ill the recognition ~ccura.cy of t, he lirsi; candida.t.e p is 0.75, we will as sign i,he prob;dfilii~y of Lllo Iirst,, secmid~ ;rod i, hird cn, ndida.i,cs l,o 0.75, 0.\]9, a,i/d (I.05, respect.ively, regli, i'dloss ~lI' Lhe hipuL a,iid Ollt;pul, cha,r~cl;ers.</Paragraph> <Paragraph position="8"> (-)11o ~.)J' Lhe I:lelietil,s o\[ usin<g a, siliiplc ()(It{ nlodel is Lha,L Lhe spelling correct, ion s.yM,eni be coiiles hi~4hly imh3)endenl, of l,lie underlying; ()( i1{ cha,raxq, crisl,ics. Obviously, a, more sophislficaLed O(\]11, niodol would iniprove OI'FOF Col'rect;ioli /)el&quot; retina, liCe, hut, eVeli l,his shnlHe O(II/ilit>d<q works fa,h'iy wdl in our eXllerinient,s 7 2()11(' (if (,lit! I)i'ax:l,iciil r(',a,Stlllb ~'(Jr ilSill,~ Lhc ~tXJlllCILrh: di~lrilml, hJli i~ ilia, i, we ll~cd l,hc cuid'H,HO, nlal, rix for ilnl)h!iiicnl, in ~ the O(;R silnlila, l,Ol-. \'\&quot;c fccl h, i,~ ilnfllir I,t) ii~t! I, hc sliillc con\[llblon lilaA, rix btJLh ftJl' t!lr()l l'~Clicr&l, ioli illld error corrc, d, lon.</Paragraph> </Section> <Section position="2" start_page="806" end_page="806" type="sub_section"> <SectionTitle> 3.1 Statistical Language Model </SectionTitle> <Paragraph position="0"> For the language model in Equation (1), we used the part of speech trigram nlodel (POS trigranl or 2nd-order HMM). It is used ,as tagging mode\[ in English (Church, 1988; Cutting et al., 1992) and morphological analysis nlodel (word segmentation and tagging) in Japanese (Nagata, 1994).</Paragraph> <Paragraph position="1"> Let the input character sequence be (/ = c\]c.e...c .... We approxinlate P(C)by P(W, 7'), the joint prol>ability of' word sequence W = wlw2...'u),~ and part of speech sequence '\[' = tlt.e.., t,,. P(W,T) is then approximated t>y the product of parts of speech trigram probabilities P(ti\]ti-'2, |i-l) and word output probabilities for given part of speech P(wiltl),</Paragraph> <Paragraph position="3"> computing the relative frequencies of the corresponding events in training corpus a</Paragraph> </Section> <Section position="3" start_page="806" end_page="806" type="sub_section"> <SectionTitle> 3.2 Forward-DP Backward-A* Algorithm </SectionTitle> <Paragraph position="0"> \[/sing the language model (5), .Japanese morp\[lological analysis can be detined ,as finding tile set of word segmentation and parts of speech (1~/, 7'') that maximizes the joint probability of word sequence and tag sequence P(W, 7').</Paragraph> <Paragraph position="2"> This maxinfization search can be efficiently implemented t>y using the forward-DP backward-A* algorithm (Nagata, 1994). It is a natural extension of the Viteri>i algorithm (Church, 1<,)88; Cutting et al., 1992) for those languages that do not have delimiters between words, and it can generate N-best morphological analysis hypotheses, like tree trellis search (Soong and l\[uang, 1991).</Paragraph> <Paragraph position="3"> The algorithm consists of a forward dynamic programming search and a backward A* search.</Paragraph> <Paragraph position="4"> The fbrward search starts from tile beginning of the input sentence, and proceeds character by character. At each point in tile sentence, it looks up the combination of the best partial parses ending at the point and word hypotheses starting at the point. If the connection between a partial parse and a word hypothesis is allowed by the language model, that is, the corresponding part of speech trigram probability is positive, a new continuation parse is made and registered in the best partial path table. \[,'or example, at point 4 in Figure 1, tile final word of the partial parses ending at 4 are ga b~ ('application'), .~ ('prospect'), SAs a word segmeotal, ion nmdel, the advantage of the POS trigram model is that it can be trained using a smaller <:orpus, than the word bigram mode.1.</Paragraph> <Paragraph position="5"> and ~ ('inclusive'), while tile word hypotheses starting at 4 are mC/~ ('form'), ~ ('s~ne'), Y\] ('moon'), and Fq ('circle').</Paragraph> <Paragraph position="6"> In tile backward A* search, we consider a partial parse recorded in the best partial path tat>lc `as a state in A* seareiL 'File backward search starts at tile end of the input sentence, and backtracks to tile beginning of the sentence. Since the prob abilities of the best possible remaining paths are exactly known by the forward search, the backward search is admissible.</Paragraph> <Paragraph position="7"> We made two extensions to tile original fbrward-DP backward-A* algorithm to handle OCR outputs. First, it retrieves all words in tile dictionary that match the strings which consist of a combination of the characters in the matrix. Second, the path probability is changed to the product of the language model probability and the OCR model probability, so as to get the most likely character sequence, according to Equation (1).</Paragraph> </Section> </Section> <Section position="5" start_page="806" end_page="808" type="metho"> <SectionTitle> 4 Word Model for Non-Words and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="806" end_page="808" type="sub_section"> <SectionTitle> Unknown Words </SectionTitle> <Paragraph position="0"> The identification of non:words and unknown words is a key to implement Japanese spelling cotrector, because word identilication error severely atDets the segmentation of neighboring words.</Paragraph> <Paragraph position="1"> We take tile following approach for this word boundary problem. We first tlypothesize all sub: strings in the input sentence as words, and assign a reasonable non-zero probal>ility. \[,'or example, at point 7 in Figure 1, other than the exactly and approximately matched words starting at 7 such ,as ,.g,~ ('necessary'), ~,'~ ('necessarily'), and alZ ('pond'), we tlypothesize the sut>strings ~,, ~,~, ~,@~, ,.g,@~, ... as words. We then locate the most likely word boundaries using the forward-I)P backward-A* algorithm, taking into account the entire sentence.</Paragraph> <Paragraph position="2"> We use a statistical word model to assign a probat>ility to each subs|ring (Nagata, 1996). It is defined as tile joint probability of tile character sequence if it is an unknown word. Without loss of generality, we can write,</Paragraph> <Paragraph position="4"> k that constitutes word wi. We call P(k) the word length model, and P(cl * .. ck \]k) the spelling nmdel.</Paragraph> <Paragraph position="5"> We assume that word length probability P(k) obeys a Poisson distribution whose parameter is the average word length A, (.~ __ \] )k This means that we think word length is the in terval between hidden word boundary markers, which are randomly placed where tile average interval equals tile average word length. Although this word length model is very simple, it plays a key role in making tile word segmentation algo rithm rot>ust.</Paragraph> <Paragraph position="6"> We al)proximate the spelling probability given word length P(el ... ck \]k) |>y tile word-t)a~ed character trigram model, regardless of word length. Since there are more than 3,000 characters in Japanese, tile amount of training data would be too small if we divided them by word length.</Paragraph> <Paragraph position="8"> where &quot;#&quot; indicates the word t>oundary marker.</Paragraph> <Paragraph position="9"> Note that tile word-I>,%sed character trigram model is different from tile sentence-b~Lsed character trigram model. 'l'he tbrmer is estimated from tile corpus which is segmented into words. It a,ssigns large probabilities to character sequences that appear within a word, and small probat>ilities to those that appear across word boundaries.</Paragraph> </Section> </Section> <Section position="6" start_page="808" end_page="808" type="metho"> <SectionTitle> 5 Approximate Match for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="808" end_page="808" type="sub_section"> <SectionTitle> Correction Candidates </SectionTitle> <Paragraph position="0"> As described t>elBre, we hypothesize all sul>strings in the input sentence ,as words, and retrieve ap: proximately matched words from the dictionary as correction candidates. For a word hypoth-.</Paragraph> <Paragraph position="1"> esis, correction candidates are generated based on tile minimmn edit distance technique (Wagnet anti l!'ischer, 1974). Edit distance is defined as the ntiniulum number of editing operations (in sertions, deletions, and substitutions) required to transform one string into another. If tile target is OCIL output, we can restrict tile type of errors to substitutions only. Thus, the similarity of two words can be computed as c/n, where c is tile nund)er of matched characters and n is tile length of the misspelled (and dictionary) word.</Paragraph> <Paragraph position="2"> For longer words (._> 3 characters), it is rea: sonable to generate correction candidates t>y retrieving all words in the dictionary with similarity above a certain threshold (eta >_ 0.5). For exampie, at point 0 in Figure 1, g+ b~ ('application') is retrieved by approximately ntatching the string Itt L~;9- with the dictionary (c/n = 3/4 = 0.75).</Paragraph> <Paragraph position="3"> Ilowever, tbr short words (1 or 2 character word), this strategy is unrealistic because there are a large numt>cr of words with one edit dislance. Since the total nund)er of one character words and two <:haracter words anlounts to luore than 80% of the total word tokens in Japanese, we cannot neglect these short words.</Paragraph> <Paragraph position="4"> It is natural to resort to context-dependent word correction methods to overcome tile short word prol>lem. In English, ((-;ale and (\]hurch, 199(t) achieved good spelling check performance using word bigranLs, llowever, in ,lapanese, we cannot use word bigram to rank correction candidates, because we have to rank them betbre we pertbrm word segnmntation.</Paragraph> <Paragraph position="5"> Therefbre, we used character context instead of word context. For a short word, correction candidates with the same edit distance are ranked by tile joint probability of tile previous and tile following two characters in the context. This probw bility is computed using the sentence-based character trigram model. For 2 character words, for example, we first retrieve a set of words in the dictionary that match exactly one character with the one in the input string. We then compute the 6 grant probability Ibr all candidate words .siSi+l, and rank them according to the prot>ability.</Paragraph> <Paragraph position="7"> For example, at point 12 in Figure 1, there are many two character words whose first character is ~g, such ~s -gEil~ ('mention'), ~E~4$ ('article'), ~0. .~ ('journalist'), gg.zX. ('entry'), g0,,&~, ('commen> oration'), etc. By using character contexts, tile system selects gg)k. anti ~t\]fti;~ as approximately matched word hypotheses.</Paragraph> </Section> </Section> class="xml-element"></Paper>