File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1041_metho.xml
Size: 21,877 bytes
Last Modified: 2025-10-06 14:14:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1041"> <Title>Multimedia Re.search Ibaboratories</Title> <Section position="4" start_page="236" end_page="238" type="metho"> <SectionTitle> 3 Probabilistic Formulation (IIMM) </SectionTitle> <Paragraph position="0"> I:% us assunm t,hal, we want i.o I,:ttov,, the n~,::)st; likc'ly tag seqtt<mcc ~b(PV), given a parlicttlar wot:d Se(lUtmcc I'V. Tim l.agging prol)h-nt is dolim'd as liuding t.lw tttosl, likely t.?tg s(&quot;tltlClt('(! &quot;1'</Paragraph> <Paragraph position="2"> il,y of word sc,,tu(m(-c. H/, given I.It,:~ .'-;cqucuco o1&quot; tags 7', at,d /'(W) is l,ho tmcotMitiotled i)roba.I)ilit,y ot: word scqmmcc W. 'l'hc t>rol>abilit.y I'(W) in (2) is rc'tnovc'd I)ccause it lt~ts no ell'cot, on 0(W). (:onscquent.ly, it. is suilicicnt i.o find the tag SC(ltmnce 7' which sal.isiies (3).</Paragraph> <Paragraph position="3"> Wc can rcwvit,(> the prolmldlily of0ach scqttcnc(' as a. prodltct, of 1,he cotMil.iona.l prot>altilit.i<~s of each word or fag given all of the lm'viotts t, ags.</Paragraph> <Paragraph position="5"> Typically, otto nmkcs t.wo silllplifying assmttp (.ions I.o (:llt. dowu o, l, lt<: nttnd>er of F, rc, l>al:,ilil.ies 1.,.) Ive <+st.inml.ed. I,'irst., rat.her I.h;tn ;tssuttling lt'i dclwnds on all IH'cviotm words and all i)rcviotts ta.gs, ono ;-tsstttnes w+ d,:'F, cn,:Is (:,nly ,:mli. Sc,::oml, rath(,r i:hau aSS(lining the tag ti deltends on the t'ull s(>quc:ncc of l)rcvious l, ags, w(' can assume l.hal.</Paragraph> <Paragraph position="6"> local COltl;oxl, is sullicicnt. This locality assumed is rcfercd t,o as ;t Mm'kov imlcl)endence rlSStlllll)liioIl.</Paragraph> <Paragraph position="7"> Using 1.lwse ass(trutH,ion, w(> al)lwoxhtml,c l,}to uqua.l.ion l,o l, ttc R)llowing</Paragraph> <Paragraph position="9"> We can gel each l)robabilit, y va.hte front the t.aggcd corItus which is i,rq+arcd for l.raining by</Paragraph> <Paragraph position="11"> where (7(t:),C(ti, Ig) is tit(&quot; \['reqttency obt.aincd fronl l rainhlg dal.a.</Paragraph> <Paragraph position="12"> \:it;orl>i algorilhnt (l:orncy73) is the one gonerally used to liw.l t.he t,a.g SO<luencc which safislies (6) aim I.ttis algoril.hnt gttaranl.ccs the opl, itttal sohit ion to I,he I)r(+bhmt.</Paragraph> <Paragraph position="13"> This model has several prot>l(+tns. First,, so(no wot'ds or \[,ag~ s<Xltl(~ll(W','-; 1/13.y itot, O(HHIt' ill l.ra.ilihtg dal.a or Hlay occur with very low \[reqttetlcy; ii('vt'rlh('lcs,% t,llc words or lag soqtt(~ltC(~s c;/tt ;+\])l)ear ill t.agging l>roccss, lit this case, it, usually causes V(!l'y bad result, t.o COllll)ttt.c (6), because the lwol)al)ility has zero wdue or very low value.</Paragraph> <Paragraph position="14"> 'l'his problont is c.alh'd data slmr,sc.css i>rol)h+tn.</Paragraph> <Paragraph position="15"> To avoid thi+q l~roldetn, sm,)ot, hing of itd'ortttat.i,:m tlJtts/, I:,c ttscd. Sntool, hing proc('ss is ahnost ('sscntia,\] in tlMM tmcattse IIMM has sevet'c d:at,;t sparseness prol>hmt.</Paragraph> <Paragraph position="16"> 4 combining information sources</Paragraph> <Section position="1" start_page="236" end_page="237" type="sub_section"> <SectionTitle> 4.1 linear inl;(',\]rlmlatiou </SectionTitle> <Paragraph position="0"> Various Mttds o\[' inlol'ntal, ion sf:.ttr(;(~s and differout knowledge sources must, he Colnl)incd l.O S()IV(&quot; the l,a.gging prol>l(+m. The gemwal method used it( IlMM is linear iul,erl~olatiot L which is l,he w(gght, ed sutnmal,ion of all prol)alfilil:y infornlat, ion</Paragraph> <Paragraph position="2"> wlt~t'c 0 < Ai N I mid }3: Ai = 1.</Paragraph> <Paragraph position="3"> This hie(hod cnn I)e used I)ol, h as a way of con> Itining Mtowh:dg(&quot; sources and snloot, hing infornmt, iotI sou t'.::cs.</Paragraph> <Paragraph position="4"> I1MM based l.agging modd times unigram, t>igl'atll a, lld t.rigt'altt in\[ortn;d.iott. These in for(mr(ion sources are linearly cotnl>ined by weighl.cd Slllliliiftl.ion. null</Paragraph> <Paragraph position="6"> can be estimated by forward-backward algorithm (Deroua86+) (Charniak93+)(tIUANG90+).</Paragraph> <Paragraph position="7"> Linear interpolation is so advantageous because it reconciles the different information sources in a straightforward and simple-minded way. But such simpliticy is also the source of its weaknesses: * Linearly interpolated information is generally inconsistent with their information sources because information sources are heterogeneous for each other in general.</Paragraph> <Paragraph position="8"> * Linear interpolation does not make optimal combination of information sources.</Paragraph> <Paragraph position="9"> * Linear interpolation has over-estimation problem because it adjusts the model on the training data only and has no policy for untrained data. This problem occur seriously when the size of the training data is not large enough.</Paragraph> <Paragraph position="10"> 4.2 ME(maxlmum entropy) principle There is very powerful estimation method which combines information sonrces objectively. ME(maximum entropy) principle (,laynes57) provides the method to combine information sources consistently and the ability to overcome overestimation problem by maximizing entropy of the domain with which the training data do not provide information.</Paragraph> <Paragraph position="11"> Let us describe ME principle briefly. For given x, the quantity x is capable of assuming the discrete wdues xi, (i = 1, 2, ..., n). We are not given the corresponding probabilities pi; all we know is tire expectation value of the function f,.(x), (r =</Paragraph> <Paragraph position="13"> On the basis of this information, how can we determine the probability value of the function pi(x)? At first glance, the problem seems insoluble because the given information is insufficient to determine the probabilities pi(x).</Paragraph> <Paragraph position="14"> We call the function f,. (xi) a constraint function or fealure. Given consistent constraints, a unique ME soluton is guaranteed to exist and to be of the form: where the Ar's are some nnknown constants to be found. This formula is derived by maximizing the entropy of the probability distribution Pi as satisfying all the constraint given, qb search the l,.'s that make pi(x) satisfy all tile constraints, an system with distance 2 iterative algorithm, &quot;Generalized Iterative Scaling&quot; (GIS), exists, which is guaranteed to converge to the solution (l)arroch72+).</Paragraph> <Paragraph position="15"> (12) is similar to Gibbs distribution, which is the primary probability distribution of M\[{F model. MRF model uses ME principle in combining information sources and parameter estimation. We will describe MRFF model and its parameter estimation method later.</Paragraph> <Paragraph position="16"> 5 MRF-based tagging model</Paragraph> </Section> <Section position="2" start_page="237" end_page="238" type="sub_section"> <SectionTitle> 5.1 MRF in tagging </SectionTitle> <Paragraph position="0"> Neighborhood of given random variable is defined by the set of random variables that directly atfect the given random variable. Let N(i) denote a set of random variables which are neighbors of ith random variable. Let's define the neighborhood system with distance L in tagging fbr words W = wl, ..., w,~, and tags T = h, ..., t,~.</Paragraph> <Paragraph position="2"> This neighborhood system has on(; dimensional relation and describes the one dimenstional structure of sentence. Fig. 1 showes MP~F T which is defined for the neighborhood system with distance 2. The arrows represent that the random variable ti is affected by the neighbors ti- 2, ti- 1, ti+ t, ~j+'~. It also showes that ti,ti-t and ti,ti+l have the neighborhood relation connected by bigram, and that ti,ti-l,ti-2 and ti,ti+l,ti+2 have ttm neighborhood relation connected by trigram.</Paragraph> <Paragraph position="3"> A clique is defined as the set of random variables that all of the pairs of random variables are neighborhood in it. Let's define the clique as the tag sequence with size L in tagging problem.</Paragraph> <Paragraph position="5"> A clique concept is used to define clique fimction that evaluates current state of random variables in clique.</Paragraph> <Paragraph position="6"> The definition of MRF is presented as following. Definil~ion of MI{F: Random 'variable T is Markov random field if T' salisfies the following two properties.</Paragraph> <Paragraph position="8"> We assume tha.t every prob;d>lity value el tag se(luenee is larger l, hml zero bee;rose ungraluiuat, ic;d Sellt, ellCeS (;fill ,tl)pem&quot; in htlllHill l~tligll&ge liS~ge, including meaningless sequence of characters. St) the positivity of' MRt!' is satislied. This :+tSStllnption results in the robustness mid ada.ptability of the inodel, eveli though unti:a~ined events ocolir.</Paragraph> <Paragraph position="9"> The locality of MRF is consistent with the as-Slliliptioii O\[ I;a.ggiilg t)roblein in that the tag of given word ca, it be deterinined by the local context. (Tonsequenl, ly, the random variable 7' is MRF for neighborhood systenl N(i) its 7' satisties the positivity and the locality.</Paragraph> </Section> <Section position="3" start_page="238" end_page="238" type="sub_section"> <SectionTitle> 5.2 A Posteriori Protiatiillty </SectionTitle> <Paragraph position="0"> A posteriori probat)ility is needed to sea.rcb for the Jrlost, likely tag sequence. M II, F provides the i;heoretical bi~cliground about the probal)ility of the system (Bes~tg74) ((leiJfla, ii84+).</Paragraph> <Paragraph position="1"> H~mniersley-(\]liflbrd thcorein: 7'he probability dish'ib'ulio'n I'(7') is (Tibbs dish'ibulion if and only if 'random wzriable 7' is ,,llarkov random field for givcn ncigborhood syslc'/n N(i),</Paragraph> <Paragraph position="3"> where &quot;\['HI is l;elillJel'i~tllre~ ~ is norlnalizing COIl-Sl;~tllt, called partition ftlllCLioil aAld U ('\[') iS etlergy fimct;ion. The a priori probal)ilit;y P(7') of tag sequence 7 ~ is Gibbs distribution because the randora variable 7' of tagging is MRF.</Paragraph> <Paragraph position="4"> It can be proved that a posteriori probability i)(TiW) for given word sequen(;e W is also Gil)bs distribution (Chun93). (7onsequent/y, a I)osteriori probability of 7' for giwm W is</Paragraph> <Paragraph position="6"> We llSe (|9) i,O ('.m'ry Ollt MAP estiui;dion in the tagging model . The energy function U(TJW) is of this form.</Paragraph> <Paragraph position="8"> where V,, is clique function wii;h the property that Vc depends only oil those randoui variable, in clique e. This lllelLllS t;hat ellergy funcl, ioli (Urill be obliained \[rOlll each clique funtion which splits l,\[ie set of ralldOlll viu'iables to slibscLs.</Paragraph> </Section> </Section> <Section position="5" start_page="238" end_page="239" type="metho"> <SectionTitle> 6 Clique function design </SectionTitle> <Paragraph position="0"> The more state of random variables are near to Llie solution, the niore the system becomes stable, and energy function has lower vahie. Energy flmci, ion repre, sents the degree of unstability of current stntc of raiidoni vl.triables in M RF. It is similar to the I)ehaviour o\[' molecular particles in the rcM world.</Paragraph> <Paragraph position="1"> ('~lique function is proportional to energy fun(:tion, and it represents the unstability of current state of randoni varia.bles in clique or it has high value when the state of MRF is bad, low value when the st;~te of MI{F is nero: to solution. Clique fimction contributes to reduce the comi)utation of evahmtion function of entire MRF by clique concept that separates random v~triables to the subsets. null (llique function V/(TJ W) is described by the. few.</Paragraph> <Paragraph position="2"> tures that represent the constraint or information sources of givcu prol)h;m domain.</Paragraph> <Paragraph position="4"> The basic information sources which arc used m statistical tagging model are unigram, l)igrani nnd trigrain. M I{I&quot; nlodel I lises unigrmn, higranri and t, rigraln. We write the \[eaPSure furiction o\[ unigraln ;iS j\~,,.:,,. ..... = (t - ~'(t~I,<)) (~) and the feature f'illtCtioll O\[ II-gralll, inchiding bigram, trigram ~s</Paragraph> <Paragraph position="6"> The clique filnction of the model 1 is ttt~de as follows.</Paragraph> <Paragraph position="8"> Morphok)gical level information helps tagger to determine the tag of the word, more. especially of the unknown word. The suffix of a word gives very useful information about the tag of the word in F, nglish. The (:li(ltte function of model 2 is delined as</Paragraph> <Paragraph position="10"> We used the statistical distribution of the sixty sll\[lixes thztt are IlK)st frequently used ill English.</Paragraph> <Paragraph position="11"> We can expand the clique flnlction of the model 1 easily by just adding Stlficix inforui~-ttion to the clique function of the ntodel 2.</Paragraph> <Paragraph position="13"/> <Section position="1" start_page="239" end_page="239" type="sub_section"> <SectionTitle> 6.3 Model 3 (error correction) </SectionTitle> <Paragraph position="0"> There exist error prone words in every ta.gging systern. We adjust error prone words 1)y collecting the. error results and adding more inforniation of the words. The feature function of Model 3 is for adjusting errors in word level.</Paragraph> <Paragraph position="2"> YVe used the probat)ility d istribu tion of five huntired error proiie words ill Model 2 in oMer to reduce the tltllllber 0t' paF31ileters.</Paragraph> </Section> </Section> <Section position="6" start_page="239" end_page="239" type="metho"> <SectionTitle> 7 Optimization </SectionTitle> <Paragraph position="0"> The process of selecting the best tag sequence is called ms optimization process. We use MAP (Maximum A l)osteriori) estiniation method. The tag sequence 7' is selected to niaximize the a posteriori probM)ility of tagging (19) by MAP.</Paragraph> <Paragraph position="1"> Simulated annealing is used to searcti the optimal tag sequence as Gibbs distribution provides simulated anneMing facility with teliiperatur(+ arid eileFgy ('OllCept. }go change the tag candidate of one word selected to tninilnize the energy t&quot;iinction in k-th step froni T (k) to j,(k+i) , a.n(l l'(+'t)e;/t this process until there is tlO change. The t(?llll)(?l?ature 7'm is started in high vahle and lower to zero as tile above process is doing. Then the final tag seqtlellce is the solution. Sininlat,ed annealing is US0flil in the prol)leni which has very hugo search sl)ace, and it is the approxiniation of MAP est.ifllatioll ((\]elll&iq 84 -t--).</Paragraph> <Paragraph position="2"> There is another algorithm called Viterbi algorithtn to lind ol)timal solution. Viterbi algoritllm guarantees optinial sohltion \]tilt, it canilot bc used in the probleln which has very huge search space.</Paragraph> <Paragraph position="3"> SO it iS /iscd in the l)rol)leni which has Slliall sea, rch space 3,11(1 Ilsed ill I1M M. M RF model Call ilSe both Viterbi algoril, hni and siinulated anealing, but it is not \](nowtl IO IlSe sinitllated allne, aling ill fIMM.</Paragraph> </Section> <Section position="7" start_page="239" end_page="240" type="metho"> <SectionTitle> 8 parameter estimation </SectionTitle> <Paragraph position="0"> The weighting parameter A in tile clique \['unction (19) Call be estiinated frOlil training data by MIg principle (.\] ayiles57).</Paragraph> <Paragraph position="1"> Let tlS descrit)e ME princil)le and IIS algorithni briefly. For given x = (Xl,...,;Frt), the corr(?-Sl)onding probal)ilities t)i(xi) is nod klloWll. All we know is the expectation value of the flmction</Paragraph> <Paragraph position="3"> (riven consist(mr constraints, we can find the prot)ability distribution p~ that niakes the entrol)y -- ~ Pi In t)i wlhle lllaxillllllil \])y llSillg Largl:angiall niultipliers in the nsual way, and obtai u the result: pi(a?i ) = cXt)(-- ~ J,.J;.(xi)) (30) 7&quot; This forniula is alniost siinilar 1.o Gibbs distribution (17), also J\]. correspoilds to the feature of clique function in M I{,F (20) (2l). Using this fact, we can use M 1!; in paranieter estimation in M Ii.F+ We can derive (31) to be used in pgLralneter estiniation fi'om training data.</Paragraph> <Paragraph position="5"> 'l?o solve the solution of it, a numerical analysis mt;thod (-~IS ((\]enerlaized Iterative Scaling) was suggested (l)arroch72+). Pietra used his owu algorithm IIS (hnt)roved lterative Scaling) based on G IS to induce the features and parameters of random field automatically (l)ietra95) . Following is model q0 and fl), fl .... , fn.</Paragraph> <Paragraph position="6"> * Output q. alld A by MI'\] estiiiiation A lgori t h m (0) Set qC0) = qs) (1) Per each i (ihd Xi, t,il(; llnique sohlth)n el&quot; q(X.)fi(7,)c~,lk) ~,. f,.(T) =- ~ IS(T));(7/, ) T 7' (33) (2) k +-- k+l, set q~+l with new Ai (3) l\[' qt~O tins converged, set q, = q(~') and tertilhiate. Otherwise go to step(i) where q(k) is the distribution of the iriodel in k-th step, alld it, corresponds to the posteriori pro}> ability of the tagging model (\[.(J). A, tile sohltion ()f' (:/:t) (:all be ol)i,ained 1)y Newton niei.hod ((,'tlrtis89+), Olie Of lilllll(~rica\] analysis nietilod. The I'efC/TellC(&quot; distribution \]~ is the l)rol)ability distril)ution which is obtaiued directly frOlll ill'aiDillg data. \]) corresponds to tile posterior (listributton t'(TIW ) ill the IA/g~illg iItod(ti. ~C/Vo tlS( t the</Paragraph> </Section> <Section position="8" start_page="240" end_page="240" type="metho"> <SectionTitle> M R,F nioch;Is. </SectionTitle> <Paragraph position="0"> posterior t)rol)alfility of l:,hc words sequence o\[ win{low size n (CSl>ecially 3 in this lliodel) I)y colillting the entry Oil training data. Trailiiilg data lllewis tagged (:orl)us tmrc.</Paragraph> </Section> <Section position="9" start_page="240" end_page="240" type="metho"> <SectionTitle> 9 Experiments </SectionTitle> <Paragraph position="0"> The Ili;,I, ill ol>jcctivc of' this experinicnts is to ('Oliil)a, re ttio M I{i!' l,a,gghig nlodcl with l lic I IM M tagging nl()(hJ. \Y=C consl, lulci(xl a, ~,'11{1&quot; /.aggcr mid a II MM tagger usiilg sa.lne inl'orlll;tliion on t.hc sailic (?llVil:onlllelll;.</Paragraph> <Paragraph position="1"> li, is lle(:esSa, l:y t,o do smoothing tiroccss for data sl)arseiicss l>roblelit which is scw;l:c i\[l \]1MM, while MRF has tll(' \[acility of sniooi,hing in it,self like neura, l-nel, . IvVc ilSCd line;tr inl,erl>olat, ion ine/,hod (l),~rot,a.S6+) (jclin&Sg) and assigning &C,lUel,cy 1 for uilknowll word (Weisctig:/+) for sliioothilig in II M M.</Paragraph> <Paragraph position="2"> !t&quot;V'(? llS(?(I l,\[lC \]lrowII (;orlillS ill licnliTl'ce Bank, dcscril)cd in (Ma.rc,s93+) with ,l~ dilli'rent, tags.</Paragraph> <Paragraph position="3"> A set of t~00,000 words is colhx'tod for ('ach parl, Of lirowlt ('Orl)llS ali(\[ llSe(I ;t,&quot;; t.l'aiilillg (h~l;a, which is used to 1)uihl 1,tw niodels. And a set of 30,000 words (',()l'l)llS is used as i,c'st da.C~, which is used to t,esl, the qua.lil;y of Ltic models.</Paragraph> <Paragraph position="4"> 'l'alJe 1 shows the 3~CClll;&(;y o\[&quot; each L~ggillg niodcl. '1't,~ average a(:(:ura.(:y of tll(' I IMM-hascd l,a,gg(;r iS Sii n ilar i,o t Ila, t of M I{F( 1 ) l,a, gger I)c(:aiiS(~ l,hey iis( 1 \[,he SalilO hil'ornial, h)n.</Paragraph> <Paragraph position="5"> ld<% 7 sliows dial, l,he error tale as \[,iic size o\[ \],r;IJnhig dnla, is illcreased. MI{F(I) has lower error rate Lha, n that of ilMM when l, lie size of training data is slnMl. '\]'hc crrol: l:~tl, e of M t{,F(2) is decreased CSlmcially wllell tlw size of the tr{dning dab~t is StlllJl, l)c:(;a.llSC luorphologicvd ia\['ornia.don helps I,t,~ process of llllkiiowii words. Filially, MI{l&quot;(3) show itnliroveinent as the size of l;raining (hfl,;~ grows I)ul; COllVel'g(is l;o l, ile \]inlit Oll sOl\]l(? poinl,s.</Paragraph> <Paragraph position="6"> These expcrilnen~s show thai, M tt, F has I)el, l,cr a,d(lal)l;abilil, y with snlMI I, raillilig data than II MM does, aud l,h;fl: MI{F tagger has bss data sl)arse</Paragraph> </Section> <Section position="10" start_page="240" end_page="240" type="metho"> <SectionTitle> 10 Comparison of MRF with HMM </SectionTitle> <Paragraph position="0"> We (:~\[ii derive l, llc siinldified cquatioil of IIMM only wilh bigra.m : l>(v'tw)= (35) is consid(m~d as l, he Inull, ilflied probal>iltics o\[ a the h)cal cwml,s. The iioarer the probabilily vahm of local cv(mt is to zero , t, hc ,~or(' it, a\[Ii;cl,s I,h(' l)robahilil:y of the (ml, ire evenl,. This prol)erty strictly reflccl;s on the cwmt.'-; which does not occur in l,rainhig (lat,~L \[:Jui, lid prohibits even the cvcul. that does llot OCCllr in l;rainiug datl h althougtl the CVellt is legal.</Paragraph> <Paragraph position="1"> M I{I&quot; can he sinil)lificd t)y lhe sunnlml:ion of clique \['un(;tion as (3{\]).</Paragraph> <Paragraph position="2"> I - ~%:{ v,+v,+.. +<,,} (3(J) /'('/'lW)-= 2 !'vl I{,1&quot; uses rvalual, io, funclion I)y suunuali<)., while IIMM do('s I)y umltil)lic~tion. F, ven if a cliqm~ flmcl,ion wd.e is very bad, o(,hcr cliqnc function ca, n conll)ensate dequa~ely lmca,use the clique functions are coime('l,ed by summatiou.</Paragraph> <Paragraph position="3"> 'l'here is no critical point of postcriori l)robal)lil,y in MRI i', while IIMM has cril, ical poi,1, in zero value. This property results in the rolmstness mid the ada, ptability o\[ l;tl(~ lliodol aud niakcs M t{F uiodcl stronger in data Sl)arscncss probhml.</Paragraph> </Section> class="xml-element"></Paper>