File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1070_metho.xml
Size: 10,044 bytes
Last Modified: 2025-10-06 14:07:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1070"> <Title>Lexicalized Hidden Markov Models for Part-of-Speech Tagging</Title> <Section position="3" start_page="0" end_page="481" type="metho"> <SectionTitle> 2 Tile &quot;standard&quot; HMM </SectionTitle> <Paragraph position="0"> We basically follow the not~ti(m of (Charniak et al., 1993) to describe Bayesian models. In this paper, we assume that {w I , 'w~,..., w ~0 } is a set of words, {tt,t'2,...,t;} is a set of POS tags, a sequence of random variables l'lq,,~ = l~q lazy... I'E~ is a sentence of n words, and a sequence of random w~riables T1,,, = 7~T,2... TT~ is a sequence of n POS tags. Because each of random wtrbflfles W can take as its value any of the words in the vocabulary, we denote the value of l'l(i by wi mM a lmrticular sequence of wflues tbr H~,j (i < j) by wi, j. In a similar wl.ty, we denote the value of Ti by l,i and a particular sequence of values for T/,j (i _< j) t)y ti,j. For generality, terms wi,j and ti,j (i > j) are defined as being empty.</Paragraph> <Paragraph position="1"> Tile purpose of Bayesian models for POS tagging is to find the most likely sequence of POS tags for a given sequence of' words, as follows:</Paragraph> <Paragraph position="3"> Because l'efhrence to the random variables thelnselves can 1)e oulitted, the above equation</Paragraph> <Paragraph position="5"> Then, tile prolmbility Pr(tL,z, wl,n ) is broken down into Eqn. 3 by using tile chain rule. fl(Pr(ti,t\],i-l,Wl,i-1) )</Paragraph> <Paragraph position="7"> Because it is difficult to compute Eqn. 3, the standard ItMM simplified it t)3; making a strict Markov assumption to get a more tract~d)le tbrm.</Paragraph> <Paragraph position="9"> I51 the standard HMM, the probability of the current tag ti depends oi5 only the previous K tags ti-K,i-1 and the t)robability of' the current word wi depends on only the current tag 1.</Paragraph> <Paragraph position="10"> Thereibre, this model cannot consider lexical information in contexts.</Paragraph> </Section> <Section position="4" start_page="481" end_page="483" type="metho"> <SectionTitle> 3 Lexicalized HMMs </SectionTitle> <Paragraph position="0"> In English POS tagging, the tagging unit is a word. On the contrary, Korean POS tagging prefers a morpheme 2.</Paragraph> <Paragraph position="1"> 1Usually, K is determined as 1 (bigram as in (Charniak et al., 1993)) or 2 (trigram as in (Merialdo, 1991)). 2The main reason is that the mtmber of word-unit tags is not finite because I(orean words can be ti'eely and newly formed l)y agglutinating morphemes(Lee et al., 1999).</Paragraph> <Paragraph position="3"/> <Paragraph position="5"> Figure 1 shows a word-unit lattice of an Eilglish sentence, &quot;Flies like a flowc'r.&quot;, where each node has a word and its word-unit tag. Figure 2 shows a morpheme-unit lattice of a Korean sentence, &quot;NcoNeun tIal Su issDa.&quot;, where each node has a morphenm and its morI)hemeunit tag. In case of Korean, transitions across a word boundary, which are depicted by a solid line, are distinguished fl'om transitions within a word, which are depicted by a dotted line. ill both cases, sequences connected by bold lines indicate the most likely sequences.</Paragraph> <Section position="1" start_page="481" end_page="482" type="sub_section"> <SectionTitle> 3.1 Word-unit models </SectionTitle> <Paragraph position="0"> Lexicalized HMMs fbr word-unit tagging are defined 1)y making a less strict Markov assmnption, as tbllows:</Paragraph> <Paragraph position="2"> Ill models A(T(K,j), 14/(L j)), the probability of the current tag ti depends on both tile previous If tags ti-K,i-i and the previous d words wi-j,i-i and the probability of the current word 'wi depends on the current tag and the previous L tags ti_L, i and the previous I words wi-l,i-~.</Paragraph> <Paragraph position="3"> So, they can consider lexieal inforination. In experiments, we set If as 1 or 2, J as 0 or K, L as 1 or 2, and 1 as 0 or L. If J and I are zero, the above models are non-lexicalized models. Otherwise, they are lexicalized models.</Paragraph> <Paragraph position="4"> llal S'u i.ssl)a.&quot; (= You (:an do it.) rl f in a lexicalized model A(~/(9,2), lI ('J,2)), fin&quot; exmnl)lc , the t)robal)ility of a node &quot;a/AT&quot; of tlm most likely sequen(:e in Figure 1 is calculate(t as tbllows: l'r(AT' I NM& vIL Fli(:,~, lit,:c) * tq * x Pr(a t :'1~, NNS, VH, 1 l'~,c.s, lil,:c)</Paragraph> </Section> <Section position="2" start_page="482" end_page="482" type="sub_section"> <SectionTitle> 3.2 Morphelne-unit models </SectionTitle> <Paragraph position="0"> l);~yesian models for lnOrl)heme-unit tagging tin(t the most likely se(lueame of mor\])h(mms and corresponding tags fi)r ;~ given sequence of words, as follows:</Paragraph> <Paragraph position="2"> In the above equations, u(_> 'n) denotes the llllIlll)cr of morph(mms in a Se(ltlell(;e ('orrespending the given word sequ('ncc, c denotes a morl)heme-mfit tag, 'm. denotes a morl)heme , aim p denotes a type of transition froln the previous tag to the current tag. p can have one of two values, &quot;#&quot; denoting a transition across a word bomldary and &quot;+&quot; denoting a transition within a word. Be(-ause it is difficult to calculate Eqn. 6, the word sequence term 'w~,,, is usually ignored as ill Eqn. 7. Instead, we introduce p in Eqn. 7 to consider word-spacing 3.</Paragraph> <Paragraph position="3"> Tile probability Pr(cj ,~L, P2,u, 'm,~ ,u) is also broken down into Eqn. 8 t)3r using the chain rule.</Paragraph> <Paragraph position="5"> \]3('caus(' Eqn. 8 is not easy to (;omlmte ~ it is sinll)lified by making a Marker assmnt)tion to get; a more tractal)le forlll.</Paragraph> <Paragraph position="6"> In a similar way to the case of word-unit; tagging, lexicalize(t HMMs for morl)heme-mfit tagging are defined by making a less strict Markov assunq)tion, as tblh)ws:</Paragraph> <Paragraph position="8"> ~=~, xl'r(milci l,,i\[,>-L+l,,i\],'mi-l,i--I) In models A(C\[.q(tc,,I),M\[q(L,Q), the 1)robal)ility of the (:urrent mori)heme tag ci depends on l)oth the 1)revious K |:ags Ci_K,i_ 1 (oi)tionally, th(' tyl)eS of their transition Pi-K~ 1,i-~) a.n(l the 1)revious ,\] morl)hemes H~,i_.l,i_ 1 all(1 the probability of the current mort)heine 'm,i (t(> 1)en(ls on the current, tag and I:he previous L tags % l,,i (optional\]y, the typ('~s of their transition Pi -L-t-I,i) and the 1)revious I morl)hemes ?lti--l,i-1. ~()~ t\]l(ly ('&ll &lSO (-onsid(,r h;xi(-al information. null In a lexicalized model A(C,.(~#), M(~,2)) whea:e word-spa(:ing is considered only in the tag probal)ilities, for example, the 1)rol)al)ility of a nod(; &quot;S'u/NNBG&quot; of the most likely sequence in Figurc 2 is calculated as follows:</Paragraph> <Paragraph position="10"/> </Section> <Section position="3" start_page="482" end_page="483" type="sub_section"> <SectionTitle> 3.3 Parameter estimation </SectionTitle> <Paragraph position="0"> In supervised lcarning~ the simpliest parameter estimation is the maximum likelihood(ML) cstimation(Duda et al., 1973) which lnaximizes the i)robal)ility ot! a training set. The ML estimate of tag (K+l)-gram i)robal)ility, PrML (f;i \[ t,i-K,i-i), is calculated as follows: where the flmction Fq(x) returns the fl:equency of x in the training set. When using the maximum likelihood estimation, data sparseness is more serious in lexicalized models than in non-lexicalized models because the former has even more parameters than the latter.</Paragraph> <Paragraph position="1"> In (Chen, 1996), where various smoothing techniques was tested for a language model by using the perplexity measure, a back-off smoothing(Katz, 1987) is said to perform better on a small traning set than other methods. In the back-off smoothing, the smoothed probability of tag (K+l)-gram PrsBo(ti \[ ti-l~,i-l) is calculated as tbllows:</Paragraph> <Paragraph position="3"> n,. denotes the nmnber of (K+l)-gram whose frequency is r, and the coefficient dr is called the discount ratio, which reflects the Good~lhtring estimate(Good, 1953) 4. Eqn. 11 means that Prxgo(ti \[ ti-K,i-l) is under-etimated by dr than its maximum likelihood estimate, if r > 0, or is backed off by its smoothing term Prsuo(ti \[ ti-K+j,i-l) in proportion to the value of the flmction (~(ti-K,i-t) of its conditional term ti-K,i-1, if r = 0.</Paragraph> <Paragraph position="4"> However, because Eqn. 11 requires complicated computation in ~(ti-l(,i-1), we simI)lify it to get a flmction of the frequency of a conditional term, as tbllows:</Paragraph> <Paragraph position="6"> In Eqn. 12, the range of .f is bucketed into 7 Using the formalism of our simplified back-off smoothing, each of probabilities whose ML estimate is zero is backed off by its corresponding smoothing term. In experiments, the smoothing terms of Prsl~o(ti \[ ti-K,i-l,~t)i-,l,i-l) are determined as follows:</Paragraph> <Paragraph position="8"> Also, the snloothing terms of' Pl's\]~o(wi</Paragraph> <Paragraph position="10"> In Eqn. 13 and 14, the smoothing term of a unigram probability is calculated by using an additive smoothing with 5 = 10 .2 which is chosen through experiments. The equation for the additive smoothing(Chen, 1996) is as tbllows:</Paragraph> <Paragraph position="12"> In a similar way, the smoothing terms of parameters in Eqn. 9 ~re determined.</Paragraph> </Section> <Section position="4" start_page="483" end_page="483" type="sub_section"> <SectionTitle> 3.4 Model decoding </SectionTitle> <Paragraph position="0"> h'om the viewpoint of the lattice structure, the t)roblem of POS tagging can be regarded as the problem of finding the most likely path ti'om the start node ($/$) to the end node ($/$). The Viterbi search algorithm(Forney, 1973), which has been used for HMM decoding, can be effectively applied to this task just with slight modification 5.</Paragraph> </Section> </Section> class="xml-element"></Paper>