Lexicalized Hidden Markov Models for Part-of-Speech Tagging 
Sang-Zoo Lee and Jun-ichi Tsujii 
Del)artInent of Inforlnation Science 
Graduate School of Scien(:e 
University of Tokyo, Hongo 7-3-1 
Bunkyo-ku, Tokyo 113, Ja, l)3,iI 
{lee,tsujii} ((~is.s.u-tokyo.ac.jp 
Hae-Chang Rim 
Det)artment of Computer Science 
Korea Ulfiversity 
i 5-Ca Anam-Dong, Seongbuk-C~u 
Seoul 136-701, Korea 
rim~)nll).korea.ac.kr 
Abstract 
Since most previous works tbr HMM-1)ased tag- 
ging consider only part-ofsl)eech intbrmation in 
contexts, their models (:minor utilize lexical in- 
forlnatiol~ which is crucial tbr resolving some 
morphological tmfl)iguity. In this paper we in- 
troduce mliformly lexicalized HMMs fin: i)art - 
ofst)eech tagging in 1)oth English and \](ore, an. 
The lexicalized models use a simplified back-off 
smoothing technique to overcome data Sl)arse- 
hess. In experiment;s, lexi(:alized models a(:hieve 
higher accuracy than non-lexicifliz(~d models 
and the l)ack-off smoothing metho(l mitigates 
data sparseness 1)etter (;ban simple smoothing 
methods. 
1 Introduction 
1)arl;-Ofsl)e(:('h(POS) tagging is a l)ro(:ess ill 
which a l)rOl)('.r \])()S l;ag is assigned to ea(:h wor(l 
in raw tex(;s. Ev('n though morl)h()logi(:ally am- 
l)iguous words have more thnn one P()S tag, 
they l)elong to just one tag in a colll;ex(;. 'J~o 
resolve such ambiguity, taggers lmve to consult 
various som'ces of inibrmation such as lexica\] 
i)retbrences (e.g. without consulting context, 
table is more probably a n(mn than a. ver}) or 
an adje(:t;ive), tag n-gram context;s (e.g. after a 
non-1)ossessiv(: pronoun, table is more l)robal)ly 
a verb than a. nmm or an adjective., as in th, ey ta- 
ble an amendment), word n-grain conl;e.xl;s (e.g. 
betbre lamp, table is more probal)ly an adjective 
than ~ noun or ~ verb, as in I need a table lamp), 
and so on(Lee et al., 1.999). 
However, most previous HMM-1)ased tag- 
gers consider only POS intbrmation in con- 
texts, and so they C~I~IlII()t capture lexical infi)r- 
nmtion which is necessary for resolving some 
mort)hological alnbiguity. Some recent, works 
lmve rel)orted thai; tagging a('curacy could 
l)e iml)roved 1)y using lexicM intbrnml;ion in 
their models such as the transtbrmation-based 
patch rules(Brill, 1994), the ln~txinnun entropy 
model(lIatn~q)arkhi, 1996), the statistical lex- 
ical ruh:s(Lee et al., 1999), the IIMM consid- 
ering multi-words(Kim, 1996), the selectively 
lexicalized HMM(Kim et al., 1999), and so on. 
In the l)revious works(Kim, 1996)(Kim et al., 
1999), however, their ItMMs were lexicalized se- 
h:ctively and resl;rictively. 
\]n this l>al)er w('. prol)ose a method of uni- 
formly lcxicalizing the standard IIMM for part- 
of speech tagging in both English and Korean. 
Because the slmrse-da.ta problem is more seri- 
ous in lexicMized models ttl~ll ill the standard 
model, a simplified version of the well-known 
back-oil' smoothing nml;hod is used to overcome 
the. 1)rol)lem. For experiments, the Brown cor- 
pus(Francis, 1982) is used lbr English tagging 
and the KUNLP (:orlms(Lee ('t al., 1999) is 
used for Kore, an tagging. Tim eXl)criln(;nl;~t\] re- 
sults show that lexicalized models l)erform bet- 
ter than non-lexicalized models and the simpli- 
fied back-off smoothing technique can mitigate 
data sparseness betl;er than silnple smoothing 
techniques. 
2 Tile "standard" HMM 
We basically follow the not~ti(m of (Charniak 
et al., 1993) to describe Bayesian models. In 
this paper, we assume that {w I , 'w~,..., w ~0 } is 
a set of words, {tt,t'2,...,t;} is a set of POS 
tags, a sequence of random variables l'lq,,~ = 
l~q lazy... I'E~ is a sentence of n words, and a 
sequence of random w~riables T1,,, = 7~T,2... TT~ 
is a sequence of n POS tags. Because each of 
random wtrbflfles W can take as its value any 
of the words in the vocabulary, we denote the 
value of l'l(i by wi mM a lmrticular sequence of 
wflues tbr H~,j (i < j) by wi, j. In a similar wl.ty, 
we denote the value of Ti by l,i and a particular 
481 
sequence of values for T/,j (i _< j) t)y ti,j. For 
generality, terms wi,j and ti,j (i > j) are defined 
as being empty. 
Tile purpose of Bayesian models for POS tag- 
ging is to find the most likely sequence of POS 
tags for a given sequence of' words, as follows: 
= arglnaxPr(T,,n =- I W,,,, = w,,,d 
tl,n 
Because l'efhrence to the random variables 
thelnselves can 1)e oulitted, the above equation 
b ecolnes: 
T('wl,n) = argmax Pr(tl,n \[ wl,,z) (1) 
~'l,~t 
Now, Eqn. 1 is transtbrnled into Eqn. 2 since 
Pr(wl,n) is constant for all tq,~, 
Pr (l.j ,n, wl,n) T(*/q,n) -- argmax 
t, .... Pr('wl,n) 
= arDnaxP,'(tj,,~,w,,,,) (2) 
tl ,n 
Then, tile prolmbility Pr(tL,z, wl,n ) is broken 
down into Eqn. 3 by using tile chain rule. fl(Pr(ti,t\],i-l,Wl,i-1) ) 
Pr(tl,n,~q,r,,) = x Pr(/~i \[tl,i,~Vl,i-l) (3) i=l 
Because it is difficult to compute Eqn. 3, the 
standard ItMM simplified it t)3; making a strict 
Markov assumption to get a more tract~d)le 
tbrm. 
Pr(tl,,,, Wl,n) ~ x Pr(wi I td (4) 
i=l 
I51 the standard HMM, the probability of the 
current tag ti depends oi5 only the previous K 
tags ti-K,i-1 and the t)robability of' the cur- 
rent word wi depends on only the current tag 1. 
Thereibre, this model cannot consider lexical in- 
formation in contexts. 
3 Lexicalized HMMs 
In English POS tagging, the tagging unit is a 
word. On the contrary, Korean POS tagging 
prefers a morpheme 2. 
1Usually, K is determined as 1 (bigram as in (Char- 
niak et al., 1993)) or 2 (trigram as in (Merialdo, 1991)). 
2The main reason is that the mtmber of word-unit 
tags is not finite because I(orean words can be ti'eely 
and newly formed l)y agglutinating morphemes(Lee et 
al., 1999). 
,/, 
Flies/NNS Flies/VBZ 
like/CS like/IN like/JJ like/VB 
a/A~ a/IN a/NN 
ttower/NN flower/VB 
./. 
$/$ 
Figure 1: A word-unit lattice ot' "Flies like a 
\[lower ." 
Figure 1 shows a word-unit lattice of an Eil- 
glish sentence, "Flies like a flowc'r.", where each 
node has a word and its word-unit tag. Fig- 
ure 2 shows a morpheme-unit lattice of a Ko- 
rean sentence, "NcoNeun tIal Su issDa.", where 
each node has a morphenm and its morI)heme- 
unit tag. In case of Korean, transitions across 
a word boundary, which are depicted by a solid 
line, are distinguished fl'om transitions within a 
word, which are depicted by a dotted line. ill 
both cases, sequences connected by bold lines 
indicate the most likely sequences. 
3.1 Word-unit models 
Lexicalized HMMs fbr word-unit tagging are de- 
fined 1)y making a less strict Markov assmnp- 
tion, as tbllows: 
A(T(K,j), W(I;j))~ Pr(tl,,~,wl,n) 
i=\] x Pr(wi I ti-L,i, wi-I,i-1) 
Ill models A(T(K,j), 14/(L j)), the probability of 
the current tag ti depends on both tile previ- 
ous If tags ti-K,i-i and the previous d words 
wi-j,i-i and the probability of the current word 
'wi depends on the current tag and the previous 
L tags ti_L, i and the previous I words wi-l,i-~. 
So, they can consider lexieal inforination. In ex- 
periments, we set If as 1 or 2, J as 0 or K, L as 
1 or 2, and 1 as 0 or L. If J and I are zero, the 
above models are non-lexicalized models. Oth- 
erwise, they are lexicalized models. 
482 
$/, 
Neo/N NI" Ncol/VV 
¢. 4 No'an~ P X Ncun/EFD 
H~d/NNCC Hd/NNBU H~(VV \]Ia/VX 
S'a/NNCG Su/NNBG 
iss/\zJ iss/VX 
Da/EFF Da/EFC °"'OOoo,,,j~g_._.--"- 
./ss. 
$/$ 
Figure 2: A morl)heme-unit latti(:(; of "N,oN,'un 
llal S'u i.ssl)a." (= You (:an do it.) 
rl f in a lexicalized model A(~/(9,2), lI ('J,2)), fin" ex- 
mnl)lc , the t)robal)ility of a node "a/AT" of tlm 
most likely sequen(:e in Figure 1 is calculate(t as 
tbllows: 
l'r(AT' I NM& vIL Fli(:,~, lit,:c) 
• tq • x Pr(a t :'1~, NNS, VH, 1 l'~,c.s, lil,:c) 
3.2 Morphelne-unit models 
l);~yesian models for lnOrl)heme-unit tagging 
tin(t the most likely se(lueame of mor\])h(mms 
and corresponding tags fi)r ;~ given sequence of 
words, as follows: 
~'(11) 1,1,,) = al'glll;XX Pr(c l,v,, ?/~,,,u I '1,,,,~) (6) 
Cl~u flltl,,t 
, ra-ax Pr(c,,,,, m,,. ',,,,, ,,,) (7) 
Cl,~tllt~l,u 
In the above equations, u(_> 'n) denotes the 
llllIlll)cr of morph(mms in a Se(ltlell(;e ('orre- 
spending the given word sequ('ncc, c denotes 
a morl)heme-mfit tag, 'm. denotes a morl)heme , 
aim p denotes a type of transition froln the pre- 
vious tag to the current tag. p can have one of 
two values, "#" denoting a transition across a 
word bomldary and "+" denoting a transition 
within a word. Be(-ause it is difficult to calculate 
Eqn. 6, the word sequence term 'w~,,, is usually 
ignored as ill Eqn. 7. Instead, we introduce p in 
Eqn. 7 to consider word-spacing 3. 
Tile probability Pr(cj ,~L, P2,u, 'm,~ ,u) is also bro- 
ken down into Eqn. 8 t)3r using the chain rule. 
Pr(c~ ,,,, P2,,, , 'm, , ,,,,) 
fl ( \])r(ci,Pi \[ cl,i-l,P2,i-l,'lnl,i-l) ) 
~- X P1"(1~'1,i \[('d,i,I,2,i,17tl,i_\]) (8) i=1 
\]3('caus(' Eqn. 8 is not easy to (;omlmte ~ it is 
sinll)lified by making a Marker assmnt)tion to 
get; a more tractal)le forlll. 
In a similar way to the case of word-unit; tag- 
ging, lexicalize(t HMMs for morl)heme-mfit tag- 
ging are defined by making a less strict Markov 
assunq)tion, as tblh)ws: 
A(C\[,q(K,.\]), AJ\[sI(L,1)) 1= Pr(c\],,,,p2,,,, 'mq,~,) 
I'r(c \[,pd I ,,I,i-,Uc/--lC/-' (!)) 
~=~, xl'r(milci l,,i\[,>-L+l,,i\],'mi-l,i--I) 
In models A(C\[.q(tc,,I),M\[q(L,Q), the 1)robal)il- 
ity of the (:urrent mori)heme tag ci depends 
on l)oth the 1)revious K |:ags Ci_K,i_ 1 (oi)tion- 
ally, th(' tyl)eS of their transition Pi-K~ 1,i-~) 
a.n(l the 1)revious ,\] morl)hemes H~,i_.l,i_ 1 all(1 
the probability of the current mort)heine 'm,i (t(> 
1)en(ls on the current, tag and I:he previous L 
tags % l,,i (optional\]y, the typ('~s of their tran- 
sition Pi -L-t-I,i) and the 1)revious I morl)hemes ?lti--l,i-1. 
~()~ t\]l(ly ('&ll &lSO (-onsid(,r h;xi(-al in- 
formation. 
In a lexicalized model A(C,.(~#), M(~,2)) whea:e 
word-spa(:ing is considered only in the tag prob- 
al)ilities, for example, the 1)rol)al)ility of a nod(; 
"S'u/NNBG" of the most likely sequence in Fig- 
urc 2 is calculated as follows: 
Pr(NNBG, # \[ Vl4 EFD, +, Ha, l) 
x Pr(gu \[ VV, EFD, NNBG, Ha, l) 
3.3 Parameter estimation 
In supervised lcarning~ the simpliest parameter 
estimation is the maximum likelihood(ML) cs- 
timation(Duda et al., 1973) which lnaximizes 
the i)robal)ility ot! a training set. The ML esti- 
mate of tag (K+l)-gram i)robal)ility, PrML (f;i \[ 
t,i-K,i-i), is calculated as follows: 
P Pr(ti l ti_ir,i_j) 
__ \]: q(ti-i(,i) (10) ML Fq(ti-lGi-l) 
aMost 1)rcvious HMM-bascd Korean taggcrs except 
(Kim et al., 1998) did not consider word-spacing. 
483 
where the flmction Fq(x) returns the fl:equency 
of x in the training set. When using the max- 
imum likelihood estimation, data sparseness is 
more serious in lexicalized models than in non- 
lexicalized models because the former has even 
more parameters than the latter. 
In (Chen, 1996), where various smoothing 
techniques was tested for a  model 
by using the perplexity measure, a back-off 
smoothing(Katz, 1987) is said to perform bet- 
ter on a small traning set than other methods. 
In the back-off smoothing, the smoothed prob- 
ability of tag (K+l)-gram PrsBo(ti \[ ti-l~,i-l) 
is calculated as tbllows: 
Pr (ti \[ ti-I(,i-~) = ,5'1~20 
drPrML(ti \[ti-I(,i-1) " if r>0 (11) c~(ti-K,i-1) Prsso(ti \[ ti-K+l,i-l)if r = 0 
where r = Fq(ti_t(,i), r* = (r+ 1)'nr+l 
7~, r 
r* (r+l.) x~%.+l 
dr ~ F ltl 
1- (r+l)xm.+l 
nl 
n,. denotes the nmnber of (K+l)-gram whose 
frequency is r, and the coefficient dr is called 
the discount ratio, which reflects the Good- 
~lhtring estimate(Good, 1953) 4. Eqn. 11 means 
that Prxgo(ti \[ ti-K,i-l) is under-etimated by 
dr than its maximum likelihood estimate, if 
r > 0, or is backed off by its smoothing term 
Prsuo(ti \[ ti-K+j,i-l) in proportion to the 
value of the flmction (~(ti-K,i-t) of its condi- 
tional term ti-K,i-1, if r = 0. 
However, because Eqn. 11 requires compli- 
cated computation in ~(ti-l(,i-1), we simI)lify 
it to get a flmction of the frequency of a condi- 
tional term, as tbllows: 
ct(Fq(ti-K,i-1) = f) = 
E\[Fq(ti-I(,i-1) = f\] Ax 
E7-o E\[Fq(ti-K,i-1) -= f\] 
where A = 1 - ~ Pr (tglti-/c,i-,), 
SBO ti--K,i~r>O 
E\[Fq(ti-g,i-1) = f\] = 
SP\]to ( ti \[ti-K + l,i-1) 
ti- K + L i,r=O,F q( ti- K,i-1)= f ' 
(12) 
In Eqn. 12, the range of .f is bucketed into 7 
4Katz said that d,. = i if r > 5. 
regions such as f = 0, 1, 2, 3, 4, 5 and f > 6 since 
it is also difficult to compute this equation tbr 
all possible values of f. 
Using the formalism of our simplified back-off 
smoothing, each of probabilities whose ML es- 
timate is zero is backed off by its corresponding 
smoothing term. In experiments, the smooth- 
ing terms of Prsl~o(ti \[ ti-K,i-l,~t)i-,l,i-l) are 
determined as follows: 
PI'sBo(ti\[ ti-Ii+l,i-h )if K> 1,d> 1 wi_j+~,i_~ 
Prsuo(ti ifK >_ 1, d = 1 
Prs13o(ti \[ ti-K+Li-l) if K > 1, J = 0 
PrAD(ti) if K = 0, J = 0 
Also, the snloothing terms of' Pl's\]~o(wi 
ti_L,i, Wi_l,i_ 1 ) are determined as follows: 
\[ Prst~o(wi 
Prsuo (wi 
Prs,o (wi 
PrsBO(Wi) 
PrA.O i) 
ti-L+~,i, ) if L _> 1, I> 1 
il)i-I+l ,i-I ti-L,i) 
if L _> 1, I = 1 
ti-L+Li) if L >_ 1, I = 0 
ifL = 0, I --= 0 
ilL = -1, I = 0 
In Eqn. 13 and 14, the smoothing term of a 
unigram probability is calculated by using an 
additive smoothing with 5 = 10 .2 which is cho- 
sen through experiments. The equation for the 
additive smoothing(Chen, 1996) is as tbllows: 
Fq(ti-t(,i) + 5 
AD ~tl (Fq(ti-lf,i) + 5) 
In a similar way, the smoothing terms of param- 
eters in Eqn. 9 ~re determined. 
3.4 Model decoding 
h'om the viewpoint of the lattice structure, the 
t)roblem of POS tagging can be regarded as the 
problem of finding the most likely path ti'om the 
start node ($/$) to the end node ($/$). The 
Viterbi search algorithm(Forney, 1973), which 
has been used for HMM decoding, can be effec- 
tively applied to this task just with slight mod- 
ification 5. 
4 Experiments 
4.1 Environment 
In experiments, the Brown corpus is used tbr 
English POS tagging and the KUNLP corpus 
'%uch modification is explained in detail in (Lee, 
1999). 
(13) 
(14) 
484 
NW 1,113,189 
NS 
NT 
DA 
RUA 
Brown KUNLP 
167,115 
53,885 15,211 
82 65 
1.64 3.4:1 
61.54% 26.72% 
NW Number of words. NS Number of sen- 
tcnccs. NT Numl){'.r of tags (nlorpheme-unit 
tag for KUNLP). DA Degree of mnbiguity 
(i.e. the number of tags per word). RUA 
1\].atio of mlanlbiguous words. 
Table 1: Intbrmation al)out the Brown eortms 
and the KUNLP tort}us 
Inside-test ()utside-|;(;st 
ML 95.57 94.97 
= 1) 
-AD(a - \](}- \]) 
AD(~ = 1{}2T) - 
ADO;-- =a) 
A\])((; = 
AD(5 = 
AD(5 = 
AD(5 = \]\]}-~7)- 
AD(5 = 
93.92 93.02 
95.02 94.79 
95.42 95.08 
95.55 95.05 
95.57 94.98 
95.57 94.94: 
95.57 94.91 
95.57 94.89 
95.57 94.87 
SBO 95.55 95.25 
ML Maximum likelihood estimate (with sim- 
ple smoothing). A\]) Additiv(~ smoothing. 
SBO Sinll}liticd 1)ack-off smootlfing. 
lal)l(, 2: lagging accura(:y (}f A(C(\]:o), M0}:0 )) 
for Kore~m POS tagging. Table 1 shows some 
intbrmation M)out 1}oth (:ori)ora {~. Each of them 
was segmented into two parts, the training set 
of 90% and the test; set of 10%, ill. the way that 
each sentence in the test set was extra{'tc, d \]i'()ln 
every 1(} senl;ellce. A(:cording to Tabl(! 1, Ko- 
reml is said to 1)e lllOre (litli(:ult to disambiguat(; 
tl\]ml English. 
We assmne "closed" wmabulary for English 
and "open" vocabulary for Korean since we do 
not h~ve any English morphological mmlyzer 
consistent with the Brown corlms. Therefore, 
for morphological mmlysis of English, we just 
aNote that some sentcnc('.s, which have coml}osite 
tags(such as "HV+TO" in "hafta"), "ILLEGAL" tag, 
or "NIL" tag~ were remov(M fronl the Brown corl)us and 
tags with "*" (not) such as "BEZ*" were r(',l)la(:(~(t 1)y (:of 
r{~st}o\]ttling tags without "*" such as "BEZ". 
2M 
1.5M 
IM 
(}.5M 
I I I I I 
- ML 
AD .x. - 
SBO 
1,02,01,02,01,02,0 
{},(} 0,01 ,(} 1,02,(} 2,0 
\]' --I I \[ I I 
.99 
.98 
.97 _I~L~ ~_± 
1,02,01,023} 13} 2,{1 
(},0 {},{} 1,01,1} 2,02,0 
.98 
.97 
2)6 
.(,):, ?vii, -rJ-- 
AD '×- - 
.:)4 SBO -~--- 
1,02,01,02,(11,02,0 
o,00,01,01,02,02,0 
1. 
.99 
.98 
\[ 
( .97 
.96 
I I I I I i t~ 
I I I I I t t } I I I I I 
1,11,11,01,12,01,12,22,22,22,21,02,01,12,2 
0,01,01,12,0 1,1 1,1 0,01,0 1,12,0 2,2 2,2 2,2 2,2 
(a) # of paraln{;ters 
M\], -D- 
AD -×- - - 
SB() 
I I I I I I I I I I I I I 
1 , 1 1, l 1,0 1,1 2,01,1 2,22,22,22,2 1 ,(} 2,01,12,2 
0,01 0 1,12,01,11,10,01,01 ,l 2,02,22,22,22,2 (1)} Inside-test 
1,11,I 1,01,12,01,I 2,22,22,22,21,02,01,12,2 
0 01,01,12,01,11,10,01,01,12,02,22,22,22,2 (c) Ouiside-test 
1,02,01,02,01,02,0 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2 
0,00,01,0 1,02,02,0 0,01,01,l 2,01,11,10,(11,{11,I 2,02,22,22,22,2 (d) inside vs. outside-test in SBO 
Figure 3: Results of English tagging 
485 
looked up the dictionary tailored to the Brown 
corpus. In case of Korean, we have used a Ko- 
rean morphological analyzer(Lee, 1999) which 
is consistent with the KUNLP corpus. 
4.2 Results and evaluation 
Table 2 shows the tagging accuracy of the sim- 
plest HMM, A(C(l:0),M(0:0)), for Korean tag- 
ging, according to various smoothing meth- 
ods 7. Note that ML denotes a simple smooth- 
ing method where ML estimates with prob- 
ability less than 10 -9 are smoothed and re- 
placed by 10-9• Because, in the outside-test, 
AD(d = 10 -2) performs better than ML and 
kD(a ¢ 10-2), we use 5 = 10 -2 in our ad- 
ditive smoothing. According to Table 2, SBO 
I)ertbrms well even in the simplest HMM. 
Figure 3 illustrates 4 graphs'about the results 
of English tagging: (a) the number of param- 
eters in each model, (b) the accuracy of each 
model tbr the training set, (c) the accuracy of 
each model for the test set, and (d) the accuracy 
of each model with SBO tbr both training and 
test set. Here, labels in x-axis sI)ecify models 
K, ,1 in the way that ~ denotes A(T(\];,j) , W(Lj)). 
Therefore, the first 6 models are non-lexicalized 
models and tile others are lexicalized models. 
Actually, SBO uses more parameters than 
others. The three smoothing methods, ML, 
AD, SBO, perform well for the training set; 
since the inside-tests usually have little data 
sparseness. On the other hand, tbr the un- 
seen test set, the simple methods, ML and 
AD, cannot mitigate the data sparseness prob- 
lem, especially in sophisticated models. How- 
ever, our method SBO can overcome the prob- 
lem, as shown in Figure 3(c). Also, we can 
see in Figure 3(d) that some lexicalized mod- 
els achieve higher accuracy than non-lexicalized 
models. We can say that the best lexicalized 
model, A(T(1,~),W(1,1)) using SBO, improved 
the simple bigram model, A(T(L0),W(0,0)) us- • 
~ 0 mg SBO, from 97.19>/o to 97.87~ (the error re- 
duction ratio of 24.20%). Interestingly, some 
lexicalized models (such as A(T(1,1), W-(0,0)) and 
A(T(1,1), W(1,o))), which have a relatively small 
number of paranmters, perform better than 
non-lexicalized models in the case of outside- 
tests using SBO. Untbrtunately, we cannot ex- 
rInside-test means an experiment on the training set 
itself and outside-test an experiment on the test set. 
.96 
.94 ~• ..~ uu . X • "" " ~' ".~1%~ ~ 
.92 
.90 
ML ~ k .88 
AD .x. - 
SBO 
.86 I I I I I I I I I I f I I I I I I I 
1,02,01,02,01,02,0 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2 
0,00,01,01,02,02~0 0,01,01,12,01,11,10,01,01,12,02,22,22,22,2 (a) Outside-test 
• 97 I I I I I I ~ I I I d~ I I I t I I-~ 
C,M + ¢ 
.966 C~,/l~/ + 
.9(;2 ~.~, -~I~ X \[\] 
+ 
1,02,01,02,01102,0 1,11,11,01,12,01,12,22,22,22,21,02,01,12,2 
0,00,01,01,02,02,0 0,01,01,12,01,11,10,01,01,12,02,22,22,22,2 (b) Considering word-spacing 
+ 
x 
÷ 
x 
\[\] × ÷ 
Illlllll 
Figure 4: Results of Korean tagging 
pect the result of outside-tests from that of 
inside-tests because there is no direct relation 
t)etween them 
Figm:e 4 includes 2 graphs about the re- 
sults of Korean tagging: (a) the outside ac- 
curacy of each model A(C(K,j),MiL,I)) and 
(b) the outside accnracy of each model 
A(C\[s\](~-g),M\[s\](L,0) with/without considering 
word-spacing when using SBO. Here, labels in 
K,J de- x-axis specify models in the way that ,7,, 
notes A(C\[s\](K,j),i~/I\[.~\](Lj)) and, tbr example, 
C,,M in (b) denotes k(C~(,r,j), M(L,r)). 
As shown in Figure 4, the simple meth- 
ods, ML and AD, cannot mitigate that sparse- 
data problem, t)ut our method SBO can over- 
come it. Also, some lexicalized models per- 
tbrm better than non-lexicalized models. On 
the other hand, considering word-spacing gives 
good clues to the models sometimes, but yet 
we cannot sw what is the best ww. From 
the experimental results, we can say that the 
best model, A(C(9,2),M(2,2)) using SBO, im- 
proved the previous models, A(C(1,0), M(o,o)) us- 
486 
ing ML(Lee, 1995), and A(G(,,0), M(0,0))using 
ML(Kim et al., 1998), t'ronl 94.97% and 95.05% 
to 96.98% (the error reduction ratio of 39.95% 
mid 38.99%) respectively. 
5 Conclusion 
We have 1)resented unitbrmly lexicalized HMMs 
for POS tagging of English and Korean. In 
the models, data sparseness was etlix:tively mit- 
igated by using our simplified ba(-k-ofl" smooth- 
ing. From the ext)eriments, we have ol)served 
that lexical intbrmation is usefifl fi)r POS tag- 
ging in HMMs, as is in other models, and 
ore" lexicalized models improved non-lexicalized 
models by the error reduction ratio of 24.20% 
(in English tagging) and 39.95% (in Korean tag- 
ging). 
G('.nerally, the mfiform extension of models 
requires ral)id increase of parameters, and hence 
suffers fl'om large storage a.nd sparse data. l~.e- 
cently in many areas where HMMs are used, 
many eflbrts to extend models non-mfitbrmly 
have been made, sometimes resulting in notice- 
able improvement. For this reason~ we are try- 
ing to transfbnn our uniform models into non- 
mliform models, which may 1)e more effective 
in terms of both st)ace (:omt)h'~xity and relial)le 
estimation of I)areme|;ers, without loss of accu- 
racy. 

References 

12. Brill. 1994. Some Advances in 
~l¥ansformation-B ased Part of St)eech 
~Dtgging. In P~ve. of the 12th, Nat'l Cm¢. on 
Art'tficial hdelligencc(AAAI-.9~), 722-727. 

E. Charniak, C. Hendrickson, N. Jacobson, and 
M. Perkowitz. 1993. l~3quations for Part- 
of Speech %~gging. In Proc, of the 11th, 
Nat'l CoT~:f. on Artificial Intclligence(AAAL 
93), 784-789. 

S. F. Chen. 1996. Building Probabilistic Models 
for Natural Language. Doctoral Dissert~tion, 
Harvard University, USA. 

R. O. Duda and R. E. Hart. 1973. Pattern CIas- 
s'~fication and Scene Analysis. John Wiley. 

G. D. Forney. 1973. The Viterbi Algorithm. Ill 
Proc. of the IEEE, 61:268-278. 

W. N. Francis and H. Ku~era. 1982. Fre- 
quency Analysis of English Usage: Lczicon 
and GTnmmar. Houghton Mitltin Coral)any , 
Boston, Massachusetts. 

I. J. Good. 1953. "The Population Frequen- 
cies of Species and the Estimation of Pop- 
ulation Parameters," Ill Biometrika, 40(3- 
4):237-264. 

S. M. Katz. 1987. Estimation of Probabilities 
fronl Sparse Data for the Language Model 
Component of a Speech Recognizer. In IEEE 
Transactions on Acoustics, Speech, and Signal 
i'rocessing(ASSl'), 35(3):400-401. 

J.-\]). Kim, S.-Z. Lee, and H.-C. Rim. 1998. 
A Morpheme-Unit POS Tagging Model Con- 
sidering Word-Spacing. Ill Pwc. of th.e I0 th 
National CoT~:fercnce on Korean h~:formation 
PTveessing, 3-8. 

J.-D. Kim, S.-Z. Lee, and H.-C. Rim. 1999. 
HMM Specialization with Selective Lexi- 
calization. In Pwe. of the joint SIGDAT 
Co~l:h':rence on Empirical Methods in Nat- 
'aral Language Processing and Very La'qtc 
Co'rpora(EMNLP- VL C-99), ld4-148. 

J.-H. Kim. 1996. Lcxieal Disambig'aation with 
Error-Driven Learning. Doctoral Disserta- 
tion, Korea Advanced Institute of Science and 
Te.clmology(KAIST), Korea. 

S.-H. Lee. 1995. Korean POS Tagging System 
Considering Unknown Words. Master The- 
sis, Korea Advanced Institute of Science and 
Teclmology(KAIST), Korea. 

S.-Z. Lee, .I.-D. Kim, W.-H. Ryu, and H.- 
C. Rim. 1999. A Part-of Speech Tagging 
Model Using Lexical l/.ules Based on Corlms 
Statistics. In Pwc. of the International Con- 
ference on Computer \])'lvcessin 9 of Oriental 
Languages(lCCPOL-99), 385-390. 

S.-Z. Lee. 1999. New Statistical Models for Au- 
tomatic POS Tagging. Doctoral Dissertation, 
l(orea University, Korea. 

B. Merialdo. 1991. Tagging Text with a Prol)- 
abilisl;ic Model. In P~vc. of the International 
Conference on Acoustic, Spccch and Signal 
Processing(ICASSP-91), 809-812. 

A. Ratnap~rkhi. 1996. A Maximum Entrol)y 
Model tbr Part-of-Speech Tagging. In Proe. 
of the Empirical Methods in Natural Lan- 
guage P~vcessi'ng Co'a:fercnce(EMNLP-9b'), 
133-142. 
