Marker random field based 
English Part-Of-Speech tagging system 
Sung-Young Jung , Young C. Park, Key-Sun Choi and Youngwhan Kim* 
Comput;er Science \])el)a.rtment 
Korea Advanced \[nstitul, e of Science and q%chnology 
Taejon, Korea 
*Multimedia Re.search Ibaboratories 
Korea ~\[>lecom 
{ chopin,ycpark,kschoi}(ci!csone.kaist.ac.kr 
Abstract 
Probabilistic models have been widely 
used for natural language processing. 
Part-of-speech tagging, which assigns 
the most likely tag to each word in a 
given sentence, is one. of tire problems 
which can be solved by statisticM ap- 
proach. Many researchers haw~ tried 
to solve the problem by hidden Marker 
model (HMM), which is well known as 
one of the statistical models. But it 
has many difficulties: integrating hetero- 
geneous information, coping with data 
sparseness prohlem, and adapting to new 
environments. In this paper, we pro- 
pose a Markov radom field (MRF) model 
based approach to the tagging problem. 
The MRF provides the base frame to 
combine various statistical information 
with maximum entropy (ME) method. 
As Gibbs distribution can be used to 
describe a posteriori probability of tag- 
ging, we use it in ma.ximum a posteri- 
ori (MAP) estimation of optimizing pro- 
cess. Besides, several tagging models are 
developed to show the effect of adding 
information. Experimental results show 
that the performance of the tagger gets 
improved as we add more statistical in- 
formation, and that Mt{F-based tagging 
model is better than ttMM based tagging 
model in data sparseness problem. 
1 Introduction 
Part-of-speech tagging is to assign the correct tag 
to each word in the context of the sentence. '\['here 
are three main approaches in tagging problem: 
rule-based approach (Klein and Simmons 1%3; 
Brodda 1982; Paulussen and Martin 1992; Brill 
et al. 1990), statistical approach (Church :1988; 
Merialdo 1994; Foster 1991; Weischedel et al. 
1993; Kupiec 1992) and connectionist approach 
(Benello et al. 1989; Nakanmra et al. 1989). In 
these approaches, statistical approach has the fol- 
lowing advantages : 
• a theoretical framework is provided 
• automatic learning facility is provided 
• the probabilities provide a straightforward 
way to disambiguate 
Many information sources must be combined to 
solve tagging problem with statistical approach. 
It is a significant assumption that tire correct tag 
can generally be chosen from I.he local context. 
Not only local sequences of words and tags are 
needed to solve tagging problem, but syntax, se- 
mantic, and morphological level information is also 
required in general. Usually information sources 
such as t)igram, trigram and migra.m are used in 
the tagging systems which are based on statistical 
method. Traditionally, linear interpolation an(t its 
variants have been used to combine the informa- 
tion sources, })tit these are shown to be seriously 
deficient. 
ME (Maximum Entropy) estimation method 
provides the facility to combine several informa- 
tion sources. Each inR)rmation source gives rise 
to a set of constraints, to be imposed on the con> 
bined estimate. The function with the highest en- 
tropy within the constraints is the ME solution. 
Given consistent statistical evidence, a unique M E 
solution is guaranteed to exist and an iteratiw~" al- 
gorithm is provided. 
MRF (Marker random field) model is based on 
ME method and it; has the facility to combine 
many inlbrmation sources through feature flmc- 
tions. MRF model has the following adwmtages: 
robustness, adaptability, parallelism and the facil- 
ity of combining informatiort sources. M RF-based 
tagging model inherits these advantages. 
In this paper, we will present one of the statis- 
tical models, namely MRF-based tagging systern. 
We will show that several information sources in- 
cluding unigram, bigram and trigram, can be com- 
bined in MRF-based tagging model. Experimen- 
tal results show that the MRF-based tagger has 
very good performance especially when training 
data size is small. 
Section 2 describes the tagging problem , Sec- 
tion "l describes statistical model already known 
236 
a.nd ,,;ect.iou 4 tip rcso{~rch for contl)it~ing st.at.ist, i- 
cal in\['orutal.ion. Sect.ion 5 provides Ml{F-I)ased 
tagging model attd secl.ion 9 sho,.w's the expcri- 
t:twnl;al rcsult.s. Sc.ct.iou 10 (:Opal)arcs M\[{I: wilh 
I\]MM. Fit(ally we conclude in scct, iott l. 
2 The Problem of Taggil+g 
When scnt.ct~c(! i,'V = wt, w2, .,., u,,,~ is given, t.lwrc 
exist ('orresl>onding (.ags 7' = /i,*2,...,t, of the 
same hmgl.h. W{> call l he pair (W,T) an align- 
lltOtll;. ~'Vo ::-;a.y that. wor(l iv: has l)('('tt assigned l.}w 
l.ag t i ill t.his aliguutcnl.. V'v'o suppose l.hat, a sel. 
,:)f I,ags is giv~'n. 'l'aggittg is assigniltg ('()rr('('1. lag: 
s(!quetlCO "1'-- It, t2, ..., tn I'or given word sc:qtlcnce 
H / ~- lt; t , 11:2, ..., IU++. 
3 Probabilistic 
Formulation (IIMM) 
I:% us assunm t,hal, we want i.o I,:ttov,, the n~,::)st; 
likc'ly tag seqtt<mcc ~b(PV), given a parlicttlar wot:d 
Se(lUtmcc I'V. Tim l.agging prol)h-nt is dolim'd as 
liuding t.lw tttosl, likely t.?tg s("tltlClt('(! "1' 
q~(w) :- .,':z ,t~,tx/'('/'llI,:) (t) 
z'(Wl',")/'('/') = ..,'q max (2) 
",' P(w) 
= +,rff nF3xI'(VVI'l')l'(73 (:It / 
wh.cre P(7') is 1,he a priori iwohal)ilit.y of a tag 
scquom:o~ 'r, t,(WI'I ') is t,h.c cot~dil.ioual I,'olmhil 
il,y of word sc,,tu(m(-c. H/, given I.It,:~ .'-;cqucuco o1" tags 
7', at,d /'(W) is l,ho tmcotMitiotled i)roba.I)ilit,y ot: 
word scqmmcc W. 'l'hc t>rol>abilit.y I'(W) in (2) is 
rc'tnovc'd I)ccause it lt~ts no ell'cot, on 0(W). (:on- 
scquent.ly, it. is suilicicnt i.o find the tag SC(ltmnce 
7' which sal.isiies (3). 
Wc can rcwvit,(> the prolmldlily of0ach scqttcnc(' 
as a. prodltct, of 1,he cotMil.iona.l prot>altilit.i<~s of 
each word or fag given all of the lm'viotts t, ags. 
t,(wv/') v(~r) 
,, { :'(.,,,:It+, .... ~,, ,,,_ ~, ..., ,,t) } 
= lli=l ×l'(li\[li_ I, ...,/1) 
Typically, otto nmkcs t.wo silllplifying assmttp 
(.ions I.o (:llt. dowu o, l, lt<: nttnd>er of F, rc, l>al:,ilil.ies 
1.,.) Ive <+st.inml.ed. I,'irst., rat.her I.h;tn ;tssuttling lt'i 
dclwnds on all IH'cviotm words and all i)rcviotts 
ta.gs, ono ;-tsstttnes w+ d,:'F, cn,:Is (:,nly ,:mli. Sc,::oml, 
rath(,r i:hau aSS(lining the tag ti deltends on the 
t'ull s(>quc:ncc of l)rcvious l, ags, w(' can assume l.hal. 
local COltl;oxl, is sullicicnt. This locality assumed is 
rcfercd t,o as ;t Mm'kov imlcl)endence rlSStlllll)liioIl. 
Using 1.lwse ass(trutH,ion, w(> al)lwoxhtml,c l,}to 
uqua.l.ion l,o l, ttc R)llowing 
z'(WlV') ~ II'A~z'(.,~le,) (4) 
/,(~/') ~ tl}%,z,(,~lz+_t) (:,) 
Accordingly, O(i'V) is (h'rived I) 3, applying (,I) 
,.td (,~) to (:~). 
,/,(w) : ... ,,+p+.: tI'L,;'(,.+lz;)z'(z,:lt,_.) ((i) 
We can gel each l)robabilit, y va.hte front the 
t.aggcd corItus which is i,rq+arcd for l.raining by 
usiu:4 (7) aml (8). 
z'(.,zlt,) - (:(+<,t,) (T) cT(t~) 
/'(/,:1t~_, ) c'(z~) (st 
where (7(t:),C(ti, Ig) is tit(" \['reqttency obt.aincd 
fronl l rainhlg dal.a. 
\:it;orl>i algorilhnt (l:orncy73) is the one goner- 
ally used to liw.l t.he t,a.g SO<luencc which safislies 
(6) aim I.ttis algoril.hnt gttaranl.ccs the opl, itttal so- 
hit ion to I,he I)r(+bhmt. 
This model has several prot>l(+tns. First,, so(no 
wot'ds or \[,ag~ s<Xltl(~ll(W','-; 1/13.y itot, O(HHIt' ill l.ra.ili- 
htg dal.a or Hlay occur with very low \[reqttetlcy; 
ii('vt'rlh('lcs,% t,llc words or lag soqtt(~ltC(~s c;/tt ;+\])- 
l)ear ill t.agging l>roccss, lit this case, it, usually 
causes V(!l'y bad result, t.o COllll)ttt.c (6), because 
the lwol)al)ility has zero wdue or very low value. 
'l'his problont is c.alh'd data slmr,sc.css i>rol)h+tn. 
To avoid thi+q l~roldetn, sm,)ot, hing of itd'ortttat.i,:m 
tlJtts/, I:,c ttscd. Sntool, hing proc('ss is ahnost ('s- 
scntia,\] in tlMM tmcattse IIMM has sevet'c d:at,;t 
sparseness prol>hmt. 
4 combining information sources 
4.1 linear inl;(',\]rlmlatiou 
Various Mttds o\[' inlol'ntal, ion sf:.ttr(;(~s and differ- 
out knowledge sources must, he Colnl)incd l.O S()IV(" 
the l,a.gging prol>l(+m. The gemwal method used 
it( IlMM is linear iul,erl~olatiot L which is l,he 
w(gght, ed sutnmal,ion of all prol)alfilil:y infornlat, ion 
,%o11 r c(:s. 
k 
P ....... ,,.,,,,(,,'7,) -: ~ A~/'~I,,,I/,) (!)) 
/=1 
wlt~t'c 0 < Ai N I mid }3: Ai = 1. 
This hie(hod cnn I)e used I)ol, h as a way of con> 
Itining Mtowh:dg(" sources and snloot, hing infornm- 
t, iotI sou t'.::cs. 
I1MM based l.agging modd times unigram, t>i- 
gl'atll a, lld t.rigt'altt in\[ortn;d.iott. These in for(mr(ion 
sources are linearly cotnl>ined by weighl.cd Slllliliift- 
l.ion. 
237 
P(zg It~-~, tg_2) = A1 P(ti Iti_l, li-2) + A2P(ti Iti- 1) (lo) 
where A1 + A2 = 1. Tire parameter lland A2 
can be estimated by forward-backward algorithm 
(Deroua86+) (Charniak93+)(tIUANG90+). 
Linear interpolation is so advantageous because 
it reconciles the different information sources in a 
straightforward and simple-minded way. But such 
simpliticy is also the source of its weaknesses: 
• Linearly interpolated information is generally 
inconsistent with their information sources 
because information sources are heteroge- 
neous for each other in general. 
• Linear interpolation does not make optimal 
combination of information sources. 
• Linear interpolation has over-estimation 
problem because it adjusts the model on the 
training data only and has no policy for un- 
trained data. This problem occur seriously 
when the size of the training data is not large 
enough. 
4.2 ME(maxlmum entropy) principle 
There is very powerful estimation method 
which combines information sonrces objectively. 
ME(maximum entropy) principle (,laynes57) pro- 
vides the method to combine information sources 
consistently and the ability to overcome over- 
estimation problem by maximizing entropy of the 
domain with which the training data do not pro- 
vide information. 
Let us describe ME principle briefly. For given 
x, the quantity x is capable of assuming the dis- 
crete wdues xi, (i = 1, 2, ..., n). We are not given 
the corresponding probabilities pi; all we know is 
tire expectation value of the function f,.(x), (r = 
1, 2, ..., m): 
;q 
E\[fr(x)\] = Epi(xi)f,.(xi) (1l) 
i=1 
On the basis of this information, how can we 
determine the probability value of the function 
pi(x)? At first glance, the problem seems insol- 
uble because the given information is insufficient 
to determine the probabilities pi(x). 
We call the function f,. (xi) a constraint function 
or fealure. Given consistent constraints, a unique 
ME soluton is guaranteed to exist and to be of the 
form: 
where the Ar's are some nnknown constants to 
be found. This formula is derived by maximizing 
the entropy of the probability distribution Pi as 
satisfying all the constraint given, qb search the 
l,.'s that make pi(x) satisfy all tile constraints, an 
external 
observation: 
OOO W\[_2 Wi 1 Wi Wi+l Wi+2 ooo 
MRF: .co {~~ eee <L.V <?L~/ ',,IL/ <dJ <L.v 
Figure 1: MRF T is defined for the neighborhood 
system with distance 2 
iterative algorithm, "Generalized Iterative Scal- 
ing" (GIS), exists, which is guaranteed to converge 
to the solution (l)arroch72+). 
(12) is similar to Gibbs distribution, which 
is the primary probability distribution of M\[{F 
model. MRF model uses ME principle in combin- 
ing information sources and parameter estimation. 
We will describe MRFF model and its parameter 
estimation method later. 
5 MRF-based tagging model 
5.1 MRF in tagging 
Neighborhood of given random variable is defined 
by the set of random variables that directly atfect 
the given random variable. Let N(i) denote a set 
of random variables which are neighbors of ith 
random variable. Let's define the neighborhood 
system with distance L in tagging fbr words W = 
wl, ..., w,~, and tags T = h, ..., t,~. 
N(i) = {i- L,...,i- 1, i+ l,...,i+ L} (13) 
This neighborhood system has on(; dimensional 
relation and describes the one dimenstional struc- 
ture of sentence. Fig. 1 showes MP~F T which is 
defined for the neighborhood system with distance 
2. The arrows represent that the random variable 
ti is affected by the neighbors ti- 2, ti- 1, ti+ t, ~j+'~. 
It also showes that ti,ti-t and ti,ti+l have the 
neighborhood relation connected by bigram, and 
that ti,ti-l,ti-2 and ti,ti+l,ti+2 have ttm neigh- 
borhood relation connected by trigram. 
A clique is defined as the set of random vari- 
ables that all of the pairs of random variables are 
neighborhood in it. Let's define the clique as the 
tag sequence with size L in tagging problem. 
G = {ti-L,ti-(t,-1), ...,ti} (14) 
A clique concept is used to define clique fimction 
that evaluates current state of random variables in 
clique. 
The definition of MRF is presented as following. 
Definil~ion of MI{F: Random 'variable T is 
Markov random field if T' salisfies the follow- 
ing two properties. 
238 
Positivity : 
t)('F) > O, W' (15) 
Locality : 
S'(t~I%,Vj, j ¢ i) = P(t~I%,Vj, j ~ iV(j)) 
We assume tha.t every prob;d>lity value el tag se- 
(luenee is larger l, hml zero bee;rose ungraluiuat, ic;d 
Sellt, ellCeS (;fill ,tl)pem" in htlllHill l~tligll&ge liS~ge, 
including meaningless sequence of characters. St) 
the positivity of' MRt!' is satislied. This :+tSStllnp- 
tion results in the robustness mid ada.ptability of 
the inodel, eveli though unti:a~ined events ocolir. 
The locality of MRF is consistent with the as- 
Slliliptioii O\[ I;a.ggiilg t)roblein in that the tag of 
given word ca, it be deterinined by the local con- 
text. (Tonsequenl, ly, the random variable 7' is 
MRF for neighborhood systenl N(i) its 7' satis- 
ties the positivity and the locality. 
5.2 A Posteriori Protiatiillty 
A posteriori probat)ility is needed to sea.rcb for 
the Jrlost, likely tag sequence. M II, F provides the 
i;heoretical bi~cliground about the probal)ility of 
the system (Bes~tg74) ((leiJfla, ii84+). 
H~mniersley-(\]liflbrd thcorein: 7'he probability 
dish'ib'ulio'n I'(7') is (Tibbs dish'ibulion if and 
only if 'random wzriable 7' is ,,llarkov random 
field for givcn ncigborhood syslc'/n N(i), 
e. ",'~,, uCr) 
P(7') - Z (17) 
Z = Ee-~"~'~l(m) (:18) 
where "\['HI is l;elillJel'i~tllre~ ~ is norlnalizing COIl- 
Sl;~tllt, called partition ftlllCLioil aAld U ('\[') iS etlergy 
fimct;ion. The a priori probal)ilit;y P(7') of tag se- 
quence 7 ~ is Gibbs distribution because the ran- 
dora variable 7' of tagging is MRF. 
It can be proved that a posteriori probability 
i)(TiW) for given word sequen(;e W is also Gil)bs 
distribution (Chun93). (7onsequent/y, a I)osteriori 
probability of 7' for giwm W is 
t , u(.rlw) 1'('VlW) = --/~- ~ (i,q 
Z 
We llSe (|9) i,O ('.m'ry Ollt MAP estiui;dion in the 
tagging model . The energy function U(TJW) is 
of this form. 
u@'lW) = ~ w~(;t'lW) (20) 
c 
where V,, is clique function wii;h the property 
that Vc depends only oil those randoui variable, in 
clique e. This lllelLllS t;hat ellergy funcl, ioli (Urill be 
obliained \[rOlll each clique funtion which splits l,\[ie 
set of ralldOlll viu'iables to slibscLs. 
6 Clique function design 
The more state of random variables are near to 
Llie solution, the niore the system becomes stable, 
and energy function has lower vahie. Energy flmc- 
i, ion repre, sents the degree of unstability of current 
stntc of raiidoni vl.triables in M RF. It is similar to 
the I)ehaviour o\[' molecular particles in the rcM 
world. 
('~lique function is proportional to energy fun(:- 
tion, and it represents the unstability of current 
state of randoni varia.bles in clique or it has high 
value when the state of MRF is bad, low value 
when the st;~te of MI{F is nero: to solution. Clique 
fimction contributes to reduce the comi)utation of 
evahmtion function of entire MRF by clique con- 
cept that separates random v~triables to the sub- 
sets. 
(llique function V/(TJ W) is described by the. few. 
tures that represent the constraint or information 
sources of givcu prol)h;m domain. 
~5(:z'lvv) := ~ a,.j;~.('clm) ('2J) 
F 
6.1 MRF Model 1 (Basic model) 
The basic information sources which arc used m 
statistical tagging model are unigram, l)igrani nnd 
trigrain. M I{I" nlodel I lises unigrmn, higranri and 
t, rigraln. We write the \[ea£ure furiction o\[ unigraln 
;iS 
j\~,,.:,,. ..... = (t - ~'(t~I,<)) (~) 
and the feature f'illtCtioll O\[ II-gralll, inchiding 
bigram, trigram ~s 
f li, - :, • ..... = 
where 
(t - Z'(t~ I J)) (m3) 
ioN(i) 
t'(tilti_j,t~_j+x,...,t~_t), iI'i > j P(ti 
IJ) = t"(6: Iti+l, h+~, ..., zi+j), it" i < j 
The clique filnction of the model 1 is ttt~de as 
follows. 
/01'lw) -- A, •/;,,,~,.,,,~ + x~. f,,-:,,.,m (~4) 
6.2 Model 2 (Morphological inforntation 
indnded) 
Morphok)gical level information helps tagger to 
determine the tag of the word, more. especially 
of the unknown word. The suffix of a word gives 
very useful information about the tag of the word 
in F, nglish. The (:li(ltte function of model 2 is de- 
lined as 
f.~,\]:i~, = (\[- t'(gi\].suffix(wi))) (25) 
We used the statistical distribution of the sixty 
sll\[lixes thztt are IlK)st frequently used ill English. 
239 
We can expand the clique flnlction of the model 
1 easily by just adding Stlficix inforui~-ttion to the 
clique function of the ntodel 2. 
'~,~.(7' IW) = A~ J;,,,o,' ..... +Ae.f,+_<,.a,,,+Aa.f.~ff~. 
(2 5) 
6.3 Model 3 (error correction) 
There exist error prone words in every ta.gging sys- 
tern. We adjust error prone words 1)y collecting 
the. error results and adding more inforniation of 
the words. The feature function of Model 3 is for 
adjusting errors in word level. 
= (1 - (2r) 
f#vo,.2 = (;1 - P(lil'wi_2,ti_l)) (28) 
YVe used the probat)ility d istribu tion of five hun- 
tired error proiie words ill Model 2 in oMer to re- 
duce the tltllllber 0t' paF31ileters. 
7 Optimization 
The process of selecting the best tag sequence 
is called ms optimization process. We use MAP 
(Maximum A l)osteriori) estiniation method. The 
tag sequence 7' is selected to niaximize the a pos- 
teriori probM)ility of tagging (19) by MAP. 
Simulated annealing is used to searcti the op- 
timal tag sequence as Gibbs distribution provides 
simulated anneMing facility with teliiperatur(+ arid 
eileFgy ('OllCept. }go change the tag candidate of 
one word selected to tninilnize the energy t"iinction 
in k-th step froni T (k) to j,(k+i) , a.n(l l'(+'t)e;/t this 
process until there is tlO change. The t(?llll)(?l?a- 
ture 7'm is started in high vahle and lower to zero 
as tile above process is doing. Then the final tag 
seqtlellce is the solution. Sininlat,ed annealing is 
US0flil in the prol)leni which has very hugo search 
sl)ace, and it is the approxiniation of MAP est.i- 
fllatioll ((\]elll&iq 84 -t--). 
There is another algorithm called Viterbi algo- 
rithtn to lind ol)timal solution. Viterbi algoritllm 
guarantees optinial sohltion \]tilt, it canilot bc used 
in the probleln which has very huge search space. 
SO it iS /iscd in the l)rol)leni which has Slliall sea, rch 
space 3,11(1 Ilsed ill I1M M. M RF model Call ilSe both 
Viterbi algoril, hni and siinulated anealing, but it 
is not \](nowtl IO IlSe sinitllated allne, aling ill fIMM. 
8 parameter estimation 
The weighting parameter A in tile clique \['unction 
(19) Call be estiinated frOlil training data by MIg 
principle (.\] ayiles57). 
Let tlS descrit)e ME princil)le and IIS algorithni 
briefly. For given x = (Xl,...,;Frt), the corr(?- 
Sl)onding probal)ilities t)i(xi) is nod klloWll. All 
we know is the expectation value of the flmction 
J;+(x), (r = 1,2, ...,m): 
?t 
\];;\[J;.(,)\] = pg(.<)J;.(.+:) (2.<)) 
i=1 
(riven consist(mr constraints, we can find the 
prot)ability distribution p~ that niakes the entrol)y 
-- ~ Pi In t)i wlhle lllaxillllllil \])y llSillg Largl:angiall 
niultipliers in the nsual way, and obtai u the result: 
pi(a?i ) = cXt)(-- ~ J,.J;.(xi)) (30) 
7" 
This forniula is alniost siinilar 1.o Gibbs distri- 
bution (17), also J\]. correspoilds to the feature of 
clique function in M I{,F (20) (2l). Using this fact, 
we can use M 1!; in paranieter estimation in M Ii.F+ 
We can derive (31) to be used in pgLralneter es- 
tiniation fi'om training data. 
0 
- o-A-t,+z -- (31) 
7 
z _- (32) 
i r 
'l?o solve the solution of it, a numerical analysis 
mt;thod (-~IS ((\]enerlaized Iterative Scaling) was 
suggested (l)arroch72+). Pietra used his owu al- 
gorithm IIS (hnt)roved lterative Scaling) based on 
G IS to induce the features and parameters of ran- 
dom field automatically (l)ietra95) . Following is 
IIS algorithnl 
llS(In)provcd lterative Scaling) 
t Initial data 
A rof'ol'ellce distribution 1), all initial 
model q0 and fl), fl .... , fn. 
• Output 
q. alld A by MI'\] estiiiiation 
A lgori t h m 
(0) Set qC0) = qs) 
(1) Per each i (ihd Xi, t,il(; llnique sohlth)n 
el" 
q(X.)fi(7,)c~,lk) ~,. f,.(T) =- ~ IS(T));(7/, ) 
T 7' (33) 
(2) k +-- k+l, set q~+l with new Ai 
(3) l\[' qt~O tins converged, set q, = q(~') 
and tertilhiate. Otherwise go to step(i) 
where q(k) is the distribution of the iriodel in k- 
th step, alld it, corresponds to the posteriori pro}> 
ability of the tagging model (\[.(J). A, tile sohltion 
()f' (:/:t) (:all be ol)i,ained 1)y Newton niei.hod ((,'tlr- 
tis89+), Olie Of lilllll(~rica\] analysis nietilod. 
The I'ef¢TellC(" distribution \]~ is the l)rol)ability 
distril)ution which is obtaiued directly frOlll ill'aiD- 
illg data. \]) corresponds to tile posterior (listri- 
button t'(TIW ) ill the IA/g~illg iItod(ti. ~¢Vo tlS( t the 
240 
Model Tagging accuracy 
11MM 96.11 1 
M H,F(l) !)6.2 
M I{,F(2) 9(i.5 
M I{F(3) .97. I 
Table 1: Measuring l, hc. a('cur;~cy of IIMM aud 
M R,F nioch;Is. 
posterior t)rol)alfility of l:,hc words sequence o\[ win- 
{low size n (CSl>ecially 3 in this lliodel) I)y colillting 
the entry Oil training data. Trailiiilg data lllewis 
tagged (:orl)us tmrc. 
9 Experiments 
The Ili;,I, ill ol>jcctivc of' this experinicnts is to ('Olii- 
l)a, re ttio M I{i!' l,a,gghig nlodcl with l lic I IM M tag- 
ging nl()(hJ. \¥C consl, lulci(xl a, ~,'11{1" /.aggcr mid a 
II MM tagger usiilg sa.lne inl'orlll;tliion on t.hc sailic 
(?llVil:onlllelll;. 
li, is lle(:esSa, l:y t,o do smoothing tiroccss for data 
sl)arseiicss l>roblelit which is scw;l:c i\[l \]1MM, while 
MRF has tll(' \[acility of sniooi,hing in it,self like 
neura, l-nel, . IvVc ilSCd line;tr inl,erl>olat, ion ine/,hod 
(l),~rot,a.S6+) (jclin&Sg) and assigning &C,lUel,cy 
1 for uilknowll word (Weisctig:/+) for sliioothilig 
in II M M. 
!t"V'(? llS(?(I l,\[lC \]lrowII (;orlillS ill licnliTl'ce Bank, 
dcscril)cd in (Ma.rc,s93+) with ,l~ dilli'rent, tags. 
A set of t~00,000 words is colhx'tod for ('ach parl, 
Of lirowlt ('Orl)llS ali(\[ llSe(I ;t,"; t.l'aiilillg (h~l;a, which 
is used to 1)uihl 1,tw niodels. And a set of 30,000 
words (',()l'l)llS is used as i,c'st da.C~, which is used 
to t,esl, the qua.lil;y of Ltic models. 
'l'alJe 1 shows the 3~CClll;&(;y o\[" each L~ggillg 
niodcl. '1't,~ average a(:(:ura.(:y of tll(' I IMM-hascd 
l,a,gg(;r iS Sii n ilar i,o t Ila, t of M I{F( 1 ) l,a, gger I)c(:aiiS(~ 
l,hey iis( 1 \[,he SalilO hil'ornial, h)n. 
ld<% 7 sliows dial, l,he error tale as \[,iic size 
o\[ \],r;IJnhig dnla, is illcreased. MI{F(I) has lower 
error rate Lha, n that of ilMM when l, lie size of 
training data is slnMl. '\]'hc crrol: l:~tl, e of M t{,F(2) 
is decreased CSlmcially wllell tlw size of the tr{dn- 
ing dab~t is StlllJl, l)c:(;a.llSC luorphologicvd ia\['ornia.- 
don helps I,t,~ process of llllkiiowii words. Filially, 
MI{l"(3) show itnliroveinent as the size of l;rain- 
ing (hfl,;~ grows I)ul; COllVel'g(is l;o l, ile \]inlit Oll sOl\]l(? 
poinl,s. 
These e×pcrilnen~s show thai, M tt, F has I)el, l,cr 
a,d(lal)l;abilil, y with snlMI I, raillilig data than II MM 
does, aud l,h;fl: MI{F tagger has bss data sl)arse 
IleSS probhmi than IIMM l:aggor. 
<;0 
4O 
Ht4i,l 
{e 
20 MRF ( L I 
~I}{!." \[ 2 ) 
i0 
do0 41 .... L000 /<: ...... 13 ......... ; ....... % .... t ..... 
Ji <~t tLaIIll\[I I WL~\[ I 
l"igurc :7: Error ral,c of each model \[or giwm size 
of t, raining word 
10 Comparison of MRF with 
HMM 
We (:~\[ii derive l, llc siinldified cquatioil of IIMM 
only wilh bigra.m : 
l>(v'tw)= 
(35) is consid(m~d as l, he Inull, ilflied probal>iltics 
o\[ a the h)cal cwml,s. The iioarer the probabilily 
vahm of local cv(mt is to zero , t, hc ,~or(' it, a\[Ii;cl,s 
I,h(' l)robahilil:y of the (ml, ire evenl,. This prol)erty 
strictly reflccl;s on the cwmt.'-; which does not occur 
in l,rainhig (lat,~L \[:Jui, lid prohibits even the cvcul. 
that does llot OCCllr in l;rainiug datl h althougtl the 
CVellt is legal. 
M I{I" can he sinil)lificd t)y lhe sunnlml:ion of 
clique \['un(;tion as (3{\]). 
I - ~%:{ v,+v,+.. +<,,} (3(J) /'('/'lW)-= 2 
!'vl I{,1" uses rvalual, io, funclion I)y suunuali<)., 
while IIMM do('s I)y umltil)lic~tion. F, ven if a 
cliqm~ flmcl,ion wd.e is very bad, o(,hcr cliqnc 
function ca, n conll)ensate dequa~ely lmca,use the 
clique functions are coime('l,ed by summatiou. 
'l'here is no critical point of postcriori l)robal)lil,y 
in MRI i', while IIMM has cril, ical poi,1, in zero 
value. This property results in the rolmstness mid 
the ada, ptability o\[ l;tl(~ lliodol aud niakcs M t{F 
uiodcl stronger in data Sl)arscncss probhml. 
11 Conclusion 
We prol)oscd a Ml:\[l!'-based tagging lnodel. In- 
formation sourccs for tagging aro combiimd hy 
M F, 1)rincil)le which is ,sod i, M I{F as theoretical 
background. A I1 I)ara~liiclxu's used in the iiio(\]e\] are 
eslJmated from tra, ining data a.ul,omatica,lly. As 
;t result,, our MRl!'-l)ascd tagging nlodcl has bet,- 
ter tmrfornla, ncc Lha, n t\[MM tagging nio(h'l, CSl)C 
cially when the size of" the training dal,a is Sliiall. 
Vv'e haw~ sooii l, hat tim i)or\['oriilali(:o Of l,he M HI"- 
I)ascd tagging nio(hJ can be' iinprow'd by adding 
inl'orina, i ion I,o tim niodct. 
241 
References 
Besag, J. "Spatial interaction and the statistical 
anMysis of lattice systems( with discusstion )," 
J. Royal Statist. ,%c., series B, vol. 36, pp. 192- 
326, 1974. 
Besag, J. "On the Statistical Analysis of Dirty 
Pictures", J. Royal Statist. Soc., vol. B48, 1986. 
Brill, E. "A Simple Rule-Based Part of Speech 
Tagger", In Proceedings of the 3rd Co@ on Ap- 
plied Natural Language Processing, pages 153- 
155, April, 1992. 
Charniak, E., C. ltendrickson, N. Jacobson and 
M. Perkowitz, "Equations for Part-of Speech 
'Fagging," Proc. of Nat'l Conf. on Artificial 
\[ntelligence(AAAI-86), pp. 784-789, 11993. 
In, G. Chun, "Range hnage Segmentation Using 
Multiple Markov Random Fiehls", Ph.D. thesis, 
KAIST, KOt{EA, 1993. 
Church, K. W., "A Stochastic PAI~;rS Pro- 
gram and Noun Phrase Parser for Unrestricted 
Text,", Proceedings of Applied Natural Lan- 
guage Processing, Austin, Texas, pp. 136-143, 
1988. 
Darroch, J. N. and D. Ratcliff, "Generalized It- 
erative Scaling for Long-Linear Models..", The 
Annals of Mathematical Statistics, Volume 43, 
pages 1470-1480, 1972. 
Derouault, A. M. and B. Merialdo, "Natural 
Language Modeling for Phoneme-to-Text Tran- 
scription,", \[EEE Tr. on Pattern Anaysis and 
Machine Intelligence, vol. PAMI-8, no.6, Nov. 
1986. 
Curtis, F. G. and Patric O. Wheatley, "Applied 
Numerical Analysis", forth edition, ADDISON 
WESLEY, 1989. 
Forney, G. D., "The Viterbi Algorithm", Proc. of 
the IEEE, vo\[. 61, pp. 268-278, Mar. 1973. 
Gamble, E. B., Geiger D. and Possio T., "in- 
tegration of Vision Modules and labeling of 
Surface Discontinuities", IEEE Transactions on 
systems, man and cybernetics, vol. 19, no. 6, 
November/deeemver 1989. 
Geman, S. and Geman D., "Stochastic Relax- 
ation, Gibbs Distributions, and the Bayesian 
Restoration of Images", IEEE transactions 
on pattern analysis and machine intelligence, 
VoI.PAMI-6, NO. 6, NOVEMBER 1984. 
Geiger, D. and Girosi F., "ParalM and Determin- 
istic Algorithms from MRF's: Surface Recon- 
struction", \[EEE Transactions on pattern anal- 
ysis and machine intelligence, VOL 13, NO. 5, 
MAY 1991. 
IIUANF, X.D., Y. AtHKI and M.A. JACK, "Hid- 
den Markov Models for Speech Recognition", 
1990. 
Jaynes, E. T., Information Theory and StatisticM 
Mechanics, Physics Reviewsl06, pages 620-630, 
1957. 
Jelinek, F. , "Self-Organized Language Model- 
ing for Speech Recognition." , in Readings in 
Speech Recognition, Alex Waibel and Kai-l,'u 
Lee(Editors). Morgan Kaufinann, 1989. 
Kupiec, ,l., Robust Part-of-Speeh Tagging Using 
a llidden Markov Model, Computer Speech and 
Language, 1992. 
Marcus, M. P., Beatrice Santorini and Mary Ann 
Marcinkiewiez, "Building a large annotated cor- 
pus of English: the Penn Treebank", Computa- 
tional Linguistics, Vol. 19, No. 2, pp 313-330, 
June, 1993. 
Merialdo, B., "Tagging English Text with a Prob- 
abilistic Model", Computational Linguistics, 
Volume 20, no 2, June 1994. 
Nakamura, M., K. Maruyama, T. Kawanata and 
K. Shikano, "Neural Net- 
work Approach of Word Category Prediction 
for English rl~XtS," Int'l Conf on Computational 
Linguislics(Coling-90), pp. 213-218, 1990. 
Pietra, S. D., V. D. Pietra and J. Lafferty, "Induc- 
ing features of random fields", Carnegie Mellon 
University, Technical report CMU-CS-95-144, 
MAY, 1995. 
Rosenfeld, R., "Adaptive Statistical language 
Modeling: A Maximum Entropy Approach" 
, Carnegie Mellon Uniw~rsity, technical report 
CMU-CS-94-138, April 19, 1994. 
Weischedel, 1{., R. Scewartz, a. Ralmucci, M. 
Meteer, and L. P~awshaw. "Coping with Am- 
biguity and Unknown Words through Prob- 
abilistic Models", Computational Linguistics, 
19(2):359-382, 1993. 
Zhang, J. and J.W. Modestino, "A Markov Ran- 
dora Field model-based approach to image in- 
terpretation", Visual Communications and im- 
age Processing IV, Vol 1199, 1989. 
242 
