I 
HMM Specialization with Selective Lexicalization* 
Jin-Dong Kim and Sang-Zoo Lee and Hae-Chang Rim 
Dept. of Computer Science and Engineering, Korea University, 
Anam-dong, Seongbuk-ku, Seoul 136-701, Korea 
E-maih {jinlzoolrim}@nlp.korea.ac.kr 
Abstract 
We present a technique which complements 
Hidden Markov Models by incorporating some 
lexicalized states representing syntactically un- 
common words. 'Our approach examines the 
distribution of transitions, selects the uncom- 
mon words, and makes lexicalized states for the 
words. We perfor'med a part-of-speech tagging 
experiment on the Brown corpus to evaluate the 
resultant language model and discovered that 
this technique improved the tagging accuracy 
by 0.21% at the 95% level of confidence. 
1 Introduction 
Hidden Markov 'Models are widely used for 
statistical language modelling in various fields, 
e.g., part-of-speech tagging or speech recogni- 
tion (Rabiner and Juang, 1986). The models 
are based on Markov assumptions, which make 
it possible to view the language prediction as a 
Markov process. 'In general, we make the first- 
order Markov ass'umptions that the current tag 
is only dependant on the previous tag and that 
the current word is only dependant on the cur- 
rent tag. These are very 'strong' assumptions, 
so that the first-order Hidden Markov Models 
have the advantage of drastically reducing the 
number of its parameters. On the other hand, 
the assumptions restrict the model from utiliz- 
ing enough constraints provided by the local 
context and the resultant model consults only 
a single category 'as the contex. 
A lot of effort has been devoted in the past 
to make up for the insufficient contextual in- 
formation of the first-order probabilistic model. 
The second order Hidden Markov Models with 
" The research underlying this paper was supported 
t) 3" research grants fl'om Korea Science and Engineering 
Foundation. 
appropriate smoothing techniques show better 
performance than the first order models and is 
considered a state-of-the-art technique (Meri- 
aldo, 1994; Brants, 1996). The complexity of 
the model is however relatively very high con- 
sidering the small improvement of the perfor- 
mance. 
Garside describes IDIOMTAG (Garside et 
al., 1987) which is a component of a part-of- 
speech tagging system named CLAWS. ID- 
IOMTAG serves as a front-end to the tagger 
and modifies some initially assigned tags in or- 
der to reduce the amount of ambiguity to be 
dealt with by the tagger. IDIOMTAG can 
look at any combination of words and tags, 
with or without intervening words. By using 
the IDIOMTAG, CLAWS system improved tag- 
ging accuracy from 94% to 96-97%. However, 
the manual-intensive process of producing id- 
iom tags is very expensive although IDIOMTAG 
proved fruitful. 
Kupiec (Kupiec, 1992) describes a technique 
of augmenting the Hidden Markov Models for 
part-of-speech tagging by the use of networks. 
Besides the original states representing each 
part-of-speech, the network contains additional 
states to reduce the noun/adjective confusion, 
and to extend the context for predicting past 
participles from preceding auxiliary verbs when 
they are separated by adverbs. By using these 
additional states, the tagging system improved 
the accuracy from 95.7% to 96.0%. However, 
the additional context is chosen by analyzing 
the tagging errors manually. 
An automatic refining technique for Hidden 
Markov Models has been proposed by Brants 
(Brants, 1996). It starts with some initial first 
order Markov Model. Some states of the model 
are selected to be split or merged to take into 
account their predecessors. As a result, each of 
121 
new states represents a extended context. With 
this technique, Brants reported a performance 
cquivalent to the second order Hidden Markov 
Models. 
In this paper, we present an automatic re- 
fining technique for statistical language models. 
First, we examine the distribution of transitions 
of lexicalized categories. Next, we break out the 
uncommon ones from their categories and make 
new states for them. All processes are auto- 
mated and the user has only to determine the 
extent of the breaking-out. 
2 "Standard" Part-of-Speech 
Tagging Model based on HMM 
From the statistical point of view, the tagging 
problem can be defined as the problem of find- 
ing the proper sequence of categories c:,r~ = 
Cl, c2, ..., cn (n _> 1) given the sequence of words 
w:,n = wl, w2, ...,wn (We denote the i'th word 
by wi, and the category assigned to the wi by 
ci), which is formally defined by the following 
equation: 
"\]-(Wl,n) -= argmaxP(Cl,nlW:,~) (1) 
Charniak (Charniak et al., 1993) describes 
the "standard" HMM-based tagging model as 
Equation 2, which is the simplified version of 
Equation 1. 
n 
T(w:,~) = arg max II P(cilci-1)P(wilci) (2) 
i n . ' z----I 
With this model, we select the proper category 
for each word by making use of the contextual 
probabilities, P(citci_ 1), and the lexical prob- 
abilities, P(wilci). This model has the advan- 
tages of a provided theoretical framework, auto- 
matic learning facility and relatively high per- 
formance. It is thereby at the basis of most tag- 
ging programs created over the last few years. 
For this model, the first-order Markov assum- 
tions are made as follows: 
P(cilcl,i-l,Wl,i_l) ~ P(cilci-1) (3) 
P(wi\[cd (4) 
With Equation 3, we assume that the current 
category is independent of the previous words 
and only dependent on the previous category. 
With Equation 4, we also assume that the cor- 
rect word is independent of everything except 
the knowledge of its category. Through these 
assmnptions, the Hidden Markov Models have 
the advantage of drastically reducing the num- 
ber of parameters, thereby alleviating the sparse 
data problem. However, as mentioned above, 
this model consults only a single category as 
context and does not utilize enough constraints 
provided by the local context. 
3 Some Refining Techniques for 
HMM 
Tile first-order Hidden Markov Models de- 
scribed in the previous section provides only 
a single category as context. Sometimes, this 
first-order context is sufficient to predict the 
following parts-of-speech, but at other times 
(probably much more often) it is insufficient. 
The goal of the work reported here is to develop 
a method that can automatically refine the Hid- 
den Markov Models to produce a more accu- 
rate language model. We start with the care- 
ful observation on the assumptions which are 
made for the "standard" Hidden Markov Mod- 
els. With the Equation 3, we assume that the 
current category is only dependent on the pre- 
ceding category. As we know, it is not always 
true and this first-order Markov assumption re- 
stricts the disambiguation information witlfin 
the first-order context. 
The immediate ways of enriching the context 
are as follows: 
• to lexicalize the context. 
• to extend the context to higher-order. 
To lexiealize the context, we include the pre- 
ceding word into the context. Contextual prob- 
abilities are then defined by P(eilci_l,Wi-1). 
Figure 1 illustrates the change of dependency 
when each method is applied respectively. Fig- 
ure l(a) represents that each first-order contex- 
tual probability and lexical probability are in- 
dependent of each other in the "standard" Hid- 
den Markov Models, where Figure l(b) repre- 
sents that the lexical probability of the preced- 
ing word and the contextual probability of the 
current category are tied into a lexicalized con- 
textual probability. 
To extend the context to higher-order, we ex- 
tend the contextual probability to the second- 
122 
Wi.1 
(a) 
Wi., I Wi.t 
(b) (c) 
Figure 1: Two Types of Weakening the Markov 
Assumption 
order. Contextual probabilities are then defined 
by P(cilci_l,Ci_2). Figure l(c) represents that 
the two adjacent contextual probabilities are 
tied into the sec0nd-order contextual probabil- 
ity. 
The simple way of enriching the context is to 
extend or lexica!ize it uniformly. The uniform 
extension of context to the second order is fea- 
sible with an appropriate smoothing technique 
and is considered a state-of-the-art technique, 
though its complexity is very high: In the case 
of the Brown cerpus, we need trigrams up to 
the number of 0.6 million. An alternative to the 
uniform extension of context is the selective ex- 
tension of context. Brants(Brants, 1996) takes 
this approach and reports a performance equiv- 
alent to the uniform extension with relatively 
much low complexity of the model. 
The uniform lexicalization of context is com- 
putationally prohibitively expensive: In the 
case of the Brown corpus, we need lexicalized 
bigrams up to the number of almost 3 billion. 
Moreover, manylof these bigrams neither con- 
tribute to the per~formance of the model, nor oc- 
cur frequently enough to be estimated properly. 
An alternative to the uniform lexicalization is 
the selective lexicalization of context, which is 
the main topic of this paper. 
4 Selective Lexicalization of HMM 
This section describes a new technique for re- 
fining the Hidden Markov Model, which we call 
selective lexicalization. Our approach automat- 
ically finds out s'yntactically uncommon words 
and makes a new state (we call it a lexiealized 
state)for each of the words. 
Given a fixed set of categories, {C 1 , C 2, ..., cC}, 
e.g., {adjective,..., verb}, we assume the dis- 
crete random variable XcJ with domain the set 
of categories and range a set of conditional prob- 
abilities. The random variable XcJ then repre- 
sents a process of assigning a conditional prob- 
ability p(cilc j) to every category c i (e i ranges 
over cl ...c C) 
xc (c = P(d 
= P(c21cJ) 
Xc) (c C) = p(cClc j) 
We convert the process of Xcj into the state 
transition vector, VcJ , which consists of the 
corresponding conditional probabilities, e.g., 
Vprep ---- ( P(adjectiveiprep), ..., P(verbiprep) ) T. 
The (squared) distance between two arbitrary 
vectors is then computed as follows: 
l~(Vl, V2) = (Vl -- v2)T(v1 - V2) (5) 
Similarly, we define the lexicalized state 
transition vector 1, VO,wk , e.g., 
Vprep,i n -~ 
( P (adjectivelprep, in),..., P (verblprep, in)) Y 
In this situation, it is possible to regard each 
lexicMized state transition vector, VcJ,wk, of the 
same category cJ as members of a cluster whose 
centroid is the state transition vector, Vc). We 
can then compute the deviation of each lexi- 
calized state transition vector, Vc~,wk , from its 
corresponding centroid. 
T D(Vc¢,wk ) = (Vc~,~k-Vd) (VcJ,wJ-Vcj) (6) 
Figure 2 represents the distribution of lexical- 
ized state transition vectors according to their 
deviations. As you can see in the figure, the 
majority of the vectors are near their centroids 
and only a small number of vectors are very far 
from their centroids. In the first-order context 
model (without considering lexicalized context). 
1To alleviate the sparse data problenl, we smoothed 
the lexicalized state transition probabilities by MacKay 
and Peto(MacKay and Peto, 1995)'s smoothing tech- 
nique. 
123 
180000 
160000 
140000 
120000 
100000 
~equency 
80000 
60000 
40000 
20000 
0 
f I I I I I I I I 
0.1 0.2 0.3 0.4 
Ai ~ • _1 1 I 
0 0.5 0.6 0.7 0.8 0.9 1 
deviation 
Figure 2: Distribution of Lexicalized Vectors according to Deviation 
the centroids represent all the members belong- 
ing to it. In fact, the deviation of a vector is a 
kind of (squared) error for the vector. The error 
for a cluster is 
e(VcJ) = ~ D(Vd,wk ) (7) 
W k 
and the error for the overall model is simply the 
sum of the individual cluster errors: 
E = ~ e(Vg) (8) 
cJ 
Now, we could break out a few lexicalized 
state vectors which have large deviation (D > 0) 
and make them individual clusters to reduce the 
error of the given model. 
As an example, let's consider the preposi- 
tion cluster. The value of each component 
of the centroid, Vprep, is illustrated in Figure 
3(a) and that of the lexicalized vectors, Vprep,in, 
Vprep,wit h and Vprep,out are in Figure 3(b), (c) 
and (d) respectively. As you can see in these fig- 
ures, most of the prepositions including in and 
with are immediately followed by article(AT), 
noun(NN) or pronoun(NP), but the word out as 
preposition shows a completely different distri- 
bution. Therefore, it would be a good choice to 
break out the lexicalized vector, Vprep,out , froln 
its centroid, Vprep. 
out 
at *-- with 
(a) (b) 
Figure 4: Splitting the preposition State 
From the viewpoint of a network, the state 
representing preposition is split into two 
states; the one is the state representing ordi- 
nary prepositions except out, and the other is 
the state representing the special preposition 
out, which we call a lexicalized state. This 
process of splitting is illustrated in Figure 4. 
Splitting a state results in some changes of 
the parameters. The changes of the parame- 
ters resulting from lexicalizing a word, w k, in a 
category, C j, are indicated in Table 1 (c i ranges 
over cl...cC). This full splitting will increase 
the complexity of the model rapidly, so that 
estimating the parameters may suffer from the 
sparseness of the data. 
To alleviate it, we use the pseudo splitting 
which leads to relatively small increment of the 
124 
1 
IIIIIIlITI II1~1111111111~11 11111~11111171111111111 IIITIII ITII~IIIIIII NP AT NN PREP 
(a) preposition 
0 ,~llri71 I,l,llll,ll-rll?ll ,,,ll,lllllll,iiiliTilil iliT, ii II,illll 
NP AT NN PREP 
(b) in as preposition 
1 ! 
~0 JN, IiITI IIItl,,llliT,IrT'llllll~,,~ll~ll,llllllTIl,I li~Tllt ItT~tttt NP AT NN PREP 
(c) with as preposition 
NP AT NN PREP 
(d) out as preposition 
Figure 3: Transition Vectors in preposition Cluster 
Table 1: Changes of Parameters in Full Split- 
ting 
before splitting after splitting 
P(w~lc y) 
P(cil~) 
P(dlci) 
p(wilcJ, w k) 
p(wilc j , ~W k) 
P(dlcJ , w k) 
P(cilcJ, ~w k) 
P(cJ, w k Ic i) 
P(cJ, -~w~l ci) 
parameters. The changes of the parameters in 
pseudo splitting ate indicated in Table 2. 
5 Experimental Result 
We have tested our technique through part-of- 
speech tagging eXperiments with the Hidden 
Markov Models which are variously lexicalized. 
In ordcr to conduct the tagging experiments, we 
divided the whole Brown (tagged) corpus con- 
taining 53,887 sentences (1,113,191 words) into 
two parts. For tlle training set. 90% of the 
Table 2: Changes of Parameters in Pseudo 
Splitting 
before splitting after splitting 
P(w'ld) P(w~l d) P(cilc j) p(cilcJ, w k) 
p(cilc j, ~w k) 
P(elc P(c lc') 
sentences were chosen at random, from which 
we collected all of the statistical data. We re- 
served the other 10% for testing. Table 3 lists 
the basic statistics of our corpus. 
Table 3: Overview of Our Corpora 
I # of sentences # of words 
training set 48,499 1,001,712 
test set 5,388 111.479 
125 
We used a tag set containing 85 categories. 
The amount of ambiguity of the test set is sum- 
marized in Table 4. The second column shows 
that words to the ratio of 52% (the number 
of 57,808) are not ambiguous. The tagger at- 
tempts to resolve the ambiguity of the remain- 
ing words. 
Table 4: Amount of Ambiguity of Test Set 
I ambiguity(#) 3 4 
ratio(%) 5121:018 71 5 1 I total 100 I 
Figure 5 and Figure 6 show the results of 
our part-of-speech tagging experiments with the 
"standard" Hidden Markov Model and vari- 
ously lexicalized Hidden Markov Models us- 
ing full splitting method and pseudo splitting 
method respectively. 
We got 95.7858% of the tags correct when 
we applied the standard Hidden Markov Model 
without any lexicalized states. As the num- 
ber of lexicalized states increases, the tagging 
accuracy increases until the number of lexical- 
ized states becomes 160 (using full splitting) 
and 210 (using pseudo splitting). As you can 
see in these figures, the full splitting improves 
the performance of the model more rapidly 
but suffer more sevelery from the sparseness 
of the training data. In this experiment, we 
employed Mackay and Peto's smoothing tech- 
niques for estimating the parameters required 
for the models. The best precision has been 
found to be 95:9966% through the model with 
the 210 lexcalized states using the pseudo split- 
ting method. 
6 Conclusion 
In this paper, we present a method for comple- 
menting the Hidden Markov Models. With this 
method, we lexicalize the Hidden Markov Model 
seletively and automatically by examining the 
transition distribution of each state relating to 
certain words. 
Experimental results showed that the selec- 
tive lexicalization improved the tagging accu- 
rary from about 95.79% to about 96.00%. Using 
normal tests for statistical significance we found 
that the improvement is significant at the 95% 
level of confidence. 
Tile cost for this imt~rovenmnt is minimal. 
The resulting network contains 210 additional 
lexicalized states which are found automati- 
cally. Moreover, the lexicalization will not de- 
crease the tagging speed 2, because the lexi- 
calized states and their corresponding original 
states are exclusive in our lexicalized network, 
and thus the rate of ambiguity is not increased 
even if the lexicalized states are included. 
Our approach leaves much room for improve- 
ment. We have so far considered only the outgo- 
ing transitions from the target states. As a re- 
sult, we have discriminated only the words with 
right-associativity. We could also discriminate 
the words with left-associativity by examining 
the incoming transitions to the state. Further- 
more, we could extend the context by using the 
second-order context as represented in Figure 
l(c). We believe that the same technique pre- 
sented in this paper could be applied to the pro- 
posed extensions. 
2The Viterbi algorithm for finding the best tags runs 
in O(n 2) where n is the number of states. 
126 
i I 
0.96 
0.9595 
.~ 0.959 
0.9585 
0.958 
0.9575 I I I I 
50 100 150 200 250 
# of lexicalized states 
Figure 5: POS tagging results with lexicalized HMM using full splitting method 
0.96 
0.9595 
0.959 
Q" 0.9585 
0.958 
f 
f 
| 0.9575 I I I 
I 
0 50 100 150 200 
# oflexicalized states 
250 
Figure 6: POS tagging results with lexicalized HMM using pseudo splitting method 
Appendix : 
Top 100 words with high deviation 
according(IN) l"ather(IN) out(IN) able(J J) 
Aj(NP) no(QL) because(RB) however(RB) 
as(IN) per(IN) trying(VBG) however(WRB) 
likely(J J) Uuited(NP) Mrs.(NP) New(NP) 
Rhode(NP) . National(NP) Miss(NP) 
tried(VBD) Dr)(NP) lack(NN) nmch(QL) 
Mr.(NP) North(NP) June(NP) A.(NP) 
J.(NP) right(QL) May(NP) ready(J J) 
St.(NP) even(QL) various(J J) don't(DO) 
instead(RB) far(QL) B(NP) didn't(DOD) 
try(VB) available(JJ) William(NP)!(.) ?(.) 
;(.) number(NN) so(CS) due(J J) World(NP) 
Christian(NP) difficult (J J) tell(VB) go- 
ing(VBG) kind(NN) let(VB) continue(VB) 
series(NN) part(NN) radio(NN) sure(J J) 
want(VB) front(NN) seem(VB) total(NN) 
decided(VBD) expected(VBN) right(NN) 
based(VBN) White(NP) except(IN) told(VBD) 
James(NP) fact(NN) March(NP) sort(NN) 
example(NN) designed(VBN) respect(NN) 
talk(VB) Department(NP) single(AP) Ne- 
gro(NP) wanted(VBD) Western(NP) yes(RB) 
become(VBN) necessary(J J) speak(VB) 
about(RB) amount(NN) down(IN)like(VB) 
S.(NP) same(AP) too(RB) General(NP) 
began(VBD) use(NN) tax(NN) got(VBN) 
127 

References 

T. Brants. 1996. Estimating markov model 
structures. In Proceedings of the Fourth In- 
ternational Conference on Spoken Language 
Processing, pages 893-896. 

E. Charniak, C. Hendrickson, N. Jacobson, 
and M. Perkowitz. 1993. Equations for 
part-of-speech tagging. In Proceedings of the 
Eleventh National Conference on Artificial 
Intelligence, pages 784-789. 

K. Church. 1988. A stochastic parts program 
and noun phrase parser for unrestricted text. 
In Proceedings of the Second Conference on 
Applied Natural Language Processing, pages 
136-143. 

S. Derose. 1988. Grammatical category disam- 
biguation by statistical optimization. Com- 
putational Linguistics, 14(1):31-39. 

R. Garside, G. Leech, and G. Sampson. 
1987. The Computational Analysis of En- 
glish. Longman Group. 

J. Kupiec. 1992. Robust part-of-speech tag- 
ging using a hidden markoV model. Computer 
Speech and Language, 6:225-242. 

D. MacKay and L. Peto. 1995. A hierarchical 
dirichlet language model. Natural Language 
Engineering, 1(3):289-307. 

B. Merialdo. 1994. Tagging english text with a 
probabilistic model. Computational Linguis- 
tics, 20(4):155-171. 

L. Rabiner and B. Juang. 1986. An introduc- 
tion to hiddeii markov models. IEEE ASSP 
Magazine, pages 4-16, January. 
