WSD system based on Specialized Hidden Markov Model (upv-shmm-eaw)
Antonio Molina, Ferran Pla and Encarna Segarra
Departament de Sistemes Inform`atics i Computaci´o
Universitat Polit`ecnica de Val`encia
Cam´ı de Vera s/n Val`encia (Spain)
famolina,fpla,esegarrag@dsic.upv.es
Abstract
We present a supervised approach to Word Sense
Disambiguation (WSD) based on Specialized Hid-
den Markov Models. We used as training data the
Semcor corpus and the test data set provided by
Senseval 2 competition and as dictionary the Word-
net 1.6. We evaluated our system on the English
all-word task of the Senseval-3 competition.
1 Description of the WSD System
We consider WSD to be a tagging problem (Molina
et al., 2002a). The tagging process can be formu-
lated as a maximization problem using the Hidden
Markov Model (HMM) formalism. Let O be the
set of output tags considered, and I, the input vo-
cabulary of the application. Given an input sen-
tence, I = i1; : : : ; iT , where ij 2 I, the tag-
ging process consists of finding the sequence of tags
(O = o1; : : : ; oT , where oj 2 O) of maximum
probability on the model, that is:
bO = arg max
O
P (OjI)
= arg max
O
 P (O)  P (IjO)
P (I)
 
; O 2 OT (1)
Due to the fact that the probability P (I) is a con-
stant that can be ignored in the maximization pro-
cess, the problem is reduced to maximize the nu-
merator of equation 1. To solve this equation, the
Markov assumptions should be made in order to
simplify the problem. For a first-order HMM, the
problem is reduced to solve the following equation:
arg max
O
0
@ Y
j:1:::T
P (ojjoj 1)  P (ijjoj)
1
A (2)
The parameters of equation 2 can be represented
as a first-order HMM where each state corresponds
to an output tag oj, P (ojjoj 1) represent the transi-
tion probabilities between states and P (ijjoj) rep-
resent the probability of emission of input symbols,
ij, in every state, oj. The parameters of this model
are estimated by maximum likelihood from seman-
tic annotated corpora using an appropriate smooth-
ing method (linear interpolation in our work).
Different kinds of available linguistic information
can be useful to solve WSD. The training corpus we
used provides as input features: words (W), lemmas
(L) and the corresponding POS tags (P); and it also
provides as output tags the WordNet senses.
WordNet senses can be represented by a sense key
which has the form lemma%lex sense. The high
number of different sense keys and the scarce an-
notated training data make difficult the estimation
of the models. In order to alleviate this sparness
problem we considered the lex sense field (S) of the
sense key associated to each lemma as the semantic
tag. This assumption reduces the size of the output
tag set and it does not lead to any loss of information
because we can obtain the sense key by concatenat-
ing the lemma to the output tag.
Therefore, in our system the input vocabulary is
I = W  L  P, and the output vocabulary is
O = S. In order to incorporate this kind of in-
formation to the model we used Specialized HMM
(SHMM) (Molina et al., 2002b). This technique
has been successfully applied to other disambigua-
tion tasks such as part-of-speech tagging (Pla and
Molina, 2004) and shallow parsing (Molina and Pla,
2002).
Other HMM-based approaches have also been
applied to WSD. In (Segond et al., 1997), they esti-
mated a bigram model of ambiguity classes from the
SemCor corpus for the task of disambiguating the
semantic categories corresponding to the lexicogra-
pher level. These semantic categories are codified
into the lex sense field. A second-order HMM was
used in (Loupy et al., 1998) in a two-step strategy.
First, they determined the semantic category associ-
ated to a word. Then, they assigned the most prob-
able sense according to the word and the semantic
category.
A SHMM consists of changing the topology of
the HMM in order to get a more accurate model
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
which includes more information. This is done by
means of an initial step previous to the learning pro-
cess. It consists of the redefinition of the input vo-
cabulary and the output tags. This redefinition is
done by means of two processes which transform
the training set: the selection process, which is ap-
plied to the input vocabulary, and the specialization
process, which redefines the output tags.
1.1 Selection process
The aim of the selection process is to choose which
input features are relevant to the task. This pro-
cess applies a determined selection criterion to I
that produces a new input vocabulary (eI). This new
vocabulary consists of the concatenation of the rel-
evant input features selected.
Taking into account the input vocabulary I =
W  L P, some selection criteria could be as fol-
lows: to consider only the word (wi), to consider
only the lemma (li), to consider the concatenation
of the word and its POS1 (wi  pi), and to consider
the concatenation of the lemma and its POS (li  pi).
Moreover, different criteria can be applied depend-
ing on the kind of word (e.g. distinguishing content
and non-content words).
For example, for the input word interest, which
has an entry in WordNet and whose lemma and POS
are interest and NN (common noun) respectively,
the input considered could be interest 1. For a non-
content word, such as the article a, we could con-
sider only its lemma a as input.
1.2 Specialization process
The specialization process allows for the codifica-
tion of certain information into the context (that is,
into the states of the model). It consists of redefin-
ing the output tag set by adding information from
the input. This redefinition produces some changes
in the model topology, in order to allow the model
to better capture some contextual restrictions and to
get a more accurate model.
The application of a specialization criterion to O
produces a new output tag set ( eO), whose elements
are the result of the concatenation of some relevant
input features to the original output tags.
Taking into account that the POS input feature is
already codified in the lex sense field, only words
or lemmas can be considered in the specialization
process (wi lex sensei or li lex sensei).
This specialization can be total or partial depend-
ing on whether we specialize the model with all the
elements of a feature or only with a subset of them.
1We mapped the POS tags to the following tags: 1 for
nouns, 2 for verbs, 3 for adjectives and 4 for adverbs.
For instance, the input token interest 1 is tagged
with the semantic tag 1:09:00:: in the training data
set. If we estimate that the lemma interest should
specialize the model, then the semantic tag is rede-
fined as interest 1:09:00::. Non-content words, that
share the same output tag (the symbol notag in our
system), could be also considered to specialize the
model. For example, for the word a, the specialized
output tag associated could be a notag.
1.3 System scheme
The disambiguation process is presented in (Figure
1). First, the original input sentence (I) is processed
in order to select its relevant features, providing the
input sentence (eI). Then, the semantic tagging is
carried out through the Viterbi algorithm using the
estimated SHMM. WordNet is used to know all the
possible semantic tags associated to an input word.
If the input word is unknown for the model (i.e., the
word has not been seen in the training data set) the
system takes the first sense provided by WordNet.
The learning process of a SHMM is similar to the
learning of a basic HMM. The only difference is that
SHMM are based on an appropriate definition of the
input information to the learning process. This in-
formation consists of the input features (words, lem-
mas and POS tags) and the output tag set (senses)
provided by the training corpus. A SHMM is built
according to the following steps (see Figure 2):
1. To define which available input information is
relevant to the task (selection criterion).
2. To define which input features are relevant to
redefine or specialize the output tag set (spe-
cialization criterion).
3. To apply the chosen criteria to the original
training data set to produce a new one.
4. To learn a model from the new training data
set.
5. To disambiguate a development data set using
that model.
6. To evaluate the output of the WSD system in
order to compare the behavior of the selected
criteria on the development set.
These steps are done using different combina-
tions of input features in order to determine the best
selection criterion and the best total specialization
criterion. Once these criteria are determined, some
partial specializations are tested in order to improve
the performance of the model.
Selection
      of Relevant
Features
Disambiguated sentence
WSD
HMM WORDNETSelection
criterion
I~I
Original Input sentence Input sentence
Figure 1: System Description
Specialization
criterion (2)
SET
TRAINING
SET
DEVELOPMENT
   Output Tags
      of
Specialization
REFERENCE SET
DEVELOPMENT
Selection
      of Relevant
Features
HMM WORDNET
Training set
sentence
Input
Selection
Disambiguated
sentence
criterion (1)
New
       Model
      the
Learning
4 6
WSD
5
3
Evaluation
Figure 2: Learning Phase Description
2 Experimental Work
We used as training data the part of the SemCor cor-
pus which is semantically annotated and supervised
for nouns, verbs, adjectives and adverbs (that is, the
files contained in the Brown1 and the Brown2 fold-
ers of SemCor corpus), and the test data set provided
by Senseval-2. We used 10% of the training corpus
as a development data set in order to determine the
best selection and specialization criteria.
In the experiments, we used WordNet 1.6 as a
dictionary which supplies all the possible semantic
senses for a given word. Our system disambiguated
all the polysemic lemmas, that is, the coverage of
our system was 100% (therefore, precision and re-
call were the same). For unknown words (words
that did not appear in the training data set), we as-
signed the first sense in WordNet.
The best selection criterion determined from the
experimental work on the development set is as fol-
lows: if a word wi has a sense in WordNet we con-
catenate the lemma (li) and the POS (pi) associ-
ated to the word (wi) as input vocabulary. For non-
content words, we only consider their lemma (li) as
input.
The best specialization criterion consisted of se-
lecting the lemmas whose frequency in the training
data set was higher than a certain threshold (other
specialization criteria could have been chosen, but
frequency criterion usually worked well in other
tasks as we reported in (Molina and Pla, 2002)). In
order to determine which threshold maximized the
performance of the model, we conducted a tuning
experiment on the development set. The best per-
formance was obtained using the lemmas whose fre-
quency was higher than 20 (about 1,600 lemmas).
The performance of our system on the Senseval 3
data test set was 60.9% of precision and recall.
3 Concluding remarks
In our WSD system, the choice of the best special-
ization criterion is based on the results of the system
on the development set. The tuning experiments in-
cluded totally specialized models, which is equiva-
lent to consider the sense keys as the output vocab-
ulary, non-specialized models, which is equivalent
to consider the lex senses as the output vocabulary,
and partially specialized models using different sets
of lemmas.
For the best specialization criterion, we have not
studied the linguistic characteristics of the different
groups of synsets associated to the same lex sense
for non-specialized output tags. We think that we
could improve our WSD system through a more ad-
equate definition of the selection and specialization
criteria. This definition could be done using seman-
tic knowledge about the domain of the task.
4 Acknowledgments
This work has been supported by the Spanish
research projects CICYT TIC2003-07158-C04-03
and TIC2003-08681-C02-02.

References
C. Loupy, M. El-Beze, and P. F. Marteau. 1998.
Word Sense Disambiguation using HMM Tag-
ger. In Proceedings of the 1st International Con-
ference on Language Resources and Evaluation,
LREC, pages 1255–1258, Granada, Spain, May.
Antonio Molina and Ferran Pla. 2002. Shallow
Parsing using Specialized HMMs. Journal of
Machine Learning Research, 2:595–613.
Antonio Molina, Ferran Pla, and Encarna Segarra.
2002a. A Hidden Markov Model Approach to
Word Sense Disambiguation. In Proceedings
of the VIII Conferencia Iberoamericana de In-
teligencia Artificial, IBERAMIA2002, Sevilla,
Spain.
Antonio Molina, Ferran Pla, and Encarna Segarra.
2002b. Una formulaci´on unificada para resolver
distinto problemas de ambig¨uedad en PLN. Re-
vista para el Procesamiento del Lenguaje Natu-
ral, (SEPLN’02), Septiembre.
Ferran Pla and Antonio Molina. 2004. Improv-
ing Part-of-Speech Tagging using Lexicalized
HMMs. Natural Language Engineering, 10. In
press.
F. Segond, A. Schiller, G. Grefenstette, and J-P.
Chanod. 1997. An Experiment in Semantic Tag-
ging using Hidden Markov Model Tagging. In
Proceedings of the Joint ACL/EACL Workshop
on Automatic Information Extraction and Build-
ing of Lexical Semantic Resources, pages 78–81,
Madrid, Spain.
