UBB system at Senseval3
Gabriela Serban
Department of Computer Science
University "Babes-Bolyai"
Romania
gabis@cs.ubbcluj.ro
Doina Tatar
Department of Computer Science
University "Babes-Bolyai"
Romania
dtatar@cs.ubbcluj.ro
Abstract
It is known that whenever a system’s actions
depend on the meaning of the text being pro-
cessed, disambiguation is beneflcial or even nec-
essary. The contest Senseval is an international
frame where the research in this important fleld
is validated in an hierarchical manner. In this
paper we present our system participating for
the flrst time at Senseval 3 contest on WSD,
contest developed in March-April 2004. We
present also our intentions on improving our
system, intentions occurred from the study of
results.
1 Introduction
Word Sense Disambiguation (WSD) is the pro-
cess of identifying the correct meanings of words
in particular contexts (Manning and Schutze,
1999). It is only an intermediate task in NLP,
like POS tagging or parsing. Examples of flnal
tasks are Machine Translation, Information Ex-
traction or Dialogue systems. WSD has been a
research area in NLP for almost the beginning
of this fleld due to the phenomenon of polysemy
that means multiple related meanings with a
single word (Widdows, 2003). The most im-
portant robust methods in WSD are: machine
learning methods and dictionary based meth-
ods. While for English exist some machine read-
able dictionaries, the most known being Word-
Net (Christiane Fellbaum, 1998), for Romanian
until now does not exist any. Therefore for our
application we used the machine learning ap-
proach.
2 Machine learning approach in
WSD
Our system falls in the supervised learning ap-
proach category. It was trained to learn a clas-
sifler that can be used to assign a yet unseen ex-
ample to one or two of a flxed number of senses.
We had a trained corpus (a number of annotated
contexts), from where the system learned the
classifler, and a test corpus which the system
will annotate.
In our system we used the Vector Space
Model: a context c was represented as a vec-
tor~c of some features which we will present bel-
low. By a context we mean the same deflnition
as in Senseval denotation: the content between
<context> and </context>.
The notations used to explain our method are
(Manning and Schutze, 1999):
† w - the word to be disambiguate;
† s1;¢¢¢;sNs the senses for w;
† c1;¢¢¢;cNc the contexts for w;
† v1;¢¢¢;vNf the features selected.
As we treated each word w to be disam-
biguated separately, let us explain the method
for a single word. The features selected was
the set of ALL words used in the trained corpus
(nouns, verbs, prepositions, etc) , so we used the
cooccurrence paradigm (Dagan, Lee and Pereira
, 1994).
The vector of a context c of the target word
w is deflned as:
† ~c = (w1;¢¢¢;wjWj) where wi is the number
of occurences of the word vi in the context
c and vi is a word from the entire trained
corpus of j W j words.
The similarity between two contexts ca;cb is
the normalised cosine between the vectors ~ca
and ~cb (Jurafsky and Martin, 2000):
cos(~ca; ~cb) =
Pm
j=1 wa;j £wb;jqP
m
j=1 w2a;j £
Pm
j=1 w2b;j
and sim(~ca; ~cb) = cos(~ca; ~cb).
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
The number wi is the weight of the feature
vi. This can be the frequency of the feature vi
(term frequency or tf), or "inverse document
frequency ", denoted by idf. In our system we
considered all the words from the entire corpus,
so both these aspects are satisfled.
3 k-NN or memory based learning
At training time, our k-NN model memorizes all
the contexts in the training set by their associ-
ated features. Later, when proceeds a new con-
text cnew, the classifler flrst selects k contexts
in the training set that are closest to cnew, then
pick the best sense (senses) for cnew (Jackson
and Moulinier, 2002).
† TRAINING: Calculate~c for each context c.
† TEST: Calculate
Step1.
A = f~c j sim( ~cnew;~c)ismaxim;j A j= kg
that means A is the set of the k nearest
neighbors contexts of ~cnew.
Step2.
Score(cnew;sj) = X
~ci2A
(sim( ~cnew;~ci)£aij)
where aij is 1 if ~ci has the sense sj and aij
is 0 otherwise.
Step3. Finally,
s0 = argmaxjScore(cnew;sj):
We used the value of k set to 3 after some
experimental veriflcations.
A major problem with supervised approaches
is the need for a large sense tagged training set.
The bootstrapping methods use a small number
of contexts labeled with senses having a high
degree of confldence.
These labeled contexts are used as seeds to
train an initial classifler. This is then used to
extract a larger training set from the remain-
ing untagged contexts. Repeating this process,
the number of training contexts grows and the
number of untagged contexts reduces. We will
stop when the remaining unannotated corpus is
empty or any new context can’t be annotated.
In(Tatarand Serban, 2001), (Serbanand Tatar,
2003) we presented an algorithm which falls in
this category. The algorithm is based on the two
principles of Yarowsky (Resnik and Yarowsky,
1999):
† One sense per discourse: the sense of a tar-
get word is highly consistent within a given
discourse (document);
† One sense per collocation: the contextual
features ( nearby words) provide strong
clues to the sense of a target word.
Also, for each iteration, the algorithm uses
a NBC classifler. We intend to present a sec-
ond system based on this algorithm at the next
Senseval contest.
4 Implementation details
Our disambiguation system is written in JDK
1.4.
In order to improve the performance of the
disambiguation algorithm, we made the follow-
ing reflnements in the above k-NN algorithm.
First one is to substitute the lack of an e–cient
tool for stemming words in Romanian.
1. We deflned a relation between words as – :
W £ W, where W is the set of words. If
w1 2 W and w2 2 W are two words, we
say that (w1;w2) 2 – if w1 and w2 have
the same grammatical root. Therefore, if
w is a word and C is a context, we say that
w occurs in C ifi exists a word w2 2 C
so that (w;w2) 2 –. In other words, we
replaced the stemming step with collecting
all the words with the same root in a single
class. This collection is made considering
the rules for romanian morphology;
2. The step 3 of the algorithm for choosing
the appropriate sense (senses) of a poly-
semic word w in a given context C (in
fact the sense that maximizes the set S =
fScore(C;sj) j j = 1;¢¢¢Nsg of scores for
C) is divided in three sub-steps:
† If there is a single sense s that maxi-
mizes S, then s is reported as the ap-
propriate sense for C;
† If there are two senses s1 and s2 that
maximize S, then s1 and s2 are re-
ported as the appropriate senses for C;
† Consider that Max1 and Max2 are
the flrst two maximum values from S
where (Max1 > Max2). If Max1 is
obtained for a sense s1 and if Max2 is
obtained for a sense s2 and if
Max1¡Max2 • P
where P = Max1¡Min(Ns¡1) and Min is the
minimum score from S, then s1 and s2
are reported as the appropriate senses
for C.
Experimentally, we proved that the above im-
provements grow the precision of the disam-
biguation process.
5 Conclusions after the evaluation
Coarse-grained score for our system UBB using
key "EVAL/RomanianLS.test.key" was:
precision: 0.722 (2555.00 correct of 3541.00
attempted)
recall: 0.722 (2555.00 correct of 3541.00 in
total)
attempted: 100.00
Fine-grained score was:
precision: 0.671 (2376.50 correct of 3541.00
attempted)
recall: 0.671 (2376.50 correct of 3541.00 in
total)
attempted: 100.00
Considering as baseline procedure the major-
ity sense (all contexts are solved with the most
frequent sense in the training corpus), for the
word nucleu (noun) is obtained a precision of
0,78 while our procedure obtained 0,81. Also,
for the word desena (verb) the baseline proce-
dure of the majority sense obtains precision 0,81
while our procedure obtained 0,85.
At this stage our system has not as a goal to
label with U (unknown) a context, every time
choosing one or two from the best scored senses.
Annotating with the label U is one of our com-
ing improving. This can be done simply by
adding as a new sense for each word the sense
U. A simple experiment reported a number of
right annotated contexts.
Another direction to improve our system is
to exploit better the senses as they are done in
training corpus: our system simply consider the
flrst sense.

References
I. Dagan, L. Lee and F. C. N. Pereira. 1994.
Similarity-based Estimation of Word Cooc-
curences Probabilities. Meeting of the Asso-
ciation for Computational Linguistics, 272{
278.
Christiane Fellbaum. 1998. WordNet: An elec-
tronic lexical database. The MIT Press.
P. Jackson and I. Moulinier. 2002. Natural
Language Processing for Online Applications.
John Benjamin Publ. Company.
D. Jurafsky and J. Martin. 2000. Speech and
language processing. Prentice-Hall, NJ.
C. Manning and H. Schutze. 1999. Foundation
of statistical natural language processing. The
MIT Press.
Ruslan Mitkov,editor. 2002 The Oxford Hand-
book of Computational Linguistics. Oxford
University Press.
P. Resnik and D. Yarowsky. 1999. Distinguish-
ing Systems and Distinguishing sense: new
evaluation methods for WSD. Natural Lan-
guage Engineering, 5(2):113-134.
G. Serban and D. Tatar. 2003. Word Sense Dis-
ambiguation for Untagged Corpus: Applica-
tion to Romanian Language. CICLing-2003,
LNCS 2588, 270{275.
D. Tatar and G. Serban. 2001. A new algorithm
for WSD. Studia Univ. "Babes-Bolyai", In-
formatica, 2 99{108.
D. Widdows. 2003. A mathematical model for
context and word meaning. Fourth Interna-
tional Conference on Modeling and using con-
text, Stanford, California, June 23-25.
