UBBNBC WSD System Description 
Csomai ANDRÁS 
Department of Computer Science 
Babes-Bolyai University 
Cluj-Napoca, Romania 
csomaia@personal.ro 
 
 
Abstract 
The Naïve Bayes classification proves to 
be a good performing tool in word sense 
disambiguation, although it has not yet 
been applied to the Romanian language. 
The aim of this paper is to present our 
WSD system, based on the NBC algorithm, 
that performed quite well in Senseval 3. 
1 Introduction 
According to the literature, the NBC algorithm is 
very efficient, in many cases it outperforms more 
sophisticated methods (Pedersen 1998). Therefore, 
this is the approach we used in our research. The  
word sense disambiguating process has three major 
steps, therefore, the application has three main 
components as follows: 
Stemming – removal of suffixes, and the filter-
ing out of the irrelevant information from 
the corpora. A simple dictionary based ap-
proach. 
Learning – the training of the classifier, based 
on the sense tagged corpora. A database 
containing the number of co-occurrences is 
built. 
Disambiguating –on the basis of the database, 
the correct sense of a word in a given con-
text is estimated. 
In the followings the previously mentioned three 
steps are described in detail. 
 
 
2 Stemming 
The preprocessing of the corpora is one of the most 
result-influential steps. The preprocessing consists 
of the removal of suffixes and the elimination of 
the irrelevant data. The removal of suffixes is per-
formed trough a simple dictionary based method. 
For every w
i
 word the possible w
j
 candidates are 
selected from the dictionary containing the word 
stems. Then a similarity score is calculated be-
tween the word to be stemmed and the candidates, 
as follows: 
l
i
, l
j
 is the length of word i, respectively j.  
 
score
i
=
ji
i
ll
l
+
2
if l
i 
≤ l
j
   and 
score
j
=0, otherwise. 
 
The result is the candidate with the highest score if 
its score is above a certain threshold, otherwise the 
word is leaved untouched. 
In the preprocessing phase we also erase the pro-
nouns and prepositions from the examined context. 
This exclusion was made upon a list of stop words. 
3 Learning 
The training is conducted according to the NBC 
algorithm. First a database is built, with the follow-
ing tables: 
words – contains all the words found in the cor-
pora. Its role is to assign a sense id to every 
word. 
wordsenses – contains all the tagged words in 
the corpora linked with their possible senses. 
One entry for a given sense and word. 
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
nosenses - number of tagged contexts, with a 
given sense 
nocontexts  - number of tagged contexts of a 
given word 
occurrences – number of co-occurrences of a 
given word with a given sense 
 
 
Figure1: The tables of the database 
 
The training of the system is nothing but filling up 
the tables of the database. 
fill NoSenses 
fill NoContexts 
fill Wordsenses 
scan corpora 
   c
akt
=actual entry in corpora (a context) 
   w=actual word in entry (the ambiguous word) 
   s
k
=actual sense of entry 
 scan c
akt
 
     v
j
=actual word in entry 
     if v
j
<>w then 
  if v
j
 in words then 
   vi=wordid from words where w=v
j
 
  else 
   add words v
j
 
  endif 
     if (exists entry in occurrences where  
         wordid=vi and senseid=s
k
) then 
  increment C(wordid,senseid) in occurrences,  
                     where wordid=vi and senseid=s
k
 
     else 
  add occurrences(wordid, senseid, 1)  
          endif 
         step to next word 
        endscan 
step to next entry 
endscan corpora 
As it is obvious, the database is filled up (so the 
system is trained) only upon the training corpus 
provided for the Senseval3 Romanian Lexical 
Sample task. 
4 Disambiguation 
The basic assumption of the Naïve Bayes method 
is that the contextual features are not dependent on 
each other. In this particular case, we assume that 
the probability of co-occurrence of a word v
i
 with 
the ambiguous word w of sense s is not dependent 
on other co-occurrences.  
The goal is to find the correct sense s′ , of  the 
word w, for a given context. This s′ sense maxi-
mizes the following equation. 
)()|(maxarg    
)(
)(
)|(
maxarg    
)|(maxarg
kks
k
k
s
ks
sPscP
sP
cP
scP
csPs
k
k
k
=
=
=′
  
At this point we make the simplifying “naïve” as-
sumption: 
 
∏
∈
=
cv
kjk
j
svPscP )|()|(  
The algorithm (Tă tar, 2003) for estimating the cor-
rect sense of word w according to its c context is 
the following: 
for every s
k
 sense of w do 
 score(s
k
)=P(s
k
) 
 for every v
j
 from context c do 
  score(s
k
)= score(s
k
)*P(v
j  
| s
k
) 
s’= ))((maxarg
ks
sscore
k
 
where s’ is the estimated sense, v
j
 is the j-th word 
of the context, s
k
 is the k-th possible sense for word 
w. 
P(s
k
) and P(v
j  
| s
k
) are calculated as follows: 
 
where  C(w) is the number of contexts for word w,  
C(v
j
 , s
k
) is the number of occurrences of word v
j
 in 
)(
)C(s
)P(s
k
k
wC
=
)s(
)s,C(v
)s|P(v
k
kj
kj
C
=
contexts tagged with sense s
k
 , and C(s
k
) is the 
number of contexts tagged with sense s
k
 
 
The values are obtained from the database, as fol-
lows:  
C(w)- from nocontexts,  
C(v
j
 , s
k
)- from occurrences,  
C(s
k
)- from nosenses.  
wordsenses is being used to determine the possible 
senses of a given word.  
 
5 Evaluation 
The described system was evaluated at Senseval 3. 
The output was not weighted, therefore for every 
ambiguous word, at most 1 solution (estimated 
sense) was provided. The results achieved, are the 
followings: 
 
 score correct/attempted
precision 0.710 2415 correct of 
3403 attempted 
recall 0.682 2415 correct of 
3541 in total 
attempted 96.10% 3403 attempted 
of 3541 in total 
Figure2: Fine-grained score 
 
 score correct/attempted
precision 0.750 2551 correct of 
3403 attempted 
recall 0.720 2551 correct of 
3541 in total 
attempted 96.10% 3403 attempted 
of 3541 in total 
Figure2: Coarse-grained score 
 
A simple test was made, before the Senseval 3 
evaluation. The system was trained on 90% of the 
Romanian Lexical Sample training corpus, and 
tested on the remaining 10%. The selection was 
random, with a uniform distribution.  A coarse 
grained score was computed and compared to the 
baseline score. A baseline method consists of  de-
termining the most frequent sense for every word 
(based upon the training corpus) and in the evalua-
tion phase always this sense is assigned. 
 
 
 UBBNBC Baseline 
recall 0.66 0.56 
precision 0.69 0.56 
Figure3: baseline UBBNBC comparison 

References 
Ted Pedersen. 1998. Naïve Bayes as a Satisficing 
Model. Working Notes of the AAAI Spring Sympo-
sium on Satisficing Models, Palo Alto, CA 
Doina Tă tar. 2003. Inteligenţă  artificială  - Aplicaţ ii în 
prelucrarea limbajului natural. Editura Albastra, 
Cluj-Napoca, Romania. 
Manning, C. D., Schütze, H. 1999. Foundations of sta-
tistical natural language processing. MIT Press, 
Cambridge, Massachusetts. 
