Pattern Abstraction and Term Similarity for Word Sense Disambiguation:
IRST at Senseval-3
Carlo Strapparava and Al o Gliozzo and Claudio Giuliano
ITC-irst, Istituto per la Ricerca Scienti ca e Tecnologica, I-38050 Trento, ITALY
fstrappa, gliozzo, giulianog@itc.it
Abstract
This paper summarizes IRST’s participation in
Senseval-3. We participated both in the English all-
words task and in some lexical sample tasks (En-
glish, Basque, Catalan, Italian, Spanish). We fol-
lowed two perspectives. On one hand, for the all-
words task, we tried to re ne the Domain Driven
Disambiguation that we presented at Senseval-2.
The re nements consist of both exploiting a new
technique (Domain Relevance Estimation) for do-
main detection in texts, and experimenting with the
use of Latent Semantic Analysis to avoid reliance on
manually annotated domain resources (e.g. WORD-
NET DOMAINS). On the other hand, for the lexical
sample tasks, we explored the direction of pattern
abstraction and we demonstrated the feasibility of
leveraging external knowledge using kernel meth-
ods.
1 Introduction
The starting point for our research in the Word
Sense Disambiguation (WSD) area was to explore
the use of semantic domains in order to solve lex-
ical ambiguity. At the Senseval-2 competition we
proposed a new approach to WSD, namely Domain
Driven Disambiguation (DDD). This approach con-
sists of comparing the estimated domain of the con-
text of the word to be disambiguated with the do-
mains of its senses, exploiting the property of do-
mains to be features of both texts and words. The
domains of the word senses can be either inferred
from the learning data or derived from the informa-
tion in WORDNET DOMAINS.
For Senseval-3, we re ned the DDD methodol-
ogy with a fully unsupervised technique - Domain
Relevance Estimation (DRE) - for domain detection
in texts. DRE is performed by an expectation maxi-
mization algorithm for the gaussian mixture model,
which is exploited to differentiate relevant domain
information in texts from noise. This re ned DDD
system was presented in the English all-words task.
Originally DDD was developed to assess the use-
fulness of domain information for WSD. Thus it
did not exploit other knowledge sources commonly
used for disambiguation (e.g. syntactic patterns or
collocations). As a consequence the performance of
the DDD system is quite good for precision (it dis-
ambiguates well the  domain words), but as far as
recall is concerned it is not competitive compared
with other state of the art techniques. On the other
hand DDD outperforms the state of the art for unsu-
pervised systems, demonstrating the usefulness of
domain information for WSD.
In addition, the DDD approach requires domain
annotations for word senses (for the experiments we
used WORDNET DOMAINS, a lexical resource de-
veloped at IRST). Like all manual annotations, such
an operation is costly (more than two man years
have been spent for labeling the whole WORDNET
DOMAINS structure) and affected by subjectivity.
Thus, one drawback of the DDD methodology was
a lack of portability among languages and among
different sense repositories (unless we have synset-
aligned WordNets).
Besides the improved DDD, our other proposals
for Senseval-3 constitute an attempt to overcome
these previous issues.
To deal with the problem of having a domain-
annotated WORDNET, we experimented with a
novel methodology to automatically acquire domain
information from corpora. For this aim we esti-
mated term similarity from a large scale corpus, ex-
ploiting the assumption that semantic domains are
sets of very closely related terms. In particular we
implemented a variation of Latent Semantic Analy-
sis (LSA) in order to obtain a vector representation
for words, texts and synsets. LSA performs a di-
mensionality reduction in the feature space describ-
ing both texts and words, capturing implicitly the
notion of semantic domains required by DDD. In
order to perform disambiguation, LSA vectors have
been estimated for the synsets in WORDNET. We
participated in the English all-words task also with
a  rst prototype (DDD-LSA) that exploits LSA in-
stead of WORDNET DOMAINS.
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
Task Systems
English All-Words DDD DDD-LSA
English Lex-sample Kernels-WSD Ties
Italian Lex-Sample Kernels-WSD Ties
Basque Lex-Sample Kernels-WSD
Catalan Lex-Sample Kernels-WSD
Spanish Lex-Sample Kernels-WSD
Table 1: IRST participation at Senseval-3
As far as lexical sample tasks are concerned, we
participated in the English, Italian, Spanish, Cata-
lan, and Basque tasks. For these tasks, we ex-
plored the direction of pattern abstraction for WSD.
Pattern abstraction is an effective methodology for
WSD (Mihalcea, 2002). Our preliminary experi-
ments have been performed using TIES, a general-
ized Information Extraction environment developed
at IRST that implements the boosted wrapper induc-
tion algorithm (Freitag and Kushmerick, 2000). The
main limitation of such an approach is, once more,
the integration of different knowledge sources. In
particular, paradigmatic information seems hard to
be represented in the TIES framework, motivating
our decision to exploit kernel methods for WSD.
Kernel methods is an area of recent interest in
Machine Learning. Kernels are similarity functions
between instances that allows to integrate different
knowledge sources and to model explicitly linguis-
tic insights inside the powerful framework of sup-
port vector machine classi cation. For Senseval-3
we implemented the Kernels-WSD system, which
exploits kernel methods to perform the following
operations: (i) pattern abstraction; (ii) combination
of different knowledge sources, in particular domain
information and syntagmatic information; (iii) inte-
gration of unsupervised term proximity estimation
in the supervised framework.
The paper is structured as follows. Section 2 in-
troduces LSA and its relations with semantic do-
mains. Section 3 presents the systems for the En-
glish all-words task (i.e. DDD and DDD-LSA). In
section 4 our supervised approaches are reported.
In particular the TIES system is described in section
4.1, while the approach based on kernel methods is
discussed in section 4.2.
2 Semantic Domains and LSA
Domains are common areas of human discussion,
such as economics, politics, law, science etc., which
are at the basis of lexical coherence. A substantial
part of the lexicon is composed by  domain words ,
that refer to concepts belonging to speci c domains.
In (Magnini et al., 2002) it has been claimed that
domain information provides generalized features at
the paradigmatic level that are useful to discriminate
among word senses.
The WORDNET DOMAINS1 lexical resource
is an extension of WORDNET which provides
such domain labels for all synsets (Magnini and
Cavagli a, 2000). About 200 domain labels were se-
lected from a number of dictionaries and then struc-
tured in a taxonomy according to the Dewey Deci-
mal Classi cation (DDC). The annotation method-
ology was mainly manual and took about 2 person
years.
WORDNET DOMAINS has been proven a useful
resource for WSD. However some aspects induced
us to explore further developments. These issues
are: (i) it is dif cult to  nd an objective a-priori
model for domains; (ii) the annotation procedure
followed to develop WORDNET DOMAINS is very
expensive, making hard the replicability of the lexi-
cal resource for other languages or domain speci c
sub-languages; (iii) the domain distinctions are rigid
in WORDNET DOMAINS, while a more  fuzzy as-
sociation between domains and concepts is often
more appropriate to describe term similarity.
In order to generalize the domain approach and to
overcome these issues, we explored the direction of
unsupervised learning on a large-scale corpus (we
used the BNC corpus for all the experiments de-
scribed in this paper).
In particular, we followed the LSA approach
(Deerwester et al., 1990). In LSA, term co-
occurrences in the documents of the corpus are cap-
tured by means of a dimensionality reduction oper-
ated on the term-by-document matrix. The result-
ing LSA vectors can be exploited to estimate both
term and document similarity. Regarding document
similarity, Latent Semantic Indexing (LSI) is a tech-
nique that allows one to represent a document by
a LSA vector. In particular, we used a variation
of the pseudo-document methodology described in
(Berry, 1992). Each document can be represented in
the LSA space by summing up the normalized LSA
vectors of all the terms contained in it.
By exploiting LSA vectors for terms, it is pos-
sible to estimate domain vectors for the synsets of
WORDNET, in order to obtain similarity values be-
tween concepts that can be used for synset cluster-
ing and WSD. Thus, term and document vectors can
be used instead of WORDNET DOMAINS for WSD
and other applications in which term similarity and
domain relevance estimation is required.
1WORDNET DOMAINS is freely available for research pur-
poses at wndomains.itc.it
3 All-Words systems: DDD and DDD-LSA
DDD with DRE. DDD assignes the right sense of
a word in its context comparing the domain of the
context to the domain of each sense of the word.
This methodology exploits WORDNET DOMAINS
information to estimate both the domain of the tex-
tual context and the domain of the senses of the
word to disambiguate.
The basic idea to estimate domain relevance for
texts is to exploit lexical coherence inside texts. A
simple heuristic approach to this problem, used in
Senseval-2, is counting the occurrences of domain
words for every domain inside the text: the higher
the percentage of domain words for a certain do-
main, the more relevant the domain will be for the
text.
Unfortunately, the simple local frequency count
is not a good domain relevance measure for sev-
eral reasons. Indeed irrelevant senses of ambigu-
ous words contribute to augment the  nal score of
irrelevant domains, introducing noise. Moreover,
the level of noise is different for different domains
because of their different sizes and possible dif-
ferences in the ambiguity level of their vocabular-
ies. We re ned the original Senseval-2 DDD system
with the Domain Relevance Estimation (DRE) tech-
nique. Given a certain domain, DRE distinguishes
between relevant and non-relevant texts by means
of a Gaussian Mixture model that describes the fre-
quency distribution of domain words inside a large-
scale corpus (in particular we used the BNC corpus
also in this case). Then, an Expectation Maximiza-
tion algorithm computes the parameters that maxi-
mize the likelihood of the model on the empirical
data (Gliozzo et al., 2004).
In order to represent domain information we in-
troduced the notion of Domain Vectors (DV), which
are data structures that collect domain information.
These vectors are de ned in a multidimensional
space, in which each domain represents a dimen-
sion of the space. We distinguish between two
kinds of DVs: (i) synset vectors, which represent
the relevance of a synset with respect to each con-
sidered domain and (ii) text vectors, which repre-
sent the relevance of a portion of text with respect
to each domain in the considered set. The core of
the DDD algorithm is based on scoring the compar-
ison of these kinds of vectors. The synset vectors
are built considering WORDNET DOMAINS, while
in the calculation of scoring the system takes into
account synset probabilities on SemCor. The sys-
tem makes use of a threshold th-cut, ranging in the
interval [0,1], that allows us to tune the tradeoff be-
tween precision and recall.
th-cut Prec Recall Attempted
0.0 0.583 0.583 99.76
0.9 0.729 0.441 60.51
Table 2: DDD on the English all-words task.
Latent Semantic Domains for DDD. As seen in
Section 2, it is possible to implement a DDD ver-
sion that does not use WORDNET DOMAINS and
instead it exploits LSA term and document vectors
for estimating synset vectors and text vectors, leav-
ing the core of DDD algorithm unchanged. As for
text vectors, we used the psedo-document technique
also for building synset vectors: in this case we con-
sider the synonymous terms contained in the synset
itself.
The system presented at Senseval-3 does not
make use of any statistics on SemCor, and conse-
quently it can be considered fully unsupervised. Re-
sults are reported in table 3 and do not differ much
from the results obtained by DDD in the same task.
th-cut Prec Recall Attempted
0.5 0.661 0.496 75.01
Table 3: DDD-LSA on the English all-words task.
4 Lexical Sample Systems: Pattern
abstraction and Kernel Methods
One of the most discriminative features for lexi-
cal disambiguation is the lexical/syntactic pattern in
which the word appears. A well known issue in the
WSD area is the one sense per collocation claim
(Yarowsky, 1993) stating that the word meanings
are strongly associated with the particular colloca-
tion in which the word is located. Collocations are
sequences of words in the context of the word to
disambiguate, and can be associated to word senses
performing supervised learning.
Another important knowledge source for WSD is
the shallow-syntactic pattern in which a word ap-
pears. Syntactic patterns, like lexical patterns, can
be obtained by exploiting pattern abstraction tech-
niques on POS sequences. In the WSD literature
both lexical and syntactic patterns have been used
as features in a supervised learning schema by rep-
resenting each instance using bigrams and trigrams
in the surrounding context of the word to be ana-
lyzed2.
2More recently deep-syntactic features have been also con-
sidered by several systems, as for example modi ers of nouns
and verbs, object and subject of the sentence, etc. In order to
Representing each instance by a  bag of features 
presents several disadvantages from the point of
view of both machine learning and computational
linguistics: (1) Sparseness in the learning data: most
of the collocations found in the learning data occur
just once, reducing the generalization power of the
learning algorithm. In addition most of the collo-
cations found in the test data are often unseen in
the training data. (2) Low  exibility for pattern ab-
straction purposes: bigram and trigram extraction
schemata are  xed in advance. (3) Knowledge ac-
quisition bottleneck: the size of the training data is
not large enough to cover each possible collocation
in the language.
To overcome problems 1 and 2 we investigated
some pattern abstraction techniques from the area
of Information Extraction (IE) and we adapted them
to WSD. To overcome problem 3 we developed La-
tent Semantic Kernels, which allow us to integrate
external knowledge provided by unsupervised term
similarity estimation.
4.1 TIES
Our  rst experiments have been performed exploit-
ing TIES, an environment developed at IRST for IE
that induces patterns from the marked entities in the
training phase, and then applies those patterns in the
test phase in order to assign a category if the pat-
tern is satis ed. For our experiments, we used the
Boosted Wrapper Induction (BWI) algorithm (Fre-
itag and Kushmerick, 2000) that is implemented in
TIES.
For Senseval-3 we used very few features (lemma
and POS). We proposed the system in this con gu-
ration as a  baseline system for pattern abstraction.
Task Prec Recall Attempted
English LS 0.706 0.505 71.50
English LS (coarse) 0.767 0.548 71.50
Italian LS 0.552 0.309 55.92
Table 4: Performance of the TIES system
Our preliminary experiments with BWI have
shown that pattern abstraction is very attractive for
WSD, allowing us to achieve a very high precision
for a restricted number of words, in which the syn-
tagmatic information is suf cient for disambigua-
tion. However, we still had some restrictions. In
particular, the integration with different knowledge
sources for classi cation is not trivial.
obtain such features parsing of the data is required. However,
we decided to do not use such information, while we plan to
introduce it in the next future.
4.2 Kernel-WSD
Our choice of exploiting kernel methods for WSD
has been motivated by the observation that pattern-
based approaches for disambiguation are comple-
mentary to the domain based ones: they require dif-
ferent knowledge sources and different techniques
for classi cation and feature description. Both ap-
proaches have to be simultaneously taken into ac-
count in order to perform accurate disambiguation.
Our aim was to combine them into a common
framework.
Kernel methods, e.g. Support Vector Machines
(SVMs), are state-of-the-art learning algorithms,
and they are successfully adopted in many NLP
tasks.
The idea of SVM (Cristianini and Shawe-Taylor,
2000) is to map the set of training data into a higher-
dimensional feature space F via a mapping func-
tion  : @ ! F, and construct a separating hy-
perplane with maximum margin (distance between
planes and closest points) in the new space. Gen-
erally, this yields a nonlinear decision boundary in
the input space. Since the feature space is high di-
mensional, performing the transformation has of-
ten a high computational cost. Rather than use the
explicit mapping  , we can use a kernel function
K : @ @!< , that corresponds to the inner prod-
uct in a feature space which is, in general, different
from the input space.
Therefore, a kernel function provides a way
to compute (ef ciently) the separating hyperplane
without explicitly carrying out the map  into the
feature space - this is called the kernel trick. In this
way the kernel acts as an interface between the data
and the learning algorithm by de ning an implicit
mapping into the feature space. Intuitively, we can
see the kernel as a function that measures the sim-
ilarity between pairs of objects. The learning algo-
rithm, which compares all pairs of data items, ex-
ploits the information encoded in the kernel. An
important characteristic of kernels is that they are
not limited to vector objects but are applicable to
virtually any kind of object representation.
In this work we use kernel methods to combine
heterogeneous sources of information that we found
relevant for WSD. For each of these aspects it is
possible to de ne kernels independently. Then they
are combined by exploiting the property that the
sum of two kernels is still a kernel (i.e. k(x;y) =
k1(x;y) + k2(x;y)), taking advantage of each sin-
gle contribution in an intuitive way3.
3In order to keep the kernel values comparable for dif-
ferent values and to be independent from the length of the
examples, we considered the normalized version ^K(x; y) =
lsa Task Prec Recall Attempted MF-Baseline
? English LS 0.726 0.726 100 0.552
? English LS (coarse) 0.795 0.795 100 0.645
- English LS (no-lsa) 0.704 0.704 100 0.552
- Basque LS 0.655 0.655 100 0.558
- Italian LS 0.531 0.531 100 0.183
- Catalan LS 0.858 0.846 98.62 0.663
- Spanish LS 0.842 0.842 100 0.677
Table 5: Performance of the Kernels-WSD system
The Word Sense Disambiguation Kernel is de-
 ned in this way:
KWSD(x; y) = KS(x; y) + KP (x; y) (1)
where KS is the Syntagmatic Kernel and KP is
the Paradigmatic Kernel.
The Syntagmatic Kernel. The syntagmatic ker-
nel generalizes the word-sequence kernels de ned
by (Cancedda et al., 2003) to sequences of lem-
mata and POSs. Word sequence kernels are based
on the following idea: two sequences are similar
if they have in common many sequences of words
in a given order. The similarity between two ex-
amples is assessed by the number (possibly non-
contiguous) of the word sequences matching. Non-
contiguous occurrences are penalized according to
the number of gaps they contain. For example the
sequence of words  I go very quickly to school is
less similar to  I go to school than  I go quickly to
school . Different than the bag-of-word approach,
word sequence kernels capture the word order and
allow gaps between words. The word sequence ker-
nels are parametric with respect to the length of the
(sparse) sequences they want to capture.
We have de ned the syntagmatic kernel as the
sum of n distinct word-sequence kernels for lem-
mata (i.e. Collocation Kernel - KC) and sequences
of POSs (i.e. POS Kernel - KPOS), according to the
formula (for our experiments we set n to 2):
KS(x; y) =
nX
i=1
KCi(x; y) +
nX
i=1
KPOSi(x; y) (2)
In the above de nition of syntagmatic kernel,
only exact lemma/POS matches contribute to the
similarity. One shortcoming of this approach is
that (near-)synonyms will never be considered sim-
ilar. We address this problem by considering soft-
matching of words employing a term similarity
K(x; y)=sqrt(K(x; x)K(y; y))
measure based on LSA4. In particular we consid-
ered equivalent two words having the same POS and
a similarity value higher than an empirical thresh-
old. For example, if we consider as equivalent
the terms Ronaldo and football player the sequence
The football player scored the  rst goal can be con-
sidered equivalent to the sentence Ronaldo scored
the  rst goal. The properties of the kernel methods
offer a  exible way to plug additional information,
in this case unsupervised (we could also take this in-
formation from a semantic network such as WORD-
NET).
The Paradigmatic Kernel. The paradigmatic
kernel takes into account the paradigmatic aspect of
sense distinction (i.e. domain aspects) (Gliozzo et
al., 2004). For example the word virus can be dis-
ambiguated by recognizing the domain of the con-
text in which it is placed (e.g. computer science
vs. biology). Usually such an aspect is captured
by  bag-of-words , in analogy to the Vector Space
Model, widely used in Text Categorization and In-
formation Retrieval. The main limitation of this
model for WSD is the knowledge acquisition bot-
tleneck (i.e. the lack of sense tagged data). Bag of
words are very sparse data that require a large scale
corpus to be learned. To overcome such a limita-
tion, Latent Semantic Indexing (LSI) can provide a
solution.
Thus we de ned a paradigmatic kernel composed
by the sum of a  traditional bag of words kernel
and an LSI kernel (Cristianini et al., 2002) as de-
 ned by formula 3:
KP (x; y) = KBoW (x; y) + KLSI(x; y) (3)
where KBoW computes the inner product be-
tween the vector space model representations and
KLSI computes the cosine between the LSI vectors
representing the texts.
4For languages other than English, we did not exploit this
soft-matching and the KLSI kernel described below. See the
 rst column in the table 5.
Table 5 displays the performance of Kernel-
WSD. As a comparison, we also report the  gures
on the English task without using LSA. The last col-
umn reports the recall of the most-frequent baseline.
Acknowledgments
Claudio Giuliano is supported by the IST-Dot.Kom
project sponsored by the European Commission
(Framework V grant IST-2001-34038). TIES and
the kernel package have been developed in the con-
text of the Dot.Kom project.

References
M. Berry. 1992. Large-scale sparse singular value
computations. International Journal of Super-
computer Applications, 6(1):13 49.
N. Cancedda, E. Gaussier, C. Goutte, and J.M. Ren-
ders. 2003. Word-sequence kernels. Journal of
Machine Learning Research, 3(6):1059 1082.
N. Cristianini and J. Shawe-Taylor. 2000. Support
Vector Machines. Cambridge University Press.
N. Cristianini, J. Shawe-Taylor, and H. Lodhi.
2002. Latent semantic kernels. Journal of Intel-
ligent Information Systems, 18(2):127 152.
S. Deerwester, S. T. Dumais, G. W. Furnas, T.K.
Landauer, and R. Harshman. 1990. Indexing by
latent semantic analysis. Journal of the American
Society for Information Science, 41(6):391 407.
D. Freitag and N. Kushmerick. 2000. Boosted
wrapper induction. In Proc. of AAAI-00, pages
577 583, Austin, Texas.
A. Gliozzo, C. Strapparava, and I. Dagan. 2004.
Unsupervised and supervised exploitation of se-
mantic domains in lexical disambiguation. Com-
puter Speech and Language, Forthcoming.
B. Magnini and G. Cavagli a. 2000. Integrating sub-
ject  eld codes into WordNet. In Proceedings of
LREC-2000, Athens, Greece, June.
B. Magnini, C. Strapparava, G. Pezzulo, and
A. Gliozzo. 2002. The role of domain informa-
tion in word sense disambiguation. Natural Lan-
guage Engineering, 8(4):359 373.
R. F. Mihalcea. 2002. Word sense disambiguation
with pattern learning and automatic feature selec-
tion. Natural Language Engineering, 8(4):343 
358.
D. Yarowsky. 1993. One sense per collocation. In
Proceedings of ARPA Human Language Technol-
ogy Workshop, pages 266 271, Princeton.
