Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 9–16,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
Cross language Text Categorization by acquiring
Multilingual Domain Models from Comparable Corpora
Al o Gliozzo and Carlo Strapparava
ITC-Irst
via Sommarive, I-38050, Trento, ITALY
{gliozzo,strappa}@itc.it
Abstract
In a multilingual scenario, the classical
monolingual text categorization problem
can be reformulated as a cross language
TC task, in which we have to cope with
two or more languages (e.g. English and
Italian). In this setting, the system is
trained using labeled examples in a source
language (e.g. English), and it classi es
documents in a different target language
(e.g. Italian).
In this paper we propose a novel ap-
proach to solve the cross language text
categorization problem based on acquir-
ing Multilingual Domain Models from
comparable corpora in a totally unsuper-
vised way and without using any external
knowledge source (e.g. bilingual dictio-
naries). These Multilingual Domain Mod-
els are exploited to de ne a generalized
similarity function (i.e. a kernel function)
among documents in different languages,
which is used inside a Support Vector Ma-
chines classi cation framework. The re-
sults show that our approach is a feasi-
ble and cheap solution that largely outper-
forms a baseline.
1 Introduction
Text categorization (TC) is the task of assigning cat-
egory labels to documents. Categories are usually
de ned according to a variety of topics (e.g. SPORT,
POLITICS, etc.) and, even if a large amount of
hand tagged texts is required, the state-of-the-art su-
pervised learning techniques represent a viable and
well-performing solution for monolingual catego-
rization problems.
On the other hand in the worldwide scenario of
the web age, multilinguality is a crucial issue to deal
with and to investigate, leading us to reformulate
most of the classical NLP problems. In particular,
monolingual Text Categorization can be reformu-
lated as a cross language TC task, in which we have
to cope with two or more languages (e.g. English
and Italian). In this setting, the system is trained
using labeled examples in a source language (e.g.
English), and it classi es documents in a different
target language (e.g. Italian).
In this paper we propose a novel approach to solve
the cross language text categorization problem based
on acquiring Multilingual Domain Models (MDM)
from comparable corpora in an unsupervised way.
A MDM is a set of clusters formed by terms in dif-
ferent languages. While in the monolingual settings
semantic domains are clusters of related terms that
co-occur in texts regarding similar topics (Gliozzo et
al., 2004), in the multilingual settings such clusters
are composed by terms in different languages ex-
pressing concepts in the same semantic  eld. Thus,
the basic relation modeled by a MDM is the domain
similarity among terms in different languages. Our
claim is that such a relation is suf cient to capture
relevant aspects of topic similarity that can be prof-
itably used for TC purposes.
The paper is organized as follows. After a brief
discussion about comparable corpora, we introduce
9
a multilingual Vector Space Model, in which docu-
ments in different languages can be represented and
then compared. In Section 4 we de ne the MDMs
and we present a totally unsupervised technique
to acquire them from comparable corpora. This
methodology does not require any external knowl-
edge source (e.g. bilingual dictionaries) and it is
based on Latent Semantic Analysis (LSA) (Deer-
wester et al., 1990). MDMs are then exploited to
de ne a Multilingual Domain Kernel, a generalized
similarity function among documents in different
languages that exploits a MDM (see Section 5). The
Multilingual Domain Kernel is used inside a Sup-
port Vector Machines (SVM) classi cation frame-
work for TC (Joachims, 2002). In Section 6 we will
evaluate our technique in a Cross Language catego-
rization task. The results show that our approach is
a feasible and cheap solution, largely outperforming
a baseline. Conclusions and future works are  nally
reported in Section 7.
2 Comparable Corpora
Comparable corpora are collections of texts in dif-
ferent languages regarding similar topics (e.g. a col-
lection of news published by agencies in the same
period). More restrictive requirements are expected
for parallel corpora (i.e. corpora composed by texts
which are mutual translations), while the class of
the multilingual corpora (i.e. collection of texts ex-
pressed in different languages without any addi-
tional requirement) is the more general. Obviously
parallel corpora are also comparable, while compa-
rable corpora are also multilingual.
In a more precise way, let L = fL1,L2,...,Llg
be a set of languages, let Ti = fti1,ti2,...,ting be a
collection of texts expressed in the language Li 2L,
and let ψ(tjh,tiz) be a function that returns 1 if tiz is
the translation of tjh and 0 otherwise. A multilingual
corpus is the collection of texts de ned by T∗ =uniontext
i Ti. If the function ψ exists for every text tiz 2T∗
and for every language Lj, and is known, then the
corpus is parallel and aligned at document level.
For the purpose of this paper it is enough to as-
sume that two corpora are comparable, i.e. they are
composed by documents about the same topics and
produced in the same period (e.g. possibly from dif-
ferent news agencies), and it is not known if a func-
tion ψ exists, even if in principle it could exist and
return 1 for a strict subset of document pairs.
There exist many interesting works about us-
ing parallel corpora for multilingual applications
(Melamed, 2001), such as Machine Translation,
Cross language Information Retrieval (Littman et
al., 1998), lexical acquisition, and so on.
However it is not always easy to  nd or build par-
allel corpora. This is the main reason because the
weaker notion of comparable corpora is a matter re-
cent interest in the  eld of Computational Linguis-
tics (Gaussier et al., 2004).
The texts inside comparable corpora, being about
the same topics (i.e. about the same semantic do-
mains), should refer to the same concepts by using
various expressions in different languages. On the
other hand, most of the proper nouns, relevant enti-
ties and words that are not yet lexicalized in the lan-
guage, are expressed by using their original terms.
As a consequence the same entities will be denoted
with the same words in different languages, allow-
ing to automatically detect couples of translation
pairs just by looking at the word shape (Koehn and
Knight, 2002). Our hypothesis is that comparable
corpora contain a large amount of such words, just
because texts, referring to the same topics in differ-
ent languages, will often adopt the same terms to
denote the same entities1.
However, the simple presence of these shared
words is not enough to get signi cant results in TC
tasks. As we will see, we need to exploit these com-
mon words to induce a second-order similarity for
the other words in the lexicons.
3 The Multilingual Vector Space Model
Let T = ft1,t2,...,tng be a corpus, and V =
fw1,w2,...,wkg be its vocabulary. In the mono-
lingual settings, the Vector Space Model (VSM) is a
k-dimensional space Rk, in which the text tj 2 T
is represented by means of the vector vectortj such that
the zth component of vectortj is the frequency of wz in tj.
The similarity among two texts in the VSM is then
estimated by computing the cosine of their vectors
in the VSM.
1According to our assumption, a possible additional crite-
rion to decide whether two corpora are comparable is to esti-
mate the percentage of terms in the intersection of their vocab-
ularies.
10
Unfortunately, such a model cannot be adopted in
the multilingual settings, because the VSMs of dif-
ferent languages are mainly disjoint, and the similar-
ity between two texts in different languages would
always turn out zero. This situation is represented
in Figure 1, in which both the left-bottom and the
rigth-upper regions of the matrix are totally  lled by
zeros.
A  rst attempt to solve this problem is to ex-
ploit the information provided by external knowl-
edge sources, such as bilingual dictionaries, to col-
lapse all the rows representing translation pairs. In
this setting, the similarity among texts in different
languages could be estimated by exploiting the clas-
sical VSM just described. However, the main dis-
advantage of this approach to estimate inter-lingual
text similarity is that it strongly relies on the avail-
ability of a multilingual lexical resource containing
a list of translation pairs. For languages with scarce
resources a bilingual dictionary could be not eas-
ily available. Secondly, an important requirement
of such a resource is its coverage (i.e. the amount
of possible translation pairs that are actually con-
tained in it). Finally, another problem is that am-
biguos terms could be translated in different ways,
leading to collapse together rows describing terms
with very different meanings.
On the other hand, the assumption of corpora
comparability seen in Section 2, implies the pres-
ence of a number of common words, represented by
the central rows of the matrix in Figure 1.
As we will show in Section 6, this model is rather
poor because of its sparseness. In the next section,
we will show how to use such words as seeds to in-
duce a Multilingual Domain VSM, in which second
order relations among terms and documents in dif-
ferent languages are considered to improve the sim-
ilarity estimation.
4 Multilingual Domain Models
A MDM is a multilingual extension of the concept
of Domain Model. In the literature, Domain Mod-
els have been introduced to represent ambiguity and
variability (Gliozzo et al., 2004) and successfully
exploited in many NLP applications, such us Word
Sense Disambiguation (Strapparava et al., 2004),
Text Categorization and Term Categorization.
A Domain Model is composed by soft clusters of
terms. Each cluster represents a semantic domain,
i.e. a set of terms that often co-occur in texts hav-
ing similar topics. Such clusters identi es groups of
words belonging to the same semantic  eld, and thus
highly paradigmatically related. MDMs are Domain
Models containing terms in more than one language.
A MDM is represented by a matrix D, contain-
ing the degree of association among terms in all the
languages and domains, as illustrated in Table 1.
MEDICINE COMPUTER SCIENCE
HIV e/i 1 0
AIDSe/i 1 0
viruse/i 0.5 0.5
hospitale 1 0
laptope 0 1
Microsofte/i 0 1
clinicai 1 0
Table 1: Example of Domain Matrix. we denotes
English terms, wi Italian terms and we/i the com-
mon terms to both languages.
MDMs can be used to describe lexical ambiguity,
variability and inter-lingual domain relations. Lexi-
cal ambiguity is represented by associating one term
to more than one domain, while variability is rep-
resented by associating different terms to the same
domain. For example the term virus is associated
to both the domain COMPUTER SCIENCE and the
domain MEDICINE while the domain MEDICINE is
associated to both the terms AIDS and HIV. Inter-
lingual domain relations are captured by placing dif-
ferent terms of different languages in the same se-
mantic  eld (as for example HIV e/i, AIDSe/i,
hospitale, and clinicai). Most of the named enti-
ties, such as Microsoft and HIV are expressed using
the same string in both languages.
When similarity among texts in different lan-
guages has to be estimated, the information con-
tained in the MDM is crucial. For example the two
sentences  I went to the hospital to make an HIV
check and  Ieri ho fatto il test dell’AIDS in clin-
ica (lit. yesterday I did the AIDS test in a clinic)
are very highly related, even if they share no to-
kens. Having an  a priori knowledge about the
inter-lingual domain similarity among AIDS, HIV,
hospital and clinica is then a useful information to
11

















English documents Italian documents
de1 de2 ··· den−1 den di1 di2 ··· dim−1 dim
we1 0 1 ··· 0 1 0 0 ···
English
Lexicon
we2 1 1 ··· 1 0 0 ...
... . . . . . . . . . . . . . . . . . . . . . . . . ... 0 ...
wep−1 0 1 ··· 0 0 ... 0
wep 0 1 ··· 0 0 ··· 0 0
common wi we/i1 0 1 ··· 0 0 0 0 ··· 1 0
... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
wi1 0 0 ··· 0 1 ··· 1 1
Italian
Lexicon
wi2 0 ... 1 1 ··· 0 1
... ... 0 ... . . . . . . . . . . . . . . . . . . . . . . . . .
wiq−1 ... 0 0 1 ··· 0 1
wiq ··· 0 0 0 1 ··· 1 0

















Figure 1: Multilingual term-by-document matrix
recognize inter-lingual topic similarity. Obviously
this relation is less restrictive than a stronger associ-
ation among translation pair. In this paper we will
show that such a representation is suf cient for TC
puposes, and easier to acquire.
In the rest of this section we will provide a formal
de nition of the concept of MDM, and we de ne
some similarity metrics that exploit it.
Formally, let V i = fwi1,wi2,...,wikig be the vo-
cabulary of the corpus Ti composed by document
expressed in the language Li, let V ∗ = uniontextiV i be
the set of all the terms in all the languages, and
let k∗ = jV ∗j be the cardinality of this set. Let
D = fD1,D2,...,Ddg be a set of domains. A DM
is fully de ned by a k∗  d domain matrix D rep-
resenting in each cell di,z the domain relevance of
the ith term of V ∗ with respect to the domain Dz.
The domain matrix D is used to de ne a function
D : Rk∗ ! Rd, that maps the document vectors vectortj
expressed into the multilingual classical VSM, into
the vectors vectortprimej in the multilingual domain VSM. The
function D is de ned by2
2In (Wong et al., 1985) the formula 1 is used to de ne a
Generalized Vector Space Model, of which the Domain VSM is
a particular instance.
D(vectortj) = vectortj(IIDFD) = vectortprimej (1)
where IIDF is a diagonal matrix such thatiIDFi,i =
IDF(wli), vectortj is represented as a row vector, and
IDF(wli) is the Inverse Document Frequency of wli
evaluated in the corpus Tl.
The matrix D can be determined for example us-
ing hand-made lexical resources, such as WORD-
NET DOMAINS (Magnini and Cavagli a, 2000). In
the present work we followed the way to acquire
D automatically from corpora, exploiting the tech-
nique described below.
4.1 Automatic Acquisition of Multilingual
Domain Models
In this work we propose the use of Latent Seman-
tic Analysis (LSA) (Deerwester et al., 1990) to in-
duce a MDM from comparable corpora. LSA is an
unsupervised technique for estimating the similar-
ity among texts and terms in a large corpus. In the
monolingual settings LSA is performed by means
of a Singular Value Decomposition (SVD) of the
term-by-document matrix T describing the corpus.
SVD decomposes the term-by-document matrix T
into three matrixes T ’ VΣkprimeUT where Σkprime is the
diagonal k k matrix containing the highest kprime  k
12
eigenvalues of T, and all the remaining elements are
set to 0. The parameter kprime is the dimensionality of
the Domain VSM and can be  xed in advance (i.e.
kprime = d).
In the literature (Littman et al., 1998) LSA has
been used in multilingual settings to de ne a mul-
tilingual space in which texts in different languages
can be represented and compared. In that work LSA
strongly relied on the availability of aligned parallel
corpora: documents in all the languages are repre-
sented in a term-by-document matrix (see Figure 1)
and then the columns corresponding to sets of trans-
lated documents are collapsed (i.e. they are substi-
tuted by their sum) before starting the LSA process.
The effect of this step is to merge the subspaces (i.e.
the right and the left sectors of the matrix in Figure
1) in which the documents have been originally rep-
resented.
In this paper we propose a variation of this strat-
egy, performing a multilingual LSA in the case in
which an aligned parallel corpus is not available.
It exploits the presence of common words among
different languages in the term-by-document matrix.
The SVD process has the effect of creating a LSA
space in which documents in both languages are rep-
resented. Of course, the higher the number of com-
mon words, the more information will be provided
to the SVD algorithm to  nd common LSA dimen-
sion for the two languages. The resulting LSA di-
mensions can be perceived as multilingual clusters
of terms and document. LSA can then be used to
de ne a Multilingual Domain Matrix DLSA.
DLSA = INVradicalbigΣkprime (2)
where IN is a diagonal matrix such that iNi,i =
1radicalBig
〈 vectorwprimei, vectorwprimei〉
, vectorwprimei is the ith row of the matrix VpΣkprime.
Thus DLSA3 can be exploited to estimate simi-
larity among texts expressed in different languages
(see Section 5).
3When DLSA is substituted in Equation 1 the Domain VSM
is equivalent to a Latent Semantic Space (Deerwester et al.,
1990). The only difference in our formulation is that the vectors
representing the terms in the Domain VSM are normalized by
the matrix IN, and then rescaled, according to their IDF value,
by matrix IIDF. Note the analogy with the tf idf term weighting
schema, widely adopted in Information Retrieval.
4.2 Similarity in the multilingual domain space
As an example of the second-order similarity pro-
vided by this approach, we can see in Table 2 the  ve
most similar terms to the lemma bank. The similar-
ity among terms is calculated by cosine among the
rows in the matrix DLSA, acquired from the data
set used in our experiments (see Section 6.2). It is
worth noting that the Italian lemma banca (i.e. bank
in English) has a high similarity score to the English
lemma bank. While this is not enough to have a pre-
cise term translation, it is suf cient to capture rele-
vant aspects of topic similarity in a cross-language
text categorization task.
Lemma#Pos Similarity Score Language
banking#n 0.96 Eng
credit#n 0.90 Eng
amro#n 0.89 Eng
unicredito#n 0.85 Ita
banca#n 0.83 Ita
Table 2: Terms with high similarity to the English
lemma bank#n, in the Multilingual Domain Model
5 The Multilingual Domain Kernel
Kernel Methods are the state-of-the-art supervised
framework for learning, and they have been success-
fully adopted to approach the TC task (Joachims,
2002).
The basic idea behind kernel methods is to em-
bed the data into a suitable feature space F via a
mapping function φ : X ! F, and then to use a
linear algorithm for discovering nonlinear patterns.
Kernel methods allow us to build a modular system,
as the kernel function acts as an interface between
the data and the learning algorithm. Thus the ker-
nel function becomes the only domain speci c mod-
ule of the system, while the learning algorithm is a
general purpose component. Potentially any kernel
function can work with any kernel-based algorithm,
as for example Support Vector Machines (SVMs).
During the learning phase SVMs assign a weight
λi  0 to any example xi 2 X. All the labeled
instances xi such that λi > 0 are called Support
Vectors. Support Vectors lie close to the best sepa-
rating hyper-plane between positive and negative ex-
amples. New examples are then assigned to the class
13
of the closest support vectors, according to equation
3.
f(x) =
nsummationdisplay
i=1
λiK(xi,x) +λ0 (3)
The kernel function K(xi,x) returns the simi-
larity between two instances in the input space X,
and can be designed just by taking care that some
formal requirements are satis ed, as described in
(Schcurrency1olkopf and Smola, 2001).
In this section we de ne the Multilingual Domain
Kernel, and we apply it to a cross language TC task.
This kernel can be exploited to estimate the topic
similarity among two texts expressed in different
languages by taking into account the external knowl-
edge provided by a MDM. It de nes an explicit map-
ping D : Rk ! Rkprime from the Multilingual VSM
into the Multilingual Domain VSM. The Multilin-
gual Domain Kernel is speci ed by
KD(ti,tj) = hD(ti),D(tj)iradicalBig
hD(tj),D(tj)ihD(ti),D(ti)i
(4)
where D is the Domain Mapping de ned in equa-
tion 1. Thus the Multilingual Domain Kernel re-
quires Multilingual Domain Matrix D, in particular
DLSA that can be acquired from comparable cor-
pora, as explained in Section 4.1.
To evaluate the Multilingual Domain Kernel we
compared it to a baseline kernel function, namely the
bag of words kernel, that simply estimates the topic
similarity in the Multilingual VSM, as described in
Section 3. The BoW kernel is a particular case of
the Domain Kernel, in which D = I, and I is the
identity matrix.
6 Evaluation
In this section we present the data set (two compara-
ble English and Italian corpora) used in the evalua-
tion, and we show the results of the Cross Language
TC tasks. In particular we tried both to train the
system on the English data set and classify Italian
documents and to train using Italian and classify the
English test set. We compare the learning curves of
the Multilingual Domain Kernel with the standard
BoW kernel, which is considered as a baseline for
this task.
6.1 Implementation details
As a supervised learning device, we used the SVM
implementation described in (Joachims, 1999). The
Multilingual Domain Kernel is implemented by
de ning an explicit feature mapping as explained
above, and by normalizing each vector. All the ex-
periments have been performed with the standard
SVM parameter settings.
We acquired a Multilingual Domain Model by
performing the Singular Value Decomposition pro-
cess on the term-by-document matrices representing
the merged training partitions (i.e. English and Ital-
ian), and we considered only the  rst 400 dimen-
sions4.
6.2 Data set description
We used a news corpus kindly put at our dis-
posal by ADNKRONOS, an important Italian news
provider. The corpus consists of 32,354 Ital-
ian and 27,821 English news partitioned by
ADNKRONOS in a number of four  xed cate-
gories: Quality of Life, Made in Italy,
Tourism, Culture and School. The corpus
is comparable, in the sense stated in Section 2, i.e.
they covered the same topics and the same period of
time. Some news are translated in the other language
(but no alignment indication is given), some others
are present only in the English set, and some others
only in the Italian. The average length of the news
is about 300 words. We randomly split both the En-
glish and Italian part into 75% training and 25% test
(see Table 3). In both the data sets we postagged the
texts and we considered only the noun, verb, adjec-
tive, and adverb parts of speech, representing them
by vectors containing the frequencies of each lemma
with its part of speech.
6.3 Monolingual Results
Before going to a cross-language TC task, we con-
ducted two tests of classical monolingual TC by
training and testing the system on Italian and En-
glish documents separately. For these tests we used
the SVM with the BoW kernel. Figures 2 and 3 re-
port the results.
4To perform the SVD operation we used LIBSVDC
http://tedlab.mit.edu/∼dr/SVDLIBC/.
14
English Italian
Categories Training Test Total Training Test Total
Quality of Life 5759 1989 7748 5781 1901 7682
Made in Italy 5711 1864 7575 6111 2068 8179
Tourism 5731 1857 7588 6090 2015 8105
Culture and School 3665 1245 4910 6284 2104 8388
Total 20866 6955 27821 24266 8088 32354
Table 3: Number of documents in the data set partitions
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F1 measure
Fraction of training data
BoW Kernel
Figure 2: Learning curves for the English part of the
corpus
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F1 measure
Fraction of training data
BoW Kernel
Figure 3: Learning curves for the Italian part of the
corpus
6.4 A Cross Language Text Categorization task
As far as the cross language TC task is concerned,
we tried the two possible options: we trained on the
English part and we classi ed the Italian part, and
we trained on the Italian and classi ed on the En-
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F1 measure
Fraction of training data
Multilingual Domain Kernel
Bow Kernel
Figure 4: Cross-language (training on Italian, test on
English) learning curves
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F1 measure
Fraction of training data
Multilingual Domain Kernel
Bow Kernel
Figure 5: Cross-language (training on English, test
on Italian) learning curves
glish part. The Multilingual Domain Model was ac-
quired running the SVD only on the joint (English
and Italian) training parts.
Table 4 reports the vocabulary dimensions of the
English and Italian training partitions, the vocabu-
15
# lemmata
English training 22,704
Italian training 26,404
English + Italian 43,384
common lemmata 5,724
Table 4: Number of lemmata in the training parts of
the corpus
lary of the merged training, and how many com-
mon lemmata are present (about 14% of the total).
Among the common lemmata, 97% are nouns and
most of them are proper nouns. Thus the initial term-
by-document matrix is a 43,384  45,132 matrix,
while the DLSA matrix is 43,384  400. For this
task we consider as a baseline the BoW kernel.
The results are reported in Figures 4 and 5. An-
alyzing the learning curves, it is worth noting that
when the quantity of training increases, the per-
formance becomes better and better for the Multi-
lingual Domain Kernel, suggesting that with more
available training it could be possible to go closer to
typical monolingual TC results.
7 Conclusion
In this paper we proposed a solution to cross lan-
guage Text Categorization based on acquiring Mul-
tilingual Domain Models from comparable corpora
in a totally unsupervised way and without using any
external knowledge source (e.g. bilingual dictionar-
ies). These Multilingual Domain Models are ex-
ploited to de ne a generalized similarity function
(i.e. a kernel function) among documents in differ-
ent languages, which is used inside a Support Vec-
tor Machines classi cation framework. The basis of
the similarity function exploits the presence of com-
mon words to induce a second-order similarity for
the other words in the lexicons. The results have
shown that this technique is suf cient to capture rel-
evant aspects of topic similarity in cross-language
TC tasks, obtaining substantial improvements over
a simple baseline. As future work we will investi-
gate the performance of this approach to more than
two languages TC task, and a possible generaliza-
tion of the assumption about equality of the common
words.
Acknowledgments
This work has been partially supported by the
ONTOTEXT project, funded by the Autonomous
Province of Trento under the FUP-2004 program.

References
S. Deerwester, S. T. Dumais, G. W. Furnas, T.K. Lan-
dauer, and R. Harshman. 1990. Indexing by latent se-
mantic analysis. Journal of the American Society for
Information Science, 41(6):391 407.
E. Gaussier, J. M. Renders, I. Matveeva, C. Goutte, and
H. Dejean. 2004. A geometric view on bilingual lexi-
con extraction from comparable corpora. In Proceed-
ings of ACL-04, Barcelona, Spain, July.
A. Gliozzo, C. Strapparava, and I. Dagan. 2004. Unsu-
pervised and supervised exploitation of semantic do-
mains in lexical disambiguation. Computer Speech
and Language, 18:275 299.
T. Joachims. 1999. Making large-scale SVM learning
practical. In B. Schcurrency1olkopf, C. Burges, and A. Smola,
editors, Advances in kernel methods: support vector
learning, chapter 11, pages 169  184. The MIT Press.
T. Joachims. 2002. Learning to Classify Text using Sup-
port Vector Machines. Kluwer Academic Publishers.
P. Koehn and K. Knight. 2002. Learning a translation
lexicon from monolingual corpora. In Proceedings of
ACL Workshop on Unsupervised Lexical Acquisition,
Philadelphia, July.
M. Littman, S. Dumais, and T. Landauer. 1998. Auto-
matic cross-language information retrieval using latent
semantic indexing. In G. Grefenstette, editor, Cross
Language Information Retrieval, pages 51 62. Kluwer
Academic Publishers.
B. Magnini and G. Cavagli a. 2000. Integrating subject
 eld codes into WordNet. In Proceedings of LREC-
2000, Athens, Greece, June.
D. Melamed. 2001. Empirical Methods for Exploiting
Parallel Texts. The MIT Press.
B. Schcurrency1olkopf and A. J. Smola. 2001. Learning with Ker-
nels. Support Vector Machines, Regularization, Opti-
mization, and Beyond. The MIT Press.
C. Strapparava, A. Gliozzo, and C. Giuliano. 2004. Pat-
tern abstraction and term similarity for word sense
disambiguation. In Proceedings of SENSEVAL-3,
Barcelona, Spain, July.
S.K.M. Wong, W. Ziarko, and P.C.N. Wong. 1985. Gen-
eralized vector space model in information retrieval.
In Proceedings of the 8th ACM SIGIR Conference.
