Computing Term Translation Probabilities with Generalized Latent
Semantic Analysis
Irina Matveeva
Department of Computer Science
University of Chicago
Chicago, IL 60637
matveeva@cs.uchicago.edu
Gina-Anne Levow
Department of Computer Science
University of Chicago
Chicago, IL 60637
levow@cs.uchicago.edu
Abstract
Term translation probabilities proved an
effective method of semantic smoothing in
the language modelling approach to infor-
mation retrieval tasks. In this paper, we
use Generalized Latent Semantic Analysis
to compute semantically motivated term
and document vectors. The normalized
cosine similarity between the term vec-
tors is used as term translation probabil-
ity in the language modelling framework.
Our experiments demonstrate that GLSA-
based term translation probabilities cap-
ture semantic relations between terms and
improve performance on document classi-
fication.
1 Introduction
Many recent applications such as document sum-
marization, passage retrieval and question answer-
ing require a detailed analysis of semantic rela-
tions between terms since often there is no large
context that could disambiguate words’s meaning.
Many approaches model the semantic similarity
between documents using the relations between
semantic classes of words, such as representing
dimensions of the document vectors with distri-
butional term clusters (Bekkerman et al., 2003)
and expanding the document and query vectors
with synonyms and related terms as discussed
in (Levow et al., 2005). They improve the per-
formance on average, but also introduce some in-
stability and thus increased variance (Levow et al.,
2005).
The language modelling approach (Ponte and
Croft, 1998; Berger and Lafferty, 1999) proved
very effective for the information retrieval task.
Berger et. al (Berger and Lafferty, 1999) used
translation probabilities between terms to account
for synonymy and polysemy. However, their
model of such probabilities was computationally
demanding.
Latent Semantic Analysis (LSA)(Deerwester et
al., 1990) is one of the best known dimensionality
reduction algorithms. Using a bag-of-words docu-
ment vectors (Salton and McGill, 1983), it com-
putes a dual representation for terms and docu-
ments in a lower dimensional space. The resulting
document vectors reside in the space of latent se-
mantic concepts which can be expressed using dif-
ferent words. Thestatistical analysis oftheseman-
tic relatedness between terms is performed implic-
itly, in the course of a matrix decomposition.
In this project, we propose to use a combi-
nation of dimensionality reduction and language
modelling to compute the similarity between doc-
uments. We compute term vectors using the Gen-
eralized Latent Semantic Analysis (Matveeva et
al., 2005). This method uses co-occurrence based
measures of semantic similarity between terms
to compute low dimensional term vectors in the
space of latent semantic concepts. The normalized
cosine similarity between the term vectors is used
as term translation probability.
2 Term Translation Probabilities in
Language Modelling
The language modelling approach (Ponte and
Croft, 1998) proved very effective for the infor-
mation retrieval task. This method assumes that
every document defines a multinomial probabil-
ity distribution p(w|d) over the vocabulary space.
Thus, given a query q = (q1,...,qm), the like-
lihood of the query is estimated using the docu-
ment’s distribution: p(q|d) = producttextm1 p(qi|d), where
151
qi are query terms. Relevant documents maximize
p(d|q) ∝ p(q|d)p(d).
Many relevant documents may not contain the
same terms as the query. However, they may
contain terms that are semantically related to the
query terms and thus have high probability of
being “translations”, i.e. re-formulations for the
query words.
Berger et. al (Berger and Lafferty, 1999) in-
troduced translation probabilities between words
into the document-to-query model as a way of se-
mantic smoothing of the conditional word proba-
bilities. Thus, they query-document similarity is
computed as
p(q|d) =
mproductdisplay
i
summationdisplay
w∈d
t(qi|w)p(w|d). (1)
Each document word w is a translation of a query
term qi with probability t(qi|w). This approach
showed improvements over the baseline language
modelling approach (Berger and Lafferty, 1999).
The estimation of the translation probabilities is,
however, a difficult task. Lafferty and Zhai used
a Markov chain on words and documents to es-
timate the translation probabilities (Lafferty and
Zhai, 2001). We use the Generalized Latent Se-
mantic Analysis to compute the translation proba-
bilities.
2.1 Document Similarity
We propose to use low dimensional term vectors
for inducing the translation probabilities between
terms. Wepostpone thediscussion ofhowtheterm
vectors are computed to section 2.2. To evaluate
the validity of this approach, we applied it to doc-
ument classification.
We used two methods of computing the sim-
ilarity between documents. First, we computed
the language modelling score using term transla-
tion probabilities. Once the term vectors are com-
puted, the document vectors are generated as lin-
ear combinations of term vectors. Therefore, we
also used the cosine similarity between the docu-
ments to perform classificaiton.
We computed the language modelling score of
a test document d relative to a training document
di as
p(d|di) = productdisplay
v∈d
summationdisplay
w∈di
t(v|w)p(w|di). (2)
Appropriately normalized values of the cosine
similarity measure between pairs of term vectors
cos(vectorv, vectorw) are used as the translation probability
between the corresponding terms t(v|w).
In addition, we used the cosine similarity be-
tween the document vectors
〈vectordi, vectordj〉 = summationdisplay
w∈di
summationdisplay
v∈dj
αdiwβdjv 〈vectorw,vectorv〉, (3)
where αdiw and βdjv represent the weight of the
terms w and v with respect to the documents di
and dj, respectively.
Inthis case, theinner products between the term
vectors are also used to compute the similarity be-
tween the document vectors. Therefore, the cosine
similarity between the document vectors also de-
pends on the relatedness between pairs of terms.
We compare these two document similarity
scores to the cosine similarity between bag-of-
word document vectors. Our experiments show
that these two methods offer an advantage for doc-
ument classification.
2.2 Generalized Latent Semantic Analysis
We use the Generalized Latent Semantic Analy-
sis (GLSA)(Matveeva et al., 2005) to compute se-
mantically motivated term vectors.
TheGLSAalgorithm computes thetermvectors
for the vocabulary of the document collection C
with vocabulary V using a large corpus W. It has
the following outline:
1. Construct the weighted term document ma-
trix D based on C
2. For the vocabulary words in V, obtain a ma-
trix of pair-wise similarities, S, using the
large corpus W
3. Obtain the matrix UT of low dimensional
vector space representation of terms that pre-
serves the similarities in S, UT ∈ Rk×|V|
4. Compute document vectors by taking linear
combinations of term vectors ˆD = UTD
The columns of ˆD are documents in the k-
dimensional space.
In step 2 we used point-wise mutual informa-
tion (PMI) as the co-occurrence based measure of
semantic associations between pairs of the vocab-
ulary terms. PMI has been successfully applied to
semantic proximity tests for words (Turney, 2001;
Terra and Clarke, 2003) and was also success-
fully used as a measure of term similarity to com-
pute document clusters (Pantel and Lin, 2002). In
152
our preliminary experiments, the GLSA with PMI
showed a better performance than with other co-
occurrence based measures such as the likelihood
ratio, and χ2 test.
PMI between random variables representing
two words, w1 and w2, is computed as
PMI(w1,w2) = log P(W1 = 1,W2 = 1)P(W
1 = 1)P(W2 = 1)
.
(4)
We used the singular value decomposition
(SVD) in step 3 to compute GLSA term vectors.
LSA (Deerwester et al., 1990) and some other
related dimensionality reduction techniques, e.g.
Locality Preserving Projections (He and Niyogi,
2003) compute a dual document-term representa-
tion. The main advantage of GLSA is that it fo-
cuses on term vectors which allows for a greater
flexibility in the choice of the similarity matrix.
3 Experiments
The goal of the experiments was to understand
whether the GLSA term vectors can be used to
model the term translation probabilities. We used
a simple k-NN classifier and a basic baseline to
evalute the performance. We used the GLSA-
based term translation probabilities within the lan-
guage modelling framework and GLSA document
vectors.
We used the 20 news groups data set because
previous studies showed that the classification per-
formance on this document collection can notice-
ably benefit from additional semantic informa-
tion (Bekkerman et al., 2003). For the GLSA
computations we used the terms that occurred in
at least 15 documents, and had a vocabulary of
9732 terms. We removed documents with fewer
than 5 words. Here we used 2 sets of 6 news
groups. Groupd contained documents from dis-
similar news groups1, with a total of 5300 docu-
ments. Groups contained documents from more
similar news groups2 and had 4578 documents.
3.1 GLSA Computation
To collect the co-occurrence statistics for the sim-
ilarities matrix S we used the English Gigaword
collection (LDC). We used 1,119,364 New York
Times articles labeled “story” with 771,451 terms.
1os.ms, sports.baseball, rec.autos, sci.space, misc.forsale,
religion-christian
2politics.misc, politics.mideast, politics.guns, reli-
gion.misc, religion.christian, atheism
Groupd Groups
#L tf Glsa LM tf Glsa LM
100 0.58 0.75 0.69 0.42 0.48 0.48
200 0.65 0.78 0.74 0.47 0.52 0.51
400 0.69 0.79 0.76 0.51 0.56 0.55
1000 0.75 0.81 0.80 0.58 0.60 0.59
2000 0.78 0.83 0.83 0.63 0.64 0.63
Table 1: k-NN classification accuracy for 20NG.
Figure 1: k-NN with 400 training documents.
We used the Lemur toolkit3 to tokenize and in-
dex the document; we used stemming and a list of
stopwords. Unlessstated otherwise, fortheGLSA
methods we report the best performance over dif-
ferent numbers of embedding dimensions.
Theco-occurrence counts can be obtained using
either term co-occurrence within the same docu-
ment or within a sliding window of certain fixed
size. In our experiments we used the window-
based approach which was shown to give better
results (Terra and Clarke, 2003). We used the win-
dow of size 4.
3.2 Classification Experiments
We ran the k-NN classifier with k=5 on ten ran-
dom splits of training and test sets, with different
numbers of training documents. The baseline was
to use the cosine similarity between the bag-of-
words document vectors weighted with term fre-
quency. Other weighting schemes such as max-
imum likelihood and Laplace smoothing did not
improve results.
Table 1 shows the results. We computed the
score between the training and test documents us-
ing two approaches: cosine similarity between the
GLSA document vectors according to Equation 3
(denoted as GLSA), and the language modelling
score which included the translation probabilities
between the terms as in Equation 2 (denoted as
3http://www.lemurproject.org/
153
LM). We used the term frequency as an estimate
for p(w|d). To compute the matrix of translation
probabilities P, where P[i][j] = t(tj|ti) for the
LMCLSA approach, we first obtained the matrix
ˆP[i][j] = cos(vectorti,vectortj). We set the negative and zero
entries in ˆP to a small positive value. Finally, we
normalized the rows of ˆP to sum up to one.
Table 1 shows that for both settings GLSA and
LM outperform the tf document vectors. As ex-
pected, the classification task was more difficult
for the similar news groups. However, in this
case both GLSA-based approaches outperform the
baseline. In both cases, the advantage is more
significant with smaller sizes of the training set.
GLSA and LM performance usually peaked at
around 300-500 dimensions which is in line with
results for other SVD-based approaches (Deer-
wester et al., 1990). When the highest accuracy
was achieved at higher dimensions, the increase
after 500 dimensions was rather small, as illus-
trated in Figure 1.
These results illustrate that the pair-wise simi-
larities between the GLSA term vectors add im-
portant semantic information which helps to go
beyond term matching and deal with synonymy
and polysemy.
4 Conclusion and Future Work
We used the GLSA to compute term translation
probabilities as a measure of semantic similarity
between documents. We showed that the GLSA
term-based document representation and GLSA-
based term translation probabilities improve per-
formance on document classification.
The GLSA term vectors were computed for all
vocabulary terms. However, different measures of
similarity may be required for different groups of
terms such as content bearing general vocabulary
words and proper names as well as other named
entities. Furthermore, different measures of sim-
ilarity work best for nouns and verbs. To extend
this approach, we will use a combination of sim-
ilarity measures between terms to model the doc-
ument similarity. We will divide the vocabulary
into general vocabulary terms and named entities
and compute a separate similarity score for each
of the group of terms. The overall similarity score
is a function of these two scores. In addition, we
will use the GLSA-based score together with syn-
tactic similarity to compute the similarity between
the general vocabulary terms.
References
Ron Bekkerman, Ran El-Yaniv, and Naftali Tishby.
2003. Distributionalword clusters vs. words for text
categorization.
Adam Berger and John Lafferty. 1999. Informationre-
trieval as statistical translation. In Proc. of the 22rd
ACM SIGIR.
ScottC.Deerwester,SusanT.Dumais,ThomasK.Lan-
dauer,GeorgeW. Furnas,andRichardA. Harshman.
1990. Indexing by latent semantic analysis. Jour-
nal of the American Society of Information Science,
41(6):391–407.
Xiaofei He and Partha Niyogi. 2003. Locality preserv-
ing projections. In Proc. of NIPS.
John Lafferty and Chengxiang Zhai. 2001. Document
language models, query models, and risk minimiza-
tion for information retrieval. In Proc. of the 24th
ACM SIGIR, pages 111–119, New York, NY, USA.
ACM Press.
Gina-Anne Levow, Douglas W. Oard, and Philip
Resnik. 2005. Dictionary-based techniques for
cross-language information retrieval. Information
Processing and Management: Special Issue on
Cross-language Information Retrieval.
Irina Matveeva, Gina-Anne Levow, Ayman Farahat,
and Christian Royer. 2005. Generalized latent se-
mantic analysis for term representation. In Proc. of
RANLP.
Patrick Pantel and Dekang Lin. 2002. Document clus-
tering with committees. In Proc. of the 25th ACM
SIGIR, pages 199–206. ACM Press.
Jay M. Ponte and W. Bruce Croft. 1998. A language
modelingapproachtoinformationretrieval. InProc.
of the 21st ACM SIGIR, pages 275–281, New York,
NY, USA. ACM Press.
Gerard Salton and Michael J. McGill. 1983. Intro-
duction to Modern Information Retrieval. McGraw-
Hill.
Egidio L. Terra and Charles L. A. Clarke. 2003. Fre-
quencyestimates for statistical word similarity mea-
sures. In Proc.of HLT-NAACL.
Peter D. Turney. 2001. Mining the web for synonyms:
PMI–IR versus LSA on TOEFL. Lecture Notes in
Computer Science, 2167:491–502.
154
