Proceedings of EACL '99 
Finding content-bearing terms using term similarities 
Justin Picard 
Institut Interfacultaire d'Informatique 
University of Neuchttel 
SWITZERLAND 
justin.picard@seco.unine.ch 
Abstract 
This paper explores the issue of using dif- 
ferent co-occurrence similarities between 
terms for separating query terms that are 
useful for retrieval from those that are 
harmful. The hypothesis under examina- 
tion is that useful terms tend to be more 
similar to each other than to other query 
terms. Preliminary experiments with 
similarities computed using first-order 
and second-order co-occurrence seem to 
confirm the hypothesis. Term similari- 
ties could then be used for determining 
which query terms are useful and best 
reflect the user's information need. A 
possible application would be to use this 
source of evidence for tuning the weights 
of the query terms. 
1 Introduction 
Co-occurrence information, whether it is used for 
expanding automatically the original query (Qiu 
and Frei, 1993), for providing a list of candi- 
date terms to the user in interactive query ex- 
pansion, or for relaxing the independence as- 
sumption between query terms (van Rijsbergen, 
1977), has been widely used in information re- 
trieval. Nevertheless, the use of this information 
has often resulted in reduction of retrieval effec- 
tiveness (Smeaton and van Rijsbergen, 1983), a 
fact sometimes explained by the poor discriminat- 
ing power of the relationships (Peat and Willet, 
1991). It was not until recently that a more elabo- 
rated use of this information resulted in consistent 
improvement of retrieval effectiveness. Improve- 
ments came from a different computation of the 
relationships named "second-order co-occurrence" 
(Schutze and Pedersen, 1997), from an adequate 
combination with other sources of evidence such 
as relevance feedback (Xu and Croft, 1996), or 
from a more careful use of the similarities for ex- 
panding the query (Qiu and Frei, 1993). 
Indeed, interesting patterns relying in co- 
occurrence information may be discovered and, 
if used carefully, may enhance retrieval effective- 
ness. This paper explores the use of co-occurrence 
similarities between query terms for determining 
the subset of query terms which are good descrip- 
tors of the user's information need. Query terms 
can be divided into those that are useful for re- 
trieval and those that are harmful, which will be 
named respectively "content" terms and "noisy" 
terms. The hypothesis under examination is that 
two content terms tend to be more similar to 
each other than would be two noisy terms, or a 
noisy and a content term. Intuitively, the query 
terms which reflect the user's information need are 
more likely to be found in relevant documents and 
should concern similar topic areas. Consequently, 
they should be found in similar contexts in the 
corpus. A similarity measures the degree to which 
two terms can be found in the same context, and 
should be higher for two content terms. 
We name this hypothesis the "Cluster Hypoth- 
esis for query terms", due to its correspondence 
with the Cluster Hypothesis of information re- 
trieval which assumes that relevant documents 
"are more like one another than they are like non- 
relevant documents" (van Rijsbergen and Sparck- 
Jones, 1973, p.252). Our middle-term objective 
is to verify experimentally the hypothesis for dif- 
ferent types of co-occurrences, different measures 
of similarity and different collections. If a higher 
similarity between content terms is indeed ob- 
served, this pattern could be used for tuning the 
weights of query terms in the absence of relevance 
feedback information, by increasing the weights of 
the terms which appear to be content terms, and 
inversely for noisy terms. Next section is about 
the verification of the hypothesis on the CACM 
collection (3204 documents, 50 queries). 
241 
Proceedings of EACL '99 
2 Verifying the Cluster Hypothesis 
for query terms 
2.1 The Cluster Hypothesis for query 
terms 
The hypothesis that similarities between query 
terms is an indicator of the relevance of each term 
to the user's information need is based on an in- 
tuition. This intuition can be illustrated by the 
following request: 
Document will provide totals or 
specific data on changes to the proven 
reserve figures for any oil or natural 
gas producer. 
It appears that the only terms which appear in 
one or more relevant documents are oil,reserve 
and gas, which obviously concern similar topic ar- 
eas, and are good descriptors of the information 
need 1. All the other terms retrieve only non- 
relevant documents, and consequently reduce re- 
trieval effectiveness. Taken individually, they do 
not seem to specifically concern the user's infor- 
mation need. Our hypothesis can be formulated 
this way: 
• Content terms which are representative of the 
information need (like oil, reserve, and gas) 
concern similar topics and are more likely to 
be found in relevant documents; 
• Terms which concern similar topics should be 
found in similar contexts of the corpus (doc- 
uments, sentences, neighboring words...); 
* Terms found in similar contexts have a high 
similarity value. Consequently, content terms 
tend to be similar to each other. 
2.2 Determining content terms and noisy 
terms 
Until now, we have talked of "content" or "noisy" 
terms, as terms which are useful or harmful for re- 
trieval. How can we determine this? First, terms 
which do not occur in any relevant document can 
only be harmful (at best, they have no impact on 
retrieval) and can directly be classified as "noisy". 
For terms which occur in one or more relevant 
documents, the usefulness depends on the total 
number of relevant documents and on the num- 
ber of occurrences of the term in the collection. 
We use the X2 test of independence between the 
occurrence of the term and the relevance of a doc- 
ument to determine if the term is a content or a 
1 Remark that we do not consider here phrases such 
as 'natural gas', but the argument can be extended to 
phrases. 
noisy term. For terms which fail the test at the 
95% confidence level, the hypothesis of indepen- 
dence is rejected, and they are considered con- 
tent terms. Otherwise, they are considered noisy 
terms. 
Another way of verifying if a term is useful for 
retrieval would be to compare the retrieval effi- 
ciency of the query with and without the term. 
This method is appealing since our final objective 
is better retrieval efficiency. But it has some draw- 
backs: (1) there are several measures of retrieval 
effectiveness, and (2) the classification of a term 
will depend in part on the retrieval system itselfi 
A point deserves discussion: terms which do not 
appear in any relevant documents and which are 
classified noisy may sometimes be significant of 
the content of the query. This may happen for 
example if the number of relevant documents is 
small and if the vocabularies used in the request 
and in the relevant documents are different. Any- 
way, this does not change the fact that the term 
is harmful to retrieval. It could still be used for 
finding expansion terms, but this is another prob- 
lem. In any case, a rough classification of terms 
between "content" and "noisy" can always be dis- 
cussed, the same way that a binary classification 
of documents between relevant and non-relevant 
is a major controversy in the field of information 
retrieval. 
2.3 Preliminary experiments 
Once terms are classified as either content or 
noisy, three types of term pairs are considered: 
content-content, content-noisy, and noisy-noisy. 
For each pair of query terms, different measures 
of similarity can be computed, depending on the 
type of co-occurrence, the association measure, 
and so on. Each of the three classes of term pairs 
has an a-priori probability to appear. We are in- 
terested in verifying if the similarity has an influ- 
ence on this probability. 
One problem with first-order co-occurrence is 
that the majority of terms never co-occur, because 
they occur too infrequently. We decided to se- 
lect terms which occur more than ten times in the 
corpus. The same term pairs were used for first 
and second-order co-occurrence. Term pairs come 
from selected terms of the same query. For ex- 
ample, take a query with 10 terms of which 5 are 
classified content. Then for this query, there are 
I0.(i0--I) 2 = 45 term pairs, of which 5"(5-1)2 = 10 
are content-content, 10 are noisy-noisy, and the 
other 25 are noisy-content. 
On the 50 queries used for experiments, there 
are 7544 term pairs, of which 1340 (17.76%) are 
242 
Proceedings of EACL '99 
of class content-content, 3426 (45.41%) of class 
content-noisy, and 2778 (36.82%) of class noisy- 
noisy. 40.47% of the terms are content terms. 
Obviously, a term can be classified content in a 
query and noisy in another. In the following sub- 
sections, we present our preliminary experiments 
on the CACM collection. 
2.3.1 First-order co-occurrence 
First-order co-occurrence measures the degree 
to which two terms appear together in the 
same context. If the vectors of weights of ti 
and tj in documents d~ to dn are respectively 
(wil, wi2,..., w,~) T and (wjz, wj2, ..., win) T, the 
cosine similarity is: 
n 2 /x-'~ n 2 Wik 
V~,k=l Wjk 
The weight wij was set to 1 if ti occured in 
dj, and to 0 otherwise, and within document fre- 
quency and document size were not exploited. 
Figure 1 shows the probability to find each of the 
classes vs similarity. The probabilities are com- 
puted from the raw data binned in intervals of 
similarity of 0.05, and for the 0 similarity value. 
The values associated on the graph are 0 for the 
0 similarity value, 0.025 for interval \]0,0.05\], 0.075 
for \]0.05,0.1\], etc. The similarities after 0.325 are 
not plotted because there are very few of them. 
There is a neat increase of probability of the 
class 'content-content' with increasing similarity. 
It is interesting to remark that if high values of 
similarities are evidence that the terms are con- 
tent terms, small values can be taken as nega- 
tive evidence for the same conclusion. By using 
smaller and more reliable contexts such as sen- 
tences, paragraphs or windows, it is expected that 
the measures of similarity should be more reliable, 
and the observed pattern should be stronger. 
2.3.2 Second-Order co-occurrence 
Second-order co-occurrence measures the de- 
gree to which two terms occur with similar 
terms. Terms are represented by vectors of co- 
occurrences where the dimensions correspond to 
each of the m terms in the collection. The value 
attributed to dimension k of term ti is the number 
of times that ti occurs with tk. More elaborated 
measures take into account a weight for each di- 
mension, which represent the discriminating value 
of the corresponding term. Term ti is represented 
here by (wil, wi2, ..., wire) T, where wij is the num- 
ber of time that ti and tj occur in the same con- 
text. 
0.8 
0.7 • -- content--content 
• - - content-noisy 
0,6 
0.5 
~ 0.4 
0.3 
0,2 
0.1 
0 0.05 0'.1 0.;5 0:2 0.15 013 
SJr~tar~y 
Figure 1: Probability of term pairs classes vs 
First-order similarity 
We used again Equation 1 for computing simi- 
larities between query terms. The similarity val- 
ues were in general higher than for first-order co- 
occurrence. Remark that the same data (term 
pairs) were taken for first and second-order co- 
occurrence. For the computation of probabil- 
ities, data were binned in intervals of 0.1, on 
the range \[0, 0.925\] (not enough similarities higher 
than 0.925). Figure 2 represents the probabilities 
of the class vs similarity. 
Again, the probability of having the class 
content-content increases with similarity, but to a 
lesser degree than with first-order similarity. More 
experiments are needed to see if first-order co- 
occurrence is in general stronger evidence of the 
quality of a term than second-order co-occurrence. 
However, a second-order similarity can be com- 
puted for nearly all query terms, while first-order 
similarities can only be computed for frequent 
enough terms. 
0.7 
_- :I:',Z:; o1 
0.f ..... .... ~sy-r~isy 0 
a. 0.3 
0.2 
0.I 
O0 0'.I 012 0'.3 0'.4 0'.5 0'.0 0'.7 0'.0 0'.9 S~mitar~ty 
Figure 2: Probability of term pairs classes vs 
Second-order similarity 
243 
Proceedings of EACL '99 
3 Discussion 
In this paper, we have formulated the hypothe- 
sis that query terms which are good descriptors 
of the information need tend to be more simi- 
lar to each other. We have proposed a method 
to verify if the hypothesis holds in practice, and 
presented some preliminary investigations on the 
CACM collection which seem to confirm the hy- 
pothesis. But many other investigations have to 
be done on bigger collections, involving more elab- 
orate measures of similarity using weights, differ- 
ent contexts (paragraphs, sentences), and not only 
single words but also phrases. Experiments are 
ongoing on a subset of the TREC collection (200 
Mb), and preliminary results seem to confirm the 
hypothesis. Our hope is that investigations on 
this large test collection should yield better re- 
sults, since the computed similarities are statis- 
tically more reliable when they are computed on 
larger data sets. 
In a way, this work can be related to word sense 
disambiguation. This problem has already been 
addressed in the field of the information retrieval, 
but it has been shown that the impact of word 
sense disambiguation is of limited utility (Krovetz 
and Croft, 1992). Here the problem is not the de- 
termination of the correct sense of a word, but 
rather the determination of the usefulness of a 
query term for retrieval. However, it would be 
interesting to see if techniques developed for word 
sense disambiguation such as (Yarowsky, 1992) 
could be adapted to determine the usefulness of 
a query term for retrieval. 
From our preliminary investigations, it seems 
that similarities can be used as positive and as 
negative evidence that a term should be useful for 
retrieval. The other part of our work is to deter- 
mine a technique for using this pattern in order 
to improve term weighting, and at the end im- 
prove retrieval effectiveness. While simple tech- 
niques might work and will be tried (e.g. cluster- 
ing), we seriously doubt about it because every 
relationship between query terms should be taken 
into account, and this leads to very complex in- 
teractions. We are presently developing a model 
where the probability of the state (content/noisy) 
of a term is determined by uncertain inference, 
using a technique for representing and handling 
uncertainty named Probabilistic Argumentation 
Systems (Kohlas and Haenni, 1996). In the next 
future, this model will be implemented and tested 
against simpler models. If the model allows to pre- 
dict reasonably well the state of each query term, 
this information can be used to refine the weight- 
ing of query terms and lead to better information 
retrieval. 
Acknowledgements 
The author wishes to thank Warren Greiff for 
comments on an earlier draft of this paper. This 
research was supported by the SNSF (Swiss Na- 
tional Scientific Foundation) under grants 21- 
49427.95. 
References 
J. Kohlas and R. Haenni. 1996. Assumption- 
based reasoning and probabilistic argumenta- 
tion systems. In J. Kohlas and S. Moral, 
editors, Defensible Reasoning and Uncertainty 
Management Systems: Algorithms. Oxford Uni- 
versity Press. 
R. Krovetz and W.B. Croft. 1992. Lexical ambi- 
guity and information retrieval. ACM Transac- 
tions on Information Systems, 10(2):115-141. 
H.J. Peat and P. Willet. 1991. The limita- 
tions of term co-occurence data for query ex- 
pansion in document retrieval systems. Journal 
of the American Society for Information Sci- 
ence, pages 378-383, June. 
Y. Qiu and H.P. Frei. 1993. Concept based query 
expansion. In Proc. of the Int. A CM-SIGIR 
Conf., pages 160-169. 
H. Schutze and J.O. Pedersen. 1997. A 
cooccurrence-based thesaurus and two applica- 
tions to information retrieval. Information Pro- 
cessing eJ Management, 33(3):307-318. 
A.F. Smeaton and C.J. van Rijsbergen. 1983. The 
retrieval effects of query expansion on a feed- 
back document retrieval system. The Computer 
Journal, 26(3):239-246. 
C.J. van Rijsbergen and K. Sparck-Jones. 1973. 
A test for the separation of relevant and 
non-relevant documents in experimental re- 
trieval collections. Journal of Documentation, 
29(3):251-257, September. 
C.J. van Rijsbergen. 1977. A theoretical basis for 
the use of co-occurrence data in information re- 
trieval. Journal of Documentation, 33(2):106- 
119. 
J. Xu and W.B. Croft. 1996. Query expansion 
using local and global document analysis. In 
Proc. of the Int. ACM-SIGIR Conf., pages 4- 
11. 
D. Yarowsky. 1992. Word-sense disambiguation 
using statistical models of Roget's categories 
trained on large corpora. In COLING-92, pages 
454-460. 
244 
