A CONCEPTUAL FRAMEWORK FOR AUTOMATIC AND DYNAMIC THESAURUS UPDATING 
IN INFORMATION RETRIEVAL SYSTEMS 
M.F. BRUANDET 
Laboratoire IMAG 
B.P. 53X, 38041 GRENOBLE Cedex (France) 
ABSTRACT 
This paper aims at presenting a methodology for 
automatic thesaurus construction in order to 
help the search of documents and we want to 
obtain the development of classes for specific 
topics (for a given corpus) without a priori 
semantic information. Information contained in 
the thesaurus lead to new search formulations 
via automatic and/or user feedback. This pre- 
sentation even being theoretical is oriented 
toward a database implementation. 
Preliminary remarks 
Different strategies used in Information Retrie- 
val Systems must be developped to increase "re- 
call" and "precision ''8'9. The classic one is the 
construction of thesaurus. A thesaurus is usual- 
ly defined as a set of terms (called descriptors) 
and a set of relations between these terms. 
This study is made for an information retrieval 
system using an inverted file (bitmap, each key- 
word points to a set of documents containing 
this keyword). For formulating a request the 
user defines a set of keywords and boolean opera- 
tors on this set (for example MISTRAL, GOLEM- 
PASSAT, STAIRS systems). When entering a docu- 
ment into the database, a module (e.g. PIAF) 4,5 
generates stems from the data (several grammati- 
cal variants of the same word are reduced to a 
canonical form). We call this form an item. 
Thesaurus construction in the context 
of local documents 
Our object is to find a method for the construc- 
tion of non-hierarchical relations and the defi- 
nition of item clusters from these relations. 
A point to be underlined is that this methodolo- 
gy could efficiently be used only on homogeneous 
collections of texts. To this purpose, we only 
consider a database subset : the local set of 
all documents returned from a given query. The 
local clustering method makes use of the common 
occurrences of items within a certain "neighbo- 
rhood", this method has been studied by R. 
ATTAR and A.S. FRAENKEL (in "Local feedback in 
full-text retrieval") I. 
Let be D£ the local set of documents retrieved 
from a given query and TZ the set of items con- 
tained in DZ. We define a metrical function 
which is inversely proportional to the distance 
between items in the same sentence. Each item is 
defined by its coordinates (DN, SN, IN) where 
DN is the document number, SN the sentence num- 
ber and IN the item number within a sentence. 
For any item t¢ T~, let wt(i) be the coordinate 
of the ith occurrence of t. 
For any couple (s,t) ¢ T~ × T%, we define 
d = •\]wt(i) - Ws(J) \] the distance between the ith 
occurrence of t and j th occurrence of s. 
In fact 
,(\]) d--\]INt(i)-INs(J) \] withIDNt(i) = DNs(j) 
SNt(i) SNs(J) 
Let be F a function of the distance d : 
Ii I/d if wt(i) , Ws(J) are 
in the same sentence (2) 
Fkwt~i),Ws<J)jt r ~ t..\~ = with d -< 20 
0 otherwise. 
For s and t e T% we define : 
(3) b(s,t) = I I F(wt(i), Ws(J) ) 
i j 
where the summation is over all occurrences i 
and j of s and t. 
Remark : b(s,t) = b(t,s). 
In order to normalize the function, we take 
b(s,t) ~R(S,t) - ~ where f(t) is the number of 
occurrences of t for all local documents D£ 
0 -< ~R(S,t) <- I. 
Through this function, we obtain for an item s 
a reference vector R which is a list of items t 
s related to s, such as DR(S,t) is greater (or 
equal) than a threshold-e. These values form an 
eigen vector : E R . 
s 
Taking into account new local information 
in thesaurus updating 
Without excluding for the thesaurus the search of 
hierarchical relations (specific or generic), we 
try tO build a set or a group of items having a 
notion of "similarity" or "liaison" between them- 
selves. This thesaurus is built as the answers of 
the used Information Retrieval System are analy- 
sed. It must be structured so that the updating 
should be dynamic and automatic ; the implementa- 
tion study has not yet been examined. The main 
problem of updating is to take into account 
"liaisons", "proximities" or "similarities" bet- 
ween the already registered items in the thesau- 
rus and the new liaisons found after a new query. 
For any query, we obtain a set of items related 
to s. Let be R the previous reference vector • $ 
(~R its assoclated function) and R's the newly 
s calculated vector (~R' its associated function). 
s 
--586-- 
A new reference vector may be calculated from R 
and R' using two functions m(s,t) and M(s,t) : s S 
(4) 
Min(~ R (s,t),~R,(S,t)) 
S S 
m(s,t) = 1 - IIIDI R I(sI, t)-~g,(S,t)l 
s s 
(5) 
Max(~ R (s,t),PR,(S,t)) 
S S 
M(s,t) = I + I~ R (s,t)-~R , (s,t)\[ 
S S 
The function M involves all the items t which 
are related, or not, to s in R and Ri (see Table 
I). The function m allows us t~ consider only the 
items which are both in R and in R' (see Table S S 
I). 
One might consider m and M to be respectively the 
union and intersection of items t related to s. 
Table I using the above functions m and M (formulas (4), (5)) 
Min(~R '~R' ) Max(~R 'DR' ) 
S S S 
~Rs (s,t) ~R,(S,t)s m = 1_\]~ R -~R' \] M = 1+\]~ R ~R,I 
S S S S 
0 l indeterminate 0.5 
0 0.2 0 0.166 
0 0.8 0 0.44 
.l 0 
0.1 
0.1 
0.1 
0.1 
0,1 
0,1 
0.1 
0.1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0.5 
0.33 
0.25 
0.2 
0.16 
0. 142 
0.125 
0.11 
0.1 
0.5 
0.47 
0.43 
0.40 
0.36 
0.30 
0.25 
0.18 
0.1 
0.5 0.5 0.5 0.5 
0'.9 0.2 0.66 0.52 
0.9 0.4 0.8 0.6 
0.9 0.5 0.83 0.64 
0.9 0.6 0.85 0.69 
0.9 0.8 0.88 0.81 
Functions m and M consider the weakest and the 
strongest bindings between items. Any associa- 
tion between s and t is meaningful only as re- 
gard to the "binding strength", that is to say 
the value of the association function. 
Use of the functions m and M for thesaurus cons- 
truction and updating 
For an item x, only the items related to x in 
several local contexts must be considered in the 
thesaurus. Thus, it is necessary to keep records 
of the initial queries into a pseudo-thesaurus. 
In this pseudo-thesaurus is registered, for any 
item x, the set of items related to x in one or 
more local contexts. 
Let be 
PS x = {t/~ps(X,t ) e ~} 
for x belonging to the set of items T, (T = uT%). 
Concerning an item x of T~, three re- 
ference vectors (and their associated functions) 
can be yielded : R , PS and T which are the 
• X sets of items t re~ated to x x respectlvely consi- 
dered in the treated local context, in one or. 
more local contexts kept in the pseudo-thesaurus, 
and in the global context kept in the thesaurus. 
These sets can be void, also several cases can 
be encountered : 
I) PS and T are not void X X 
The updating process is performed in three 
steps : 
Step l : ~~_2~_~_~ 
In order to know, if the newly calculated liai- 
sons in R x already exist in other local context, 
we compare R x and PS x. 
Only the common items of these two reference 
vectors are considered, and we form a temporary 
reference vector P using the function m (for- 
mula (4)). x 
In Px only items from R x which are previously 
related to x in at least one context are retai- 
ned. The stronger connections are decreased (see 
Table \[) because we can suppose they are only 
local. 
Step 2 : Thesaurus_H~a~_Er~e~ ~ 
The thesaurus updating is made in two different 
ways : 
(i) if Px.and T x contain the same items t, only 
the elgen vector E T (of Tx) is modified 
using the function ~ (formula (4)) ; 
--587-- 
(ii) if the items t in T x are different from those 
occuring in Px' then a new reference vector 
T is constructed combinating the values of x 
functlons D T and ~PS using M (formula 
(5)). x x 
Remarks : 
- We do not calculate the new association func- 
tion between two items for T with m (formula 
• x (4)), because we do not introduce new items 
related to x in the thesaurus, when new items 
appear in several local contexts. 
- The function M uses the common or not common 
items and introduces in the thesaurus the new 
items, which are related to x in at least two 
local contexts. 
Step 3 : Pseudo-thesaurus ~!!~_~!~!~ 
The pseudo-thesaurus updating must take into 
account the new items Occuring in R x. The new 
association function for PS x is calculated from 
the association function ~R and the old associa- 
tion function ~PS using M ~formula (5)). 
x 
2) PS and T are void 
x x 
This case corresponds to the situation~lere x is 
never appeared in any local context. We create 
the reference vectors PS in the pseudo-thesaurus 
. and R x with the assoclatlon function ~R (PSx = 
Rx). No information about x is kept in ~he the- 
saurus (T x = ~). 
3) PS x is not void and T x is void 
This corresponds to the case where x is already 
appeared in only one local context. If R x # ¢, 
then we can build the initial reference vector 
T in the thesaurus. We use the association func- 
tion m (formula (4)) calculated from the values 
of association functions D R and Dps (respecti- 
• x ~R vely contained in the elgen vectors x and EpSx). 
The present experimentation exhibits among the 
items related to x in T x (initial step) local 
synonyms, some global synonyms and many parasis- 
tic items. After a few thesaurus updatings the 
values of the association function for parasistic 
items rapidly decrease, and the values for local 
and global synonyms increase. It is clear that 
reliability of such a thesaurus can be reached 
only after a large number of queries. In such a 
situation new updating procedures might be consi- 
dered so that new parasistic items should not be 
introduced in T x (thus breaking the stability of 
Tx). 
Global treatment of thesaurus 
Let be T the large set of items registered in the 
thesaurus. In order to classify T (i.e. to split 
T into classes of similar items), we consider the 
couple of reference vectors T x and Ty (so E T and 
x E T ) for any items x and y. 
Y 
Let be r(x,y) a similarity measure : 
Z Min(D T , D T ) 
(6) r(x,y) = T x y 
Z Max(D T , D T ) 
T x y 
(7) d(x,y) = 1-r(x,y) is a pseudo-distance whose 
range is \[0,1\]. 
We can use an association matrix (i.e. term-term 
matrix) between items and found a partition of T 
in equivalence classes. Moreover, this method 
hardly applies to a great many items and does not 
seem realistic for a large scale dictionnary 
(6000 or 10000 items, for example) which are 
common in information retrieval field. To over- 
come this drawback, we may try to build up the 
global association matrix from the local ones. 
Some ideas have been suggested 2 using the fuzzy 
sets theory6, 13 but there are still theoretical 
approaches. 
Feedback query processing 
Number of parers are related to thefeedback query 
processing\],v, \]2 and our approach is similar. 
We think to adopt the following strategy, though 
we lack practical results to assert better "score" 
on queries. 
After a query we have therefore a set R of items s 
related to s (for each s ~ T~) and a partition of 
T% into equivalence classes F 4. In the thesaurus 
we might have both a set T (Jitems related to s) 
and a partition of the global set T into equiva- 
lence classes C.. i 
Several strategies can be used, they are detai- 
led in an other paper 4. We can use only local 
context, global context or both global and local 
context. We summarize some of the solutions below : 
\]) use of only global context 
A query is automatically generated with t instead 
of x when t belongs to the reference vector T 
and ~T (x,t) is greater or equal than a threshold 
~. x 
If the user agrees, a new query is generated with 
t when x and t are equivalent in the thesaurus. 
2) use of both local and global context 
When an item t is considered as "similar" to x 
both in local context (Rx) and in global context 
(Tx) and D R (x,t) N D T (x,t), t automatically 
x x replaces x in the query. When R and T x have 
common items, we can purpose toXthe user new 
queries with item t appearing in T x but not in 
Rx (~T (x,t) e ~). 
x 
As previously mentioned we can use the same stra- 
tegy using the local equivalence classes F. and 
global equivalence classes C~ (automatic fled- 
back query processing with xlc C. n F., and under i j 
user control with x e C i but x ~ C i n Fj and 
C i n Fj # ~). 
In this last case, we can think global synonymies 
allow to retrieve new documents originally left 
out. 
--588-- 
From the previous analysis, it seems that the 
best strategies should be those using both local 
and global contexts, but this needs to be veri- 
fied. 
Conclusions 
We conclude from present experimentation on small 
number of french texts that the thesaurus upda- 
ting method shall give horizontal thesaurus 
relations. 
Moreover unexpected relation between items 
should appear in the thesaurus, that is associa- 
tion which strongly reflects the corpus' content 
and which could not a priori be established and 
enhanced. 
The methodology presented above does not exclu- 
de any further intervention on the thesaurus 
to refine semantic information about some parti- 
cular cases, such as modifying values of the as- 
sociation function for some items, enriching 
definition of synonyms, 
Our next goal for such a design of the thesau- 
rus is twofold : 
I) we wish to make possible non boolean queries 
through the use of fuzzy keywords and subse- 
quent improvement of dialogue ; 
2) we wish to cluster documents with a dynamic 
indexing mechanism. 

REFERENCES 

R. ATTAR & A.S. FRAENKEL 
Local feedback in full text retrieval systems. 
Journal of ACM, vol.20, n°3, pp. 397-417, 
July 1977. 

M.F. BRUANDET 
Apropos de la construction automatique d'un 
thesaurus flou dans un syst~me de recherche 
d'information (syst~me documentaire). 
Internal research report IMAG Grenoble, 
Juin 1980. 

M.F. BRUANDET 
A conceptual framework for automatic and dy- 
namic thesaurus updating and for feedback 
query processing. 
Processing of SECOND INTERNATIONAL CONFE- 
RENCE ON DATA BASES IN THE HUMANITIES AND 
SOCIAL SCIENCES, Madrid, Juin 1980. 

J. COURTIN 
Algorithmes pour le traitement interactif des 
langues naturelles. 
Th~se d'Etat soutenue g l'Universitg Scienti- 
fique et M~dicale de Grenoble, INPG, Octobre 
1977. 

E. GRANDJEAN 
Projet PIAF - Application g la documentation 
automatique : dgfinition et utilisation du 
produit prototype PIAFDOC. 
Internal research report, IMAG Grenoble, 1979. 

T. RADECKI 
Mathematical model of information retrieval 
system based on the concept of fuzzy thesau- 
rus. 
Information processing and management, vol.12, 
pp. 313-318, Pergamon Press, 1976. 

L. REISINGER 
On fuzzy thesaurus. 
COMPSTAT/4 - Proc. Symp. Computational sta- 
tistics, Bruckman b, Fershl I, Schmetterer - 
Vienna Physics Verlag° 

G. SALTON 
The smart retrieval system, experiments in 
automatic. 
Document processing (ch.21 - the use of sta- 
tistical significance in relevance feedback. 
J.S. Brown, P.D. Reilly), Prentice Hall, 1971. 

G. SALTON 
Dynamic information processing. 
Prentice Hall 1975. 

G. SALTON and D. BERGMARK 
Clustered file generation and its application 
to computer Science taxonomies. 
IFIP Information processing 77, pp. 441-447, 
North Holland publishing company. 

W. SILVERT 
Symmetric summation : a class of operations 
on fuzzy sets. 
IEEE Trans. SMC, 1979. 

C.T. YU, M.K. SIU 
Effective automatic indexing using term addi- 
tion and deletion. 
Journal of ACM, vol.12, n=2, April 1978, 
pp. 210-225. 

L.A. ZADEH 
Fuzzy sets, Information and control. 
pp. 338-353, 1965. 
