Corpus-dependent Association Thesauri for Information Retrieval 
Hiroyuki Kaji "~, Yasutsugu Morimoto "~, Toshiko Aizono "l, and Noriyuki Yamasaki "2 
"~ Central Research Laboratory, Hitachi, Ltd. ,2 Software Division, Hitachi, Ltd. 
1-280 Higashi-Koigakubo, Kokubunji-shi 549-6 Shinano-cho, Totsuka-kt,, Yokohanla-shi 
Tokyo 185-8601, Japan Kanagawa 244-0801, Japan 
{kaji, morimoto, aizono}@crl.hitachi.co.jp, yamasa n@soft.hitachi.co.jp 
Abstract 
This paper presents a method for automati- 
cally generating an association thesaurus 
from a text corpus, and demonstrates its ap- 
plication to information retrieval. The the- 
saurus generation method .consists of ex- 
tracting tenns and co-occurrence data from a 
corpus and analyzing the correlation between 
terms statistically. A new method for dis- 
ambiguating the structure of compound 
nouns, which is a key component for term 
extraction, is also proposed. The automati- 
cally generated thesaurus is effectively used 
as a tool for exploring infonnation. A the- 
saurus navigator having novel functions such 
as term clustering, thesaurus overview, and 
zooming-in is proposed. 
1 Introduction 
A thesaurus plays essential roles in information 
retrieval systems. In particular, a domain- 
specific thesaurus greatly improves the effective- 
ness of information retrieval. However, we are 
confronted with the difficult problem of how to 
construct and maintain a domain-specific thesau- 
rus. The goal of our present research is to es- 
tablish a method for autolnatically generating a 
thesaurus from a text corpus of a domain and 
demonstrate its application to information re- 
trieval. 
Thesauri are classified 
into taxonomy-type thesauri 
and association thesauri. 
There has been various 
research on the extraction of 
taxonomic information |'io111 
a corpus, including extrac- 
tion of hyponyms by using 
linguistic patterns (Hearst 
1992) and extraction of synonyms based on the 
similarity of sets of co-occurring words (Ruge 
1991; Grefenstette 1992). However, the perfor- 
mance of these methods is limited, and they 
should be considered as aids to augment hand- 
made thesauri. In contrast, an association the- 
saurus, that is a collection of pairs of semanti- 
cally associated terms, can be possibly generated 
from a corpus entirely automatically. Word 
association norms based on co-occurrence infor- 
mation have been proposed by (Church and 
Hanks 1990). Here we focus on the automatic 
generation of an association thesanrus. 
Association thesauri are as useful as taxon° 
omy-type thesauri in information retrieval. The 
improvement of retrieval effectiveness by using 
an association thesaurus has been reported by a 
number of papers (Jing and Croft 1994; Schutze 
and Pedersen 1994). We propose to use a coro 
pus-dependent association thesaurus interac- 
tively. 
2 Automatic Generation of an Associa- 
tion Thesaurus from a Corpus 
2.1 Outline of the thesaurus generation 
method 
The proposed thcsaurus generation method con- 
sists of term extraction, co-occurrence data ex- 
traction, and correlation analysis, as shown in Fig. 
1. 
~¢ Term F 
U Co-occ.rrence \[ data extraction rl pmrs and their 
223___ 
Correlat!m~ 
Fig. 1. Automatic generation of a thesaurus from a corpus. 
404 
2.1.1 Ternt extraction 
A thesaurus should consist of terms, each repre- 
senting a domain-specific concept. Most of tile 
terms representing important concepts are nol, lllS, 
simple or compound, that frequently occur in the 
corpus. ThereR)re, we extract both simple 
nonns and compound nouns whose occurrence 
frequencies exceed a predetermined threshold. 
We also use a list of stop words since frequently 
occurring nouns are not always terms. 
Compound nouns are identified by a pattern 
matching method using a part-of-speech sequen- 
ce pattern. Naturally, the pattern is  
specific. The following is a pattern for Japanese 
compound nouns: 
COMP NOUN := { PREFIX } NOUN + 
SUFFIX } { PREFIX } NOUN + 
A problem in extracting compound nouns is 
that a word sequence matched to the above pat- 
tern, which actually defines just the type of noun 
phrase, is not always a term. We filter out some 
kind of non-term noun phrases by using a list of 
stop words for the first and last elements of com- 
pound nouns. Stop words for the first element 
of compound nouns include referential nouns (e.g. 
jouki (above-mentioned)) and determiner nouns 
(e.g. kaku (each)). Stop words for the last ele- 
ment of compound nouns include time/place 
nouns (e.g. nai (inside)) and relational nouns (e.g. 
koyuu (peculiar)). 
Another importmat problem we are confronted 
with in term extraction is the structural ambiguity 
of compound nouns. For our purpose, we need 
to extract non-maximal compound nouns as well 
sts lnaxinlal comp()und nouns. Here a non- 
maximal compound noun means one that occurs 
as a part of a larger conlpound noun, and a 
nlaximal compound nonn means one thstt occurs 
not as a part of a larger compound noun. We 
must disambiguate tile structure of compound 
nouns to correctly extract non-maximal com- 
pound nouns. We have developed a statistical 
disambiguation method, the detail and evaluation 
of which are described in 2.2. 
2.1.2 Co-occurrence data evtraction 
Our purpose is to collect pairs of semantically or 
contextually associated terms, no matter what 
kind of association. So we extract co- 
occurrence in a window. That is, every pair of 
terms occurring together within a window is 
extracted as the window is moved through a text. 
The window size can be specified rather arbitrar- 
ily. Considering our purpose, the window 
should accommodate a few sentences. At tile 
same time, the window size should not be too 
large from the viewpoint of computational load. 
Therefore, 20 to 50 words, excluding function 
words, seems to be an appropriate value. 
Note that we filter out any pair of words co- 
occurring within a compound noun. If such 
pairs were included in co-occurrence data, they 
would show high correlation. However, they 
would be redundant because compound nouns are 
treated as entities in our thesaurus. 
2.1.3 Correlation analysis 
As a correlation measure between terms, we use 
mutual information (Church and Hanks 1990). 
The mutual inlbrmation between terms t~ and t i is 
defined by the following formula: 
g(ti, tj)/~' g(t,, t,) 
Ml(ti, tj) = log_, /i.i , 
{t'(t0/ i ~ f(t0}' { f(tj)//j~ f(ti, 
where f(t~) is the occurrence frequency of term t~, 
and g(ti,ti ) is the co-occurrence frequency of 
terms t~ and tj. A rnaxinmm nunrber of associat-. 
ed terms for each term is predetermined as well 
as a threshold for tile mutual information, and 
associated terms are selected based on the de-o 
scending order of mutual information. 
Mutual infornaation involves a problem in 
that it is overestimated for low-frequency terms 
(I)unning 1993). Therefore, we determine 
whether two terms are related to each other by a 
log-likelihood ratio test, and we filter out pairs of 
terms that do not pass the test. 
2.2 Disambiguation of compound noun 
structure 
2.2.1 Disantbiguation based on coitus statistics 
Our disanabiguation method is described below 
for tile case of a compound noun consisting of 
three elelnents. A compound noun W~W2W 3 
has two possible structures: WI(W~W3) and 
(W~W,)W 3. We deternfine its structure based 
on the occurrence t}equencies of maxilnal com- 
pound nouns as follows: If tile maximal com- 
pound noun W~W3 occurs more frequently than 
tile inaxinml compound noun W~W,, then tile 
405 
Table 1 
(a) Examples 
Global-statistics-based disambiguation vs. local-statistics-based disambiguation. 
Maximal compound noun 
Deta shori shisutemu 
(data processing system) 
Deta tsuskin seiL:vo souchi 
(Data communication 
controller) 
Kaisen seigyo purosessa 
(Line control processor) 
Freq. Global-statistics-based 
Structure Freq. 
478 (Deta shori) shisutemu 478 
94 Deta (tsuskin seigvo 94 
souchi) 
54 Kaisen (seign'o 54 
purosessa) 
Local-statistics-based 
Structure Freq. 
(Deta shori) shisutemu 368 
Deta (skori shisutemu) 110 
(Deta tsushin) seigg'o souchi 40 
Deta (tsuskin seig:vo souchi) 54 
(Kaisen seign.,o) purosessa 54 
(b) Summary of disambiguation results 
Global-statistics-based 
Correct structure 12,565 words (62.0%) 
Incorrect structure 7,688 words (38.0%) 
Total 20,253 words (100%) 
(Note: Numbers of words are occurrence-based.) 
structure Wl(W2W3) is preferred. On the con° 
trary, if the maximal COlnpound noun W~W2 
occurs more frequently than the maximal com- 
pound noun W2W3, then the structure (W~W2) W3 
is preferred. 
The generalized disambiguation rnle is as 
follows: If a compound noun CN includes two 
compound noun candidates CN~ and CN2, which 
are incompatible with each other, and the maxi- 
mal compound noun CN~ occurs more frequently 
than the maximal compound noun CN> then a 
structure of CN including CN~ is preferred to a 
structure of CN including CN,. 
We have two alternatives regarding the range 
where we count occurrence frequencies of maxi- 
mal compound nouns. One is global-statistics 
which means that frequencies are counted in the 
whole corpus and they are used to disambiguate 
all compound nouns in the corpus. The other is 
local-statistics which means that frequencies are 
counted in each document in the corpus and they 
are used to disambiguate compound nouns in the 
corresponding document. 
2.2.2 Evaluation" Global-statistics vs. local- 
statistics 
We evaluated both the global-statistics-based 
disambiguation and the local-statistics-based 
disambiguation by using a 23.7-M Byte corpus 
consisting of 800 patent documents. Table l(a) 
shows comparative examples of these methods. 
Evah, ation results for the 200 highest-frequency 
maximal COlnpound nouns consisting of three or 
Local-statistics-based 
more words are summa ~ 
rized in Table l(b). 
14,921 words (73.7%) They prove that the 
local-statistics-based 5,332 words (26.3%) 
disambiguation method 
20,253 words (100%) is superior to the global- 
statistics-based disam- 
biguation method. 
Note that in the local-statistics-based disam- 
biguation method, we resorted to the global- 
statistics when local-statistics were not available. 
The percentage of cases the local-statistics were 
not available was 25.1 percent. 
(Kobayasi et al. 1994) proposed a disam- 
biguation method using collocation information 
and semantic categories, and reported that the 
structure of compound nouns was disambiguated 
at 83% accuracy. Note that their accuracy was 
calculated for compound nouns including unam- 
biguous compound nouns, i.e. those consisting of 
only two words. If it were calculated for com-. 
pound nouns consisting three or more words, it 
would be less than that of our method. Thus, we 
can conclude that our local-statistics-based 
method compares quite well with rather sophisti- 
cated previous methods. 
2.3 Prototype and an experiment 
We implemented a prototype thesaurus generator 
in which the local-statistics-based method was 
used to disambiguate the structure of compound 
nouns. Using this thesaurus generator, we got a 
thesaurus consisting of 38,995 terms froln a 61- 
M Byte corpus consisting of almost 48,000 arti- 
cles in the financial pages of a Japanese newspa- 
per. In this experiment, the threshold for occur- 
fence frequencies of terms in the term extraction 
step was set to 10, and the window size in the co- 
occurrence data extraction step was set to 25. 
406 
The abow+" rtm took 5.4 hours on a HP9000 
C200 workstation. The tlarouglaput is tolerable 
from a practical point of view. We should also 
note that a thesaurus can be updated as efficiently 
as it can be initially generated. Because \ve can 
run the first two steps (extraction of terms and 
extraction of co-occurrence data) in accumulative 
fashion, and we only need to run the third step 
over again when a considerable amount of terms 
and co-occurrence data are accunmlated. 
3 Navigation in an Association Thesau- 
FUS 
3.1 Purpose and outline of the proposed 
thesaurus navigator 
A big problem with toclafs information re- 
trieval systems based on search techniques is that 
they require users, who may not know exactly 
what thcy arc looking for, to explicitly describe 
their information needs. Another problem is 
that mismatched vocabularies between users and 
the corpus would bring poor retrieval results. 
To solve these problems, we propose a corptts-~ 
dependent association-thesaurus navigator en- 
abling users to efficiently explore information 
through a corpus. 
Users' requirements are summarized as fob 
lows: 
They want to grasp the overall information 
structure of a domain. 
They want to know what topics or sub°. 
domains arc contained in the corpus. 
- They want to know terms that appropriately 
describe their w~gue information needs. 
To meet the above requirements, our pro,- 
posed thesaurus navigator has novel functions 
such as clustering of related terms, generation of 
a thesaurus overview, and zoom-in on a sub-- 
domain of interest. A conceptual image of 
thesaurus navigation using these ftmctions is 
shown in Fig. 2. A typical informatio,> 
exploration session proceeds as follows. 
At the beginning, the system displays an 
overview of a corpus-dependent thesaurus so that 
users can easily enter the information space of 
the corpus. The overview is a kind of summary 
of the corpus, it consists of clusters of generic 
terms of the domain, and makes it easy to under- 
stand what topics or sub-domains are contained 
in the corpus. Looking at the thesaurus over- 
Thesaurtts 
overview 
'.. .... 
O Generic term 
Zoom-in Zoom-in 
.... 
// 
® Specific term 
Fig. 2. Conceptual image of thesaurus 
navigation. 
view, the users can select one or a few term 
clusters they have interest in, and the screen will 
zoom in on the cluster(s). The zoomed view 
consists of a nulnber of clusters, each including 
more specific terms than those in the overview. 
Users can repeat this zoom-in operation until they 
reach term clusters representing sufficiently 
specific topics. 
3.2 Functions of the thesaurus navigator 
3.2.1 Clustering of related terms 
We made a preliminary experiment to evaluate 
standard agglomerative clustering algorithms 
including the single-linkage method, the con> 
pletedinkage method, and the group-average- 
linkage method (Eldqamdouchi and Willett 1989)~ 
Among them, the group--average-linkage method 
resulted in the best results, ttowever, several 
potential clusters tended to merge into a large one 
when we repeated the merge operation until a 
predetermined number of clusters were obtained. 
Accordingly, we use the group-average-linkage 
method with an upper limit on the size of a clus- 
ter. 
3.2.2 Generation era thesaurus overview 
Our method tot generating a thesaurus overview 
consists of major-term extraction and term clus- 
tering. The m~0or-term extracting algorithm, 
which is carried out beforehand in batch mode, is 
described below. See 3.2.1 for the term clus- 
tering algorithnl. 
An overview of the thesaurus should consist 
of generic terms included in the corpus, flow- 
ever, we do not have a definite criterion for get> 
eric terms. So we collect m~oor terms from the 
corpus as follows. The number of m~!jor terms, 
407 
denoted by M below, was set to 300 in the pro- 
totype. 
i) Determine a characteristic term set for each 
doculnent. 
Calculate the weight w~j of term tj for 
document 4 according to the tf-idf (term fre- 
quency - inverse document frequency) for- 
mula. Then select the first re(i) terms in the 
descending order of u, u for each document d,, 
where re(i), the number of characteristic terms 
for document 4, is set to 20% of the total 
number of distinct terms in 4. It is also lim- 
ited to between 5 and 50. 
ii) Select major terms in the corpus. 
Select the first M terms in the descending 
order of the frequency of being contained in 
the characteristic term sets° 
3.2.3 Zoom-in on a term cluster of interest 
Our method for zooming in on a term cluster 
consists of term-set expansion and term cluster~ 
ing. The term-set expanding algorithm is de~ 
scribed below. See 3.2.1 for the term clustering 
algorithm. 
A user-specified term set To = {t~, 6 ..... t,,,} is 
expanded into a term set T,. consisting of M terms 
as follows. M was set to 300 in the prototype. 
i) Set the initial value of 7",. to 7",,. 
ii) While IT,.I< M for i = 1, 2 .... do; 
While IE, I < Mforj = 1,2, ..., m do; 
Add the tenn having the i-th highest 
correlation with tj to T,,; 
end; 
end; 
The reason why the above-described proce- 
dure implements the zoom-in is that generic 
terms tend to have higher correlation with semi- 
generic terms than with specific terms. As- 
suming that high-frequency terms are generic and 
low-frequency terms are specific, we examined 
the distribution of terms by the distance from the 
major terms and the average occurrence frequen- 
cy of terms for each distance. Here the distance 
is the length of the shortest path in a graph that is 
obtained by connecting every pair of associated 
terms with an edge. Table 2 shows the results 
for the example thesaurus mentioned in 2.3. 
According to it, the average occurrence frequen- 
cy decreases with the distance from the major 
terms. Therefore, starting from an overview, 
our method is likely to produce more and more 
specific views. 
3.3 Prototype and an experiment 
We developed a prototype as a client/server sys- 
tem. The thesaurus navigator is available on 
WWW browsers. It also has an interface to 
text-retrieval engines, through which a term 
cluster is transferred as a query. 
Test use was made with the example thesau- 
rus mentioned in 2.3. The response time for the 
zoom-in operation during the navigation sessions 
was about 8 seconds. This is acceptable given 
the rich information provided by the clustered 
view. Note that the response time is almost 
independent of the size of the thesaurus or corpus, 
because the number of temls to be clustered is 
always constant, as described in 3.2.2 and 3.2.3~ 
An example from navigation sessions is 
shown in Fig. 3. It demonstrates the useflflness 
of the corpus-dependent thesaurus navigation as a 
front-end for text retrieval. The effectiveness of 
our thesaurus navigator is summarized as folo 
lows. 
- Improved accessibility to text retrieval systems: 
Users are not necessarily required to input 
terms to describe their information need. 
They need only select from among terms pre~ 
sented on the screen. This makes text re-. 
trieval systems accessible even for those hav- 
ing vague information needs, or those unfa- 
miliar with the domain. 
- Improved navigation efficiency: 
The unit of users' cognition is a topic rather 
than a term. That is, they can recognize a 
topic from a cluster of terms at a glance. 
Therefi)re, they can efficiently navigate 
through an information space. 
Table 2 Distribution of terms by distance from major terms. 
Distance flom major terms 0 1 2 
Number of terms 245 4278 23832 
Average occurrence frequency 2642.1 318.6 61.9 
3 4 5 
10408 149 l 82 
10.6 7.7 9.0 - 
408 
I:~#-k, 99 Y,# -" 
'" " '=' /I\['C~£ 
:::Z ::::::::::::::::::::::::::: :/ ' :":' 
(a) Thesaurus overview 
1"~ ~9~_/.x 9D 27,9 -" 
2'~I~\]J!c¢x 72-'_ fAY \]"Z:-E.~ :iLtF'kM±Z 
J£,ff ,'ib'.. 9,ft.) v'~t/t!. 7,5_I 
r a~,:?: :,5 ,&5;: i!17.ii8 :~!~.I~fi ~< ~qT. >g .'JJ/g i~;? E'~ 
3:i~ /'l'.,'itR ;C,'i::i" "_Z tA: 
ki~':\[': ~LL.rJ.=\] 2>EU i::,.;~:'\]:::~: ;£~-:,':::,: i~Li~. .+ <?.. ~ /: .:. r- ~;" +. 
't :f" Z ?0 ~t ? 
¢~le'~ ' ' ~ I'*~a ,-m/I':" ~ "~I o ............... : ' " "~7: ............ ............ 
(b) Zoom-in 
F ~-)Ea?':~A rivaL752 £~ oEo~ ~ aLW.eE 
tlffa ~ Y_22_rDjtm~~ 
f~ 7ai~.NL~ ;r_u'2-?.,- in 97 
,_J 
(c) Further zoom-in 
An overview of the thesaurus was di.sTJko,ed. 
Then the user selected the.fi/ih and seventh 
cluste#w which he was interested in: {China, 
col!\[L'rence, Asia, cooperation, meeting, 
Vietnam, region, development, technology, 
environment}, and/economy export, aid, 
toward, summit, Soviet Union, debt, Russia, 
reconstruction}. This means that the user 
was interested in "development assistance 
to developing countries or areas ". 
The Jil?h aml seventh cluste~w J)'om (a) were 
shown close up, and clustelw indicating 
more .Vwc~/ic domains were presented. 
The user couM undelwtand which topics the 
respective clustelw suggested: "Economic 
assis'tance for the development of the Asia- 
Pat(lie region ", "Global environmental 
problems ", "bTternational debt problems ", 
"'Mattetw on China ", "Energy resource 
development", and so on. Since he wax 
e.vwcially interested in "International debt 
problems ", he selected the third cluster 
{debt, Egypt, Paris Club, creditor nation, 
q\[licial debt, de/'erred, Poland, .fin'eign debt, 
reimbursement, Paris, club, pro,merit, 
.fi, reign}. 
The third cluster fi'om (I 0 was shown close 
up. The resulting screen gave the user a 
choice el'many sT~ecific terms relevant to 
"htternational debt problems ", although 
not all oJ'the clustel:v indicated spec!/ic 
topics. The user was able to retrieve 
documents by simply selectin£ terms o/ 
interest./i'om those displayed on the screen. 
Fig. 3. Example of thesaurus navigation. 
409 
4 Comparison with related work 
Let us make a briefcolnparison with related work. 
Both scatter/gather document clustering (Cutting 
et al. 1992; Hearst and Pedersen 1996) and Ko- 
honen's self-organizing map (Lin et al. 1991; 
Lagus et al. 1996; Kohonen 1998) enable explo- 
ration through a corpus. While they treat a 
corpus as a collection of documents, we treat it as 
a collection of terms. Therefore our method can 
elicit finer information structure than these previ- 
ous methods, and moreover, it can be applied to a 
corpus that includes multi-topic doculnents. 
Our method compares quite well with the previ- 
otis methods for throughput and response time. 
5 Conclusion 
We demonstrated the feasibility of automatic 
generation of an association thesaurus from a 
corpus. The proposed thesaurus generation 
method consists of extracting terms and co- 
occurrence data from a corpus and analyzing the 
correlation between terms statistically. As a 
component technology for thesaurus generation, 
a method for disambiguating the structure of 
compound nouns based on corpus statistics was 
developed and evaluated. 
We also demonstrated the information re- 
trieval application of an automatically generated 
association thesaurus. A thesaurus navigator 
having novel functions such as term clustering, 
thesaurus overview, and zooming-in was devel- 
oped. An experiment with an association the- 
saurus generated from a newspaper article corpus 
proved that the thesaurus navigator allows us to 
efficiently explore information through a text 
corpus even when our information needs are 
vague. 
Acknowledgements: This research was st, p- 
ported in part by the Next-generation Digital 
Library System R&D Project of MITI (Ministry 
of International Trade and Industry), IPA (Inlbr- 
mation-technology Promotion Agency), and 
JIPDEC (Japan Information Processing Devel- 
opment Center). We thank Mainichi Newspa- 
pers, Ltd. for permitting us to use the CD-ROMs 
of the '91, '92, '93, '94 and '95 Mainichi Shim- 
bun tbr the experiment. 

References 

Church, K. W., and P. ttanks. 1990. Word association 
norms, mutual information, and lexicography. Con> 
putational Linguistics, 16( 1 ): 22-29. 

Cutting, D. R., D. R. Karger, J. O. Pedersen, and J. W. 
Tukey. 1992. Scatter/gather: A cluster-based ap- 
proach to browsing large doctnnent collections. Proc. 
ACM SIG1R '92, pp. 318-329. 

Dunning, T. 1993. Accurate methods for the statistics 
of surprise and coincidence. Computational Lin- 
guistics, 19( 1 ): 61-74. 

El-Hamdouchi, A., and P. Willett. 1989. Comparison 
of hierarchical agglomerative clustering methods for 
document retrieval. The Computer Journal, 32(3): 
220-227. 

Grefenstette, G. 1992. Use of syntactic context to 
produce term association lists for text retrieval. Proc. 
ACM SIGIR '92, pp. 89-97. 

Hearst, M. A. 1992. Automatic acquisition ofhypo- 
nyms from large text corpora. Proc. COLING '92, 
pp. 539-545. 

Hearst, M. A., and J. O. Pedersen. 1996. Reexamining 
the cluster hypothesis: Scatter/gather on retrieval 
results. Proc. ACM S1GIR '96, pp. 76-84. 

Jing, Y., and W. B. Croft. 1994.. An association the-. 
saurus for information retrieval. Proc. R1AO '94, 
Cont. on Intelligent Text and linage Handling, pp. 
146-160. 

Kobayasi, Y., T. Tokunaga, and tf. Tanaka. 1994. 
Analysis of Japanese compound nouns using colloo 
cational information. Proc. COLING '94, pp. 865- 
869. 

Kohonen, T. 1998. Self-organization of very large 
document collections: State of the art. Proc. 8th lnt'l 
Cone on Artificial Neural Networks, vol. 1, pp. 65- 
74. 

Lagus, K., T. Honkela, S. Kaski, and T. Kohonen. 
1996. Self organizing maps of document collections: 
a new approach to interactive exploration. Proc. 2nd 
lnt'l Cone on Knowledge Discovery and Data Min- 
ing, pp. 238-243. 

Lin, X., D. Soergel, and G. Marchionini. 1991. A self- 
organizing semantic map for information retrieval. 
Proc. ACM SIGIR '91, pp. 262-269. 

Ruge, G. 1991. Experiments on linguistically based 
term associations. Proc. RIAO '91, Conf. on Intelli- 
gent Text and hnage Handling, pp. 528-545. 

Schutze, tt., and J. O. Pedersen. 1994. A cooccur- 
rence-based thesaurus and two applications to in- 
formation retriewd. Proc. RIAO '94, Conf. on Intel- 
ligent Text and hnage t tandling, pp. 266-274. 
