On the Use of Term Associations in Automatic Information Retrieval 
Gerard Salton* 
Abstract 
It has been recognized that single words extracted from natural language texts are not 
always useful for the representation of information content. Associated or related terms, 
and complex content identifiers derived from thesauruses and knowledge bases, or constructed 
by automatic word grouping techniques, have therefore been proposed for text identification 
purposes. 
The area of associative content analysis and information retrieval is reviewed in this 
study. The available experimental evidence shows that none of the existing or proposed 
methodologies are guaranteed to improve retrieval performance in a replicable manner for 
document collections in different subject areas. The associative techniques are most valu- 
able for restricted environments covering narrow subject areas, or in iterative search situa- 
tions where user inputs are available to refine previously available query formulations and 
search output. 
I. Introduction 
Computers were first used for the process- 
ing of natural language texts over 30 years ago. 
From the beginning it has been recognized that 
the individual words contained in the texts of 
written documents could be used in part to pro- 
vide a representation of document content. At 
the same time it was generally accepted that 
certain words, or word sets, would not produce 
meaningful content identifiers. In particular, 
some quite broad words, such as the term "com- 
puter" used to identify computer science litera- 
ture, would be useless for distinguishing one 
document from another. Other very specific 
terms would be so rare that no single item in a 
collection might reasonably be described by such 
a very rare term. 
To improve the operations of text process- 
ing systems, it has been suggested that the ori- 
ginal document vocabulary be expanded by adding 
related or associated terms not originally 
present in the available text samples. Two main 
types of vocabulary relationships can be recog- 
nized in this connection, known respectively as 
and ~ relations. \[I\] 
The paradigmetric relations cover term associa- 
tions, such as synonyms and hierarchical inclu- 
sion, that always exist between particular terms 
regardless of the context in which these terms 
are used. For example, a paradigmatic relation 
exists between the name of a country (say, 
France) and the capital city (Paris). Syntag.- 
matie relations, on the other hand, are rela- 
tions which are not valid outside some specified 
context. For example, a cause-effect relation 
may be detected in certain circumstances between 
"poison" and "death H. 
Department of Computer Science, Cornell Univer- 
sity, Ithaca, NY 14853. 
This study was supported in part by the National 
Science Foundation under grants IST 83-16166 and 
IST 85-44189, 
380 
The paradigmatic relations may be identi- 
fied by using precoustructed dictionaries, or 
thesauruses, containing schedules or groupings 
of related terms or concepts. The syntagmatic 
relations, on the other hand, must be derived by 
analyzing particular text samples and extracting 
the term relationships specified in these texts. 
Various methods are outlined in the next 
section for utilizing paradigmatic and syntag- 
matic term associations in text processing sys- 
tems, and the effectiveness of the methods is 
assessed using available experimental output. 
2. Associative Text Processing Methods 
A) Thesaurus Operations 
A thesaurus is a word grouping device which 
provides a hierarchical and/or a clustered 
arrangement of the vocabulary for certain sub- 
ject areas. Thesauruses are used in text pro- 
cessing for three main purposes \[2\]: 
a) as authority lists where the thesaurus nor- 
malizes the indexing vocabulary by distin- 
guishing terms that are allowed as content 
identifiers from the remainder of the voca- 
bulary; 
b) as grouping devices where the vocabulary is 
broken down into classes of related, or 
synonymous terms, as in the traditional 
RogetWs thesaurus; 
c) as term hierarchies where more general 
terms are broken down into groups of nar- 
rower terms, that may themselves be broken 
down further into still narrower groups. 
When a thesaurus is available for a partic- 
ular subject area, each term found in a document 
can be used as an entry point into the 
thesaurus, and additional (synonymous or 
hierarchically related) terms included in the 
same thesaurus class as the original can be sup- 
plied. Such a thesaurus operation normalizes 
the vocabulary and provides additional opportun- 
ities for matches between query and document 
vocabularies. The vocabulary expansion tends to 
enhance the search recall (the proportion of 
relevant materials actually retrieved as a 
result of a search process). 
When the subject area is narrowly cir- 
cumscribed and knowledgeable subject experts are 
available, useful thesaurus arrangements can he 
manually constructed by human experts that may 
provide substantial enhancements in retrieval 
effectiveness. Table 1 shows the average search 
precision (the proportion of retrieved materials 
actually relevant) obtained at certain fixed 
recall points for a collection of 400 documents 
in engineering used with 17 search requests. In 
that case, the performance of a manually con- 
structed thesaurus (the Harris Three thesaurus) 
is compared with a content analysis system based 
on weighted word stems extracted from document 
and query texts. The output of Table 1 shows 
that at the high recall end of performance 
range, the thesaurus provides much better 
zetrieval output than the word stem process. 
\[3\] 
Average Search Precision 
Recall \] Weighted Word Harris Three 
Stems Thesaurus 
-- i 
.i I .9563 
• 3 l .7986 
• 5 l .6371 
• 7 l .4877 
• 9 I .3426 
.9735 + 2% 
.8245 + 3% 
.7146 +11% 
.6012 +19% 
.4973 +31% 
+13% 
Sample Thesaurus Performance 
(IRE Collection, 400 documents, 17 queries) 
Table 1 
While the use of thesauruses is widely 
advocated as a means for normalizing the vocabu- 
lary of document texts, no consensus exists 
about the best way of constructing a useful 
thesaurus. It was hoped early on, that 
thesaurses could be built automatically by 
studying the occurrence characteristics of the 
terms in the documents, and grouping into common 
thesaurus classes those terms that co-occur suf- 
ficiently often in the text of the documents: 
\[4\] 
"the statistical material that may be 
required in the manual compilation of dic- 
tionaries and thesauruses may be derived 
from the original texts in any desired form 
and degree of detail." 
Later is was recognized that thesauruses con- 
structed by using the occurrence characteristics 
of the vocabulary in the documents of" a collec- 
tion do not in fact provide generally valid 
paradigmatic temn relations, but identify 
instead locally valid syntagmatic relations 
derivable from the particular document environ- 
ment. \[5\] To utilize the conventional paradig- 
matic te~l relations existing in particular sub- 
ject areas, the vocabulary arrangements must 
effectively be constructed by subject experts 
using largely ad-hoc procedures made up for each 
particular occasion. The thesaurus method is 
therefore not generally usable in operational 
environments. 
B) Automatic Term Associations 
While generally valid thesauruses are dif- 
ficult to build, locally valid f~Km~di~aILig/l 
can be generated automatically by making 
use of similarity measurements between pairs of 
te~mls based, for example, on the number of docu- 
ments in which the terms co-occur in the docu- 
ments of a collection° The number of common 
sentences in which a pair of words can be found 
may also be taken into account, as well as some 
measure of proximity between the words in the 
various texts. Using similarity measurements 
between word pairs, term association maps can be 
constructed, and these may be displayed and used 
by the search personnel to formulate useful 
query statements, and to obtain expanded docu- 
ment representations. \[6,7\] 
Documenl Dz 
vectors D3 
Or 
(a) 
\]erms osslgne0 to documents 
r 2 r~ r~ % To r7 .... 
3 0 0 2 0 6 i 
0 0 I 3 2 o 2 
o 2 3 0 4 O O 
i 2 i O 3 I O 
Term-Document Matrix Showing Frequency of 
Terms A~;signed to Documents 
(b) 
T 2 
r D 
D, " 2 3 
6 3 
r, 
I 4 
Term-Document Graph for Matrix of Fig. l(a) 
Accelerotion 
Nozzle, Propulsion 
Chomber Ejeclion 
Fluid 
(c) Term Association Map 
Term-Document Matrix and Graph and Corresponding 
Association Map (from \[8\] p.51) 
Fig. 1 
381 
A sample assignment of terms to documents 
is Shown in the matrix of Fig. l(a). Fig. l(b) 
shows the corresponding document-term graph 
where a line between term T. and document D. 
represents the correspondin~ term assignmen~ 
appears in Fig. l(b). Given the assignment of 
Figs. l(a) and (b). term associations may be 
derived by grouping sets of terms appearing in 
term4J4 similar contexts. For example, and T_ 
may be grouped because these te appea~ 
jointly in documents D I and D2; similarity terms 
T 1 and T 6 appear in aocuments D 1 and D 4. The 
grouping operations may be used to obtain a term 
association map of the kind shown in Fig. l(c), 
where associated temDs are joined by lines on 
the map. 
The operations of a typical associativ~ 
system are illustrated in Fig. 2. 
\[7\] The original query words are listed on the 
left-hand side of Fig. 2, and the derived asso- 
ciated terms are shown on the right. The value 
of a given associated term--for example, "Inter u 
national Organizations"-- is computed as the sum 
of the term association values between the given 
term and all origilml query tez~s (0.5 for 
~IUnited Nations" plus 0.4 for "Pan American 
Union" in the example of Fig. 2). Finally the 
retrieval value of a document is computed as the 
sum of the common term association values for 
all matching terms that are present in both 
queries and documents. Many variations are pos- 
sible of the basic scheme illustrated in Fig. 2; 
in each case, the hope is that valid term asso- 
ciations would make it possible to achieve a 
greater degree of congruence between document 
representations and query formulations. 
UNITED NATIONS (1.5) ~ 
INT. ORGAniZATIONS (0.9) 
nl 5" ~ //~INT,~~/ COOPERATION <0.E) 0.5 
UNITED NATIONS~I 3 INT, RELATIONS (0.9) 
0.5 0.5 LEAGNE OF NATIONS (0.5) 
0.5 0.4 NT. LAW (0.3) 
0.3 
pAN. AM. UNION~\[ PEACE (0.7) 
0.2 
1 AN AM UNION (1.5) 
Document 1 : 
Document 2 : 
UNITED NATIONS 1.5 LEAGUE OF NATIONS 0.5 
PAN. AM. UNION 1.5 INT. COOPERATION 0.8 
4.3 
UNITED NATIONS 1.5 PAN. AM. ONION 1.5 
INT. LAW 0.3 
3.3 
382 
Associative Retrieval Example 
(from Giuliano \[6\]) 
Fig. 2 
In practice, it is found that the use of 
term associations can improve the search recall 
by providing new matches between the term 
assigned to queries and documents that were not 
available in the original query and document. 
In addition, the search precision can also be 
enhanced by reinforcing the strength of already 
existing term matches. \[5\] Unfortunately, the 
experimental evidence indicates that only about 
20 percent of automatically derived associations 
between pairs of terms are semantically signifi- 
cant; the associative indexing process does not 
therefore provide guaranteed advantages in 
retrieval effectiveness. 
Table 2 shows a typical evaluation output 
for a collection of 400 documents in engineering 
used with 17 search requests. The output of 
Table 2 shows that the automatic term associa- 
tions provide an increase in average search pre- 
cision only at the high recall end of the per- 
formance range. Overall, the average search 
precision decreases by 13 percent for the col- 
lection used in Table 2. \[8, p. 130\] 
Average Search Precision 
Recall Weighted Word 
Stems 
• i °9563 
• 3 .7986 
• 5 .6371 
.7 °4877 
,9 .3426 
Automatic Term 
Associations 
.7385 -23% 
.5844 -27% 
• 5187 -19% 
• 4452 - 9% 
.3794 +11% 
-13% 
Sample Associative Indexing Performance 
(IRE Collection, 400 documents, 17 queries) 
Table 2 
More recently other vocabulary expansion 
experiments have been conducted using associated 
terms derived by statistical term co-occurrence 
criteria. \[9-10\] Once again, the evaluation 
results were disappointing: \[10\] 
hOur results on query expansion using the 
NPL data are disappointing. We have not 
been able to achieve any significant 
improvements over nonexpansion. We have 
repeated previous experiments in which the 
query was expanded, and the resulting set 
of search terms then weighted... Once 
again the results have been conflicting..," 
The conclusions derived from the available 
evidence indicate that the vocabulary expansion 
techniques which add to the existing content 
identifiers related terms specified in a 
thesaurus, or derived by term co-occurrence 
measurements, do not provide methods for improv- 
ing retrieval effectiveness. Generally valid 
thesauruses for large subject areas are diffi- 
cult to generate and the automatic term co- 
occurrence procedures do not offer adequate 
quality control. Efforts to enhance the recall 
performance of search systems must therefore be 
based on different techniques designed to gen- 
erate indexing vocabularies of broader scope, 
including especially word stem generation and 
suffix truncation methods. 
C) Term Phrase Generation 
The vocabulary expansion methods described 
up to now are designed principally to improve 
search recall. Search precision may be enhanced 
by using narrow indexing vocabularies consisting 
largely of JL&I~B p~ replacing the normally 
used single terms. Thus "computer science" or 
"computer programming" could replace a broader 
term such as 'tcaleu1~tor" or "computer". The 
recognition and aesig u ent of term phrases poses 
much the same probems as the previously 
described generation of term associations and 
the expansion of indexing vocabularies. In par- 
ticular, an accurate determination of useful 
te~,l phrases, and the rejection of extraneous 
phrases, must he based on syntactic analyses of 
query and document texts suppl~nented by semen= 
tic components valid for the subject areas under 
consideration. Unfortunately, complete linguis- 
tic analyses of topic areas of reasonable scope 
are unavailable for reasons of efficiency as 
well as effectiveness. In practice, it is then 
necessary to fall back on simpler phrase genera- 
tion methods in which phrases are identified as 
sequences of co-occurring terms with appropriate 
statistical and/or syntactic properties. In 
such simple phrase generation environments qual- 
ity control is, however, difficult to achieve. 
The following phrase generation methods are 
of main interest: 
a) statistical methods where each phrase \]~ 
(the main phrase component) bas a stated 
minimal occurrence frequency in the texts 
under consideration, and each phrase 
exhibits another stated minimal 
occurrence frequency, and the distance in 
number of intervening words between phrase 
heads and phrase components is limited to a 
stated number of words; 
b) a simple syntactic pattern matching method 
where a dictionary search method is used to 
assign syntactic indicators to the text 
elements, and phrases are then recognized 
as sequences of words exhibiting certain 
previously established patterns of syntac- 
tic indications (e.g. adjective-noun-noun, 
or preposition-adjective-noun); \[11-12\] 
c) a more complete syntactic analysis method 
supplemented if possible by appropriate 
semantic restrictions to control the 
variety of permitted syntactic phrase con- 
structions for the available texts. \[13,14\] 
When statistical phrase generation methods 
are used, a large number of useful phrases can 
in fact be identified, together unfortunately 
with a large number of improper phrases that are 
difficult to reject on formal grounds. For 
example, given a query text such as 
"h~nophilia and christmas disease, espe- 
cially in regard to the specific complica- 
tion of pseudotumor formation (occurrence, 
pathogenesis, treatment, prognosis)' 
it is easy to produce correct phrase combina- 
tions such as "christmas disease" and "pseudotu- 
mor formation". At the same time the statist:i- 
cal phrase formation process produces inap- 
propriate patterns such as "formation 
occurrence '~ and "complication formation"° \[15\] 
Overall a statistical phrase formation process 
will be of questionable usefulness. 
Table 3 shows a comparison of the average 
search precision results for certain fixed 
recall values between a standard single term 
indexing system, and a system where the single 
terms are supplemented by statistically deter- 
mined phrase combinations. The output of Table 
3 for four different document collections in 
computer science (CACM), documentation (CISI)~ 
medicine (MED) and aeronautics (CRAN) shows that 
the phrase process affords modest average 
i~provements for three collections out of four. 
\[15\] However, the improvement is not 
guaranteed, and is in any case limited to a few 
percentage points in the average precision. 
The evaluation results available for tile 
syntax-based methods are not much more encourag- 
ing. \[16\] The basic syntactic analysis 
approach must be able to cope with ordinary word 
ambiguities (Imllp base, army base, baseball 
base), the recognition of distinct syntactic 
constructs with identical meanings, discourse 
problems exceeding sentence boundaries such as 
pronoun referents from one sentence to another, 
Recall 
.I 
.3 
.5 
.7 
.9 
Average 
Improvement 
CACM 3204 
Single I 
Terms_\] Phrases 
.5086 I .5427 
.3672 I .3971 
.2398 \] .2527 
.1462 .1462 
.0711 .0759 
+6.8% 
CISI 1460 
Single 
_ Ten,}s Phrases 
.4919 .4590 
.3118 .2999 
.2320 .2222 
.1504 .1283 
.0739 .0630 
-8.6% 
MED 1033 
Single 
Terms Phrases 
.8038 .7970 
.6742 °7064 
.5447 .5529 
.4082 .4166 
.2057 .2056 
+1,6% 
CRAN 
Single 
Term~ Phrases 
.7526 .7540 
.5184 .5385 
.3714 .3989 
.2301 .2431 
.1313 .1328 
+4.1% 
Comparison of Single Term Indexing with Statistical 
Phrase Indexing for Four S~iple Document Collections 
Table 3 
383 
and the difficulties of interpreting many com- 
plex meaning units in ordinary texts. An illus- 
tration of the latter kind is furnished by the 
phrase "high frequency transistor oscillator", 
where it is important to avoid the interpreta- 
tion "high frequency transistor" while admitting 
"transistor oscillator" and "high frequency 
oscillator". A sophisticated syntactic analysis 
system with substantial semantic components was 
unable in that case to reject the extraneous 
interpretations "frequency transistor oscilla- 
tors which are high (tall)" and "frequency 
oscillators using high (tall) transistors". \[17\] 
In addition to the problems inherent in the 
language analysis component of a phrase indexing 
system, a useful text processing component must 
also deal with phrase classification, that is 
the recognition of syntactically distinct pat- 
terns that are semantically identical ("computer 
programs", "instruction sets for computers", 
"programs for calculating machines"). The 
phrase classification problem itself raises com- 
plex problems that are not close to solution. 
\[18\] 
In summary, the use of complex identifying 
units and term associations in automatic text 
processing environments is currently hampered by 
difficulties of a fundamental nature. The basic 
theories needed to construct useful term group- 
ing schedules and thesauruses valid for particu- 
lar subject areas are not sufficiently 
developed. As a result, the effectiveness of 
associative retrieval techniques based on term 
grouping and vocabulary expansion leaves some- 
thing to be desired. The same is true of the 
syntactic and semantic language analysis 
theories used to generate a large proportion of 
the applicable complex content descriptions and 
phrases, and to reject the majority of extrane- 
ous term combinations. 
The question arises whether any retrieval 
situations exists in which it is useful to go 
beyond the basic single term text analysis 
methodology, consisting of the extraction of 
single terms from natural language query and 
document texts. This question is examined in 
the remaining section of this note. 
3. The Usefulness of Complex Text Processing 
Three particular text processing situations 
can be identified where term association tech- 
nlques have proved to be useful. The first one 
is the well-known ~ ~ process 
where initial search operations are conducted 
with preliminary query formulations obtained 
from the user population. Following the 
retrieval of certain stored text items, the user 
is asked to respond by furnishing relevance 
assessments for some of the previously retrieved 
items; these relevance assessments are then used 
by the system to construct new, improved query 
formulations which may furnish additional, hope- 
fully improved, retrieval output. In particu- 
lar, the query statements are altered by adding 
terms extracted from previously retrieved items 
that were identified as relevant to the user's 
purposes, while at the same time removing query 
terms included in previously retrieved items 
designated as nonrelevant. 
The relevance feedback methodology 
represents an associative retrieval technique, 
since new query terms are obtained from certain 
designated documents that hopefully are related 
to the originally available formulations. 
Relevance feedback techniques have been used 
with vector queries formulated as sets of search 
terms \[9, 19-20\], and more recently with Boolean 
queries. \[21\] The effectiveness of the feedback 
procedure has never been questioned. 
Table 4 shows typical evaluation output for 
four different document collections in terms of 
average search precision at ten recall points 
(from a recall of 0.I to a recall of 1.0 in 
steps of 0.I) averaged over the stated number of 
user queries. The output of Table 4 applies to 
Boolean queries with binary weighted terms. \[21\] 
The improvements in retrieval precision due to 
the user feedback process ranges from 22% to 
110% for a single search iteration. When the 
feedback process is repeated three timesj the 
improvement in search precision increases to 63% 
to 207%, Evidently, the user relevance informa- 
tion which applies to particular queries at par- 
ticular times makes it possible to find a suffi- 
cient number of interesting term associations to 
substantially improve the retrieval output. 
A second possibility for generating 
improved retrieval output consists in limiting 
the analysis effort to the ~ ~ 19~ 
/~ instead of the document texts. In a 
recent study, term phrases were first extracted 
from natural language query texts using a sim- 
ple, manually controlled, syntactic analysis 
process. These query phrases were then recog- 
nized in document texts bY a rough pattern 
Original Boolean Queries 
First Iteration Relevance 
Feedback 
Third Iteration Relevance 
Feedback 
Medlars 1033 ClSI 1460 CACM 4204 Inspec 12684 
30 queries 35 queries 52 queries 77 queries 
0.1798 0.1159 0.2065 
0.4322 
(+11o%) 
0,6334 
(+207%) 
0.1118 
0.1367 
(+22%) 
0.1827 
(+63%) 
0.2550 
(+42%) 
0.3217 
(+79%) 
0.1522 
(+31%) 
0.1933 
(+67%) 
Average Search Precision at i0 Recall Points for One Iteration 
and Throe Iterations of Relevance Feedback (4 document collections) 
Table 4 
384 
matching procedure distinguishing pairs and tri- 
ples of terms occurring in the same phrases of 
documents, and pairs and triples of terms occur- 
ring in the same sentences of documents. \[22\] 
Whenever a phrase match is obtained between a 
query phrase and a document text, the retrieval 
weight of the document is appropriately 
increased. 
An evaluation of such a manually controlled 
syntactic phrase recognition system based on 
query statement analysis reveals that substan- 
tial improvements in retrieval effectiveness are 
obtainable for the phrase assignments, compared 
with the single term alternatives. Table 5 
shows average search precision values at five 
recall levels for 25 user queries used with the 
CACM collection in computer science. \[22\] On 
average the query analysis system raises the 
search precision by 32 percent. 
0.1 
0.3 
0.5 
0.7 
0.9 
Average Search Precision 
Weighted Weighted Single 
Single Terms Terms and Phrases 
0.555 0.625 
0.271 0.355 
0.211 0.265 
0. 064 0 • 085 
0.038 0.060 
+13% 
+31% 
+26% 
+33% 
+58% 
+32% 
Average Search Precision for 
Query Statement Phrase Analysis 
(CACM Collection, 25 Queries) 
Table 5 
The special processing described up to now 
is user related in the sense that user query 
formulations and user relevance assessments are 
utilized to improve the retrieval procedures. 
The last possibility for the use of complex 
information descriptions consists in incorporat- 
ing stored JgRO_~ renresentations covering 
particular subject areas to enhance the descrip- 
tions of document and query content. \[23-25\] 
Various theories of knowledge representation are 
current, including for example, models based on 
the use of frames representing events and 
descriptions of interest in a given subject. 
Frames designating particular entities may 
be represented by tabular structures, with open 
Hslots" filled with attributes of the entities, 
or values of attributes. Relationships between 
frames are expressed by using attributed that 
are themselves represented by other frames, and 
by adding links between frames. Frame opera- 
tions can also be introduced to manipulate the 
knowledge structure when new facts or entities 
become known, or when changes occur in item 
relationships, There is some evidence that when 
the knowledge base needed to analyze the avail- 
able texts is narrowly circumscribed and limited 
in scope, useful frame structures can in fact be 
intellectually prepared to enhance the retrieval 
operations. \[26\] 
However, when the needed topic area is not 
of strictly limited scope, the construction of 
useful knowledge bases is much less straightfor- 
ward and the knowledge-based processing tech- 
niques become of limited effectiveness. It has 
been suggested that: in these circumstances, the 
system user himself might help in building the 
knowledge structures. \[27\] While this remains 
a possibility, it is hard to imagine that 
untrained users can lay out the subject 
knowledge of interest in particular areas and 
specify concept relationships such as synonyms, 
generalizations, instantiations, and cross- 
references with sufficient accuracy. In any 
case, no examples exist at the present time 
where user constructed knowledge bases have 
proved generally valid for different collections 
in particular subject areas. In fact, the 
situation appears much the same as it was thirty 
years ago: it seems quite easy to build locally 
valid te~ association systems by ad-hoc means; 
these tools fail however in somewhat different 
environments, and do not furnish reliable means 
for improving text processing systems in gen- 
eral. 
For the foreseeable future, text processing 
systems using complex information identifica- 
tions and term associations must therefore be 
limited to narrowly restricted topic areas, or 
must alternatively be based on simple user 
inputs, such as ~ocument relevance data, t:hat 
can be furnished by untrained users without 
undue hardship. 
References 

J.Co Gardin, Syntol, in Systems for the 
Intellectual Organization of Information, 
S. Artandi, editor, Vol. 2, Rutgers Univer- 
sity, New Brunswick, NJ, 1965. 

M.E. Stevens, Automatic Indexing: A State 
of the Art Report, NBS Monograph 9\], 
National Bureau of Standards, Washington, 
DC, 1965. 

G. Salton and M.E. Lesk, Computer Evalua- 
tion of Indexing and Text Processing, Jour- 
nal of the ACM, 15:1, January 1968, 8-36. 

H.P Luhn, Auto-Encoding of Documents for 
Information Retrieval Systems ~, M. Boaz, 
editor, Modern Trends in Documentation, 
1959, 45-58. 

M.E. Lesk, Word-Word Associations in Docu- 
ment Retrieval Systems, American Documenta- 
tion, 20:1, January 1969, 27-38. 

L.B. Doyle, Semantic Road Maps for Litera- 
ture Searchers, Journal of the ACM, 8° 
1961, 553-578. 

V.E. Giulianoj Automatic Message Retrieval 
by Associative Techniques, in Joint Man- 
Computer Languages, Mitre Corporation 
Report SS-10, Bedford, MA, 1962, 1-44. 

G. Salton, Automatic Information Organiza- 
tion and Retrieval, McGraw Hill Book Com- 
pany, New York, 1968. 

D.J. Harper and C.J van Rijsbergen, An 
Evaluation of Feedback in Document 
Retrieval using Cooceurrence Data, Jl. of 
Documentation, 34:3, September 1978, 189-216. 

S.E. Roberts,n, C.J. van Rijsbergen, and 
M.F. Porter, Probabilistic Models of Index- 
ing and Searching, in Information Retrieval 
Research, R.N. Oddy, S.E. Roberts,n, C.J. 
van Rijsbergen, and P.W. Williams, editors, 
Butterworths, London, 1981, 35-56. 

M. Dillon and A.S. Gray, FASIT: A Fully 
Automatic Syntactically Based Indexing Sys- 
tem, Journal of the ASIS, 34, 1983, 99-108. 

G. Salt,n, Automatic Phrase Matching, in 
Readings in Automatic Language Processing, 
DoG. Hays, editor, Am. Elsevier Publishing 
Co.B New York, 1966. 

R. Grishman, Natural Language Processing, 
Journal of the ASIS, 35, 1984, 291-296. 

P.J. Hayes and J.G. Carbonell, A Tutorial 
on Techniques and Applications for Natural 
Language Processing, Technical Report CMU- 
CS-83-158, Carnegie-Mellon University, 
Pittsburgh, PA, 1983. 

J.L. Fagan, Automatic Phrase Indexing for 
Text Passage Retrieval and Printed Subject 
Indexes, Technical Report, Department of 
Computer Science, Cornell University, 
Ithaca, NY, May 1985. 

G. Salt,n, The Smart Retrieval System - 
Experiments in Automatic Document Process- 
ing, G. Salt,n, editor, Prentice Hall Inc., 
Englewood, Cliffs, NJ, 1971, 207-208. 

K. Sparck Jones and J.l. Tait, Automatic 
Search Tel~ Variant Generation, Journal of 
Documentation, 40:1, March 1984, 50-66. 

C.D. Paice and V. Aragon-Ramirez, The Cal- 
culation of Similarities between Multi-word 
Strings using a Thesaurus, Proc. RIAO-85 
Conference, Grenoble, France, 1985, 293- 
319. 

J.J. Rocchio, Jr., Relevance Feedback in 
Information Retrieval, in The Smart System 
Experiments in Automatic Document Pro- 
cessing, Prentice Hall, Inc., Englewood 
Cliffs, NJ, 1971, Chapter 14. 

E. Ide, New Experiments in Relevance Feed- 
back, in The Smart System- Experiments in 
Automatic Document Processing, G. Salt,n, 
editor, Prentice Hall Inc., Englewood 
Cliffs, NJ, 1971, Chapter 16. 

G. Salt,n, E.A. Fox and E. Voorhees, 
Advanced Feedback Methods in Information 
Retrieval, Journal of the ASIS, 36:3, 1985, 
200-210. 

A.F. Smeaton, Incorporating Syntactic 
Information into a Document Retrieval Stra- 
tegy: An Investigation, Technical Report, 
Department of Computer Science, University 
College, Dublin, Ireland, 1986. 

M. Minsky, A Framework for Representing 
Knowledge, P.H. Winston, editor, The 
Psychology of Computer Vision, McGraw Hill 
Book Co., NY, 1975, 211-277. 

R.C. Schank and R.P. Abels,n, Scripts, 
Plans, Goals and Understanding, Lawrence 
Erlbaum Associates, Hillsdale, NJ, 1977. 

R.J. Brachman and B.C. Smith, Special Issue 
on Knowledge Representation, SIGART 
Newsletter, No. 70, February 1980. 

M.K. diBenigno, G.R. Cross and C.G. deBes- 
sonet, COREL - A Conceptual Retrieval Sys- 
tem, Technical Report, Louisiana State 
University, Baton Rouge, LA, 1986. 

W.B. Croft, User Specified Domain Knowledge 
for Document Retrieval, Technical Report, 
Computer Science Department, University of 
Massachusetts, Amherst, MA, 1986. 
