INFORMATION RETRIEVAL USING 
ROBUST NATURAL LANGUAGE PROCESSING 
Tomek Strzalkowski 
Courant Institute of Mathematical Sciences 
New York University 
715 Broadway, rm. 704 
New York, NY 10003 
tomek@cs.nyu.edu 
ABSTRACT 
We developed a fully automated Information Retrieval System which uses 
advanced natural language processing techniques to enhance the effective- 
ness of traditional key-word based document retrieval. In early experiments 
with the standard CACM-3204 collection of abstracts, the augmented sys- 
tem has displayed capabilities that made it clearly superior to the purely 
statistical base system. 
1. OVERALL DESIGN 
Our information retrieval system consists of a traditional 
statistical backbone (Harman and Candela, 1989) aug- 
mented with various natural language processing com- 
ponents that assist the system in database processing (stem- 
ming, indexing, word and phrase clustering, selectional res- 
trictions), and translate a user's information request into an 
effective query. This design is a careful compromise 
between purely statistical non-linguistic approaches and 
those requiring rather accomplished (and expensive) 
semantic analysis of data, often referred to as 'conceptual 
retrieval'. The conceptual retrieval systems, though quite 
effective, are not yet mature enough to be considered in 
serious information retrieval applications, the major prob- 
lems being their extreme inefficiency and the need for 
manual encoding of domain knowledge (Mauldin, 1991). 
In our system the database text is first processed with a fast 
syntactic parser. Subsequently certain types of phrases are 
extracted from the parse lxees and used as compound index- 
ing terms in addition to single-word terms. The extracted 
phrases are statistically analyzed as syntactic contexts in 
order to discover a variety of similarity links between 
smaller subphrases and words occurring in them. A further 
filtering process maps these similarity links onto semantic 
relations (generalization, specialization, synonymy, etc.) 
after which they are used to transform user's request into a 
search query. 
The user's natural language request is also parsed, and all 
indexing terms occurring in them are identified. Next, cer- 
tain highly ambiguous (usually single-word) terms are 
dropped, provided that they also occur as elements in some 
compound terms. For example, "natural" is deleted from a 
query already containing "natural language" because 
206 
"natural" occurs in many unrelated contexts: "natural 
number", "natural logarithm", "natural approach", etc. At 
the same time, other terms may be added, namely those 
which are linked to some query term through admissible 
similarity relations. For example, "fortran" is added to a 
query containing the compound term "program language" 
via a specification link. After the final query is constructed, 
the database search follows, and a ranked list of documents 
is returned. 
It should be noted that all the processing steps, those per- 
formed by the backbone system, and these performed by 
the natural language processing components, are fully 
automated, and no human intervention or manual encoding 
is required. 
2. FAST PARSING WITH TTP 
TIP (Tagged Text Parser) is based on the Linguistic String 
Grammar developed by Sager (1981). Written in Quintus 
Prolog, the parser currently encompasses more than 400 
grammar productions. It produces regularized parse tree 
representations for each sentence that reflect the sentence's 
logical structure. The parser is equipped with a powerful 
skip-and-fit recovery mechanism that allows it to operate 
effectively in the face of ill-formed input or under a severe 
time pressure. In the recent experiments with approximately 
6 million words of English texts, 1 the parser's speed aver- 
aged between 0.45 and 0.5 seconds per sentence, or up to 
2600 words per minute, on a 21 MIPS SparcStation ELC. 
Some details of the parser are discussed below. 2 
TIP is a full grammar parser, and initially, it attempts to 
generate a complete analysis for each sentence. However, 
unlike an ordinary parser, it has a built-in timer which regu- 
lates the amount of time allowed for parsing any one sen- 
tence. If a parse is not returned before the allotted time 
I These include CACM-3204, MUC-3, and a selection of nearly 
6,000 technical articles extracted from Computer Library database (a Zfff 
Communications Inc. CD-ROM). 
2 A complete description can be found in (Strzalkowski, 1991). 
elapses, the parser enters the skip-and-fit mode in which it 
will try to "fit" the parse. While in the skip-and-fit mode, 
the parser will attempt to forcibly reduce incomplete consti- 
tuents, possibly skipping portions of input in order to restart 
processing at a next unattempted constituent. In other 
words, the parser will favor reduction to backtracking while 
in the skip-and-fit mode. The result of this strategy is an 
approximate parse, partially fitted using top-down predic- 
tions. The fragments skipped in the first pass are not thrown 
out, instead they are analyzed by a simple phrasal parser 
that looks for noun phrases and relative clauses and then 
attaches the recovered material to the main parse structure. 
As an illustration, consider the following sentence taken 
from the CACM-3204 corpus: 
The method is illustrated by the automatic construction 
of both recursive and iterative programs operating on 
natural numbers, lists, and trees, in order to construct a 
program satisfying certain specifications a theorem in- 
duced by those specifications is proved, and the desired 
program is extracted from the proof. 
The italicized fragment is likely to cause additional compli- 
cations in parsing this lengthy string, and the parser may be 
better off ignoring this fragment altogether. To do so suc- 
cessfully, the parser must close the currently open consti- 
tuent (i.e., reduce a program satisfying certain 
specifications to NP), and possibly a few of its parent con- 
stituents, removing corresponding productions from further 
consideration, until an appropriate production is reac- 
tivated. In this case, TIP may force the following reduc- 
tions: SI --> to V NP; SA ----> SI; S ---> NP V NP SA, until the 
production S --> S and S is reached. Next, the parser skips 
input to find and, and resumes normal processing. 
As may be expected, the skip-and-fit strategy will only be 
effective if the input skipping can be performed with a 
degree of determinism. This means that most of the lexical 
level ambiguity must be removed from the input text, prior 
to parsing. We achieve this using a stochastic parts of 
speech tagger 3 to preprocess the text. 
3. WORD SUFFIX TRIMMER 
Word stemming has been an effective way of improving 
document recall since it reduces words to their common 
morphological root, thus allowing more successful matches. 
On the other hand, stemming tends to decrease retrieval 
precision, if care is not taken to prevent situations where 
otherwise unrelated words are reduced to the same stem. In 
our system we replaced a traditional morphological stem- 
mer with a conservative dictionary-assisted suffix trimmer. 4 
The suffix trimmer performs essentially two tasks: (1) it 
reduces inflected word forms to their root forms as specified 
in the dictionary, and (2) it converts nominalized verb 
3 Courtesy of Bolt Beranek and Newman. 
4 We use Oxford Advanced Leamer's Dictionary (OALD) MRD. 
forms (eg. "implementation", "storage") to the root forms of 
corresponding verbs (i.e., "implement", "store"). This is 
accomplished by removing a standard suffix, eg. 
"stor+age", replacing it with a standard root ending ("+e"), 
and checking the newly created word against the dictionary, 
i.e., we check whether the original root ("storage") is 
defined using the new root ("store"). This allows reducing 
"diversion" to "diverse" while preventing "version" to be 
replaced by "verse". Experiments with CACM-3204 collec- 
tion show an improvement in retrieval precision by 6% to 
8% over the base system equipped with a standard morpho- 
logical stemmer (the SMART stemmer). 
4. HEAD-MODIFIER STRUCTURES 
Syntactic phrases extracted from TTP parse trees are head- 
modifier pairs: from simple word pairs to complex nested 
structures. The head in such a pair is a central element of a 
phrase (verb, main noun, etc.) while the modifier is one of 
the adjunct arguments of the head. 5 For example, the phrase 
fast algorithm for parsing context-free languages yields the 
following pairs: algorithm+fast, algorithm+parse, 
parse+language, language+context_free. The following 
types of pairs were considered: (1) a head noun and its left 
adjective or noun adjunct, (2) a head noun and the head of 
its right adjunct, (3) the main verb of a clause and the head 
of its object phrase, and (4) the head of the subject phrase 
and the main verb, These types of pairs account for most of 
the syntactic variants for relating two words (or simple 
phrases) into pairs carrying compatible semantic content. 
For example, the pair \[retrieve,information\] is extracted 
from any of the following fragments: information retrieval 
system; retrieval of information from databases; and infor- 
mation that can be retrieved by a user-controlled interac- 
tive search process. 6 An example is shown in the appen- 
dix .7 
5. TERM CORRELATIONS FROM TEXT 
Head-modifier pairs form compound terms used in database 
indexing. They also serve as occurrence contexts for 
smaller terms, including single-word terms. In order to 
determine whether such pairs signify any important associa- 
tion between terms, we calculate the value of the 
5 In the experiments reported here we extracted head-modifier word 
pairs only. CACM collection is too small to warrant generation of larger 
compounds, because of their low frequencies. 
To deal with nominal compounds we use frequency information 
about the pairs generated from the entire corpus to form preferences in am- 
biguous situations, such as natural language processing vs. dynamic infor- 
mation processing. 
7 Note that working with the parsed text ensures a high degree of 
precision in capturing the meaningful phrases, which is especially evident 
when compared with the results usually obtained from either unprocessed 
or only partially processed text (Lewis and Croft, 1990). 
207 
Informational Contribution (IC) function for each element 
in a pair. Higher values indicate stronger association, and 
the element having the largest value is considered semanti- 
cally dominant. IC function is a derivative of Fano's mutual 
information formula recently used by Church and Hanks 
(1990) to compute word co-occurrence patterns in a 44 mil- 
lion word corpus of Associated Press news stories. They 
noted that while generally satisfactory, the mutual informa- 
tion formula often produces counterintuitive results for 
low-frequency data. This is particularly worrisome for rela- 
tively smaller IR collections since many important indexing 
terms would be eliminated from consideration. Therefore, 
following suggestions in Wilks et al. (1990), we adopted a 
revised formula that displays a more stable behavior even 
on very low counts. This new formula IC (x ,\[x,y \]) is'based 
on (an estimate o0 the conditional probability of seeing a 
word y to the right of the word x, modified with a disper- 
sion parameter for x. 
fx~r lC (x,\[x,y \]) - 
n,, + d,, -1 
where fx~, is the frequency of \[x ,y \] in the corpus, n x is the 
number of pairs in which x occurs at the same position as in 
Ix,y\], and d(x) is the dispersion parameter understood as 
the number of distinct words with which x is paired. When 
IC(x,\[x,y\])=O, x and y never occur together (i.e., 
fx,y = 0); when IC(x,\[x,y\]) = 1, x occurs only with y (i.e., 
fx,y =n, and d~ = 1). Selected examples generated from 
CACM-3204 corpus are given in Table 2 at the end of the 
paper. IC values for terms become the basis for calculating 
term-to-term similarity coefficients. If two terms tend to be 
modified with a number of common modifiers and other- 
wise appear in few distinct contexts, we assign them a simi- 
larity coefficient, a real number between 0 and 1. The simi- 
larity is determined by comparing distribution characteris- 
tics for both terms within the corpus: how much informa- 
tion contents do they carry, do their information contribu- 
tion over contexts vary greatly, are the common contexts in 
which these terms occur specific enough? In general we 
will credit high-contents terms appearing in identical con- 
texts, especially if these contexts are not too common- 
place. 8 The relative similarity between two words xl and x z 
is obtained using the following formula (a is a large con- 
stant): 
SIM (x 1 ,x 2) = log (a ~ sim~ (x 1,x 9) 
where 
simy (x l ,x z) = MIN (I C (x 1,\[x l ,y \]) j C (x 2,\[x 2,y \])) 
* MIN(IC(y,\[xl,y\])JC(y,\[x2,y\])) 
The similarity function is further normalized with respect to 
8 It would not be appropriate to predict similarity between language 
and logarithm on the basis of their co-occurrence with natural. 
SIM(xl,xl). It may be worth pointing out that the similari- 
ties are calculated using term co-occurrences in syntactic 
rather than in document-size contexts, the latter being the 
usual practice in non-linguistic clustering (eg. Sparck Jones 
and Barber, 1971; Crouch, 1988; Lewis and Croft, 1990). 
Although the two methods of term clustering may be con- 
sidered mutually complementary in certain situations, we 
befieve that more and slxonger associations can be obtained 
through syntactic-context clustering, given sufficient 
amount of data and a reasonably accurate syntactic parser. 9 
6. QUERY EXPANSION 
Similarity relations are used to expand user queries with 
new terms, in an attempt to make the final search query 
more comprehensive (adding synonyms) and/or more 
pointed (adding specializations). 1° It follows that not all 
similarity relations will be equally useful in query expan- 
sion, for instance, complementary relations like the one 
between algol and fortran may actually harm system's per- 
formance, since we may end up retrieving many irrelevant 
documents. Similarly, the effectiveness of a query contain- 
ing fortran is likely to diminish if we add a similar but far 
more general term such as language. On the other hand, 
database search is likely to miss relevant documents if we 
overlook the fact that fortran is a programming language, 
or that interpolate is a specification of approximate. We 
noted that an average set of similarities generated from a 
text corpus contains about as many "good" relations 
(synonymy, speciafization) as "bad" relations (antonymy, 
complementation, generalization), as seen from the query 
expansion viewpoint. Therefore any attempt to separate 
these two classes and to increase the proportion of "good" 
relations should result in improved retrieval. This has 
indeed been confirmed in our experiments where a rela- 
tively crude filter has visibly increased retrieval precision. 
In order to create an appropriate filter, we expanded the IC 
function into a global specificity measure called the cumu- 
lative informational contribution function (ICW). ICW is 
calculated for each term across all contexts in which it 
occurs. The general philosophy here is that a more specific 
word/phrase would have a more limited use, i.e., would 
appear in fewer distinct contexts. ICW is similar to the stan- 
dard inverted document frequency (idj) measure except that 
term frequency is measured over syntactic units rather than 
9 Non-syntactic contexts cross sentence boundaries with no fuss, 
which is helpful with short, succinct documents (such as CACM abstracts), 
but less so with longer texts. 
to Query expansion (in the sense considered here, though not quite in 
the same way) has been used in information retfeval research before (eg. 
Sparek Jones and Tait, 1984; Harman, 1988), usually with mixed results. 
An alternative is to use term clusters to create new terms, "metaterms", and 
use them to index the database instead (eg. Crouch, 1988; Lewis and Croft, 
1990). We found that the query expansion approach gives the system more 
flexibiUty, for instance, by making room for hypertext-style topic explora- tion via user feedback. 
208 
document size units. 11 Terms with higher ICW values are 
generally considered more specific, but the specificity com- 
parison is only meaningful for terms which are already 
known to be similar. The new function is calculated accord- 
ing to the following formula: 12 
ICW(w) =ICL(w) * ICR (w) 
where (with nw, dw > 0): 
ICL (W) = Ic (\[w,_ \]) = 
n~ 
aw(nw+aw-1) 
and analogously for IC R (w ). 
For any two terms w 1 and w 2, and a constant ~i > 1, if 
ICW(w2)>_~* ICW(wl) then w 2 is considered more 
specific than w 1. In addition, if SIM,~,~(Wl,Wz)=~> O, 
where 0 is an empirically established threshold, then w 2 can 
be added to the query containing term w 1 with weight o. 13 
In the CACM-3204 collection: 
ICW (algol) = 0.0020923 
ICW (language) = 0.0000145 
ICW (approximate) = 0.0000218 
ICW (interpolate) = 0.0042410 
Therefore interpolate can be used to specialize approxi- 
mate, while language cannot be used to expand algol. Note 
that if 8 is well chosen (we used 5=10), then the above filter 
will also help to reject antonymous and complementary 
relations, such as SIM~orm (pl_i,cobol)=0.685 with 
ICW (pl_i)=O.O 175 and ICW (cobol)=0.0289. We continue 
working to develop more effective filters. Examples of 
filtered similarity relations obtained from CACM-3204 
corpus are given in Table 3. 
7. SUMMARY OF RESULTS 
The preliminary series of experiments with the CACM- 
3204 collection of computer science abstracts showed a 
consistent improvement in performance: the average preci- 
sion increased from 32.8% to 37.1% (a 13% increase), 
while the normalized recall went from 74.3% to 84.5% (a 
14% increase), in comparison with the statistics of the base 
system. This improvement is a combined effect of the new 
stemmer, compound terms, term selection in queries, and 
query expansion using filtered similarity relations. The 
choice of similarity relation filter has beeen found critical in 
improving retrieval precision through query expansion. It 
should also be pointed out that only about 1.5% of all 
" We believe that measuring term specificity over document-size 
contexts (eg. Sparck Jones, 1972) may not be appropriate in this case. In 
particular, syntax-based contexts allow for processing texts without any 
intemal document structure. 
m Slightly simplified here. 
13 The filter was most effective at cr = 0.57. 
similarity relations originally generated from CACM-3204 
were found admissible after filtering, contributing only 1.2 
expansion on average per query. It is quite evident 
significantly larger corpora are required to produce more 
dramatic results. 14 15 A detailed summary is given in Table 
1 below. 
These results, while modest by IR standards, are significant 
for another reason as well. They were obtained without any 
manual intervention into the database or queries, and 
without using any other information about the database 
except for the text of the documents (i.e., not even the hand 
generated keyword fields enclosed with most documents 
were used). Lewis and Croft (1990), and Croft et al. (1991) 
report results similar to ours but they take advantage of 
Computer Reviews categories manually assigned to some 
documents. The purpose of this research is to explore the 
potential of automated NLP in dealing with large scale IR 
problems, and not necessarily to obtain the best possible 
results on any particular data collection. One of our goals is 
to point a feasible direction for integrating NLP into the 
traditional IR (Strzalkowski and Vauthey, 1991; Grishman 
Tests org.system suf~trimmer query exp. 
Recall Precision 
0.00 
0.10 
0.20 
0.30 
0.40 
0.50 
0.60 
0.70 
0.80 
0.90 
1.00 
Avg. Prec. 
% change 
Norm Rec. 
Queries 
0.764 0.775 
0.674 
0.547 
0.449 
0.387 
0.329 
0.273 
0.198 
0.146 
0.093 
0.079 
0.328 
0.743 
50 
0.793 
0.688 0.700 
0.547 0.573 
0.479 0.486 
0.421 0.421 
0.356 0.372 
0.280 0.304 
0.222 0.226 
0.170 0.174 
0.112 0.114 
0.087 0.090 
0.356 0.371 
8.3 13.1 
0.841 0.842 
50 50 
Table 1. Recall/precision statistics for CACM-3204 
14 KL Kwok (private communication) has suggested that the low 
percentage of admissible relations might be similar to the phenomenon of 'tight dusters' which while meaningful are so few that their impact is 
small. 
15 A sufficiently large text corpus is 20 million words or more. This 
has been partially confirmed by experiments performed at the University of Massachussetts (B. Croft, private communication). 
209 
and Strzalkowski, 1991). 
ACKNOWLEDGEMENTS 
We would like to thank Donna Harman of NIST for making 
her IR system available to us. We would also like to thank 
Ralph Weischedel and Marie Meteer of BBN for providing 
and assisting in the use of the part of speech tagger. KL 
Kwok has offered many helpful comments on an earlier 
draft of this paper. In addition, ACM has generously pro- 
vided us with text data from the Computer Library database 
distributed by Ziff Communications Inc. This paper is 
based upon work supported by the Defense Advanced 
Research Project Agency under Contract N00014-90-J- 
1851 from the Office of Naval Research, and the National 
Science Foundation under Grant IRI-89-02304. 
REFERENCES 
1. Harman, Donna and Gerald Candela. 1989. "Retrieving 
Records from a Gigabyte of text on a.Minicomputer Using 
Statistical Ranking." Journal of the American Society for 
Information Science, 41 (8), pp. 581-589. 
2. Mauldin, Michael. 1991. "Retrieval Performance in Ferret: A 
Conceptual Information Retrieval System." Proceedings of 
ACM SIGIR-91, pp. 347-355. 
3. Sager, Naomi. 1981. Natural Language Information Pro- 
cessing. Addison-Wesley. 
4. Strzalkowski, Tomek. 1991. "TI'P: A Fast and Robust Parser 
for Natural Language." Proteus Project Memo #43, Courant 
Institute of Mathematical Science, New York University. 
5. Lewis, David D. and W. Bruce Croft. 1990. "Term Cluster- 
ing of Syntactic Phrases". Proceedings of ACM SIGIR-90, 
pp. 385-405. 
6. Church, Kenneth Ward and Hanks, Patrick. 1990. "Word 
association norms, mutual information, and lexicography." 
ComputationalLinguistics, 16(1), MIT Press, pp. 22-29. 
7. Wilks, Yorick A., Dan Fass, Cheng-Ming Guo, James E. 
McDonald, Tony Plate, and Brian M. Slator. 1990. "Provid- 
ing machine tractable dictionary tools." Machine Transla- 
tion, 5, pp. 99-154. 
8. Sparck Jones, K. and E. O. Barber. 1971. "What makes 
automatic keyword classification effective?" Journal of the 
American Society for Information Science, May-June, pp. 
166-175. 
9. Crouch, Carolyn J. 1988. "A cluster-based approach to 
thesaurus construction." Proceedings of ACM SIGIR-88, pp. 
309-320. 
10. Sparck Jones, K. and J. I. Tait. 1984. "Automatic search 
term variant generation." Journal of Documentation, 40(1), 
pp. 50-66. 
11. Harrnan, Donna. 1988. "Towards interactive query expan- 
sion." Proceedings ofACM SIGIR-88, pp. 321-331. 
12. Sparck Jones, Karen. 1972. "Statistical interpretation of 
term specificity and its application in retrieval." Journal of 
Documentation, 28(1 ), pp. 11-20. 
13. Croft, W. Bruce, Howard R. Turtle, and David D. Lewis. 
1991. "The Use of Phrases and Structured Queries in Infor- 
mation Retrieval." Proceedings of ACM SIGIR-91, pp. 32- 
45. 
14. Strzalkowski, Tomek and Barbara Vauthey. 1991. "Fast Text 
Processing for Information Retrieval." Proceedings of the 4t.h 
DARPA Speech and Natural Language Workshop, Morgan- 
Kauffman, pp. 346-351. 
15. Strzalkowski, Tomek and Barbara Vauthey. 1991. "Natural 
Language Processing in Automated Information Retrieval." 
Proteus Project Memo #42, Courant Institute of Mathematical 
Science, New York University. 
16. Grishman, Ralph and Tomek Strzalkowski. 1991. "Informa- 
tion Retrieval and Natural Language Processing." Position 
paper at the workshop on Future Dkections in Natural 
Language Processing in Information Retrieval, Chicago. 
APPENDIX: SAMPLE DATA 
DOCUMENT TEXT: 
*RECORD* 
*F* NO 
2366 
*F* TITLE 
Complex gamma funcdon with error control 
*F* TEXT 
An algorithm to compute the gamma function and 
log gamma function of a complex variable is presented. 
The standard algorithm is modified in several respects 
to insure the continuity of the function value 
and to reduce accumulation of round-off errors. In 
addition to computation of function values, this 
algorithm includes an object-time estimation of round-off 
errors. Experimental data with regard to the 
effectiveness of this error control are presented. 
a fortran program for the algorithm appears in the 
algorithms section of this issue. 
HEAD+MODIFIER PAIRS EXTRACTED: 
function+gamma 
present+algorithm 
compute+function 
function+log 
gamma+log 
variable+complex 
algorithm+standard 
reduce+accumulate 
error+round off 
include+estimate 
estimate+error 
present+data 
effective+control 
programme+fortran 
algorithm+issue 
control+error 
algorithm+compute 
function+function 
gamma+function 
gamma+variable 
modify+algorithm, 
insure+continue, 
accumulate+error, 
algorithm+include, 
estimate+object_time, 
error+round_off, 
data+experimental 
control+error 
section+algorithm 
210 
word 
-- - 
theory 
mathematical 
distribute 
normal 
minimum 
relative 
retrieve 
inform 
size 
medium 
editor 
text 
system 
parallel 
read 
character 
discuss 
panel 
implicate 
legal 
system 
distribute 
make 
recommend 
infer 
deductive 
make 
arrange 
share 
resource 
comprehend 
language 
syntax 
language 
science 
compute 
maintain 
cost 
head+modifier pair 
theory+mathematical 
theory+mathematical 
distribute+normal 
distribute+normal 
minimum+relative 
minimum+relative 
retrieve+inform 
retrieve+inform 
size+medium 
size +medium 
editor + text 
editor+text 
system+parallel 
system+parallel 
read+ character 
read+character 
discuss+panel 
discuss+panel 
implicate+legal 
implicate+legal 
system+distribute 
system +distribute 
make+recommend 
make+recommend 
infer+deductive 
infer+deductive 
make+arrange 
make+arrange 
share+resource 
share +resource 
comprehend+language 
comprehend+language 
syntax+ language 
syntax+language 
science +compute 
science+compute 
concept+maintain 
cost+maintain 
IC coeff. 
Table 2. IC coefficients obtained from CACM-3204 
word1 
*aim 
algorithm 
algorithm 
acquire 
*adjacency 
*algebraic 
*american 
assert 
back-up 
*buddy 
committee 
correct 
babylonian 
critical 
best-jit 
bound-context 
*duplex 
deletion 
earlier 
encase 
give 
imaginary 
incomplete 
input 
lead 
*marriage 
mean 
method 
memory 
match 
lower 
minor 
progress 
purdue 
range 
round-off 
remote 
pulse 
purpose 
technique 
method 
train 
pair 
symbol 
standard 
infer 
mini-max 
time-share 
*symposium 
theorem 
old 
final 
first-fit 
lr 
reliable 
insert 
previous 
minimum-area 
present 
real 
miss 
output 
*trail 
stable 
*standard 
technique 
storage 
recognize 
upper 
*woman 
*trend 
stanford 
variety 
trunca te 
telerype 
wave 
Table 3. Filtered word similarities (* indicates the 
more specific term). 
