ROBUST TEXT PROCESSING IN AUTOMATED INFORMATION RETRIEVAL 
Tomek Strzalkowski 
Courant Institute of Mathematical Sciences 
New York University 
715 Broadway, rm. 704 
New York, NY 10003 
tomek@cs.nyu.edu 
ABSTRACT 
This paper outlines a prototype text retrieval system 
which uses relatively advanced natural language pro- 
cessing techniques in order to enhance the effective- 
ness of statistical document retrieval. The backbone 
of our system is a traditional retrieval engine which 
builds inverted index files from pre-processed docu- 
ments, and then searches and ranks the documents in 
response to user queries. Natural language process- 
ing is used to (1) preprocess the documents in order 
to extract contents-carrying terms, (2) discover inter- 
term dependencies and build a conceptual hierarchy 
specific to the database domain, and (3) process 
user's natural language requests into effective search 
queries. The basic assumption of this design is that 
term-based representation of contents is in principle 
sufficient to build an effective if not optimal search 
query out of any user's request. This has been 
confirmed by an experiment that compared effective- 
ness of expert-user prepared queries with those 
derived automatically from an initial narrative infor- 
mation request. In this paper we show that large- 
scale natural language processing (hundreds of mil- 
lions of words and more) is not only required for a 
better retrieval, but it is also doable, given appropri- 
ate resources. We report on selected preliminary 
results of experiments with 500 MByte database of 
Wall Street Journal articles, as well as some earlier 
results with a smaller document collection. 
INTRODUCTION 
A typical information retrieval OR) task is to 
select documents from a d~!ahase in response to a 
user's query, and rank these documents according to 
relevance. This has been usually accomplished using 
statistical methods (often coupled with manual 
encoding) that (a) select terms (words, phrases, and 
other units) from documents that are deemed to best 
represent their contents, and (b) create an inverted 
index file (or files) that provide and easy access to 
documents containing these terms. An important 
issue here is that of finding an appropriate 
combination of term weights which would reflect 
each term's relative contribution to the information 
contents of the document. Among many possible 
weighting schemes the inverted document frequency 
OdD has come to be recognized as universally appli- 
cable across variety of different text collections. 
Once the index is created, the search process 
will attempt to match a preprocessed user query (or 
queries) against representations of documents in each 
case determining a degree of relevance between the 
two which depends upon the number and types of 
matching terms. Although many sophisticated search 
and matching methods are available, the crucial prob- 
lem remains to be that of an adequate representation 
of contents for both the documents and the queries. 
The simplest word-based representations of 
contents are usually inadequate since single words 
are rarely specific enough for accurate discrimina- 
tion, and their grouping is often accidental. A better 
method is to identify groups of words that create 
meaningful phrases, especially if these phrases 
denote important concepts in database domain. For 
example, joint venture is an important term in Wall 
Street Journal (WSJ henceforth) database, while nei- 
ther joint nor venture are important by themselves. In 
the retrieval experiments with the WSJ database, we 
noticed that both joint and venture were dropped 
from the list of terms by the system because their idf 
weights were too low. In large databases, such as 
TIPSTEK/TREC, the use of phrasal terms is not just 
desirable, it becomes necessary. 
The question thus becomes, how to identify the 
correct phrases in the text? Both statistical and syn- 
tactic methods were used before with only limited 
success. Statistical methods based on word co- 
occurrences and mutual information are prone to high 
error rates, turning out many unwanted associations. 
Syntactic methods suffered from low quality of gen- 
erated parse structures that could be attributed to lim- 
ited coverage grammars and the lack of adequate lex- 
icons. In fact. the difficulties encountered in applying 
computational linguistics technologies to text pro- 
cessing have contributed to a wide-spread belief that 
9 
automated natural language processing may not be 
suitable in IR. These difficulties included 
inefficiency, lack of robustness, and prohibitive cost 
of manual effort required to build lexicons and 
knowledge bases for each new text domain. On the 
other hand, while numerous experiments did not 
establish the usefulness of linguistic methods in IR, 
they cannot be considered conclusive because of their 
limited scale. \] 
The rapid progress in Computational Linguis- 
tics over the last few years has changed this equation 
in various ways. First of all, large-scale resources 
became available: on-line lexicons, including Oxford 
Advanced Learner's Dictionary (OALD), Longman 
Dictionary of Contemporary English (LDOCE), 
Webster's Dictionary, Oxford English Dictionary, 
Collins Dictionary, and others, as well as large text 
corpora, many of which can now be obtained for 
research purposes. Robust text-oriented software 
tools have been built, including part of speech 
taggers (stochastic and otherwise), and fast parsers 
capable of processing text at speeds of 4200 words 
per minute or more (e.g., TIP parser developed by 
the author). While many of the fast parsers are not 
very accurate (they are usually partial analyzers by 
design), 2 some, like TIP, perform in fact no worse 
than standard full-analysis parsers which are many 
times slower and far less robust. 3 
An accurate syntactic analysis is an essential 
prerequisite for term selection, but it is by no means 
sufficient. Syntactic parsing of the database contents 
is usually attempted in order to extract linguistically 
motivated phrases, which presumably are better indi- 
cators of contents than "statistical phrases" where 
words are grouped solely on the basis of physical 
proximity (e.g., "college junior" is not the same as 
"junior college'). However, creation of such com- 
pound terms makes term matching process more 
complex since in addition to the usual problems of 
synonymy and subsumption, one must deal with their 
structure (e.g., "college junior" is the same as "junior 
in college"). In order to deal with structure, parser's 
t Standard IR benchmark collectiot~s are statistically too 
small and the experiments can easily produce cotm~rinmitive 
results. For example, Cnmfield collection is only approx. 180,000 
English words, while CACM-3204 collection is approx. 200.000 
words. 
2 Partial parsing is usually fast enough, but it also generates 
noisy data: as numy as 50% of all generated phrases cotild be in- 
correct (Lewis and Croft, 1990). 
3 "I'rP has been shown to produce parse structures which sum 
no worse m recall, precision and crossing rate than those generated 
by flill-setle lmguisuc parsers when compared to hand-coded 
Treebank parse tree,. 
output needs to be "normalized" or "regularized" so 
that complex terms with the same or closely related 
meanings would indeed receive matching representa- 
tions. This goal has been achieved to a certain extent 
in the present work. As it will be discussed in more 
detail below, indexing terms were selected from 
among head-modifier pairs extracted from predicate- 
argument representations of sentences. 
The next important task is to achieve normali- 
zation across diferent terms with close or related 
meaning. This can be accomplished by discovering 
various semantic relationships among words and 
phrases, such as synonymy and subsumption. For 
example, the term natural language can be con- 
sidered, in certain domains at least2 to subsume any 
term denoting a specific human language, such as 
English. Therefore, a query containing the former 
may be expected to retrieve documents containing 
the latter. The system presented here computes term 
associations from text on word and fixed phrase level 
and then uses these associations in query expansion. 
A fairly primitive filter is employed to separate 
synonymy and subsumption relationships from others 
including antonymy and complementation, some of 
which are strongly domain-dependent. This process 
has led to an increased retrieval precision in experi- 
ments with smaller and more cohesive collections 
(CACM-3204). 
In the following sections we present an over- 
view of our system, with the emphasis on its text- 
processing components. We would like to point out 
here that the system is completely automated, i.e., all 
the processing steps, those performed by the statisti- 
cal core. and these performed by the natural language 
processing components, are done automatically, and 
no human intervention or manual encoding is 
required. 
OVERALL DESIGN 
Our information retrieval system consists of a 
traditional statistical backbone (NIST's PRISE sys- 
tem; Harman and Candela, 1989) augmented with 
various natural language processing components that 
assist the system in database processing (stemming, 
indexing, word and phrase clustering, selectional res- 
trictions), and translate a user's information request 
into an effective query. This design is a careful 
compromise between purely statistical non-linguistic 
approaches and those requiring rather accomplished 
(and expensive) semantic analysis of data~ often 
referred to as 'conceptual retrieval'. 
In our system the database text is first pro- 
cessed with a fast syntactic parser. Subsequently cer- 
tain types of phrases are extracted from the parse 
3.0 
trees and used as compound indexing terms in addi- 
tion to single-word terms. The extracted phrases are 
statistically analyzed as syntactic contexts in order to 
discover a variety of similarity links between smaller 
subphrases and words occurring in them. A further 
filtering process maps these similarity links onto 
semantic relations (generalization, specialization, 
synonymy, etc.) after which they are used to 
transform user's request into a search query. 
The user's natural language request is also 
parsed, and all indexing terms occurring in them are 
identified. Certain highly ambiguous, usually single- 
word terms may be dropped, provided that they also 
occur as elements in some compound terms. At the 
same time, other terms may be added, namely those 
which are linked to some query term through admis- 
sible similarity relations. For example, "unlawful 
activity" is added to a query containing the com- 
pound term "illegal activity" via a synonymy link 
between "illegal" and "unlawful". After the final 
query is constructed, the database search follows, and 
a ranked list of documents is returned. 
The purpose of this elaborate linguistic pro- 
cessing is to create a better representation of docu- 
ments and to generate best possible queries out of 
user's initial requests. Despite limitations of term- 
and-weight type representation (or boolean versions 
thereof), very good queries can be produced by 
human experts. In order to imitate an expert, the sys- 
tem must be able to learn about its database, in par- 
ticular about various correlations among index terms. 
FAST PARSING WITH TTP PARSER 
"I'I'P (Tagged Text Parser) is based on the 
Linguistic String Grammar developed by Sager 
(1981). The parser currently encompasses some 400 
grammar productions, but it is by no means complete. 
The parser's output is a regularized parse tree 
representation of each sentence, that is, a representa- 
tion that reflects the sentence's logical predicate- 
argument structure. For example, logical subject and 
logical object are identified in both passive and active 
sentences, and noun phrases are organized around 
their head elements. The significance of this 
representation will be discussed below. The parser is 
equipped with a powerful skip-and-fit recovery 
mechanism that allows it to operate effectively in the 
faze of ill-formed input or under a severe time pres- 
sure. In the runs with approximately 83 million words 
of TREC's Wall Street Journal texts~ the parser's 
4 Approximately 0.5 GBytes of text. over 4 million sen- 
teilc¢~. 
speed averaged between 0.3 and 0.5 seconds per sen- 
tence, or up to 4200 words per minute, on a Sun's 
SparcStation-2. 
'I'I'P is a full grammar parser, and initially, it 
attempts to generate a complete analysis for each 
sentence. However, unlike an ordinary parser, it has a 
built-in timer which regulates the amount of time 
allowed for parsing any one sentence. If a parse is not 
returned before the allotted time elapses, the parser 
enters the skip-and-fit mode in which it will try to 
"fit" the parse. While in the skip-and-fit mode. the 
parser will attempt to forcibly reduce incomplete 
constituents, possibly skipping portions of input in 
order to restart processing at a next unattempted con- 
stituent. In other words, the parser will favor reduc- 
tion to backtracking while in the skip-and-fit mode. 
The result of this strategy is an approximate parse, 
partially fitted using top-down predictions. The frag- 
ments skipped in the first pass are not thrown out, 
instead they are analyzed by a simple phrasal parser 
that looks for noun phrases and relative clauses and 
then attaches the recovered material to the main parse 
structure. As an illustration, consider the following 
sentence taken from the CACM-3204 corpus: 
The method is illustrated by the automatic con- 
struction of both reeursive and iterative pro- 
grams operating on natural numbers, lists, and 
trees, in order to construct a program satisfying 
certain specifications a theorem induced by 
those specifications is proved, and the destred 
program is extracted from the proof. 
The italicized fragment is likely to cause additional 
complications in parsing this lengthy string, and the 
parser may be better off ignoring this fragment alto- 
gether. To do so successfully, the parser must close 
the currently open constituent (i.e., reduce a program 
satisfying certain specifications to NP), and possibly 
a few of its parent constituents, removing 
corresponding productions from further considera- 
tion, until an appropriate production is reactivated. 
In this case, TIP may force the following reductions: 
SI -~ to V NP, SA --~ SI; S -.~ NP V NP SA, until the 
production S ~ S and S is reached. Next, the parser 
skips input to find and, and resumes normal process- 
ing. 
As may be expected, the skip-and-fit strategy 
will only be effective if the input skipping can be per- 
formed with a degree of determinism. This means 
that most of the iexical level ambiguity must be 
removed from the input text. prior to parsing. We 
achieve this using a stochastic parts of speech tagger 
to preprocess the text. Full details of the parser can 
be found in (Strzalkowski, 1992). 
11 
PART OF SPEECH TAGGER 
One way of dealing with lexical ambiguity is to 
use a tagger to preprocess the input marking each 
word with a tag that indicates its syntactic categoriza- 
tion: a part of speech with selected morphological 
features such as number, tense, mode, case and 
degree. The following are tagged sentences from the 
CACM-32(M collection: 5 
The/dt paper/nn presents/vbz aldt proposal/nn 
for~in structured/vbn representation/nn of/in 
muhiprogramming/vbg in~in a/dt high/jj level/nn 
language/nn ./per 
The/dt notation/nn used/vbn explicitly/rb 
associates/vbz a/dt data/nns structure/nn 
shared/vbn by~in concurrent/jj processes/nns 
with~in operations/nns defined/vbn on~in it/pp 
./per 
The tags are understood as follows: dt - determiner, 
nn - singular noun, nns - plural noun, in - preposition, 
jj - adjective, vbz - verb in present tense third person 
singular, to - particle "to", vbg - present participle, 
vbn - past participle, vbd - past tense verb, vb - 
infinitive verb, cc - coordinate conjunction. 
Tagging of the input text substantially reduces 
the search space of a top-down parser since it 
resolves most of the lexical level ambiguities. In the 
examples above, tagging of presents as "vbz" in the 
first sentence cuts off a potentially long and costly 
"garden path" with presents as a plural noun followed 
by a headless relative clause starting with (that) a 
proposal .... In the second sentence, tagging resolves 
ambiguity of used (vbn vs. vbd), and associates (vbz 
vs. nns). Perhaps more importantly, elimination of 
word-level lexical ambiguity allows the parser to 
make projection about the input which is yet to be 
parsed, using a simple lookahead; in particular, 
phrase boundaries can be determined with a degree 
of confidence (Church, 1988). This latter property is 
critical for implementing skip-and-fit recovery tech- 
nique outlined in the previous section. 
Tagging of input also helps to reduce the 
number of parse structures that can be assigned to a 
sentence, decreases the demand for consulting of the 
dictionary, and simplifies dealing with unknown 
words. Since every item in the sentence is assigned a 
tag, so are the words for which we have no entry in 
the lexicon. Many of these words will be tagged as 
"rip" (proper noun), however, the surrounding tags 
may force other selections. In the following exam- 
ple, chinese, which does not appear in the dictionary, 
s Tagged using the 35-tag Penn Treebank Tagset created at 
the Univemty of Penn~Ivtnnt 
is tagged as -jj,,:6 
this~dr paper/nn dates/vbz back/rb the~dr 
genesis/nn of~in binary/jj conception/nn circa~in 
5000/cd years/nns ago/rb ,~corn as/rb 
derived/vbn by~in the~dr chinese/jj ancients/nns 
./per 
WORD SUFFIX TRIMMER 
Word stemming has been an effective way of 
improving document recall since it reduces words to 
their common morphological root, thus allowing 
more successful matches. On the other hand, stem- 
ming tends to decrease retrieval precision, if care is 
not taken to prevent situations where otherwise unre- 
lated words are reduced to the same stem. In our sys- 
tem we replaced a traditional morphological stemmer 
with a conservative dictionary-assisted suffix trim- 
mer. 7 The suffix trimmer performs essentially two 
tasks: (1) it reduces inflected word forms to their root 
forms as specified in the dictionary, and (2) it con- 
verts nominalized verb forms (e.g., "implementa- 
tion", "storage") to the root forms of corresponding 
verbs (i.e., "implement", "store"). This is accom- 
plished by removing a standard suffix, e.g.. 
"stor+age", replacing it with a standard root ending 
C+e"), and checking the newly created word against 
the dictionary, i.e., we check whether the new root 
("store") is indeed a legal word, and whether the ori- 
ginal root ("storage") is defined using the new root 
("store") or one of its standard inflectional forms 
(e.g., "storing"). For example, the following 
definitions are excerpted from the Oxford Advanced 
Learner's Dictionary (OALD): 
storage n \[13\] (space used for, money paid for) 
the storing of goods ... 
diversion n \[U\] diverting ... 
procession n It\] number of persons, vehicles, 
etc moving forward and following each other in 
an orderly way. 
Therefore, we can reduce "diversion" to "divert" by 
removing the suffix "+sion" and adding root form 
suffix "+t". On the other hand, "process+ion" is not 
reduced to "process". 
Earlier experiments with CACM-3204 collec- 
tion showed an improvement in retrieval precision by 
6% to 8% over the base system equipped with a stan- 
dard morphological stemmer (the SMART stemmer). 
6 We use the machine ~_d_~ie version of the Oxford Ad- 
vanced Learner's Dictionary (OALD). 
7 Dealing with prefixes is a more complicated matter, since 
they may have quite strong effect upon the meaning of the result- 
ing tenn. e.g., un- usually introduces explicit negation. 
3.2 
HEAD-MODIFIER STRUCTURES 
Syntactic phrases extracted from TIP parse 
trees are head-modifier pairs. The head in such a pair 
is a central element of a phrase (main verb, main 
noun, etc.), while the modifier is one of the adjunct 
arguments of the head. In the TREC experiments 
reported here we extracted head-modifier word and 
fixed-phrase pairs only. While TREC WSJ database 
is large enough to warrant generation of larger com- 
pounds, we were in no position to verify their effec- 
tiveness in indexing. This was largely because of the 
tight schedule, but also because of rapidly escalating 
complexity of the indexing process: even with 2- 
word phrases, compound terms accounted for nearly 
96% of all index entries, in other words, including 2- 
word phrases has increased the index size 25 times! 
Let us consider a specific example from WSJ 
database: 
The former Soviet president has been a local 
hero ever since a Russian tank invaded Wiscon- 
Silt. 
The tagged sentence is given below, followed by the 
regularized parse structure generated by 'FI'P, given 
in Figure I. 
The~dr former/j/' Soviet/jj president/nn has/vbz 
been/vbn aldt local/jj hero/nn ever/rb since~in 
eddt Russian/jj tanklnn invaded/vbd 
Wisconsin/rip ./per 
It should be noted that the parser's output is a 
predicate-argument structure centered around main 
elements of various phrases. In Figure 1, BE is the 
main predicate (modified by HAVE) with 2 argu- 
ments (subject, object) and 2 adjuncts (adv, sub_oral). 
INVADE is the predicate in the subordinate clause 
with 2 arguments (subject. object). The subject of 
BE is a noun phrase with PRESIDENT as the head 
element, two modifiers (FORMER, SOVIET) and a 
determiner (THE). From this structure, we extract 
head-modifier pairs that become candidates for com- 
pound terms. The following types of pairs are con- 
sidered: (1) a head noun and its left adjective or noun 
adjunct, (2) a head noun and the head of its right 
adjunct, (3) the main verb of a clause and the head of 
its object phrase, and (4) the head of the subject 
phrase and the main verb. These types of pairs 
account for most of the syntactic variants for relating 
two words (or simple phrases) into pairs carrying 
compatible semantic content. For example, the pair 
retrieve+information will be extracted from any of 
the following fragments: information retrieval sys- 
tem; retrieval of information from databases;, and 
information that can be retrieved by a user- 
controlled interactive search process. In the example 
at hand, the following head-modifier pairs are 
extracted (pairs containing low-contents elements, 
\]asserl 
\[\[~ \[HAVE\]\] 
llverb \[BE\]\] 
\[subject 
\[np 
In PRESIDENT\] 
\[tpos THE} 
\[adj \[FORMER\]\] 
ladj \[SOVIET\]Ill 
\[object 
Inp 
In HERO\] 
It .pos A\] 
ladj \[LOCAL\]Ill 
lady EVER} 
lsub_ord 
\[SINCE 
Ilverb \[INVADEII 
\[subject 
\[np 
In TANK\] 
\[t_pos A\] 
ladj \[RUSSIANII\]I 
Iobjexl 
\]np 
Iname \[WISCONSIN\]IIlIIIIII 
Figure 1. Predicale-argum~at parse structure. 
such as BE and FORMER, or names, such as 
WISCONSIN, will be later discarded): 
\[PRESIDENT,BE\] 
\[PRESIDENT,FORMER\] 
\[PRESIDENT,SOVIET\] 
\[BE,HEROI 
\[HERO,LOCAL\] 
\[TANK.INVADE\] 
flANK.RUSSIAN\] 
\[INVADE,WlSCONSlN\] 
We may note that the three-word phrase former 
Soviet president has been broken into two pairs 
former president and Soviet president, both of which 
denote things that are potentially quite different from 
what the original phrase refers to, and this fact may 
have potentially negative effect on retrieval preci- 
sion. This is one place where a longer phrase appears 
more appropriate. The representation of this sentence 
may therefore contain the following terms: 
PRESIDENT. SOVIET, PRESIDENT+SOVIET. 
PRESIDENT+FORMEIL HERO, HERO+LOCAL, 
INVADE. TANK. TANK+INVADE. TANK+RUSSIAN. 
RUSSIAN. INVADE+WISCONSIN. WISCONSIN. 
The particular way of interpreting syntactic 
contexts was dictated, to some degree at least, by sta- 
tistical considerations. Our original experiments 
13 
were performed on a relatively small collection 
(CACM-3204), and therefore we combined pairs 
obtained from different syntactic relations (e.g., 
verb-object, subject-verb, noun-adjunct, etc.) in order 
to increase frequencies of some associations. This 
became largely unnecessary in a large collection such 
as TIPSTER, but we had no means to test alternative 
options, and thus decided to stay with the original. It 
should not be difficult to see that this was a 
compromise solution, since many important distinc- 
tions were potentially lost, and strong associations 
could be produced where there weren't any. A way to 
improve things is to consider different syntactic rela- 
tions independently, perhaps as independent sources 
of evidence that could lend support (or not) to certain 
term similarity predictions. We have already started 
testing this option. 
One difficulty in obtaining head-modifier pairs 
of highest accuracy is the notorious ambiguity of 
nominal compounds. For example, the phrase natural 
language processing should generate 
language+natural and processing+language, while 
dynamic information processing is expected to yield 
processing+dynamic and processing+information. A 
still another case is executive vice president where 
the association president+executive may be stretch- 
ing things a bit too far. Since our parser has no 
knowledge about the text domain, and uses no 
semantic preferences, it does not attempt to guess any 
internal associations within such phrases. Instead, 
this task is passed to the pair extractor module which 
processes ambiguous parse smactures in two phases. 
In phase one, all and only unambiguous head- 
modifier pairs are extracted, and the frequencies of 
their occurrences are recorded. In phase two, fre- 
quency information about pairs generated in the first 
pass is used to form associations from ambiguous 
structures. For example, if language+natural has 
occurred unambiguously a number times in contexts 
such as parser for natural language, while 
processing+natural has occurred significantly fewer 
times or perhaps none at all, then we will prefer the 
former association as valid. 
TERM CORRELATIONS FROM TEXT 
Head-modifier pairs form compound terms 
used in database indexing. They also serve as 
occurrence contexts for smaller terms, including 
single-word terms. If two terms tend to be modified 
with a number of common modifiers and otherwise 
appear in few distinct contexts, we assign them a 
similarity coefficient, a real number between 0 and 1. 
The similarity is determined by comparing distribu- 
tion characteristics for both terms within the corpus: 
how much information contents do they carry, do 
their information contribution over contexts vat3' 
greatly, are the common contexts in which these 
terms occur specific enough? In general we will 
credit high-contents terms appearing in identical con- 
texts, especially if these contexts are not too com- 
monplace, s The relative similarity between two 
words xi and x2 can be obtained using the following 
formula (ct is a large constant): 9 
SIM (x l ,x2) = log (a ~ simy(x l ,x 2) ) 
Y 
where 
simy(x l ,x2) = MIN (IC (x l ,\[x l,y \]),IC (x ~,\[x 2, y \])) 
* MIN(IC(y, \[xl .y\]),lC(y, \[x2,y\])) 
and IC is the Information Contribution measure indi- 
cating the strength of word pairings, and defined as 
IC (x, \[x,y \]) = -- 
A,y 
n~+d~-I 
where f~,y is the absolute frequency of pair Ix,y\] in 
the corpus, nx is the frequency of term x at the head 
position, and dx is a dispersion parameter understood 
as the number of distinct syntactic contexts in which 
term x is found. The similarity function is further 
normalized with respect to SIM (x i ,x I ). Example 
similarities are listed in Table 1. 
We also considered a term clustering option 
which, unlike the shnilatity formula above, produces 
clusters of related words and phrases, but will not 
generate uniform term similarity ranking across clus- 
ters. We used a variant of weighted Tanimoto's 
measure described in (Grefenstette, 1992): 
SIM (x I .x2) = 
with 
~j~4tN (W (\[x,att \]),W (\[y,att \]) 
all 
~_~MAX (W (\[x,att \]). W (D',att \]) 
an 
W (\[x, y \]) = GW (x )* log (A.,) 
GW (x) = 1 - n~ 
log (N) 
s It would not be appropriate to predict similarity between 
language mad logar/thm on the basis of their co--occur~nee wlth 
natural. 
tTh/s was inspired by • formula used by Hind\]e (1990). 
14 
Sample clusters obtained from approx. 100 MByte 
(17 million words) sample of WSJ are given in Table 
2. 
In order to generate better similarities and clus- 
ters, we require that words x\] and x2 appear in at 
least M distinct common contexts, where a common 
context is a couple of pairs \[x\],y\] and \[x2,y\], or 
\[y,x l \] and \[y,x 2 \] such that they each occurred at least 
twice. Thus, banana and Baltic will not be con- 
sidered for similarity relation on the basis of their 
occurrences in the common context of republic, no 
matter how frequent, unless there is another such 
common context comparably frequent (there wasn't 
any in TREC WSJ database). For smaller or narrow 
domain databases M=2 is usually sufficient. For large 
databases covering rather diverse subject matter, like 
TIPSTER or even WSJ, we used M_>3) ° 
It may be worth pointing out that the similari- 
ties are calculated using term co-occurrences in syn- 
tactic rather than in document-size contexts, the latter 
being the usual practice in non-linguistic clustering 
(e.g., Sparck Jones and Barber, 1971; Crouch, 1988; 
Lewis and Croft, 1990). Although the two methods of 
term clustering may be considered mutually comple- 
mentary in certain situations, we believe that more 
and stronger associations can be obtained through 
syntactic-context clustering, given sufficient amount 
of data and a reasonably accurate syntactic parser) ~ 
QUERY EXPANSION 
Similarity relations are used to expand user 
queries with new terms, in an attempt to make the 
final search query more comprehensive (adding 
synonyms) and/or more pointed (adding specializa- 
tions)) 2 It follows that not all similarity relations will 
be equally useful in query expansion, for instance, 
complementary and antonymous relations like the 
1o For example banana and Dominican were found to have 
two common contexts: republic and plant, although this second oc- 
in appare, nfly different senses in Dominican plant and bana. 
na p/ant. 
" Nun-syntactic contexts cross sentence boundaries with no 
fuss. which is helpful with short, succinct documents (such as 
CACM absuacts), but less so with longer texts; see also (Gnsimaan 
et al,, 1986). 
:2 Query expansion (in the sense considered here, though not 
quite in the same way) has been used in information retrieval 
rescacch before (e.g., Sparc~ Jones and Tait. 1984; Hamum, 1988). 
usually with nuxcd ~csults. An ahemanve is to use term clusters to 
create new teans, "meta~nns". and use them to index the database 
instead {e.g.. Crouch. 1988; lewis and Croft, 1990). We found that 
the query expansion approach gives the system more flexibility, for 
instance, by making room for hypenext-style topic exploration via 
user feedback. 
one between Australian and Canadian, or accept and 
reject may actually harm system's performance, 
since we may end up retrieving many irrelevant 
documents. Similarly, the effectiveness of a query 
containing vitamin is likely to diminish if we add a 
similar but far more general term such as acid. On 
the other hand, database search is likely to miss 
relevant documents if we overlook the fact that for- 
tran is a programming language, or that infant is a 
baby and baby is a child. We noted that an average 
set of similarities generated from a text corpus con- 
tains about as many "good" relations (synonymy, 
specialization) as "bad" relations (antonymy. comple- 
mentation, generalization), as seen from the query 
expansion viewpoint. Therefore any attempt to 
separate these two classes and to increase the propor- 
tion of "good" relations should result in improved 
retrieval. This has indeed been confirmed in our 
experiments where a relatively crude filter has visibly 
increased retrieval precision. 
In order to create an appropriate filter, we dev- 
ised a global term specificity measure (GTS) which is 
calculated for each term across all contexts in which 
it occurs. The general philosophy here is that a more 
specific word/phrase would have a more limited use, 
i.e., a more specific term would appear in fewer dis- 
tinct contexts. In this respect, GTS is similar to the 
standard inverted document frequency (idjO measure 
except that term frequency is measured over syntactic 
units rather than document size units. 13 Terms with 
higher GTS values are generally considered more 
specific, but the specificity comparison is only mean- 
ingful for terms which are already known to be simi- 
lar. The new function is calculated according to the 
following formula: 
ICL(w) if both exist ICR(w) 
GTS(w)=I~R(w) otherwiseif°nlylCR(w)exists 
where (with nw, d~ > 0): 
~w 
ICt(w) = IC (\[w,_ \]) = a~(nw+d~-l) 
nw tCR(w) = tc (\[_,w\]) 
= d,,(nw+a~- I) 
For any two terms w~ and w 2, and a constant ~ > I. 
if GTS(w 2) > 8 * GTS(w\]) then w 2 is considered 
more specific than w\]. In addition, if 
" We believe that measuring term specificity over 
document-size contexts (e.g.. Sparck Jones. 1972) may not be ap- 
propnate in this case. In pameular, s3mtax-based contexts allow for 
pr~:essmg texts without any internal docmnent structure. 
2.5 
SlM,,o,~(w i,w2) = o > O, 
where 0 is an empirically established threshold, then 
w2 can be added to the query containing term w 1 
with weight o. 14 For example, the following were 
obtained from TREC WSJ training database: 
GTS (child) = 0.000001 
GTS (baby) = 0.000013 
GTS (infant) = 0.000055 
with 
SIM(child,infant) =0.131381 
SIM (baby,child) = 0.183064 
SIM (baby,infant) = 0.323121 
Therefore both baby and infant can be used to spe- 
cialize child. With this filter, the relationship between 
baby and infant had to be discarded, as we are unable 
to tell synonymous or near synonymous relationships 
from those which are primarily complementary, e.g., 
man and woman. 
SUMMARY OF RESULTS 
We have processed the total of 500 MBytes of 
articles from Wall Street Journal section of TREC 
database. Retrieval experiments involved 50 user 
information requests (topics) (TREC topics 51-100) 
consisting of several fields that included both text and 
user supplied keywords. A typical topic is shown 
below: 
<:top> 
<head> Tipster Topic Description 
<hum> Number:. 059 
<dora> Domain: Environment 
<title> Topic: Weather Related Fatalities 
<desc> Description: 
Document will report a type of weather event which has 
directly caused at least one fatality in some location. 
.~narr> Narrative: 
A relevant document will include the number of people 
killed and injured by the weather eveat, as well as 
reporting the type of we.~er event and the location 
of the event. 
<con> Cmc~(s): 
For CAC'M-3204 colle~ion the filter was most effective at 
o = 0..5"7. For TREC-I we changed the similarity formula slightly 
in order to obtain ~ nonnahza~vns m all cases. This however 
lowered smailanty coefficients in general and a new threshold had 
to be selected. We used o = 0.1 m TREC-I rims, although it tamed 
om tobcapoor choice. In all ¢au~Svaried between 10and I00. 
I. lightning, avalanche, tornado, typhoon, humcane. 
heat. heat wave. flood, snow. rain. downpour. 
blizzard, storm, freezing temperatures 
2. dead. killed, fatal, death, fatality, victim 
3. NOT man-made disasters, NOT war-induced famine 
4. NOT earthquakes, NOT volcanic ernptions 
</top> 
Note that this topic actually consists of two different 
statements of the same query: the natural language 
specification consisting of <desc> and <nan-> fields. 
and an expert-selected list of key terms which are 
often far more informative than the narrative part. 
Results obtained for queries using text fields only and 
those involving both text and keyword fields are 
reported separately. Further experiments have sug- 
gested that natural language processing impact is 
significant but may be severely limited by the expres- 
siveness of the term-based representation. Since the 
<con> field is considered the expert-user's rendering 
of the 'optimal" search query, our system is able to 
discover much of it from a less complete 
specification in the text section of the request via 
query expansion. In fact, we noted that the 
recall/precision gap between automatically generated 
queries and those supplied by the user was largely 
closed when NLP was used. Moreover, even with the 
keyword field included in the query along with other 
fields, NLP's impact on the system's performance is 
still noticeable. 
Other results on the impact of different fields in 
TREC topics on the final recall/precision results were 
reported by Broglio and Croft (1993) at the ARPA 
HLT workshop, although text-only runs were not 
included. One of the most striking observations they 
have made is that the narrative field is entirely 
disposable, and moreover that its inclusion in the 
query actually hurts the system's performance. It has 
to be pointed out, however, that they do little 
language processing. 15 
Summary statistics for these runs are shown in 
Table 4. These results are fairly tentative and should 
be regarded with some caution. For one, the column 
named txt reports performance of <dcsc> and <narr> 
fields which have been processed with our suffix- 
~rimmer. This means some NIP has been done 
already (tagging + lexicon), and therefore what we 
see there is not the performance of 'pure' statistical 
system. The same applies to con column. (For 
u Brace Cmfl (personal communication. 1992) has suggest- 
ed that excluding Ill expert-made fields (i.e.. <ctm> and <:lac>) 
would make the queries quite ineffective. Broglio (personal com- 
mumeanvc, 1993) co.rims Ibis showing thaz text-only retrieval 
(i.e.. with <desc> and ~narr'>) shows an average prnc:sion at morn 
than 30% below that of <con>-based retrieval. 
16 
word 1 word2 SIMnorm 
abm 
absence 
accept 
accord 
acquire 
speech 
adjustable 
maxsaver 
affair 
affordable 
disease 
medium+range 
aircraft 
aircraft 
airline 
alien 
anniversary 
anti+age 
anti+clot 
contra 
candidate 
contend 
property 
attempt 
await 
stealth 
child 
baggage 
ban 
bearish 
bee 
roller+coast 
two+income 
television 
soldier 
treasury 
research 
withdrawal 
*anti+ballistic 
*maternity 
acquire 
pact 
purchase 
address 
one+year 
*advance+purchase 
scandal 
low+income 
*ailment 
*air+to+air 
*jetliner 
plane 
carrier 
immigrate 
*bicentennial 
anti+wrinkle 
cholesterol+lower 
*anti + sandinista 
*aspirant 
*aspirant 
asset 
bid 
pend 
*b+l 
*baby 
luggage 
restrict 
bullish 
*honeybee 
*bumpy 
two+earner 
Iv 
troop 
*short+term 
study 
*pullout 
0.534894 
0.233082 
0.179078 
0.492332 
0.449362 
0.263789 
0.824053 
0.734008 
0.684877 
0.181795 
0.247382 
0.874508 
0.166777 
0.423831 
0.345490 
0.270412 
0.588210 
0.153918 
0.856712 
0.294677 
0.116025 
0.143459 
0.285299 
0.641592 
0.572960 
0.877582 
0.183064 
0.607333 
0.321943 
0.847103 
0.461023 
0.898278 
0.293104 
0.8O6O18 
0.374410 
0.661133 
0.209257 
0.622558 
Table 1. Selecte filtered word similarities (* indicates 
the more specific term). 
word cluster 
takeover merge, buy-out 
acquisition, bid 
stock share, issue, bond, price 
staff personnel, employee, force 
share stock, issue,fund 
sensitive crucial, difficult, critical 
rumor speculate 
president director, executive 
chairman, manage 
outlook forecast, prospect 
trend, picture 
law rule, legislate 
bill, regulate 
earnings revenue, income 
por(olio asset, invest, loan 
property, hold 
inflate growth, earnings, rise 
industry business, company, market 
help additional, support, involve 
growth increase, rise, gain 
decline, earnings, profit 
firm bank, concern, group, unit 
environ climate, condition 
situation, trend 
debt loan, secure, bond 
custom( er ) client, investor 
buyer, consume(r) 
counsel attorney 
compute machine, software 
competitor rival, partner, buyer 
company business, firm, bank 
market, industry, concern 
big large, major, huge 
base facile, source 
reserve, support 
asset property, loan,fund, invest 
share, stock, money 
Table 2. Selected clusters obtained from approx. 107 
words of text with weighted Tanimoto formula. 
17 
comparison, see Table 3 where runs with CACM- 
3204 collection included 'pure' statistics run (base), 
and note the impact our suffix trimmer is having.) 
Nonetheless, one may notice that automated NLP can 
be very effective at discovering the right query from 
an imprecise narrative specification: as much as 82% 
of the effectiveness of the expert-generated query can 
be attained. 
CONCLUSIONS 
We presented in some detail a natural language 
information retrieval system consisting of an 
advanced NLP module and a 'pure' statistical core 
engine. While many problems remain to be resolved, 
including the question of adequacy of term-based 
representation of document contents, we attempted to 
demonstrate that the architecture described here is 
nonetheless viable. In particular, we demonstrated 
that natural language processing can now be done on 
a fairly large scale and that its speed and robustness 
can match those of traditional statistical programs 
such as key-word indexing or statistical phrase 
extraction. We suggest, with some caution until more 
experiments are run, that natural language processing 
can be very effective in creating appropriate search 
queries out of user's initial specifications which can 
be frequently imprecise or vague. 
On the other hand, we must be aware of the 
limits of NLP technologies at our disposal. While 
part-of-speech tagging, lexicon-based stemming, and 
parsing can be done on large amounts of text (hun- 
dreds of millions of words and more), other, more 
advanced processing involving conceptual structur- 
ing, logical forms, etc., is still beyond reach, compu- 
rationally. It may be assumed that these super- 
advanced techniques will prove even more effective, 
since they address the problem of representation- 
level limits, however the experimental evidence is 
sparse and necessarily limited to rather small scale 
tests (e.g., Mauldin, 1991). 
ACKNOWLEDGEMENTS 
We would like to thank Donna Harman of 
NIST for making her PRISE system available to us. 
We would also like to thank Ralph Weischedel and 
Heidi Fox of BBN for providing and assisting in the 
use of the part of speech tagger. Jose Perez Carballo 
has contributed a number of valuable observations 
during the course of this work, and his assistance in 
processing the TREC data was critical. This paper is 
based upon work supported by the Defense 
Advanced Research Project Agency under Contract 
N00014-90-J-1851 from the Office of Naval 
Research. under Contract N00600-88-D-3717 from 
PRC Inc., and the National Science Foundation under 
Grant IRI-89-02304. We also acknowledge support 
from Canadian Institute for Robotics and Intelligent 
Systems (IRIS). 

REFERENCES 
Broglio, John and W. Bruce Croft. 1993. "'Query 
Processing for Retrieval from Large Text Bases." 
Proceedings of ARPA HLT Workshop. March 
21-24, Plainsboro, NJ. 
Church, Kenneth Ward and Hanks, Patrick. 1990. 
"Word association norms, mutual information, 
and lexicography.'" Computational Linguistics, 
16(1), MIT Press. pp. 22-29. 
Crouch, Carolyn J. 1988. "A cluster-based approach 
to thesaurus construction." Proceedings of ACM 
SIGIR-88, pp. 309-320. 
Grefenstette, Gregory. 1992. "Use of Syntactic Con- 
text To Produce Term Association Lists for Text 
Retrieval." Proceedings of SIGIR-92. 
Copenhagen, Denmark. pp. 89-97. 
Grishman, Ralph. Lynette Hirschman, and Ngo T. 
Nhan. 1986. "Discovery procedures for sub- 
language selectional patterns: initial experi- 
ments". Computational Linguistics. 12(3), pp. 
205-215. 
Grishman, Ralph and Tomek Strzalkowski. 1991. 
"Information Retrieval and Natural Language 
Processing." Position paper at the workshop on 
Future Directions in Natural Language Processing 
in Information Retrieval, Chicago. 
Harman, Donna. 1988. "Towards interactive query 
expansion." Proceedings of ACM SIGIR-88, pp. 
321-331. 
I-larman, Donna and Gerald Candela. 1989. 
"'Retrieving Records from a Gigabyte of text on a 
Minicomputer Using Statistical Ranldng.'" Jour- 
nal of the American Society for Information Sci- 
ence, 41(8), pp. 581-589. 
I-lindle, Donald. 1990. "'Noun classification from 
predicate-argument structures." Proc. 28 Meet- 
ing of the ACL. Pittsburgh. PA, pp. 268-275. 
Lewis, David D. and W. Bruce Croft. 1990. "'Term 
Clustering of Syntactic Phrases". Proceedings of 
ACM SIGIR-90, pp. 385-405. 
Mauldin. Michael. 1991. "Retrieval Performance in 
Ferret: A Conceptual Information Retrieval Sys- 
tem." Proceedings of ACM SIGIR-91, pp. 347- 
355. 
Meteer, Marie, Richard Schwartz, and Ralph 
Weischedel. 1991. "Studies in Part of Speech 
Labeling." Proceedings of the 4th DARPA 
Speech and Natural Language Workshop. 
Morgan-Kaufman, San Mateo, CA. pp. 331-336. 
Sager, Naomi. 1981. Natural Language Information 
Processing. Addison-Wesley. 
Sparck Jones, Karen. 1972. "'Statistical interpreta- 
tion of term specificity and its application in 
retrieval." Journal of Documentation, 28(1), pp. 
11-20. 
Sparck Jones, K. and E. O. Barber. 1971. "What 
makes automatic keyword classification effec- 
tive?" Journal of the American Society for Infor- 
mation Science, May-June, pp. 166-175. 
Sparck Jones, K. and J. I. Tait. 1984. "Automatic 
search term variant generation." Journal of 
Documentation, 40(1), pp. 50-66. 
Strzalkowski, Tomek and Barbara Vauthey. 1991. 
"Fast Text Processing for Information 
Retrieval." Proceedings of the 4th DARPA 
Speech and Natural Language Workshop, 
Morgan-Kaufman, pp. 346-351. 
Strzalkowski, Tomek and Barbara Vauthey. 1992. 
"Information Retrieval Using Robust Natural 
Language Processing." Proc. of the 30th ACL 
Meeting, Newark, DE, June-July. pp. 104-111. 
Strzalkowski, Tomek. 1992. "TrP: A Fast and 
Robust Parser for Natural Language." Proceed- 
ings of the 14th International Conference on 
Computational Linguistics (COLING), Nantes, 
France, July 1992. pp. 198-204. 
