BUILDING A LEXICAL DOMAIN MAP FROM TEXT CORPORA 
Tomek Strzalkowski 
Courant Institute of Mathematical Sciences, New York University 
715 Broadway, rm, 704, New York, NY 10003, tomek@cs.nyu.edu 
SUMMARY 
In information retrieval the task is to extract from the 
database ~dl ,and only the documents which are relevant to 
a user query, even when the query and the documents use 
little common vocabul~u'y. In this paper we discuss the 
problem of automatic generation of lexical relations 
between words ,and phrltses from large text corpora :rod 
their application to automatic query expansion ill informa- 
tion retrieval. Reported here ,are some preliminary resuhs 
and observations from the experiments with a 85 million 
word Wall Street Journal dalabase and a 45 million word 
San Jose Mercury News database (piu'ts of 0.5 billion 
word TIPSTER/TREC datab`ase). 
INTRODUCTION 
Tile task of information retrieval is to extract 
relevant documents from large collection of documents ill 
response to a user's query. When the documents cont:dn 
primm'ily unrestricted text (e.g., newspaper `articles, legld 
documents, etc.) the relev,'mce of a document is esta- 
blished through 'full-text' retriewd. This has been usually 
accomplished by identifying key terms in the documents 
(the process known as 'indexing') which could then be 
matched against terms in queries (Salton, 1989). The 
effectiveness of ,any such term-b`ased approach is directly 
related to the accuracy with which a set of terms 
represents the content of a document, ,as well as how well 
it contrasts a given document with respect to other docu- 
ments. In other words, we ,are looking for a represeutat ion 
R such that for any text items D1 and D2, R(DI) = R(D2) 
iff meaning(D1) = meaning(D2), at an appropriate level 
of abstraction (which may depend on types and character 
of anticipated queries). 
For all kinds of terms that can be assigned 1o the 
representation of a docmnent, e.g., words, operator- 
m'gument pairs, fixed phrases, ~md proper n,'unes, vltrious 
levels of "reguh'u'ization" ,are needed to ,assure that syn- 
tactic or lexie,'d v,'u'iations of input do not obscure under- 
lying semantic uniformity. Without actually doing 
semantic analysis, tiffs kind of normalization can be 
achieved through the following processes: ~ 
(1) morpbological stemming: e.g., retrieving is 
reduced to retriev; 
An altematlve, but less efficient method is to generate all vari- 
ants (lexical, syntactic, etc.) of words/phrases in the queries (Sparck- 
Jones & "Fail, 1984). 
(2) lexicon-based word nonnldizntion: e.g., retrieval 
is reduced to retrieve; 
(3) operator-argument representation of phr'tses: e.g., 
information retrieval, retrievhlg of information, 
and retrieve relewmt information ,are ,all assigned 
the slune representation, retrieve+btformation; 
(4) conlext-blmed term clustering into synonymy 
classes and subsumption hierarchies: e.g., take- 
over is a kind of acquisition (in business), luld 
Fortran is a programming language. 
We have established the general architecture of a NLP-IR 
system that accommodates these considerations. In a gen- 
eral view of this design, depicted schematic~dly below, an 
advanced NLP module is inserted between the textuld 
input (new documeuts, user queries) and the datab~Jse 
search engine (in our c`ase, NIST's PRISE system). 
NLP: 'FA\[~ PARSER temls 
This design has already shown some promise in produc- 
ing signific~mtly better performance than the base statisti- 
cld system (Strz~dkowski, 1993). Its practical significance 
stems in no slnall part from the use of a tkst and robust 
parser, TI'P, 2 which can process unrestricted text at 
speeds below 0.2 sec per sentence. TI'P's output is It reg- 
ularized representation o1' each sentence which reflects 
logical prcdicalc-argumclll su'uclure, e.g., Iogic:d subject 
and logical objects are identilicd depending upon the 
main verb subcategorization frame. For example, Ihe verb 
abide has, among others, a subcategorization frame in 
which the object is a prepositional pbrase with by, i.e., 
ABIDE: subject NP object PREP by NP 
Subcategorization inlbrmution is rend from the on-line 
Oxford Advanced Le`arner's Diction,try (OALD) which 
TTP uses. 
TFP stands for Tagged Text Parser, and it has I:een described in 
detail in (Strzalkowski, 1992) and ev~duated in (Strzalkowski & 
Scheyen, 1993). 
604 
ltEAD-MODIFIER STRUCTURES 
TTP p,'u'se structures are p~tssed to the phrase 
extraction module where head+modifier (including 
predicate+,'u'gument) pairs are extracted and collected into 
occurrence patterns. The following types of 
head+modifier pairs m'e extracted: 
(1) a head noun and its left adjective or noun adjunct, 
(2) a head noun ,and the head of its right adjunct, 
(3) the m,'fin verb of a clanse and the head of its 
object pbrase. 
These types of p,'firs account for most of the syntactic 
vm'i~mts for relating two words (or simple phrases) into 
pairs c,'urying compatible semantic content. For example, 
the pair retrieve+information will be extracted from mty 
of the following fragments: information retrieval system; 
retrieval of it~rmation /)'om databases; and information 
that can be retrieved by a user-controlled interactive 
search process. 3 
Figure 1 shows TTP parse and head+modifier pairs 
extracted. Whenever multiple-noun strings (two nouns 
plus another noun or adjective) are present, they need Io 
be structurally disambiguated before any pairs emt be 
extracted. This is accomplished using statistically-based 
preferences, e.g., world+third is pt'etizn'ed to either 
country+world or cot#ltry+third when extracted from 
third world country. If such preferences cannot be cont- 
puted, ,all alternatives ,'u'e discarded to avokl noisy input 
to clustering progrmns. 
\[S;m Jose Mercury News {}8/30/91 Busilmss Sectlonl 
For McCaw, it wouhl have hurt the company's stralegy 
of building a seamless national cellular ilelWolk. 
\[assell, 
\[\[will auxl,llpeff,\[havell, 
Ilvetb,lhmtll, 
\[sul',jeet,lnl',,lu,it111, 
\[ol~jeel,\[t'q},\[n,slnltegy\],\[t_l×~s,the I, 
In~ms,lposs. tn ,corn pany Ill. 
\[of, 
l lverb J buihlll, 
\[subject,~myouel, 
I{d}ject,l.p,ln.l~etwoz k l,\[t f, os,a I, 
ladj,lse;unless II, 
\[adj,luational\]l, 
\[adj.lcellularllll\]ll\]ll, 
I for,lup,Iname,lmccawllllll. 
EX'I'I~.A UIT21 } I'A1RS: 
hall+ s| Fate~,,y slFalegy+colnl}ally 
bui ld+nelwork net work+cclhJlar 
uet wet k+llali'ollal he|work+seamless 
F'tgnre 1. Extracting Ilead+Modilier pairs from parsed sentences. 
TERM CORRELATIONS FROM TEXT 
Head-modifier pairs serve as occurrence contexts 
for terms included in them: both single words (as shown 
in Fignre 1) and other pairs (in case of nested pairs, e.g., 
cottntry+\[world+third\]). If two terms tend to be modilicd 
with a number of common modifiers but otherwise appear 
in few distinct contexts, we assign them a simih'uity 
coefficient, a real number between 0 and 1. The similarity 
is determined by comparing distribution characlerislics 
for both terms within the corpus: in general we will credit 
high-content terms appem'ing in multiple identical elm- 
texts, provided that these contexts are not too common- 
place. 4 Figure 2 shows exmnples of terms sharing a 
number of common contexts along with frequencies of 
occurrence in a 250 MByte subset of Wall Street Journal 
database. A head context is when two distinct modifiers 
,are attached to the same head element; a rood context is 
when the s,'une term modilles two distinct heads. 
To compute term similarities we used a variant of 
weighted Jacc\[u'd's measure described in e.g., (Grefen- 
TFRMI TERM2 COMM CNTXT FRQI FRQ2 
IIEAD MOD 
vice delmty 
I1|;11) hey 
president 9295 29 
chaiml,'m l(X)7 146 
director 6 158 
minisler 37 17 
premier 7 8 
sloly 9 3 
chib fi 4 
age 18 3 
mother 4 5 
bad 4 4 
yotmg 258 12 
ohler 18 ,I 
li'tgure 2. L:xample pairs of related re.ms. 
3 snbject+ved} pairs are also extracted but these are not used in the 
lexical clustering procedure described here. 
4 It would not be appropriale to predict similarity I~.'tween 
language and logarithm on the basis of their co-occurrence with mztural. 
stette, 1992): 5 
In another series of exf, orinmt~ts (Swzalkowskl & Vauthey, 
1992} we used a Mtllnal lnfo0maliou I}ased classillcalion formula (e.g., 
Church and ttanks, 1990; lliudle, 1990), but we l~,}und it less effeclive 
for diverse dalabases, such as WSJ. 
605 
~__,MIN (W (\[x, att\]), W (\[y,att \]) 
SIM (x t ,x2) = att ~.MAX (W (\[x,att \]), W (\[y,att \]) 
~ltt 
with 
W (\[x,y 1) = aEW (x)*tog (f.,a) 
GEW(x)=I+ ny v ~nyj| 
tog (N) 1 
In rite above, f~,y stands for absolute fi'equency of pair 
\[x,y\] in tile corpus, ny is the frequency of term y, and N is 
ttte number of single-word terms. 
hi order to generate better sitnilarities, we require 
that words xt and x2 appear in at least M distinct conl- 
ilion contexts, where It common context is a couple of 
pairs \[xt,Y\] and \[x2,y\], or \[y,x 1\] and \[y,r 2\] such that they 
each occun'ed at legist K times. Thus, banana and Baltic 
will not be considered for similm-ity relation on the basis 
of tlteir occurrences in the common context of republic, 
no matter how frequent, unless there are M-1 other such 
common contexts comparably frequent (there wasn't any 
in TREC's WSJ database). For smaller or narrow domain 
databases M=2 is usually sufficient, e.g., CACM d:ltab:t,~e 
of computer science abstracts. For large databases cover- 
ing a diverse subject matter, like WSJ or SJMN (S,'m Jose 
Mercury News), we used M>_5. 6 This, however, turned 
out not to be sufficient. We would still genemle faMy 
strong simih'u'ity links between terms such as aerospace 
mid pharmaceutical where 6 and more comlnon contexts 
were found, even after a number of comlnon contexts, 
such ,'is company or market, have already been rejected 
because they were paired with too msmy different words, 
and thus had a dispersion ratio too high. The remaining 
common contexts m'e listed in Figure 3, ~dong with their 
GEW scores, all occurring at the head (left) position of a 
pair. 
CONTEXT (;EW frequency wilh 
aerospace idutrlnacetttical 
film 0.58 9 22 
induslry 0.51 84 56 
sector 0.61 5 9 
coneem 0.50 130 115 
analyst 0.62 23 8 
division 0.53 36 28 
giant 0.62 15 12 
Figure 3. Common (head) contexts for aerospace and idlarmaeeutieal. 
6 For example &tnana mM Dominican were found to have two 
common contexts: republic and plant, although tiffs second occurred in 
apparently different senses in Dominican plant and banatla ptattt. 
When analyzing Figure 3, we should note that 
while some of the GEW weights are quite low (GEW 
takes values between 0 and 1), thus indicating a low 
ilnportance context, the frequencies with which these con- 
texts occurred wilh both ter,ns were high and balanced on 
both sides (e.g., concern), thus adding to tile slrength of 
association. To liher out such casts we established thres- 
holds for adlnissible values of GEW factor, and disre- 
Du'ded contexts with entropy weights falling below the 
threshold. In the most recent experiments with WSJ texts, 
we found that 0.6 is a good threshold. We also observed 
that clustering bead terms using their moditiers as con- 
texts converges faster and gives generally ntore reliable 
links thai\] when rood terms are clustered using heads as 
context (e.g., in the above example). In onr experiment 
with tile WSJ database, we fotmd that an occurrence of a 
common head context needs to be considered Its eoulri- 
bttting less to the total context cotint than an occurrence 
of a common rood context: we used 0.6 and l, respec- 
tively. Using this formtda, terms man and boy in Figure 2 
share 5.4 contexts (4 head contexts and 3 rood contexts). 
hlilially, term similmities are organized into clus- 
ters around a centmid term. Figure 4 shows top 10 ele- 
ments (sorted by similarity wflue) of tile chister for 
president. Note that in this case lhe SIM value drops sud- 
denly after the second element of the cluster. Changes in 
SIM vahle are nsed to deternline cut-off points for clus- 
ters. Tile role of GTS factor will be explgfined later. Sam- 
ple clusters obtained fi'om approx. 250 MByte (42 million 
words) snbset of WSJ (years 1990-1992) are given in 
Table 1. 
It may be worth pointing out that the similarities arc 
calculated ilsing term co-occurrences in syntaclic rather 
than in document-size contexts, the latter Ix:ing the usual 
practice it1 non-linguistic chlstering (e.g., Sparck Jones 
and Batlx:r, 1971; Crouch, 1988; Lewis and Croft, 1990). 
Although the two methods of te,'m clustering inay be COll- 
sidered mntttally complementary in certaitt situations, we 
believe that more and slrouger associations can be 
obtained tllrough syntactic-context chlstering, given 
suflicient alnonnt of data and a reasonably accnralc syu- 
CI{NTI(OII) 
president 
TI!RM SIM (Yl'g 
0.001 I 
director 0.2481 0./1017 
chaim~;m 0.2,149 0.0028 
office 0.1689 0.0010 
m,'ulage O. 1656 0.0007 
executive 0.1626 0.0012 
official 0.1612 0.0008 
head 0.1564 0.0018 
meml)er 0.1506 0.0014 
lead 0.1311 (I.0009 
Figure 4. A cluster for president. 
606 
word duster 
takeover 
benefit 
capital 
staff" 
atlracl 
sensitive 
speetllate 
president __+ 
VICe 
outlook I 
law I 
earnings prffit, revemfe, income 
portfolio asset, invest, loan 
inflate growth, deniand, earnings 
ituhtstry business, eompatly, market 
growth increase, rise, gain 
firm bank, concern, group, tlnit 
environ climate, condition, siluation 
debt loan, sectire, botld 
lawyer attorney 
COltnsel attorney, administrator, secretary 
conlpule mac\]llne, software, eqtlO~ment 
competitor riwll, competition, bayer 
alliance i~artnersIiOl, veotnre, eoosortiunl 
big ktrge, major, bu.ee, significaot 
fight battle, attack, war, cl allet ge 
base facile, source, reserve, stqqu~rt 
shareholder creditor, customer, client 
investor, stockhohler 
merge, bay-out, acquire, bM 
compensate, aid, espense 
cash, fitnd, money 
personnel, emfloyee,foree 
hire, draw, woo 
crucial, difficult, critical 
rtimor, tlncertainty, tension 
director, chairman 
deputy 
fi)recast, t~rospect, trend 
rule, policy, leg&late, bill 
Tahle 1. Selected chlsters &taiued fronl syntat:lic contexts, derived 
from approx. 40 millio~l words of WSJ tcxl, wiih weighted Jaceaid for- 
mula. 
tactic parser\] 
? Nell-syntactic contexts cross sentetlce lmundaries with no fuss, 
which is hell)ful with shorl, succinct documents (such as CACM 
abstracts), but less so wilh longer texls; see also (Grlshmali el al., 1986). 
QUERY I(XPANSION 
Sitnilltl'ity rdaiions are t,sed to expand user queries 
with new lernts, lit an "tttelnpt to make tile tinal Seluch 
tiuery more colnprehensive (adding synonytlis) and/or 
more pointed (adding specializalions). 11 follows that not 
all similiu'ily relatiolls will be equally useful ill query 
expansion, liar instance, eomplemelltary anti aitlonymous 
relaliolts like Ihe one between Australian and Catladitl#l, 
ftCCel;t aild rejecl, or even gelieralizaliOilS like Iroill 
(1£'1"0X13(IC( ~ tO industry may actually hllrin systeln's perlor- 
nialice, Siliee we Iliay end till retrieviiig many h'relevaill 
documenls. On the olher hand, dalal)ase search is likely to 
miss relewtill doctlnlenls if we overlook the fact that vh:e 
director Call also be depety dit+et?lor, of that ltlkt'ov('r cgln 
also be merge, buy-ottl, or acqtdsition. We noled that an 
average SOl of similarities generated from it lexl corpus 
conlahis abotit as many "good" relations (synottylny, spe- 
cializalion) as "lind" rclaliolts {anlonyiny, conipleinorlla- 
lion, generalizalion), as seen froin the query exp;lliSiOli 
viewpoinl. Therefore aiiy alleinpt Io sepai~ile these two 
classes alid 1o hlerease Ihe proporlion o1 "good" relalions 
shotlld result in improved relrieval. Tills has hldeed heell 
tJonlirined in our experinlenls where a relalively crlide 
filler has visibly hlcreased reiriewil precision. 
hi order It) creale an appropriate liller, we devised a 
global lerm speciliciiy ineasiiro ((ITS) whidl is calculated 
for each lerili across all conic×is iii which ii occiirs. The 
general philosophy here is thal ti niore specilic 
word/phrase WOllld h/lYe 11 iilore Iillliled use, i.e., a illOle 
specilic term wotild appear iit fewer distinct contexts, hi 
this respecl, GTS is similar it) tile standard ire'erred tlOCli- 
met# fi'eqttetu 7 (idJ) measure excepl lhal lerni frequency 
ix iltt3aStlie(l over syntactic tlililS Iather Ihall doctllllenl size 
unils. TenliS with higher GTS vahies are generally coil- 
sidered more specilic, but the specificily compa,'isotl is 
only meanillgful for terms which are already kllown to be 
similar. We bdieve that measuring lerm specilicily over 
doeumelli-size contexts (e.g., Sparck Jones, 1972) ,nay 
iiot fie appropriale iii this case. In particular, synllax-based 
contexts allow for processint~ lexls without any inlernal 
doctinlenl slriiclllre, 
The new function is calculaled according to the fol- 
htwing forltiill'i: 
IICt+(w) * lC#,,(w) if bolll exist 
(;'I'S (w)=~ lCte(w) if otlly ICte(vv) e.visls 
i L I Q ( w ) otherwise 
where (wilh nw, el,,. > 0): 
#1w ICt+(w) = IC (Iw,_}) = 
dw(nw+dw- 1) 
II w IC1¢(w) = IC (I ,w I) - 
dw(n,,,+dw- 1) 
In the ahove, dw is di.~7)ersion of lerm w mlderslood as Ihe 
mmd~er of distinct COlltexls in which w is found. For any 
two ternls W 1 alld w2, all(l it constant ~1 > 1, ir 
(77"S (w2) _> 8t * (;TF (w 1) then w 2 is considered more 
speciiic lhall w 1 . hi addition, if 
SlM,,o,.,n(Wl,W2)=fI> 01, where 01 is an elrli)irically 
607 
established threshold, then w2 c,'m be added to the query 
containing term w t with weight ~*to, 8 where co is the 
weight w2 would have if it were present in the query. 
Simil,'u'ly, if GTS(w2) <~2 * GTS(wL) ,'rod 
SIM,,orm(wl,w2) = ~ > 0:~ (with 82 < 8t ,and 0t < 02) then 
we may consider w~ as synonymous to w~. All other rela- 
tions ,are discarded. For example, the following were 
obtained from the WSJ training database: 
GTS(takeover) =0.00145576 
GTS(merge) = 0.00094518 
GTS (buy-out) = 0.00272580 
GTS(acquire) = 0.00057906 
with 
SIM (takeover,merge) = 0.190444 
SIM (takeover,buy-out) =0.157410 
SIM (takeover,acquire) = 0.139497 
SIM (merge,buy-out) = 0.133800 
SIM (merge,acquire) = 0.263772 
SIM(buy-out,acquire) = 0.109106 
Therefore both takeover and buy-out can be used to spe- 
cialize merge or acquire. With this filter, the relationships 
between takeover ~md buy-out and between merge ~md 
acquire ,are either both discarded or accepted as 
synonymous. At this time we are unable to tell 
synonymous or ne,'u" synonymous relationships from those 
which ,are prim,wily complement~u-y, e.g., matt ,and 
womatt. 
Filtered simih'u'ity relations create a domain map of 
terms. At present it may cont~fin only two types of links: 
equiv,'dence (synonymy and near-synonymy) ,and sub- 
sumption (specification). Figure 5 shows a small frag- 
ment of such map derived from lexic,-d relation computed 
from WSJ datab`ase. The domain map is used to expand 
user queries with related terms, either automatically or in 
a feedback mode by showing the user appropriate p~u'ts of 
the map. 
cost number ease 
exp n in~ 'tigat __all'ge/w.'uit /\ 
subsumption 
equivalence 
Figure 5. A fragment of the domain map network. Note the emerging 
senses of 'charge' as 'expense' and 'allege'. 
s For TREC-2 we used 0=0.2; ,5 varied between l 0 and 100. 
We should add that the query exp~msion (in the 
sense considered here, Ibough not quite in the stone way) 
has been used in information retrieval research befo*'e 
(e.g., Sp,'trck Jones and Tail, 1984; Harm\[m, 1988), usu- 
aUy with mixed results. The main difference between the 
current approach ,'u~d those previous attempts is that we 
use lexico-sernantic evidence for selecting extra terms, 
while they relied on term co-occurrence within the same 
documents. In fact we consider these to methods colnple- 
mentary with the latter being more appropriate for 
automatic relevance feedback. An alternative query 
expansion to is to use term clusters to create new terms, 
"metaterms", and use them to index the database instead 
(e.g., Crouch, 1988; Lewis ,and Croft, 1990). We found 
that the query exp~sion approach gives the system more 
flexibility, for inst,'mce, by making room lbr hypertext- 
style topic exploration via user feedback. 
CONCLUSIONS 
We discussed selected aspecLq our inlormation 
retrieved system consisting of an advanced NLP module 
and a 'st~mdard' statistical core engine, ht this paper we 
concentrated on the problem of automatic generation of 
lexical correlations among terms which (aloug with 
appropriate weighting scheme) represent the content of 
both the dat:d)ase documents :rod the user queries. Since it 
successful retrieval relies on actual term matches between 
the queries ,'u~d the documents, it is essential tlmt any lexi- 
cal alternatives of describing a given topic ,are taken into 
account. In our system this is achieved through the expan- 
sion of user's queries with related terms: we add 
equiwdent ,and more specific terms. Lexical relations 
between terms are c;dculated directly from the database 
and stored in tbe form of a dom~dn map, which thus acts 
as a domaln-specilic thesaurus. Query expansion can be 
done in the user-feedback mode (with user's assistance) 
or automatically. In this latter c~se, local context is 
explored to ,assure meaningful exp~msious, i.e., to prevent 
e.g., exp,'mding 'charge' with 'expense' when 'allege' or 
'blame' is meant, as in the following ex~unple query: 
Documents will report on corruption, incompetence, 
on' inefficiency in the m.magement of the United 
N.'~litm's st'dT. Alleg~dions t~l' lnIil|agelnelll railings, 
as well as Felofls Io StlCh charges ~u'e relevanl. 
Many problems remain, however, we attempted 1o 
demonstrate that the architecture described here is 
nonetbeless viable and h`as practiced significance. More 
advanced NLP techniques (including semantic ,'m~dysis) 
may prove to be still more effective, in the future, how- 
ever their enormous cost limits ~my experimental evi- 
dence to small scale tests (e.g., Mauldin, 1991). 
ACKNOWLEDGEMENTS 
We would like to thank Donna Har,n~m of NIST for 
making her PRISE system av,'filable to us. We would ,also 
like to thank R~dph Weischedel and Heidi Fox of BBN for 
providing and ,'tssisting in the use of the p~u't of speech 
tagger. This paper is based upon work supported by the 
Adv,'mced Research Project Agency under Contract 
608 
N00014-90-J-1851 from the Office of Nawd Research, 
under Contract N00600-gS-D-3717 from PRC Inc., and 
the Nalional Science Foundalion under Gu~mt 1RI-93- 
02615. We ~dso acknowledge support from the Canadian 
lnsti|ule for Robolics and Intelligent Sysletns (IRIS). 
RI?~I?EI~J,;NCI(S 
Church, Kenneth Ward and flanks, Patrick. 1991/. "Word 
association norms, mutual informalitm, and lexicogra- 
phy." Computational Linguistics, 1611), MIT Press, 
pp. 22-29. 
Crotlch, Carolyn J. 1988. "A cluster-based approach to 
thesaurus construction." Proceedings of ACM 
SIGIR-88, pp. 309-320. 
Grefcnsleue, Gregory. 1992. "Use of Syntactic Coulcxt 
To Produce Term Association Lists Ik~, Text 
Reh'iewd." Proceedings of SIGIR-92, Copenhagen, 
Denmark. pp. 89-97. 
Grishm,'m, Ralph, Lynette Hirschman, and Nee T. Nhan. 
1986. "Discovery procedures R)r snblangnagc selec- 
tional patterns: inilial experiments". Computatiotml 
l,inguistics, 12(3), pp. 205-215, 
Ilarman, Donna. 1988. "Towards inleraclive query 
expansion." Proceedings of ACM SIGIR-S8, pp. 
321-331. 
ltindle, Donald. 1990. "Noun classiticalion fi'om 
predicate-m'gument slructurcs." l)roc. 28 Meeliug of 
1he ACI,, Pittsburgh, PA, pp. 268-275. 
Lewis, David D. and W. Bruce Croft. 1990. "Term Clus- 
tering of Syntactic Phrases". Proceedings of ACM 
SIGIR-90, pp. 385-405. 
Mauldin, Michael. 1991. "Relrieval PerlBrmtmce in Fer- 
ret: A Conceptual Information Relrieval System." 
Proceedings of ACM SIGIR.-91, pp. 347-355. 
Sallon, Gerard. 1989. Automatic Text Processing." the 
transformation, attalysis, (tIM retrieval of infi)rmalio. 
by computer. Addison-Wesley, Reading, MA. 
Sl)arck Jones, Karen. 1972. "Slalistical interpretation of 
lcrm specilicity lind ils application in retrieval." 
Journal of Documentation, 28(1 ), pp. I 1-20. 
Sparek Jones, K. and E. O. P~arber. 1971. "What makes 
atltomatie keyword elassilicalion effective?" Journal 
of the Americatz Society for InJbrmatiotz Science, 
May-June, pp. 166-175. 
Sparck Jones, K. :ulc.l J. I. Tail. 1984. "Aulomalic search 
terln vm'iant generatioa." Journal qf I)ocz#nenlaliotL 
40(1), pp. 50-66. 
Strzalkowski, Tomek and Barbara Vaulhey. 1992. "Iulo,'- 
malion Retricwd Using Robust Natt,ml Langnage Pro- 
cessing." Prec. of Ihe 301h ACL Meeting, Newark, 
DE, June-July. pp. 1/)4-111. 
Slrzalkowski, Tomek. 1992. "TrP: A Fasl aM Robust 
Parser lbr Natural L,-mguage." Proceedings of the 
14111 lnternalional Couference on C()mputational 
Linguistics (COLING), Nantes, Frauce, Jnly 1992. pp. 
198-204. 
Strzalkowski, Tomek. 1993. "Robust Text Processing in 
Automated hfformation Relrieval." Prec. of ACI,- 
sponsored workshop on Very Lart, e Coq)ora. Ohio 
Slate Univ. Coh\]mbus, Julle 22. 
Slrzalkowski, Tomek. 1994. "Document Representation 
in Natural Language Text Relrieval." To appear in 
proceedings of ARPA l luman Language Technology 
Workshop, Princelon, NJ. March 8-11. 
Slrzalkowski, Tomek and Jose Perez-Cm'ballo. 1994. 
"Recenl Developments in Natural IAiJIgn,3ge Text 
Retriewd." To appem" in proceedings of Sectmd Texl 
Retrieval Conference (TREC-2), Gailhersbvrg, Md, 
August 30 - Seplember l, 1993. 
Slrzalkowski, Tomek, and Peler Scheyen. 1993. "Ewthla- 
lion of TI'P F'arscr: a preliminary report." F'roceed- 
ings of lnterualional Workshop on Parsing Technolo- 
gies (lWPT-93), Tilburg, Netherlands and Durbny, 
Belgium, Angus( 10-13. 
APPI~,NI)IX: An examlfle query 
The li)llowiug is an example infommtion requesl 
(based on TREC's lOl)ic 113) and file resulliug query. 
Except for its inverled document frequency score, each 
lerm has a "conlidence level" weight which is set Io 1.0 if 
Ihe term is fouad in the nser's query, and is less lhau 1.0 
if the term is added Ihrough an expansion fl'om 1he 
domain map. Only non-negaled terms wilh idf of 6.0 or 
greater arc incltlde(I. 
<title> New ,Space Satellile Applicatim~s 
<desc> l)ocument will repf)rt on non-traditional ap- 
plicati(ms of space satellite technol{~gy. 
<hart> A relevant dOCtllncrll will discuss more recerl( 
(~l" emerging applicalions of space satellite technolo- 
gy. NOT relewml are such "traditional" ar early sa- 
tellite age usages as INTELSAT transmission of 
voice ilIId dillll cOtllllltllliGatlolls tel" telephone coin- 
panics or program feeds fro" established television 
netwm'ks. AIs() N()T relewm( are such established 
USeS c,f sat¢lliles as military ,..:omnnlllic.alilms, eaulh 
lllitlel'a\] i'e~;{}tlrt:e. Illappillg, \[tlld s\[lppor( OF weathel" 
fi~rccasling. A few examples f~f newer applicati~ms 
are the Imikling of private satellite nelworks fi)n' 
transfer {ff business dala, facsimile Iransmission t~l" 
newspapers to be printed in mulliple Iocalimls, and 
direct Immdcasling of TV signals. Tile underlying 
purp(~sc of lifts topic is (o collecl inlbrmalion on re- 
cenl or emerging trends in lhe applicali{m of space 
satellilc lechnology. 
77£RM IDI." WEIGIIT 
al)ply+cqui p 18.402237 0.458666 
satdlite+latest 18.402237 0.25,1058 
television+slgnal 18.402237 11.359777 
television+dlrect 18/102237 0.359777 
apply+equip 18.402237 0.458666 
broadcast.b-direct 16.402237 1+(100000 
locatkm+mtultiple 16.402237 1.000000 
btoadcasl+signal 16.080309 I.(XI0(X)0 
supimwFforecast 15.817275 1.000000 
(hda-t Imsiness I5.817275 l.(K'lO000 
forecast+internal 15.402238 0.283029 
trans fea+infom'~ 15.232312 0.5119411 
Iransfer+dala 14.817275 1.00(11111(1 
ligme+buslness Id.594883 0.453631 
609 
technology+satellite 14.495347 1.00000() 
transmit4facsimile 14,402238 1.000000 
exluip+satellite 14.232312 0,458(x66 
signal+broadcast 13.701797 0,441993 
signal+iv 13,701797 I.IX)O000 
signal+television 13.594883 0.813987 
news+business 13.495347 0.352291 
netwolk+satellite 13.154310 1.000000 
develop+network 12.942806 0.409144 
non+traditional 12.758382 1.000000 
inform+business 12.729813 0.511940 
apply+technology 12.471500 1.(X)O000 
build+network I 1.212413 1.000C(}O 
facsimile 10.217362 1 .(~30~.X~O 
usage 9,902391 I.O00(X)O 
newer 9.306841 1.000000 
elderly 8.202565 0.361246 
feed 7.802325 1.000000 
satellite 7.567767 1.0,(30000 
underly 7,370192 1.000000 
transmit 7,299606 1,000000 
multiple 7.241736 1. IkqO 0()0 
broadcast 7.019614 1.(X)O000 
location 6.992316 1.000000 
print 6,351709 1.000000 
space 6,226376 1.000000 
transfer 6.155497 1,000000 
collect 6.126113 1.000000 
signal 6.080873 1 .(300000 
phone 6,072441 0.663414 
tv 6.003761 1.000000 
610 
