A Self-Learning Universal Concept Spotter 
Tomek Strzalkowski and Jin Wang 
(\]E Cort)or~*l;(', Resear(:h and Dev(',lopment 
P.O. Box 8 
Schealect, a(ly, NY 12301 
USA 
{strzalkowski, wangj }@crd. ge. com 
Abstract 
We describe the Universal Spotter, a 
system for identifying in-text references 
to entities of an arbitrary, user-sl)ecitied 
type, such its people, organizations, 
equipment, products, materials, etc. 
Starting with some initial seed examples, 
and a training text eortms , I;he system 
generates rules that will find fllrther con- 
cepts of the stone type. The initial se, ed 
information is t)rovided by the user in 
the form of a typical lexical context in 
which the enl, ities to be spotted occur, 
e.g., "the name ends with Co.", or %o 
the right of produced or made", and so 
forth, or by simt)ly supplying examples 
of the concept itself, e.g., Ford Tau'r'as, 
gas turbine, Bi 9 Mac. In addition, nega- 
tive exalnples can t)e supplied, if known. 
Given a suf\[ieiently large training corpus, 
an unsupervise(t learning process is ini- 
tiated in which the system will: (1) tind 
iilstanees of the sought-after concept us- 
ing the seed-eolltext inforInation while 
maxiinizing recall and precision; (2) find 
,~dditional contexts in which these en- 
tities occur; and (3) expand the initial 
seed-context with selected new com;exts 
t;o find even lllOre entities. Preliminary 
results of creating spotters for organiza- 
tions and products are discussed. 
1 Introduction 
hlentifying concepts in natural language text is 
an important intbrmation extraction task. De- 
pending upon the current information needs one 
may be interested in finding all references to peo- 
ple, locations, dates, organizations, companies, 
products, equipment, and so on. These concepts, 
along with their classification, can be used to in- 
dex any given text for search or categorization 
purposes, to generate suimnaries, or to popu- 
late database records. However, automating the 
process of concept identification in untbrmatted 
text has not been an easy task. Various single- 
Imrpose spotters have been developed for specific 
types of conce.pts, including people mm~es, com'- 
pa.ny n&ines, location names, dates, etc. })lit; those 
were usually either hand crafted for particular 
applications or domains, or were heavily relying 
on apriori lexical clues, such as keywords (e.g., 
'Co.'), case (e.g., 'John K. Big'), predicatable for- 
mat; (e.g., 123 Maple Street), or a combination 
of thereof. This makes treat, ion and extension 
of stleh spotters an arduous mamml job. Other, 
less s;tlient entities, such as products, equipnmilt, 
foodstuff', or generic refcrenc.es of any kind (e.g., 
'a ,lapanese automaker') could only be i(lenti- 
fled if a sut\[iciently detailed domain model was 
available. Domain-model driven extraction wits 
used in ARPA-sponsored Message Understanding 
Colltc1'eilc(!s (MUC); a detailed overview of cur- 
rent research can be found in the procecdil~gs ot7 
MUC-5 (nmcS, 1993) and the recently concluded 
MUC-6, as well as Tipster Project meetings, or 
ARPA's Human Language q>chnology workshops 
(tipsterl, 1993), (hltw, 1994). 
We take a somewh~t different approach to iden- 
tify various types of text entities, both generic and 
specific, without a (let, ailed underst, anding of the 
text domain, and relying instead on a comlfination 
of shallow linguistic processing (to identi(y candi- 
date lexical entities), statistical knowledge acqui- 
sition, unsupervised learning techniques, and t)os- 
sibly broa(1 (mfiversal but often shallow) knowl- 
edge, sources, such as on-line dictionaries (e.g., 
WordNet, Comlex, ()ALl), etc.). Our method 
IllOVeS t)eytmd the traditional name si)otters and 
towards a universal spotter where, the require- 
ments on what to spot can be specified as in- 
put paraineters, and a specific-purpose spotter 
c.ouht be generated automatically. In this pa- 
per, we describe a method of creating spotters for 
entities of a specified category given only initial 
seed examples, and using an unsupervised learn- 
ing t)rocess to discover rules for finding more in- 
stances of the eoncet)t. At this time we place 
no limit on what kind of things one may want 
to build a spotter for, al@lough our extmriments 
thus far concentrated on entities customarily re- 
931 
ferred to with noun phrases, e.g., equipment (e.g., 
"gas turbine assembly"), tools (e.g., "adjustable 
wrench"), products (e.g., "canned soup", "Arm 
& Ilammer baking soda"), orgmfizations (e.g., 
American Medical Association), locations (e.g., 
Albany County Airport), people (e.g., Bill Clin- 
ton), and so on. We view the semantic cate- 
gorization problem as a case of disambiguation, 
where for each lexical entity considered (words, 
phrases, N-grams), a binary decision has to be 
made whether or not it is an instance of the se- 
mantic type we are interested in. The problem of 
semantic tagging is thus reduced to the problem of 
partitioning the space of lexical entities into those 
that are used in the desired sense, and those that 
are not. We should note here that it is acceptable 
for homonym entities to have different classifica- 
tion depending upon the context in which they are 
used. Just as the word "bank" can be assigned dif- 
ferent senses in different contexts, so can "Boeing 
777 jet" be once a product, and another time an 
equipment and not a product, depending upon the 
context. Other entities may be less context depen- 
dent (e.g., company nan'ms) if their definitions are 
based on internal context (e.g., "ends with Co.") 
as opposed to external context (e.g., "followed by 
mauufactures"), or if they lack negative contexts. 
The user provides the initial information (seed) 
about what kind of things he wishes to identify 
in text. This infortnation should be in a form of 
a typical lexical context in which tile entities to 
be spotted occur, e.g., "the name ends with Co.", 
or "to the right of produced or made", or "to the 
right of maker of', and so forth, or simply by list- 
ing or highlighting a number of examples in text. 
In addition, negative examples can be given, if 
known, to eliminate certain 'obvious' exceptions, 
e.g., "not to the right of made foal', "not tooth- 
brushes". Given a sufficiently large training cor- 
pus, an unsupervised learning process is initiated 
in which the system will: (1) generate initial con- 
text rules from the seed examples; (2) find further 
instances of tile sought-after concept using the ini- 
tial context while maximizing recall and precision; 
(3) find additional contexts in which these entities 
occur; and (4) expand the current context rules 
based on selected new contexts to find even more 
entities. 
In the rest of tlle paper we discuss the specifies 
of our system. We present and evaluate prelimi- 
nary results of creating spotters for organizations 
and products. 
2 What do you want to find: seed 
selection 
If we want to identify some things in a stream 
of text, we first need to learn how to distinguish 
them from other items. For example, company 
names are usually capitalized and often end with 
'Co.', 'Corp.', 'Inc.' and so forth. Place names, 
such as cities, are nonmflly capitalized, sometimes 
are followed by a state abbreviation (as in Albauy, 
NY), and may be preceded by locative preposi- 
tions (e.g., in, at, from, to). Products may have 
no distinctive lexical appearance, but they tend to 
be associated with verbs such as 'produce', 'man- 
ufacture', 'make', 'sell', etc., which in turn may 
involve a company name. Other concepl;s, such as 
equipment or materials, have R~'w if any ot)vious 
associati(ms with the surrounding text, and on(; 
may prefer just to iioint them out directly to the 
learning prograin. There are texts, e.g., techni- 
cal manuals, where such specialized entities occur 
more often than elsewhere, and it may be adwm- 
tagous to use these texts to derive spotters. 
The seed can be obtained either by hand tag- 
ging some text or using a naive spotter that has 
high precision but presumably low recall. A naive 
spotter may contain simple contextual rules such 
as those mentioned above, e.g., for organizations: 
a noun phrases ending with "Co." or "Inc."; for 
products: a noun phrase following "manufacturer 
of", "producer of", or "retailer of". When such 
naive spotter is ditlicult to come by, one may re- 
sort to hand tagging. 
3 From seeds to spotters 
The seed should identit~y the sought-after enti- 
ties with a high precision (thougil not; necessarily 
100%), however its recall is assumed to be low, or 
else we would already have a good spotter. Our 
task is now to iucrease tile recall while maintain- 
ing (or ('.veil increase if possible) the precision. 
We proceed by examining the lexical context in 
which tlle seed entities occur. In the silnplest in- 
stance of this process we consider a context to coil- 
sist of N words to the left of the seed and N words 
to the right of tile seed, as well as the words ill the 
seed itself. Each piece of significant contextual ev- 
idence is then weighted against its distribution in 
the balance of the training corpus. This in turn 
leads to selection of some contexts to serve as in- 
dicators of relevant entities, in other words, they 
become the initial rules of the emerging spotter. 
As an exami)le, let's consider building a spotter 
for company names, starting with seeds as illus- 
trated in the tbllowing fragments (with seed con- 
t, exts highlighted): 
... HENRY KAUFMAN is president 
of Henry Kaufmau C~ Co., a ... Gabelli, 
chairman of Gabelli l%nds Inc.; Claude 
N. Rosenberg ... is named president of 
Slmndinaviska Enskilda Banken ... be- 
come viee chairman of the state-owned 
electronics giant Thomson S.A .... bank- 
ing group, said the formal merger of 
Sl~anska Banken into ... water maker 
Source Perrier S.A., according to French 
stock ... 
932 
ltaving "Co." "htc." to pick out "Henry Kauf 
mmn & Co." rand "Gabelli IAmds Inc." as seeds, 
we proceed to find new evidence in the training 
corlms , using an unsul)ervised lemrning process, 
mnd discover thmt "chmirman of" rand "t)residcnt 
of" rare very likely to precede, cOral)any nalnes. We 
expand our initial set of rules, which tallows us to 
spot more COml)anies: 
... ltENI{Y KAUFMAN is pres- 
ident of lh;nry Kaufm.an ~'4 Co., a 
... Gabclli, chairman of Gabclli \[,}mds 
Inc.; Clmude N. \]{osenl)erg ... is nmmed 
president of Skandi'naviska Enskilda 
Bankcn ... be, come vice ('hairntan of 
lhe. state-o'wncd electronics giant Thom- 
son S.A .... banldng groul) , said dw, for- 
real merger of Skansl~ l{anken into ... 
winter inaker Sotnce Perrier S.A., accord- 
ing to French stock ... 
This evidence discovery (:an be relmated in m 
bool;strmpl)ing process l)y ret)la(:ing the initiml set; 
of seeds with the new set; of entities obtained froln 
the lmst itermtion. In t|~e mbove examt)le, we now 
have. "Slamdinaviskm Fmskihla Bank(m" and "l;hc 
stmte-owned electronics giant '\]'homson S.A." in 
mddition to the initiml two names. A flu'ther it(w- 
ation ma,y mdd "S.A." rand "Bmnken" {;o l;hc set of 
contcxtuml rules, and so forth, in generml, (ml;ities 
can 1)e both added mnd deh;ted from the evolving 
s(;t of examples, det)ending on how uxmctly the cv- 
id(;n(:e is weighted and combin(;d. The details are 
exl)lained in the following sections. 
4 Text preparation 
In ill()S~, (;asc, s l;he text needs to t)e preprocessed to 
isolmte 1)asic lexi(:al tok(',ns (words, ml)l)r(!viations, 
symbols, mnnol;a|;ions, el;(:), and sl;ru(:turml units 
(sections, pmragrat)hs , sentences) wh(mever api)li- 
cmt)le. In addition, t)mrt-of-speech tmgging ix usu- 
ml\]y desirmble, in which case tim tagger mmy need 
l;o be re-trained on a text saml)le 1;o ol)l;ilnize its 
performance (Brill, 1993), (Mercer, Schwartz & 
W(;ischedcl, 1{)91). Finmlly, a limited amount of 
lexicml normalization, or stemming, Inay be f)er- 
lormed. 
The entities we rare looking for inay be exl)ressed 
|)y certain tyt)es of phrases. For example, peo- 
ple nmmes m'e usually sequences of i)rot)er nouns, 
while equipment nmmes rare contained within noun 
phrmses, e.g., 'forwmrd looking int>m'ed radar'. We 
use 1)art of speech information to delinemte those 
se(lllelt(;es of lexicml l;okens t;hat arc likely to (:on- 
t;mill (Olll "~ enl;itics. \]~'l'()in l;h(',ll Oil we restrict tony 
further t)rocessing on these sequences, and their 
contexts. 
These preparatory steps are desirable since they 
reduce the amount of noise through which the 
lemrning process needs to plow, but they mre not, 
strictly st)eaking, ne(:essary. Further experiments 
rare required to deterlnint~ the level of preprocess- 
ing required I;o optinfize the t)erforlnanee of the 
\[hfiversal Sl)otl;er. 
5 Evidence items 
The smnmnl;i(: categorization problem described 
here displmys some pmrmllcls to the word sense dis 
ambigumdon problem where hoInonylll words ileed 
to be mssigned to one of several possible senses, 
(Yarowsky, 19!)5), (Gale, Chm'ch & Yarowsky, 
lt)92), (Brown, Pictra, Pietra & Mercer, \]991). 
'Fhcre mrc two itnportant difl'erenc(',s, however. 
First, in the semantic cat, cgorizal;ion l)ro|)lem, 
t, here is al; lemsl, one Olmn-ended catc, gory serving 
as m grml) 1)rag for roll things non-relevant;. This c, mt- 
e, gory Inay be hard, if not impossible, to describe 
by any finit(; set of rules. Second, unlike the word 
sense disambigumtion where the it;eros 1;o be clmssi- 
tied arc known apriori, we attempt to acconqflish 
two things at the smnm time: 
1. discover l;he items Lo be (:onsidcred for c, mte- 
gorization; 
2. acl;ually decide if an item 1)elongs to a given 
category, or falls outside of il;. 
'\]'hc cmtcgorization of a lexical token its belong- 
ing l,o m p;ivell selnalltic, clmss is based llpOtt t,}l(': 
information provided by the words occurriug in 
1,he token itself, ms well as the words thmL l)re - 
cede mM follow it; in t(~xl;. Ill addition, i)ositionml 
relal;ionshil)s among l;hes(; words mmy be of im- 
portaalce. ~lb capture l;his informal;ion, we define 
the notion of an e.'videncc set lbr a lexicml unil; 
W,//V2...IA<,,. (m phrase, or an N-gram) its follows. 
Let .... W.., .... W .I W~...W,,W, , W+.2...W, , .... be m 
string of subsequellt, tokens (e.g., words) in text, 
such Lhat W~ W~....I/Km is a unit of interesl, (e.g., 
a noun phrase) rand n is the maximum size of the 
context window on either side of the unit. The mt:- 
\[;ual window size, mmy l)e limited by boundaries of 
strllcturml mfit, s sm;h its sentences or parmgraphs. 
For each unit W1 Wu...l/g,,~, a se~ of evidence, ilcms 
is colh;cted as a set union of the following four 
sel;s: 
1. Pmirs of (word, position), where position 
{p,s, f} indicates whethex word is fount\[ ill the 
context preceding (p) the central refit, following 
(t) it,, or whe|;her il; come, s flom I;he centra.1 unil; 
itself is). El = 
(w ..... p) ...... (w.~,,p) (w_,,p) \] 
(w~,,~), (w,~, ~) ...... (w,,~, ~) 
(Wtt,f), (Wv2,f) ...... (W4,~,f) 
2. Pairs of (bi-gram, position) to capture word 
se, quence informmtion. E2 = 
{(W ..... W--(,~- l)), p) ... ((1/V._:~, W__ t), p) } 
((w,, w~), .~) ... ((w,,~ _,, w,,& ~) 
((w+l, w+~),f) ... ((w+/,~_l), w+,o, f) 
933 
3. 3-tuples (word, position, distance), where 
distance indicates how far word is located rela- 
tive to W1 or I/V,~. Ea = { 
4. { 
(W .... p, n) 
(Wl, s, m) 
(W+I, f, 1) 
... (W_l,p, 1) } 
... (w,,, ~, 1) 
... (W+,~,f,n) 
3-tuples (hi-gram, position, distancc).E4 = 
((W .... W (n_D),p,n-1)...(W-2, W-t),p, 1) \] 
((Wl, W2), s, 7D, - 1) ...... ((W .... 1, Win), s, 1) 
((w+l, w+~), f, 0...((w+(,~_ ~), w+~), f, n - 1) 
For example, ill the fl'agment below, tile central 
phrase the door has the context window of size 2: 
... boys kicked the door with rage ... 
The set of evidence items generated for this fl'ag- 
inent, i.e., E1 UE2 UEaUE4, contains the following 
elements: 
(boys, p), (kicked, p), (the, s), 
(door, s), (with, f), (rage , f), 
((boys, kicked), p), ((the, door)), s), 
((with, ,'age), f), (boys, p, 2), 
(ki&ed, p, 1), (the, s, 2), (door, s, 1), 
(with, f, 1), (rage, f, 2), 
((boys, kicked), p, 1), ((the, door)), s, 1), 
((with, ,'age), f, 1) 
Items in evidence sets are assigned significance 
weights (SW) to indicate how strongly they point 
towards or against the hyphothesis that the cen- 
tral unit belongs to the semantic category of in- 
terest to the spotter. The significance weights are 
acquired through corpus-based training. 
6 Training 
Evidence items for all candidate phrases in the 
training corpus, for those selected by tile initial 
used-supplied seed, as well as for those added by 
a training iteration, are divided into two groups. 
Group A items are collected from the candidate 
phrases that are accepted by tile spotter; group 
R items come from the candidate phrases that are 
rejected. Note that A and 1% may contain repeated 
elements. 
For each evidence item t, its significance weight 
is computed as: 
f(t,A)-f(t,R) f(t,A) + f(t,R) > s f(t,A)+y(t,R) 
SW (t) = 0 otherwise 
(~) 
where f(t, X) is the fl'equency of t in group X, 
and s is a constant used to filter the noise of very 
low frequency items. 
As defined SW(t) takes values from -1 to 1 
interval. SW(t) close to 1.0 means that t ap- 
pears imarly exclusively with the candidates that 
have been accepted by tile spotter, and thus pro- 
vides the strongest positive evidence. Conversely, 
SW(t) close to -1.0 means that t is a strong neg- 
ative indicator since it occurs nearly always with 
the rejected candidates. SW(t) close to 0 indi- 
cates neutral evidence, which is of little or no 
consequeuce to the spotter. In general, we take 
SW(t) > e > 0 as a piece of positive evidence, 
and SW(t) < -e as a piece of negative evidence, 
as provided by item t. Weights of evidence items 
within an evidence set are then combined to arrive 
at the compound context weight which is used to 
accept or reject candidate phrase. 
At this time, we make no claim as to whether 
(1) is an optimal fornmla for cah:ulating evidence 
weights. An alternative method we considered 
was to estimate certain conditional probabilities, 
similarly to the formula used in (Yarowsky, 1995): 
SW(t) log P(p C A/t) f(t, A)f(A) = ~ log (2) 
P(p C R/t) f(t, .R)f(.l~) 
Here f(A) is (an estimate of) the probability 
that any given candidate phrase will be accepted 
by the spotter, and f(R) is the probability that 
this phrase is rejected, i.e., f(R) = l-f (A). Thus 
fin' our experinmnts show that (1) produces better 
results than (2). We continue investigating other 
weighting schemes as well. 
7 Combining evidence weights to 
classify phrases 
In order to classify a candidate phrase, all ev- 
idence items need to be collected from its coil- 
text and their SW weights are combined. When 
the combined weight exceeds a threshold value, 
the candidate is accepted and the i)hrase becomes 
available for tagging by the spotter. Otherwise, 
the ('andidate is reje(:te(l, although it may be 
reevaluated in a fllture iteration. 
There are many ways to combine evidence 
weights. In our experiments we tried the following 
two options: 
x+y-xy ifx>Oandy>O 
x O y = x + y + xy ifx<Oandy<O (3) 
x + y otherwise 
and 
(Dy~ ~ x ifabs(x) > abs(y) 
y otherwise (4) k 
In (3), x (I) y is greater than either x or y when 
both x and y are positive, and it is less than both 
x and y for negative x and y. In all cases, x 0) y 
remains within \[-1, +1\] interval. 
In (4) only the dominating evidence is consid- 
ered. This formula is more noise resistant than 
(3), but produces generally less recall. 
934 
8 Bootstrapping 
The eviden{:e, training and candidate sele{:tion (:y- 
cle forms a l)ootstrapI}ing t}rocess, as folh)ws: 
Procedure Bootstrapping 
Collect seeds 
loop 
Training l)hase 
Tagging t)hase 
until Satisfied. 
The bootstrapping t)rocess allows fin' colle{:t- 
ing more and new ('.oni;exl, ual eviden{:e and in- 
crease recall of the spotter. This is possible thanks 
to overall redundancy and rep(;titiveness of infor- 
mation, particularly local {:ontext information, in 
large bodies of text. For exanq}le,, in our three,- 
sectional contexl, ret)resent, ation (t}re(:eding, self, 
following), if one section contains strong evidence 
that the candidate t)hrase is selectat}le, eviden(:e 
f(mnd in other se,{:tions will t}e considere, d in tile 
next training cy{:le, in order to sele(:t additional 
candidates. 
An imi}ortmlt consideration he, re is to main- 
lain all overall precision level throughout the ell- 
tire process. AMmugh, it; may t)e possible to 
rec(}ver fl'om some miselassiti{:ation errors (e.g., 
(Ym'owsky, 1995)), (:a.re shouhl 1)e taken when ad- 
justing the process l}arameters so that 1)r{;eision 
does not deteriorate too rapidly. For insl;ance, a(:- 
(;el}tan(;e thresholds of evide, nce weights, initially 
set, higll, can be gradually decreased to allow more 
recall while keeping l}recision at a reasonable level. 
In additioil, (Yarowsky, 1995), (Gale, Church &; 
Yarowsky, 1992) point ou{; that there is a st, rent 
tenden(:y for words 1;O occur in (}Ile sense within 
any given dis{:ourse ("one sense pe, r dis{:ourse"). 
Th(; same seems to at)ply to (:oncel)t sele(:l;ion, 
thai, is, Inultil}le o(:(:m'ren(:es of a (:an{lidate 1}hrase 
within ~t disc{}urse should all 1}e eithe\]' a(:eel)te{l or 
reje,(:t;(;{t \[)y the Sl}Ol,te\]'. This in turn allows f{}r 
t}ootstrat}t)ing pr(}cess to gather more contextual 
evideal{:c more quickly, and thus to (:onwuge faster 
t)rodu{:ing, better results. 
9 Experiments and Results 
We, used the Universal St)ot;ter to find organiza- 
tions an{1 products in a 7 MBytes cortms consist- 
ing of al'ti(:les fl'om i;ll(', Wall Street Journal. l,'irst, 
we l}re-t)rocess{~d the l;ext with a l}arl;-of-sl}{~ech 
tagger and |dent|tied all simple noun groups to 
l)e used as {:and|date 1}hrases. 10 artMes were 
set, aside and ha.rid l,agged as key for evalual;ion. 
Subsequently, seeds were construct, ed ma.nually 
in forln of contextual rule, s. l~i)r orgmfizati{}ns, 
these |nit|a.1 rules hall a 98% i)\]'e{;ision and 4{}%) 
recall; for products, the corresl}onding numbers 
were 97% and 42%}. (4) is used to combine evi- 
dences. No lexi{:on veriti{:ation (see later) has been 
used in order l;o show m()re clearly the behavior 
the learning nmthod itself ( the l}erformance can 
precision 
\] 0{} 
80 
60 
40 
20 
ullllllIllllllll~l .. 
"Seeds 
• 1st loo 1} 
• 4th loop 
20 40 
.......... " ....... i?"L 
.............. ,.,ilLi, ilLi 
i 
i 
....... ~J~ recall 
60 80 100 
Figure 1: Organization spotter results. 
be enhanced by lexicon verification). Also note 
that the quality of the ,~eeds affects the per'for- 
malice of the final sI)otl;er since they define what 
type of {;()I1(',(;1)1; the system is supt)osed to look 
for. The seeds that we used in our exlmrimenl;s 
are quit(; simple, perhaps too simple, lletter seeds 
may be neede.d (possibly developed through all in- 
l;era¢'tion with the user) t;o obtain str(mg r{~stllts 
for some (:~l, cgories of concci}l;s. 
For orgmdzation tagging, the recall and preci- 
sion results obtained after the tirst mid the follrth 
t}ootstrat)t)ing eyt'.le are given in Figm'e 1. 
The poinl; with the inaximmn precision*recall 
in the ftmrth rllll is 950/{) pre(:ision and 90% re- 
call. Examples of extracted organizations in- 
{:lude: "l,h,e State Statistical btstit, ntc, lst,,,t,", 
"We.rl, heim Sch, roder #4 Co", "Skandi'naviska En- 
skilda Ha'nken", "Statistics Canada". 
The results for products tagging are given in 
Figure 2 on the next page. Examph~s of ex- 
tracted products include: "the Mercury Grand 
Marquis and Ford Crown Victoria cars", "(_~tevro- 
let Prizm", "Pump shoe", 'MS/doe". 
The efl'ect of bootstrapping is clearly visible in 
both charts: it improves the recall while, main- 
raining or even iinproving the pre,(:ision. We may 
also nol;ice that some misclassifications due to all 
iml)ext'e,t:t seed (e.g., see the first dip in t)re(:ision 
()11 the 1}tOdllt;l;s chart) (:all ill t'aet t)e corrected in 
further t}ootstrapping loops. The generally lower 
performance levels for the product; spotl;er is prol)- 
ably due to the. fact t;hat the (;oncel)t of produ(;t, 
is harder to eirt'.mnscril)e. 
10 Further options 
10.1 Lexicon verification 
The itenlS identified in the second step can be fur- 
ther wflidated fl)r their broad semantic classifica- 
tion using on-line lexical (lat~J)asc8 such as Corn- 
935 
precision 
100 
80 
60 
40 
20 
....... //~ ...... :::.....: 
:" ::7:!:.. 
i :'! 
i ""'1 
'l.i ! 
• 1st loop 
• 4th loop '" "~l 
J 
20 40 60 80 100 recall 
Figure 2: Product spotter results. 
lex or Longman Dictionary, or Princeton's Word- 
Net; (Miller, 1990) For example, "gas turbine" is 
an acceptable equipment/machinery name since 
'turbine' is listed as "machine" or "device" in 
WordNet hierarchy. More complex validation may 
involve other words in the phrase (e.g., "circuit 
breaker") or words in the immediate context. 
10.2 Conjunctions 
The current program cannot deal with conjunc- 
tion. The difficulty with conjunction is not with 
classification of the conjoined noun phrases (it is 
easier, as a matter of fact, because they carry more 
evidences) but with identification of the phrase it- 
self because of the structural ambiguities it typi- 
cally involves that cannot be dealt with easily on 
lexical or even syntactic level. 
11 Conclusions 
In this paper we presented the Universal Spotter, 
a system that learns to spot in-text references to 
instances of a given semantic class: people, organi- 
zations, products, equipment, tools, to nmne just 
a few. A specific class spotter is created through 
an unsupervised learning process on a text corpus 
given only an initial nser-supplied seed: either a 
number of examples of the concept, or a typical 
context in which they can be found. The exper- 
iment shows that this method indeed can pro- 
duce useflfl spotters based on easy-to-construct 
seeds. Tile results shown here are promising, can 
be further improved by using lexicon verification. 
Different methods of computing SWs, combining 
SWs, and parameter adjustmenting for the boot- 
strapping process need to be explored as we be- 
lieve there is still room for improvement. The 
method is being continuously refined as we gain 
more feedback from empirical tests across several 
different applications. 
We believe that tile Universal Spotter can re- 
place much of the need to create hand-crafted 
concept spotters commonly used in text extrac- 
tion operations. In can also be applied to build- 
ing other than the most common spotters such 
as those for people names, place names, or com- 
pany names. In fact, is can be used to create 
more-or-less on-demand spotters, depending upon 
the applications and its subject domain. In par- 
ticular, we believe such spotters will be required 
to gain further advance in intelligent text index- 
ing and retrieval applications, text summariza- 
tion, and database apI)lications, e.g., (Harman, 
1995), (Strzalkowski, 1995). 
References 
hltw. 1994. Proceedings of the Human Lan- 
guage Technology Workshop, Princeton. San 
Francisco, CA:Morgan Kaufman Publishers. 
nine5. 1993. Proceedings of 5th Message Under- 
standing Conference, Baltimore. San Francisco, 
CA:Morgan Kaufman Publishers. 
tipsterl. 1993. Tipster Text Phase 1.: 24 month 
Conference, Fredericksburg, Virginia. 
Brill, E. 1992. A Simple Rule-based Part of 
Speech Tagger. Proceedings of 3rd Applied 
Natural Language Processing , San Francisco, 
CA:Morgan Kaufman Publishers. 
Brown,P, S. Pietra, V. Pietra and R. Mercer. 
1991. Word Sense Disambiguation Using Statis- 
tical Methods. Proceedings of the 29h Annual 
Meeting of the Association for Computational 
Linguistics, pp. 264-270. 
Gale, W., K. Church and D. Yarowsky. 1992. A 
Method for l)isambiguating Word Senses in a 
Large Corpus. Computers and the Humanities, 
26, pp. 415 439. 
Harman, D. 1995. Overview of the Third Text 
REtrieval Conference. Overview of the l'hird 
Text REtrieval Conference (TREC-3), pp.1-20. 
Meteer, M., R. Schwartz, and I{. Weischedel. 
1991. Studies in Part of Speech Labeling. Pro- 
ceedings of the ~th DARPA Speech and Natu- 
ral Language Workshop, Morgan-Kaufman, San 
Mateo, CA. pp. 331-336. 
Miller, G. 1990. WordNet: An ()n-line Lexical 
Database. International Journal of Lexicogra- 
phy, 3, 4. 
Strzalkowski, T. 1995. Natural Language Infor- 
mation Retrieval. Information Processing and 
Management, vol. 31, no. 3, pp. 397-417. 
Yarowsky, D. 1995. Unsupervised Word Sense 
Disambiguation Rivaling Supervised Methods. 
Proceedings of the 33rd Annual Meeting of the 
Association for Computational Linguistics, pp. 
189-196. 
936 
