Evaluation of an Algorithm for the Recognition and 
Classification of Proper Names 
Takahiro Wakao Robert Gaizauskas Yorick Wilks 
Department of Computer Science, 
University of Sheffield 
{T. Wakao, R. Gaizauskas, Y. Wilks}@dcs. shef. ac. uk 
Abstract 
We describe an information extraction sys- 
tem in which four classes of naming expres- 
sions organisation, person, location and 
tinm names are recognised and classified 
with nearly 92% combined precision and re- 
call. The system applies a mixture of tech- 
niques to perform this task and these are 
described in detail. We have quantitatively 
evaluated the system against a blind test 
set of Wall Street Journal business articles 
and report results not only for the system as 
a whole, but for each component technique 
and for each class of name. These results 
show that in order to have high recall, the 
system needs to make use not only of in- 
formation internal to the naming expression 
but also information from outside the nmne. 
They also show that the contribution of each 
system component w~ries fl'om one (:lass of 
name expression to another. 
1 Introduction 
The appropriate treatment of proper names is es- 
sential in a natural language understanding sys- 
tem which processes unedited newswire text, since 
up to 10 % of this type of text may consist of 
proper names (Coates-Stephens, 1992). Nor is it 
only the sheer volume of names that makes them 
important; for some applications, such as inform- 
ation extraction (IE), robust handling of proper 
names is a prerequisite for successflflly performing 
other tasks such as template filling where correctly 
identifying the entities which play semantic roles 
in relational frames is crucial. Recent research in 
the fifth and sixth Message Understanding Con- 
ferences (MUC5, 1993) (MUC6, 1995) has shown 
that the recognition and classification of proper 
names in business newswire text can now be done 
on a large scale and with high accuracy: the suc- 
cess rates of the best systems now approach 96%. 
We have developed an IE system - LaSIf 
(Large Scale Information Extraction) (Gaizauslms 
ct al, 1995) which extracts important facts from 
tmsiness newswire texts. As a key part of the 
extraction task, the system recognises and clas- 
sifies certain types of naming expressions, namely 
those specified in the MUC-6 named entity (NE) 
task definition (MUC6, 1995). These include or- 
ganisation, person, and location names, time ex- 
pressions, percentage expressions, and monetary 
amount expressions. As defined for MUC-6, the 
first three of these are proper names, the fourth 
contains some expressions that would be classi- 
fied as proper names by linguists and some that; 
would not, while the last two would generally not 
be thought of as proper names. In this paper we 
concentrate only the behaviour of the LaSIE sys- 
tem with regards to recognising and classifying ex- 
pressions in the first four classes, i.e. those which 
consist entirely or in part of proper names (though 
nothing hangs on omitting the others). The ver- 
sion of the system reported here achieves ahnost 
92% combilmd precision and recall scores on this 
task against blind test data. 
Of course the four name classes mentioned are 
not the only classes of proper names. Brand 
narnes, book and movie names, and ship names 
are .just a few further classes one might chose to 
identify. One might also want to introduce sub- 
classes within the selected classes. We have not 
done so here for two reasons. First, and foremost, 
in order to generate quantitative evaluation res- 
ults we have used tile MUC-6 data and scoring re- 
sources and these restrict us to the above proper 
name classes. Secondly, these four name classes 
account for the bulk of proper name occurrences 
in business newswire text. Our approach could 
straightforwardly be extended to account for ad- 
ditional classes of proper nalnes, and the points 
we wish to make about tile approach can be ad- 
equately presented using only this restricted set. 
Our approach to proper name recognition is 
heterogeneous. We take advantage of grapholo- 
gical, syntactic, semantic, world knowledge, and 
discourse level information to perform the task. In 
the paper we present details of the approach, de- 
scribing those data and processing componm~ts of 
the overall IE system which contribute to proper 
name recognition and classification. Since name 
418 
recognition and classification is achieved through 
the activity of four successive components in the 
system, we quantitatively ewfluate tile successive 
contribution of each comt)onent in our overall ap- 
proach. We perform this analysis not only for 
all classes of names, but for each class separately. 
The, resulting analysis 
1. supports McDonald's obse, rvation (McDoi> 
aid, 1993) that external evidence as well as in- 
ternal evidence is essential for achieving high 
precision and recall ill the recognition and 
classification task; i.e. not just the name 
string itself must be examined, but other in- 
formation in the text must be used as well; 
2. shows that all eoInponents in our heterogen- 
eous apt)roach contribute significantly; 
3. shows that not all classes of prot)er naines 
benefit equally h'om the contritmtions of the 
different colnponents in our system: in par- 
ticular, organisation names t)enefit most from 
the use of external evidence. 
In tile second section an overview of the I,aSIE 
system is presented. The third section explains 
in detail how proper names are reeognised and 
classified in the system. The results of evaluating 
the system on a blind test set; of 30 articles are 
presented and discussed in section 4. Section 5 
concludes the paper. 
2 LaSIE system overview 
Nelllilllt~ -- 
.... . ....... ilia/ -- _ 
N ~ _ . _- NI" 
~q .......... N { TeC'e'l');;;"~ \[ \[ "e , ,,7\[ .~ P.e!ul~/ "~ - ~ _ ~ t~vAelnents~J 
Figure 1: LaSIE Systeln Architecture 
LaSIE has been designed as a general tmrpose IE 
research system, initially geared towards, but not 
solely restricted to, carrying out the tasks spe- 
cified by the sixth Message Understanding Confe, r- 
ence: named entity recognition, coreference resol- 
ution, template element tilling, and scenario tem- 
plate filling tasks (see (MUC6, 1995) for fllrther 
details of the task descriptions). In addition, the 
system can generate a brief natural language sum- 
mary of the scenario it. has detected in the text. 
All of these tasks are carried out by building a 
single rich inodel of the text the discourse model 
from which the various results are read oil. 
Tile high level structure of LaSIE is illustrated 
in Figure 1. The system is a pipelined archi- 
tecture which processes a text sentence-at-a-time 
and consists of three principal processing stages: 
lexical preproeessing, parsing plus semant;ic inter-- 
pretation, and discourse interpretation. The over- 
all contributions of these stages may be briefly 
described as follows: 
• lexlcal preproeessing reads and tokenises 
tile raw inlmt text, tags the toke, ns with 
parts-of-speech, t)e, rforms morI)hological ana- 
lysis, Imrtbrms phrasal matching against lists 
of proper names, and builds lexical and 
phrasal chart edges in a h'.ature-based form- 
alism for hand-over to the parser; 
• parsing does two pass parsing, pass one with 
a special proper name grairlmal'~ pass two 
with a general grammar and, after selecting a 
'best parse', passes on a semantic represent- 
ation of the current senteliC(~ which includes 
nanle clans infi)rmation; 
• discourse interpretation adds the inform- 
ation ill its input semantic representation to a 
hierarchically structured selnantic ne, t which 
encodes the system's world model, adds ad- 
ditional ilffOl',natioi1 presupposed by the in- 
put to the world model, perforlns coreference 
resolution be, tween new instances added and 
others already ill the world model, and adds 
information consequent upon the addition of 
the input to the worhl Inodel. 
For fltrther det~fils of the systeln see (Gaizauskas 
ct al, 1.995). 
3 How proper names are 
recognised and classified 
As indicated in section 1, our approach is a het- 
erogeneous one ill which the system makes use 
of graI)hological, syntactic, selnantic, world know- 
ledge,, and discourse level intb,'mation for the re- 
cognition and classification of proper names. The 
system utilises both the information which comes 
fl'oln the name itse.lf (internal evidence ill McDon- 
ald's sense (McDonaht, 1993)) as well as tile in- 
formation which colnes from outside, the name, 
froln its context in the text: (external evidence). 
In what tbllows we describe how proper names 
are recognised and classified in LaSIE by consid- 
ering the contribution of each system component. 
3.1 Lexieal preproeessing 
The input text is first tokenise.d and then each 
token is tagged with a part-of-stmech tag from 
the, Penn qtYeebank tagset (Marcus ct al, 1993) us- 
ing a slightly custolnised ~ version of Brill's tag- 
1The tagger has been:customised by adding some 
entries to its lexicon and by adding several special tags 
419 
ger (Brill, :1.994). The tagset contains two tags 
fin' proper nouns NNP for singular proper nouns 
and I~NPS for plurals. The tagger tags a word as a 
proper noun as follows: if the word is timnd in the 
tagger's lexicon and listed as a proper noun then 
tag it, as such; otherwise, if the word is not found 
in the lexicon and is uppercase initial then tag 
it as a proper noun. Thus, capitalised unknown 
tokens are tagged as proper nouns by default. 
Before parsing an attempt is made to identi\[y 
proper naine phrases sequences of proper names 
and to classify them. This is done by matching 
tile input against pre-stored lists of proper nalnes. 
The.se lists are compiled via a flex program into a 
finite state recogniser. Each sentence is fed to the 
recogniser and all single and multi-word matches 
are tagged with special tags which indicate the 
name (:lass. 
Lists of names used include: 
• organisation : about 2600 company and 
governmental institution nmnes based on 
an organisation name list which was semi- 
automatically collected from the MUC-5 an- 
swer keys and training corl)us (Wall Street 
Journal articles); 
• location : about 2200 major country, 
province~state, and city names derived fl'om 
a gazetteer list of about 150,000 place naines; 
• person : about 500 given names taken Dom 
a list; of given names in the Oxford Advanced 
Le.arner's Dictionary (Hornby, 1980); 
• eompmly designator : 94 designators (e.g. 
'Co.','PLC'), based on the company desig- 
nator list provided in the MUC6 reference re- 
sources. 
• human titles : about 160 titles, (e.g. 'Presid- 
ent','Mr.'), manually collected; 
As well as name phrase matching, another tech- 
nique is applied at this point, inside multi-word 
proper names, certain words m~y flmction as triq- 
get words. A trigger word indicates that the 
tokens surrounding it are' probably a proper name 
and may reliably pernfit the class or even sub- 
class 2 of tile proper nmne to be determined. For 
example, 'Wing and l'rwer Airlines' is ahnost cer- 
tainly a company, given tile presence of the word 
'Airlines'. ~Digger words are detected by matching 
against, lists of such words and are then specially 
tagged. Subsequently these tags are used by the 
proper nmne parser to build complex proper name 
constituents. 
The lists of trigger words are: 
• AMine company: 3 trigger words for finding 
airline company names, e.g. 'Airlines'; 
• Governmental institutions: 7 trigger words 
for governmental institutions, e.g 'Ministry'; 
for word classes such as days of the week and months. 
2company and governmental institution are sub- 
classes of the class organisation, airline is a sub- 
class of company. 
• Location: 8 trigger words for location nanle, s, 
e.g. 'Gulf'; 
• Organisation: 135 trigger words for organisa- 
tion names, e.g 'Association'. 
These lists of trigger words were produced by 
hand, though the organisation trigger word lists 
were generated semi-automatically by looking at 
organisation names in tile MUC-6 training texts 
and applying certain heuristics. So, for example, 
words were (:ollected which come inmmdiately be- 
fore 'of' in those organisation names which eOll- 
tain 'of', e.g. 'Association' in ~Assoeiation of Air 
Flight Attendants'; l;he last; words of organisation 
names which do not contain 'of' were examined to 
find trigge.r words like 'International'. 
3.2 Grammar rules lbr proper names 
The LaSIE parser is a simple bottom-up chart 
parser iinplemented in Prolog. The grammars it 
processes are unification-style feature-based con- 
text, fl'ee grammars. During parsing, semantic rep- 
resentations of constituents are constructed using 
Prolog terin unification. When parsing {:eases, i.e. 
when the parser can generate no further edges, a 
'best parse selection' algorithm is rml on the final 
chart to chose ~ single analysis. The semantics 
are then extracted fl'om this analysis trod passed 
on to the discourse interpreter. 
Parsing takes place in two passes, each using 
a separate grammar. In the first pass a spe- 
cial grammar is used to identify proper nanms. 
These constituents are then treated as unanalys- 
able units during the second pass which employs 
a more general 'sentence' grammar. 
Proper Name Grammar The grammar rules 
for proper names constitute a subset of the sys- 
tem's noun t)hrase. (NP) rules. All the rules were 
produ(:ed 1)y hand. There are 177 such rules ill 
total of which 94 are for organisation, 54 for per- 
son, 11 for location, and 18 for time exl)ressions. 
Here are some examt)les of the i)roper nmne 
grammar rules: 
NP --> 0KGAN NP 
0RGAN NP --> LIST_LOC NP NAMES_NP CDG NP 
0RGAN_NP --> LIST_0RGAN_NP NAMES NP CDG_NP 
ORGAN_NP --> NAMES NP '&' NAMES_NP 
NAMES_NP --> NNP NAMES_NP 
NAMES_NP --> NNP PUNC(_) NNP 
NAMES NP --> NNP 
The non-terminals LIST_LOCJNP~ LIST_0RGAN_NP 
and CDG~IP are tags assigned to one or lnor(~ in- 
put tokens in the name phrase tagging stage of 
lexical preproeessing. The non-terminal NNP is the 
tag for proper name assigned to a single token by 
the Brill tagger. 
The rule 0RGAN_hIP--> NAMES_NP '&' NAMES_NP 
means that if an as yet unclassified or ambigu- 
ous proper name (NANES~P) is followed by '&' and 
another mnbiguous proper nmne, then it is an or- 
ganisation name. So, for example, 'Marks & Spell- 
420 
(:(n" and 'American Telct)hone & %le.gratth' will 
tie (:lassilicd as (/rganisat, i(m names by this rule. 
Nearly half of the t/rol)('x name rules ard for (n'- 
ganisation names be(:ausc they may contain fm- 
ther prOller name, s (e.g. l/erson or location name, s) 
as well as normal nomls, att(l their coml/inations. 
There arc Mso a good nmnb(~r of rules tilt' 1)(wson 
names sin(:c care must be taken with given names, 
family nmnes, titles (e.g. 'Mr.','President'), and 
special lcxical items su(:h as 'de' (as in 'J. lgnacio 
Lot)cz (1(,' Arriortua') and 'Jr.','II', ct;(:. 
Thor(; are thwcr rules lin' location ttmnes, as th(!y 
are i(h;ntiti(*.d mainly in tim 1)r(!vious l)r(!l)rO(:(~ssing 
stage by lool<-ul / in tim miifi-gaz('.tt;e('a'. 
Sentence (~rannnar Rules The grammar 
used for l/arsing at the scnten(:e l(,.vel contains at/- 
t)roximately 1 l0 rules and was derived automatic- 
at\[y from the Penn 3i'ceBank-ll (PTB-II) (Marcus 
ct al, 1993), (Mar('.us ct al, 1995). When llarsing 
for a senten(:e is (:omplet('. the resultant chart ix 
analysed to i(hmtitly the 'best parsC. From tit(', 
best pars(', the. associated selnallt;i(:s ate ('.xtra(:t(;d 
to lie 1)ass(xl on to I;\]le dis('.om'sc int(~rl)r(%(;r. 
Rules for COml/()siti(ntally ('onstrut:ting s(mmnti(: 
representations were assigned t/y han(l t;() {,tm 
grammar rules. F()r simple verbs and llotnls I;h(; 
mort)hologi('.al root is llSe(l as a in(~(ti(:at(~ natnc 
ill tim s('.nt;tllti(:s, and t(;llS(? alld lIlltll})Of fcatllt'(~,s 
are translat(,.d (tir(w.tly inl;o tll(', s(muult, i(: l'ellrCs- 
(rotation where, ai)l)ropriat(;. F(/r \[latnc(l ci,\[;i\[;- 
ies a t(/kcn (if the most siiccific tyi)c 1)ossiblc 
(e.g. company or perhaps only object) is ere- 
aged and a name attrit)ute ix associated with the 
entity, the a ttritlute?s vahm being the, SllrBtc(~ 
string form of th(.' name. St), \['or examtlh b its- 
stoning 'Ford Mol;or Co.' has ah'eady I)('.cn (:las- 
si\[ie(l as a c(nnl)any nam(~, its scmanti(: rei)r(~s- 
(,ntation will be something like company(e23) & 
name(e23,'Ford Motor Co.'). 
a.a Discourse interl)retation 
The discourse inl;('.rt)r(,t;(w too(hilt performs two 
a(:l;ivities l;ha\[; (:(nttribute to t)roper name (:lassi- 
fication (no fllrth(;r rc(:ogniti(m of pr(/1)(!r ll&nlcs 
goes on at this point, only a rctlning of their 
classification). The first a(:tivity is (',orcf(~rcnc(,' 
resolution an unclassified name may bc core- 
fcrr(;d wil;h t~ previously classified one tly virtue of 
which the (:lass of the unclassifi('.d name. b(,.(:om(;s 
known. The second activity, whi(:h is arguably 
not l/rolterly '(lis(:ourse intcrilr(.'t;ation' but never- 
t, heh'~ss takes tllac(', in this module, is t(/tier form in- 
f(;ren(:(;s al/(/ut, the s(;manti(: I;yl)eS of al'glllnCtll;S iIl 
(:crtain reladons; for example, in comtl(nmd n(/m- 
inals such as 'Erikson stocks' our s('.mantic inter- 
1)retcr will tell us that the, re is a qnalitier relation 
l/ctwcen 'Erikson' and 'stocks' and sin(:e the sys- 
tem stores the fact thai; named entities qualifying 
things of type stock are, of type company it can 
classil~y the i)roper name 'l@ikson ~ as a (:Oral)any. 
Note that both of these tcctmiques Inake use of 
external evidence, i.e. rely on information sup- 
plied by the. (:ontext beyond the words in th('. in- 
stance of the proper name being classilic(l. 
"1.3.1 Proper name coreference 
(~orcfcr(mcc rcsolul;ion for i)ropcr names is car- 
ried out in or(let to rtx:ognis() alternativ(', forms, 
(;specially of otganisation names. For cx~mq)le., 
,Ford Motor Co.' might lm used in a text whim 
th(,' (:ompally iS first mentioned, but~ subscquc.nl~ 
ref('xenccs are likely to b(; to 'For(t'. Similarly, 
'(beativc Artists Agency' might lm al)bre, viated to 
%',At\' lat(n ()it in th(; same Lexl;. ~qltch s\]lorttumd 
\['(Wills lllllS\[; 1)O l't;solvett aS llalliOs of t;h('..qmn(' of 
ganisi~t;ion. 
In or(let t(l (h%('.rmin(,. wlmther giv(m two prolmr 
IDl.l\[l(~S ttntl;t:h, vatiOllS hem'istics are used. For c,x- 
aant)lc , given two itg:l.ln(~,s, Rain(<\[ aIltl Nmnc2: 
• if Name2 is consists (if an initial SllbSt!(lllClt(:(,, 
of th(; words in Namel then Name2 matctms 
Namel t'..g. 'American Airlines Co.' aud 
'Anw.rican AMines'; 
• if Nalne\] is a 1)Cl'S()ii ltall/(~ anti Name2 is 
(qth(w the first, tim family, or 1)()l;h nanms (if 
Nam(<l, then Name2 niat(:hes Nanml e.g. 
',lohn .I. Major .h.' lind ',lohn Major'. 
There are 31 such heuristic ruh~s for match- 
lug organisation names, I \] tmuristi(:s for 1)(ns(/n 
names, ;m(t 3 rules lbr h/(:al;ion names. 
Wh(',n an un(:lassified t/rolmr noun is matched 
with a previously classilied proper llatn( ~, ill the 
text, it is marke(l as a tn'(/p(',r name of the (:lass 
of th( ~, kllOWt\] l)rop(;r ltai\[lO. ThllS, whell w(~ know 
'Ford Motor Co.' is an organisati(m name bill 
have n(/t (:lassificd 'F(/rd' in the same text, (:or(> 
f(w('.n(:(', resolution (let(~rmin('.s 'Ford' to lie an or- 
ganisi~tion name. 
3.3.2 Semantic Type lnli~xence 
\[n t,h(; f'(/llowing (;onl;(~xl,q, se.nmnd(; l;yI)e inf()l'tIt- 
al;ion al)olll; th(; tyI/eS of il, t'gllln(~ll\[;S ill ('.(!rl;ain re- 
lad(ms is used to (lriv(; illfCl(?llt;Cs permitting the 
dassitication of prt)lmr nanlcs. The sysl;cltt llSeS 
thes(~ t;et:hnittucs in a fairly linfited and ext)eri- 
mental way ill; l/resent, and there ix much room 
f(n their (',xtcnsi(m. 
• nOllll-itOlllt qllalificati(m: when an un(:\]assi-. 
tled t)rot)(!r nanle qualifies ttlt organisation- 
related thing then the name is c, lassifie(l its 
an orga.nisation; e.g. in 't,h'i(:kson sl;o(:ks' 
sin(:c 'sl;(/(:k' ix scmanti(:ally tyt/ed as an 
organisation-r(~qa(;(!(t dfing, 'Erickson' get;s 
(:lassiticd as an organisation name. 
• t/ossessivcs: when an un(:\[assitic(l proller 
ll;I.Ill(~ stands in a possessive r(;lation to a.n (/r- 
ganisation post, then th(, ~ name is classiti(xl as 
all organisation; e.g. 'vice l/resident of ABC', 
'ABC's vice 1)resi(h',nt'. 
• at)Iiosition : when an unclassilied proper 
name ix apt/(/s('.d with a known locati(m nanm, 
421 
the former name is also classified as a loca- 
tion; e.g. given 'Fort Lauderdale, Fla.' if we 
know 'Fla.' is a location name, 'Fort Lauder- 
dale' is also classified as a location name. 
* verbal arguments: when an unclassified 
proper name names an entity playing a role 
in a verbal fi'ame where the semantic type 
of the argument position is known, then the 
name is classified accordingly; e.g. in 'Smith 
retired from his position as ...' we (:an infer 
that 'Smith' is a person name since the se- 
mantic type of the logical subject of 'retire' 
(in this sense) is person. 
4 Results and Evaluation 
After these processing stages, the results gener- 
ator produces a version of the original text in 
which all the proper names which have been detec- 
ted are marked up with pre-defined SGML tags, 
specifying their classes. These marked up texts 
are then automatically scored against manually 
marked up texts. 
A series of evaluations has been done on tile 
system using a blind test set consisting of 30 Wall 
Street Journal texts. In these texts there are 449 
organisation names, 373 person names, and 110 
location names and 111 time expressions in total. 
The overall precision and recall scores for the four 
classes of proper naines are shown in Table 1. 
Proper Naine Class Recall Precision 
Organisation 91% 91% 
l'erson 90 % 95 % 
Location 88 % 89 % 
Time 94 % 97 % 
Overall 91% 93 % 
Table 1: Overall Precision and Recall Scores 
4.1 System module contribution 
We have analysed tile results in terms of how 
much each module of the system contributes to 
the proper nmne task. 
Table 2 illustrates the contribution of each sys- 
tem module to the task for all classes of proper 
names. In addition to recall and t)recision scores, 
we have added Van Rijsbergen's F-measure which 
combines these scores into a single measure (Rijs- 
bergen, 1979). The F-measure (also called P&R) 
allows the differential weighting of precision and 
recall. With precision and recall weighted equally 
it is computed by the formula: 
2 x Precision x Recall F= 
Precision + Recall 
There are tour different settings of the system. 
• setting 1 : Only the lexical preprocessing 
teehifiques are used tmrt-of-speeeh tagging 
and name phrase matching. 
• setting 2 : Two-stage parsing is added to 1. 
• setting 3 : Coreference resolution for proper 
names is added to 2. 
• setting 4 : Full discourse interpretation is 
added to 3. This is the full-fledged system 
Setting1 ~\[-Recall I Precisi°n49 N 89 
~ 7-§-N 94 
~~A 94 
Table 2: Module Contribution Scores 
Table 2 shows thai; we can attain reasonable 
results using tagging, exact, phrase matching, trig- 
ger word detection, and parsing (setting 2). Note 
that this amounts to making use of only internal 
evidence. However, to achieve higher recall, we 
need coreference resolution for proper names (set- 
ting 3) and other context information (setting 4). 
4.2 Different classes of proper names 
We have also examined how the contribution of 
each component varies from one class of proper 
nanm to another. 
For organisation names, using the same settings 
as above, scores are shown in Table 3. 
Setting Recall Precision P&R 
1 46 87 59.91 
2 65 92 76.15 
3 87 93 89.84 
4 91 91 91.13 
Table 3: Module Contributions for Org Names 
For person names, location names and time ex- 
pressions the results are shown in Tables 4-6. 
Precision 
1 47 88 61.64 
2 89 95 92.34 
3 90 95 92.14 
4 90 95 92.14 
Table 4: Module Contrilmtions for Person Names 
Figure 2 shows graphically how the system 
components contribute for each of the four dif- 
ferent classes of proper names as well as for all 
classes combined. 
5 Conclusion 
We have described an IE system in which four 
classes of naming expressions (organisation, per- 
son, and location names and time expressions) are 
recognised and classified. The system was tested 
Oil 30 unseen Wall Street Journal texts and the 
results were analysed in terms of inajor system 
components and ditti~rent (:lasses of referring ex- 
pression. 
422 
h ..... -J--SF-d--%4---I 86_s4 \] 
a-- 8,¢ - 8L sa\] 
a --- ~ 88- \] -89 _ r88.58-J 
Table 5: Module Contributions for Location Names 
\[ I ..... d__ 32 d 100 \] 4S.97~ \[2 
.... ~:-_T- 9-4_\[ 97- / , all 
\[3 .... ~_ 94 A 97 \]~ 
k~ ..... q-~,Jw~ ~ / 95.41 I 
Table 6: Module Contritmtions for Time Expressions 
Tables 3-6 an<t Figure 2 enabh; us to make. 
the following observations: 
\]. Techniques relying on internal cwich;n(:c, only 
• e, xac:t word and phrase mat, thing, gral)holo- 
gic:al conventions, and t)arsing m'e not mlf- 
ficient, t;o re(:ognise and (:lassi(y organisal;ion 
names. It is clear l;hal; in order to have high 
recall for organisation names, we need 1;o l)e 
;rifle to make good use, of exWxnal evidence 
as well, i.e. proper IlalSlO (;or(i'(;r(,,n(:(~ to, sol 
ul, ion and information fl'c>m t;he sm'rounding 
(:onl;0,xI;. 
2. On tim ol;her han(l, for person an(t lo(:ation 
names and time (;xprc',ssions, 1;echni(lueS rely- 
ing soMy on inl;ernal evi<h',nc(! <1o t)e, rmil; us 
1;o atl;ain high recall whilst maint, aining high 
precision. Thus, the <:ontribution of difl'ercnt 
system coillt)OIlenl;s vm'ies fl'om one (:lass of 
t)rot)(;r name l;o anol;hcr. 
3. I\[owever, giw',n that in a reasonable sample 
of business newswire text, 43 % of @c t)s'ot)er 
names are oigalfisal;ion names, it is (wi<l(m(; 
that for a sysLcm I;o achieve high overall pre- 
cision and re.(:all in the! name rec:ognil;ion and 
(:lassific:al;iol~ (;ask on this (;ext 1;yi)(~,, il; must; 
utilise not only inl, ernal evidence bul; also ex- 
ternal (;viden(:c. 
More generally, I;wo (;Oll(:hlsiOSlS can Is(,, ds&Wll. 
First, the results l)r(;s(mt(',d al)ove suggesl; t;hal; 
when a sysl;em fc)r t)rot)er IlalSl(! l'(;(;OgSlii;iosl an<l/or 
c;lassificat;ion is evaluatc'd, mu('.h I)en(;fil: (:an t)e 
gained by analysing il; not only in t;erms of ovea'- 
all recall and pre,<:ision tigures, 1)ut; also in terms of 
sysl;(;Sll COSlll)OSlelll;S &lid class<',s of IlalIICS. S(:(;OSI(I~ 
a heterogeneous approach l;o l;he r(',cognit;ion an(l 
classifi(:ation of t)l.'Ot)c'r names in n(,,wswir(; |;o, xl; 
such as describ(xt here is at)t)r<)priat;(,' sin<:e il; 
provides me, chanisms thai; can utilis(; the variety of 
internal and external evidence which is available 
and which needs t,o b(; taken inl;o a(:(;ounl;. 
6 Acknowledgements 
This resear(-h has 1)ceil made, l)ossil)le l)y I;he 
gt'ant, s from t;h(; U.K. \])cl)artln('Jl|; ()f rl'ra<l('~ and 
F measures (I'&R) Ovmall 
e 
95.00 , ~: ...................... ~t ........................ x/ 
,+ 47 " Organfisalion ,)o.+ ............... i .................... ~' 
? ^ t ~ , 
~5,(}{) ? ...... I: ............... .! : : 2 PeI SOil 
i ) ," i 
s0,+ { ............... ) ..... .......... < .......... ! ......... ° 
t -" ! \[ ,OCltliOll 
75.00 : ~,m' i V^ 
70.(x) ,' ,'" Time 
6 / 65JX) / ....... ~ ................... .... 
t,,'+ 60.{10 ................... ~ .............................. ......................... 
i 5s.m) / ..... 
+ ! 
50.00 + 
;v' ......................................... 
I 2 3 4 Selling 
l,isI IookLLp IAst I (ilallllllal 
t Nallle COlCl~MellCe 
l,ist I {;l\[Itll\[ll\[\[I I;Ul\] sysIclII 
Figure 2: C<ml,rilml,ion of Syst,em (~OlStl)on(~,nl;s 
\]n(hlstry ((;ranl; Ref. YA\],;/8/5/1002) aml the En- 
ginc<;ring an(t l)hysical Sci('.nce. l{escarch Council 
((;,.am; # (mlKZ~2(;7). 
References 
Brill, E. (1994). "S<)me Advances in ~hansfc>Imat;i(>n 
Based Parl; of Spee<:h Tagging." In l'roc. AAAL 
(?oates-Sl;e4)he.ns, S. (19/)2). The Analysi,~ and Acq'u, is- 
ition of I'Tvp('.r Names for l~obust Text Understand.. 
ing. Phi) thesis, Department of Comlmtea: S<:i(mce, 
City Univcrsii;y, lxm(hm. 
Gaizauskas, 1L; Wakao, T.; Humphreys, K.; Cmming- 
ham, It. and Wilks, Y. (1995). "Universii;y of 
Sheffield: Dcsc,iption of LaSIE system as used fl)r 
MUC-6." In Proceedings of the Sixth Mcssag<'~ Un- 
derstandin 9 Confc/rc'ucc (MUG-6), Morgan I(auf 
l\[l~t£II. 
H<>rn|,y A.S. led.). (1980). Oafo'rd Advo, nced 
Learner's Dictio'n.ary of (Iv, trent E'r~.qIish. l,on<hm: 
()xford University lhess. 
Marcus, M.; Santorini, 13. and Marcinkiewicz, M.A. 
(1993). "I}uihling a l,a.rge Annotated (?<)rims of 
l'htglish: The l'(mn '.lh:e(;bmtk." Computational l,i'n,.~ 
.quistics, 19 (2): 313 aao. 
Marcus, M.; Kim, (4; Mar<:inkiewicz, M.A.; 
Ma(:InWre, ll,.; Bies, A.; Ferguson, M.; Katz, K. 
an(l S<:hasl)erger, B. (1995). "The Prom Trcel)mtk: 
Annotating Predicate Argument Stru(:t, ure." Dis- 
trilmt(;<l on The: INmn q¥('.ebank lh;lcase 2 CI)-R()M 
by the Linguistic Data Consortium. 
McDonald, 1).1). (1993). "Internal mt<t Exl;ermfl 
Fvidence in lie hhml,ification and S(unant;ic Cat: 
egorisati<m of Proper Names." In l)t'occ, cdings 
of ,qlG'Ll(X 'wo'ckshop o'u "Acqui.sition of lm:cical 
Knowh'&tc fl'om T<d,", pp. 32 43. 
M-U(J-5. (1993). l",'occcdings of the Fifth, Message Un- 
dc.rstandinfl Conference (MUC-5). Morgan Kauf- 
Ill&II. 
MUC:6. (1995). l'roceedin\[ls of the Sixth M(:s,sage U'n,- 
dcrsta'udi'n,g Co'ufe'rcnce (MUG-6). Morgan Kaut~ 
lll}IJl, 
Van \]{ijsl)ergen, (J.J. (1979). Information lh,,trie,val. 
l,ondon: Butlaerworl;hs. 
423 
