Learning to Recognize Names Across Languages 
Anthony F. Gallippi 
University of Southern California 
University Park, EEB 234 
Los Angeles, CA 90089 
USA 
gallippi @ aludra.usc.edu 
Abstract 
The development of natural language pro- 
ccssing (NLP) systems that perform ma- 
chine translation (MT) and information re- 
trieval (IR) has highlighted the need for the 
automatic recognition of proper names. 
While various name recognizers have been 
developed, they suffer from being too lim- 
ited; some only recognize one name class, 
and all are language specific. This work 
develops an approach to multilingual name 
recognition that allows a system optimized 
for one language to be ported to another 
with little additional effort and resources. 
An initial core set of linguistic features, 
useful for name recognition in most lan- 
guages, is identified. When porting to a 
new language, these features need to be 
converted (partly by hand, partly by on-line 
lists), after which point machine learning 
(ML) techniques build decision trees that 
map features to name classes. A system 
initially optimized for English has been 
successfully ported to Spanish and Japa- 
nese. Only a few days of human effort for 
each new language results in performance 
levels comparable to that of the best cur- 
rent English systems. 
1 Introduction 
Proper names represent a unique challenge for MT 
and IR systems. They are not found in dictionaries, 
are very large in number, come and go every day, and 
appear in many alias forms. For these reasons, list 
based matching schemes do not achieve desired 
performance levels. Hand coded heuristics can be 
developed to achieve high accuracy, however this 
approach lacks portability. Much human effort is 
needed to port the system to a new domain. 
A desirable approach is one that maximizes reuse 
and minimizes human effort. This paper presents an 
approach to proper name recognition that uses ma- 
chine learning and a language independent fi'ame- 
work. Knowledge incorporated into the framework is 
based on a set of measurable linguistic characteris- 
tics, or ,features. Some of this knowledge is constant 
across languages. The rest can be generated auto- 
matically through machine learning techniques. 
The problem being considered is that of segment- 
ing natural language text into lexical units, and of 
tagging those units with various syntactic and se- 
mantic features. A lexical unit may be a word (e.g., 
"started") or a phrase (e.g., "The Washington Post"). 
The particular lexical units of interest here are proper 
names. Segmenting and tagging proper names is 
very important for natural language processing, par- 
ticularly IR and MT. 
Whether a phrase is a proper name, and what type 
of proper name it is (company name, location name, 
person name, date, other) depends on (1) the internal 
structure of the phrase, and (2) the surrounding con- 
text. 
Internal: "Mr. Brandon" 
Context: "The new company, Safetek, will make 
air bags." 
The person title "Mr." reliably shows "Mr. Brandon" 
to be a person name. "Safetek" can be recognized as 
a company name by utilizing the preceding contex- 
tual phrase and appositive "The new company,". 
The recognition task can be broken down into de- 
limitation and classification. Delimitation is the de- 
termination of the boundaries of the proper name, 
while classification serves to provide a more specific 
category. 
Original: John Smith , chairman of Safetek , an- 
nounced his resignation yesterday. 
Delimit: <PN> John Smith </PN> , chairman of 
<PN> Safetek </PN>, announced his resig- 
nation yesterday. 
Classify: <person> John Smith </person> , chairman 
of <company> Safetek </company> , an- 
nounced his resignation yesterday. 
424 
During the delimit step, the boundarics of all proper 
names are identified. Next, the delimited proper 
names are classified into more specific categodcs. 
How can a system developed in one language be 
ported to another language with minimal additional 
effort and comparable performance results? How 
much additional elTort will be required, and what 
degradation in performance, if any, is to be expected? 
These questions are addressed in the following sec- 
tions. 
2 Method 
The approach taken here is to utilize a data-drivcn 
knowledge acquisition strategy based on decision 
trees which uses contextual information. This differs 
from other approaches which attempt to achieve this 
task by: (1) hand-coded heuristics, (2) list-based 
matching schemes, (3) human-generated knowledge 
bases, and (4) combinations thereof. Delimitation 
occurs through the application of phrasal templates. 
These temphttes, built by hand, use logical ol~eralors 
(AND, OR, etc.) to combine features strongly asst)ci- 
ated with proper names, including: proper ,mun, 
ampersand, hyphen, and comma. In addition, ambi- 
guities with delimitation are handled by including 
other predictive features within the templates. 
To acquire the knowledge required for classilica- 
tion, each word is tagged with all of its associated 
features. These features are obtained through auto- 
mated and manual techniques. A decision trec is 
built (lk)r each name class) from the initial feature set 
using a rccursive partitioning algorithm (Quinhtn, 
1986; l)treiman et al., 1984) that uses the following 
function as its selection (splitting) criterion: 
-p*log2Q)) - (1-p)*log2(I-p) (i) 
where p represents the proportion of names behmg- 
ing to the class for which the tree is built. The fea-- 
ture which minimizes the weighted sum o1' tiffs 
function across both child nodes resulting from the 
split is chosen. A nmltitrce approach was chosen 
over learning a single tree for all name classes be- 
cause it allows for the straightforward association of 
features within the tree with specific natllC classes, 
and facilitates troubleshooting. 
The result is a hierarchical collection of' co-occur- 
ring fcatures which predict inclusion to or exclusion 
from a particuhtr proper name class. Since a tree is 
built for each nmne class el' interest, the trees arc all 
applied individually, and then the results are mergcd. 
2.1 Features 
Various types o1' features indicate the type el' name: 
parts of speech, designators, morphology, syntax, se- 
mantics, and more. 1)esignators are features which 
altmc provide strong evidence lbr or against a partic~ 
ular nantc type. l';xamplcs include "Co." (company), 
"l)r." (person), anti "County" (location). For exam- 
pie, of all the company nmnes in the English training 
text, 28% are associated with a corporate designator. 
Other features are predetermined, obtained via on- 
line lists, or are selected automatically based on 
statistical measures. Parts of speech features are 
predetermined based on the part of speech tagger 
employed. On-line lists provide lists of cities, person 
names, nationalities, regitms, etc. The initial set of 
lexical features is selected by choosing those that 
appear most frequently (above somc threshold) 
throughout the training data, and those that appear 
most \['requcntly near the positive instances in the 
training data. 
Some features, such as morphological, keyword, 
and key phrase features, are determined by hantl 
analysis tff the text. Capitalization is one obvious 
Table 1. Features summary. 
Type 
Part of Speech 
1/csignato'~" Company 
PersOll 
Localkm 
I)ate 
Morpholo~yg ' = ....... 
last 
Temphtte 
_ = Special Purpose 
l"eature ....... I'ixamplc 
Proper Nmm "Aristotle" 
Common Noun . "philoso\[!h~"' 
"(2orp.", "1 ,td." 
"Mr.", "PresidelW' 
Cotlnh'y, Stale, City 
Month, l)a~¢ of week . 
"'A-", "IL" 
"-Gorp', "-tee" 
WI.>8, WL<3 
"IBM", "AT&T" 
"Smith". "Michael" 
"(hdf of Mexico" 
"JtlptlllCSC" 
"based in", "sa!d lie" 
Capitalizalion 
Company Suffix 
W~wd Lengfll 
CompalfiCs 
Pcrsons 
Locations 
Nationalities 
K.eyword(s) 
Company 
PCI'SO\[I 
Location 
l)ate 
Proper Name 
Lngst (?Ill Sbslr 
Duplicated PNs 
< NNP CN .dcsig > 
< P_.Desig NNP > 
< NNI' L dcsig > 
< MM Num, Num> 
< NNP NNP > ... 
"VW" <- Volkswagen 
DUP 2+, I)UP 5+ 
llow inane. , 
NA 
NA 
100E, IIOS, 60J 
70 E, 70 S, 43 J 
520 E, 900 S, 570 J 
56E, 19S, !9J 
|E, 1S, 0J 
5 E, 0 S, 30 J 
4E, 4S, 2J 
0E, 100S,7KJ 
21KE, 2IKS, 185KJ 
20 E, 20 S, 2K J 
220 E, 0 S, 0 J 
44 E, 49 S, 54 J 
210E, 210 S, 210 J 
90 E, 95 S, 90 J 
190E, 190S, 190J 
17E, 18S, 70J 
140E, 140S, 140J 
IE,|S, IJ 
5E, 5S, 2J 
425 
morphological feature of importance. Determining 
keyword and key phrase features amounts to select- 
ing prudent subject categories. These categories are 
associated with lists of lexical items or already exist- 
ing features. For example, many of the statistically 
derived lexical features may fall under common sub- 
ject categories. The words "build", "make", 
"manufacture", and "produce" can be associated with 
the subject category "make-type verbs". Analysis of 
the immediate context surrounding company names 
may lead to the discovery of key phrases like "said 
it", "entered a venture", and "is located in". Table 1 
shows a summary of various types of features used in 
system development. The longest common substring 
(LCS) feature (Jacobs et al., 1993) is useful for 
finding proper name aliases. 
2.2 Feature Trees 
The ID3 algorithm (Quinlan, 1986) selects and orga- 
nizes features into a discrimination tree, one tree for 
each type of name (person, company, etc.). The tree, 
once built, typically contains 100+ nodes, each one 
inquiring about one feature in the text, within the 
locality of the current proper name of interest. 
An example of a tree which was generated for 
companies is shown in Figure 1. The context level 
for this example is 3, meaning that the feature in 
question must occur within the region starting 3 
words to the left of and ending 3 words to the right of 
the proper name's left boundary. A "(L)" or "(R)" 
following the feature name indicates that the feature 
must occur to the left of or to the right of the proper 
name's left boundary respectively. The numbers di- 
rectly beneath a node of the tree represent the num- 
ber of negative and positive examples present from 
the training set. These numbers are useful for associ- 
ating a confidence level with each classification. 
Definitions for the features in Figure 1 (and other 
abbreviations) can be found in the appendix. 
The training set used for this example contains 
1084 negative and 669 positive examples. To obtain 
the best initial split of the training set, the feature 
"CN_alias" is chosen. Recursively visiting and op- 
timally splitting each concurrent subset results in the 
generation of 97 nodes (not including leaf nodes). 
N N ...... N '- N -' P 
Figure 1. Company tree example (context is +/- 3). 
2.3 Architecture 
Figure 2 shows the working development system. 
The starting point is training text which has been pre- 
tagged with the locations of all proper names. The 
tokenizer separates punctuation from words. For 
non-token languages (no spaces between words), it 
also separates contiguous characters into constituent 
words. The part of speech (POS) tagger (Brill, 1992; 
Farwell et. al., 1994; Matsumoto et al., 1992) at- 
taches parts of speech. Thc set of derived features is 
attached. During the delimitation phase, proper 
names are delimited using a set of POS-based hand- 
coded tcmplates. Using ID3, a dccision tree is gen- 
erated based on the existing feature set and thc speci- 
fied level of context to be considered. The generated 
tree is applied to test data and scored. Manual analy- 
sis of the tree and scored result leads to the discovery 
of new features. The new features are added to the 
tokenized training text, and the process repeats. 
2.4 Cross Language Porting 
In order to work with another language, the follow- 
ing resources are needed: (1) pre-tagged training text 
in the new language using same tags as belore, (2) a 
tokenizer for non-token languages, (3) a POS tagger 
(plus translation of the tags to a standard POS con- 
vention), and (4) translation of designators and 
lexical (list-based) features. 
These language-specific modules are highlighted 
in Figure 2 with bold bordcrs. Feature translation 
occurs through the utilization of: on-line resources, 
dictionaries, atlases, bilingual speakers, etc. The 
remainder is constant across languages: a language 
independent core development system, and an opti- 
mally derived feature set for English. 
Also worth noting are the parts of development 
system that are executed by hand. These are shown 
shaded. Everything else is automatic. 
3 Experiment 
The system was first built for English and then 
ported to Spanish and Japanese. For English, the 
training text consisted of 50 messages obtained from 
the English Joint Ventures (EJV) domain MUC-5 
corpus of the US Advanced Research Projects 
Agency (ARPA). This data was hand-tagged with 
the locations of company names, person names, 
locations names, and dates. The test set consisted of 
10 new messages. 
Experimental results were obtained by applying 
the generated trees to test texts. The initial raw text 
is tokenized and tagged with parts of speech. All 
features necessary to apply rules and trees are at- 
tached. Phrasal template rules are applied in order to 
delimit proper names. Then trees for each proper 
name type are applied individually to the proper 
names in the featurized text. Proper names which are 
426 
Figure 2. Multilingual development system. 
voted into more than one class are handled by 
choosing the highest priority class. Priorities are 
determined based on the independent perlormance of 
each tree. For example, if person trces perform 
better independently than location trees, then a per- 
son classification will be chosen over a location 
classification. Also, designators have a large impact 
on resolving conflicts. 
3.1 English 
Various parameterizations were used for system 
development, including: (1) context depth, (2) 
feature set size, (3) training set size, and (4) incorpo- 
ration of hand-coded phrasal templates. 
Figure 3 shows ttle performance results for Eng- 
lish. The metrics used were recall (R), precision (P), 
and an averaging measure, P&R, defined as: 
P&R = 2*P*R/(P+R) (2) 
Obtained results for English compare to the English 
results of Rau (1992) and McDonald (1993). The 
weighted average of the P&R for companies, per- 
sons, locations, and dates is 94.0%. 
Ii RecaH 
• Precision 
companlos persons locations dates 
Figure 3. English performance results. 
The date grammar is rather small in comparison 
to other name classes, hence the performance for 
dates was perfect. Locations, by contrast, exhibited 
the lowest performance. This can be attributed 
mainly to: (1) locations are commonly associated 
with commas, which can create ambiguities with 
delimitation, and (2) locations made up a small 
percentage of all names in the training set, which 
could have resulted in overfitting of the built tree to 
the training data. 
Features strengths were measured for companies, 
persons, and locations. This experiment involved 
removing one feature at a time from the text used for 
testing and then reapplying the stone tree. Figure 4 
and Table 2 show performance results (P&R) when 
the three most powerful features are removed, one at 
a time, for companies, persons, and locations respec- 
tively. This experiment demonstrates tim power of 
designator features across all proper name types, and 
the importance of the alias feature for companies. 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
F1 F2 F3 F4 None 
Feature removed 
NCompanies 
I 
IPersons 
Locations j 
Figure 4. Feature strengths for English. 
"Fable 2. Strongest features for English. 
Feature Companies 
FI CAP 
F2 CN desig 
F3 CN alias 
F4 Hyphen 
Persons Locations 
P_desig CAP 
CAP L desi[j 
ATH_reg In 
F I L Re~ion 
3.2 Spanish 
Three experiments have been conducted for Spanish. 
In the first experiment, the English trees, generated 
427 
from the feature set optimized for English, are ap- 
plied to the Spanish text (E-E-S). In the second ex- 
periment, new Spanish-specific trees are generated 
from the feature set optimized for English and ap- 
plied to the Spanish text (S-E-S). The third experi- 
ment proceeds like the second, except that minor ad- 
justments and additions are made to the t'eature set 
with the goal of improving performance (S-S-S). 
The additional resources required for the first 
Spanish experiment (E-E-S) are a Spanish POS-tag- 
ger (Farwell et al., 1994) and also the translated fea- 
ture set (including POS) optimally derived for Eng- 
lish. The second and third Spanish experiments 
(S-E-S, S-S-S) require in addition pre-tagged Spanish 
training text using the same tags as for English. 
The obtained Spanish scores as compared to the 
scores from the initial English experiment (E-E-E) 
are shown in figure 5. 
1 
0.9 
0.8 
O,7 
O,6 
0,5 
0.4 
comparllas persons Iocatloos dato5 
E-E-E 
ill E-E-S 
! SES 
~:'% 
Figure 5. P&R scores lbr Spanish versus English. 
The additional Spanish specific features derived for 
S-S-S are shown in Table 3. Only a few new features 
added to the core feature set allows for significant 
pcrfommnce improvement. 
Table 3. Spanish specific features for S-S-S. 
Type Feature 
List Companies 
Ke~cword(s) 
Template Person 
Person 
Date 
Date 
Instances How 
man~' 
"IBM", "AT&T', ... 100 
"del" (OF THE) l 
< FN DE LN > 1 
< FN DE NNP > 1 
< Num OF MM > 1 
<Num OF MM OF Num> l 
3.3 Japanese 
The same three experiments conducted lor Spanish 
are being conducted for Japanese. The first two, 
E-E-J and J-E-J, have been completed; J-J-J is in 
progress. 
The additional resources required for the first 
Japanese experiment (E-E-J) are a Japanese tokenizer 
and POS-tagger (Matsumoto et al., 1992) and also 
the translated feature set optimally derived for Eng- 
lish. The second and third Japanese experiments 
(J-E-J, J-J-J) require in addition pre-taggcd Japanese 
training text using the same tags as for English. 
The obtained Japanese scores as compared to the 
scores from the initial English experiment (E-E-E) 
are shown in Figure 6. The weighted averages of the 
P&R measures across all languages, for companies, 
persons, locations, and dates, are shown in Figure 7. 
Table 4 shows comparisons to other work. 
companies persons locations dales 
I~ E-E'E J 
Figure 6. P&R scores for Japanese versus English. 
1 
o6°ii 107 ii ¸ iiiiit!i;i 
0.5 
0,4 
0.3 
0.2 
0.1 
0 i 
E'E~E E-E~S S-E-S S-S~S E4E~I J-E.I 
Figure 7. Weighted P&R scores comparison. 
Table 4. Performance comparison to other work. 
System Lang. Class R P P&R 
Rau English. Corn NA 95 NA 
PNF English Corn NA NA "Near 
(McDonald) Pers 100%" 
Loc 
Date 
P~mglyzer Spanish NA NA 80 NA 
MAJESTY Japanese Corn 84.3 81.4 82.8 
Pers 93.1 98.6 95.8 
Loc 92.6 96.8 94.7 
MNR English Corn 97.6 91.6 94.5 
(Gallippi) Pers 98.2 100 99.1 
Loc 85.7 91.7 88.6 
Date 100 100 100 
(Avg) 94.0 
MNR Spanish Corn 74.1 90.9 81.6 
Pers 97.4 79.2 87.4 
Loc 93.1 87.5 89.4 
Date 100 100 100 
(Avg) 89.2 
MNR Japanese Corn 60.0 6010 60.0 
Pers 86.5 84.9 85.7 
Loc 80.4 82.1 81.3 
Date 90.0 94.7 92.3 
(Avg) 83.1 
428 
4 Related Work 
Proper name recognition has been addressed by 
others (Farwell et al., 1994; Kitani & Mitamura, 
1994; Rau, 1992), with the goal of incorporating this 
capability into IR and MT systems. Related prob- 
lems have been studied which utilize contextual 
information and learning. Examples include posteo 
diting of documents (article seleclion) (Knight & 
Chander, 1994), word sense disambiguation (Black, 
1988; Siegel & McKeown, 1994), and discourse 
analysis (Soderland & Letmert, 1994). 
5 Future Work 
An investigation of the causes of i)erl'ormance degra- 
dation across languages will be conducted, with tim 
goal of lfinpointing and concurrently taking steps to 
minimize their effects. Other plans ir~clude using 
MI, techniques to \[:urther reduce the altlOUllt o| 
human effort: (1) automate the building of templates 
for delimitation, (2) automale the discovery of new 
features froni test results, and (3) expand the search 
space traversed by lhe tree building algorithm lo 
inchide splils on feature combinalions. 
Acknowledgments 
The author would like to offer special thanks and 
gratitude to \]~duard llovy for all of his support, 
direction, and ericouragcmcnt front the onset of this 
work. Thanks als() to Kevin Knight for his early 
suggestions, and to the lnfornmtion Sciences Institute 
for use of their facilities and resources. 
References 
Black, 1';. 1988. An I';xpcrinmnt in Computalional 
l)iscriminalion of English Word Senses. in IBM 
Journal of Research and l)evelopntent, 32(2). 
Breiman, l,., Friedman, J.H., Olshcn, R.A., and 
Stone, C..J. 1984. Classification and Regression 
77"ees. Wadsworth International Group. 
Brill, i';. 1992. A Simple Rule-Based Part of 
Speech Tagger. hi i'roceedings elthe Third Cot!i?n'o 
elite on Applied Natural Language Processing, ACL. 
Farwell, D., Hehnreich, S., Jin, W., Casper, M., 
Hargravc, J., Molina-Salgado, ll., and Weng, 1;. 
1994. Panglyzer: Spanish l,anguage Analysis 
System. in Proceedings of the Conference of the 
Association of Machine Translation in the Americas 
(ATMA). Columbia, MI). 
Jacobs, P.S., Krupta, C,., Rau, I,., Mauldin, M./.., 
Mitamura, T., Kitaui, T., Sider, 1. and Childs, L. 
1993. Gt!-CMU: l)escription of the SH()GUN 
System Used for MU('.-5. In l'roceeding.v o\[' the 
Fifth Message Undetwtanding UonJ'erenee (MUC-5). 
Morgan Katltrnann, pp. 109-120. 
Kitani, T. and Mitamura, T. 1994. An Accurate 
Morphological Almlysis and Proper Name ldcntifi- 
cation for Japanese Text l'rocessing. In 77"ansactions 
of hzJbrmation Processing Society of Japan, Vol. 35, 
No. 3, pp. 404-413. 
Knight, K. and Chandcr, I. 1994. Automated 
Poslediting (If Documents. In Proceedings of the 
Twe!/Th National Cotlference on Artificial Intelli- 
gence (AAAI), pp. 779-784. 
LeMert, W., McCarthy, J., Soderland, S., Riloff, 
E., Card|e, C., Peterson, J., Feng, F., Dolan, C., and 
Goldman, S. 1993. UMass/Hughes: Description of 
the CIRCUS System Used for MUC-5. In Proceed-. 
ings of the Fifth Message (h~derstanding Conference 
(MUC-5). Morgan Kaufinann, pp. 277-292. 
Matsumolo, Y., Kurohashi, S., Taegi, I{. and 
Nagao, M. 1992. JUMAN Users' Manual Vetwion 
0.8, Nagao Laboratory, Kyoto University. 
McDonald, I). 1993. internal and \[ixternal 
l{vidcnce in tim Identification and Semantic Catego- 
rizalion of Proper Names. In Proceedings of the 
SINGLFX workshop on. "Acquisition el Lexical 
Knowledge fiom Text", pp. 32.-43. 
Quin\[an, J.R. 1986. Induction of Decision Trees. 
\[n Machine LeatvUng, pp. 81-106. 
Rau, L.F. 1992. Extracting Company Names 
from '\['ext. In Proceedings of the Seventh Coq/~er- 
elite on Art(fieial Intelligence Applications, pp. 189- 
194.. 
Siegel, E.V. and McKeown, K.R. 1994. limer- 
gent Linguistic Rules l'rom Inducing l)ecision 'Frees: 
I)isaml)iguating Discourse Clue Words. In Pro- 
ceedings of the 7'we!flh National Conference on 
Artificial Intelligence (AAAI), pp. 820-826. 
Soderland, S. and l.ehnert, W. 1994. Corpus- 
Driven Knowledge Acquisition for l)iscourse Analy- 
sis. In Proceedings of the 7'we!/}h National Confer- 
ence on Artificial Intelligence (AAAI), pp. 827-832. 
Appendix A. Abbreviations 
Table 5. l)efinitions for abbreviations. 
Abbreviation Definition 
ACI,~ 
AT! 1 reg 
CAP 
('N alias 
CN dsg 
C(mnlry 
FN 
1" l l~ 
Hyphen 
IN region 
In 
I ,CS 
I,N 
L .desig 
NNP 
Notln 
PN_cml 
PN .2X I- 
PIing 
I' dcsig 
Region 
SO region 
Snt end & 
Acl'onylll 
Occurs ill <Aulhor> ... </Author> 
Capitalized 
LCS of full company name 
(\]Oll/pillly llalllc designator 
Counh'y nalne 
First (given) nmne 
First name + initial + lasl name 
! lyphen (punctuatiol0 
Occurs ill <IN> ... </IN> region 
1 ,cxical "in" 
Longcsl COllllll(HI std)slring 
l,ast (family) name 
Location designator 
Proper 11o1111 
General noun 
Proper Ilanlc clld delimiter 
l'ropcr lla\[llC occtirs 2-b times 
Punctuation 
l'cl'SOil designator 
Geogral)hicaI region name Occurs in <SO> ... </SO> region 
Sentence end boulRlary 
A n~!ersand dmracter 
429 
