Unsupervised Learning of Name Structure
From Coreference Data
 
Eugene Charniak
Brown Laboratory for Linguistic Information Processing
Department of Computer Science
Brown University, Box 1910, Providence, RI
ec@cs.brown.edu
Abstract
We present two methods for learning the struc-
ture of personal names from unlabeled data.
The  rst simply uses a few implicit constraints
governing this structure to gain a toehold on the
problem | e.g., descriptors come before  rst
names, which come before middle names, etc.
The second model also uses possible coreference
information. We found that coreference con-
straints on names improve the performance of
the model from 92.6% to 97.0%. We are in-
terested in this problem in its own right, but
also as a possible way to improve named entity
recognition (by recognizing the structure of dif-
ferent kinds of names) and as a way to improve
noun-phrase coreference determination.
1 Introduction
We present two methods for the unsupervised
learning of the structure of personal names as
found in Wall Street Journal text. More specif-
ically, we consider a \name" to be a sequence of
proper nouns from a single noun-phrase (as in-
dicated by Penn treebank-style parse trees). For
example, \Defense Secretary George W. Smith"
would be a name and we would analyze it into
the components \Defense Secretary" (a descrip-
tor), \George" (a  rst name), \W." (a middle
name, we do not distinguish between initials
and \true" names), and \Smith" (a last name).
We consider two unsupervised models for
learning this information. The  rst simply uses
a few implicit constraints governing this struc-
ture to gain a toehold on the problem | e.g.,
descriptors come before  rst names, which come
 
This research was supported in part by NSF grant
LIS SBR 9720368. The author would like to thank Mark
Johnson and the rest of the Brown Laboratory for Lin-
guistic Information Processing (BLLIP) for general ad-
vice and encouragement.
before middle names, etc. We henceforth call
this the \name" model. The second model also
uses possible coreference information. Typically
the same individual is mentioned several times
in the same article (e.g., we might later en-
counter \Mr. Smith"), and the pattern of such
references, and the mutual constraints among
them, could very well help our unsupervised
methods determine the correct structure. We
call this the \coreference" model. We were at-
tracted to this second model as it might o er a
small example of how semantic information like
coreference could help in learning structural in-
formation.
To the best of our knowledge there has not
been any previous work on learning personal
structure. We are aware of one previous case
of unsupervised learning of lexical information
from possible coreference, namely that of Ge et.
al. [5] where possible pronoun coreference was
used to learn the gender of nouns. In this case a
program with an approximately 65% accuracy
in determining the correct antecedent was used
to collect information on pronouns and their
possible antecedents. The gender of the pro-
noun was then used to suggest the gender of
the noun-phrase that was proposed as the an-
tecedent. The current work is quite di erent
in both goal and methods, but similar in spirit.
More generally this work is part of a growing
body of work on learning language-related in-
formation from unlabeled corpora [1,2,3,8,9,10,
11].
2 Problem De nition and Data
Preparation
We assume that people’s names have six (op-
tional) components as exempli ed in the follow-
ing somewhat contrived example:
Word Label Label Number
Defense descriptor 0
Secretary descriptor 0
Mr. honori c 1
John  rst-name 2
W. middle-name 3
Smith last-name 4
Jr. close 5
Our models make the following assumptions
about personal names:
 all words of label l (the label number) must
occur before all words of label l +1
 with the exception of descriptors, a maxi-
mum of one word may appear for each label
 every name must include either a  rst name
or a last name
 in a loose sense, honori csandclosesare
\closed classes", even if we do not know
which words are in the classes. We im-
plement this by requiring that words given
these labels must appear in our dictionary
only as proper names, and that they must
have appeared at least three times in the
training set used for the lexicon (sections
2-21 of the Penn Wall Street Journal tree-
bank)
Section 5 discusses these constraints and as-
sumptions.
The input to the name model is a noisy list
of personal names. This list is approximately
85% correct; that is, about 15% of the word
sequences are not personal names, but rather
non-names, or the names of other types of en-
tities. We obtained these names by running a
program inspired by that of Collins and Singer
[4] for unsupervised learning of named entity
recognition. This program takes as input pos-
sible names plus contextual information about
their occurrences. It then categorizes each name
as one of person, place,ororganization. A possi-
ble name is considered to be a sequence of one
or more proper nouns immediately dominated
by a noun-phrase where the last of the proper
nouns is the head (rightmost noun) of the noun
phrase. We used as input to this program the
parsed text found in the BLLIP WSJ 1987-89
WSJ Corpus | Release 1 [6]. Because of a mi-
nor error, the parser used in producing this cor-
pus had a unwarranted propensity to label un-
capitalized words as proper nouns. To correct
for this we only allowed capitalized words to be
considered proper nouns. In section 5 we note
an unintended consequence of this decision.
The coreference model for our tasks is also
given a list of all personal names (as de-
 ned above) in each Wall Street Journal arti-
cle. Although the BLLIP corpus has machine-
generated coreference markers, these are ig-
nored.
The output of both programs is an assign-
ment from each name to a sequence of labels,
one for each word in the name. Performance is
measured by the percent of words labeled cor-
rectly and percent of names for which all of the
labels are correct.
3 The Probability Models
We now consider the probability models that
underlie our learning mechanisms. Both models
are generative in that they assign probabilities
to all possible labelings of the names. (For the
coreference model the model generates all pos-
sible labelings given the proposed antecedent.)
Let
~
l be a sequence of label assignments to the
name ~n (a sequence of words). For the name
model we estimate
arg max
~
l
p(
~
l j ~n)=argmax
~
l
p(
~
l;~n)(1)
We estimate this latter probability by assum-
ing that the number of words assigned label l,
n(l), is independent of which other labels have
appeared. Our assumptions imply that with the
exception of descriptor, all labels may occur zero
or one times. We arbitrarily assume that there
may be zero to fourteen descriptors. We then
assume that the words in the name are inde-
pendent of one another given their label. Thus
we get the following equation:
p(
~
l;~n)=
Y
l=0;5
p(N(l)=n(l))
Y
i=0;n(l)
p(w(i) j l)
(2)
Here w(i)istheith word from ~n assigned the
label l in
~
l and N(l) is a random variable whose
value is the number of words in the name with
label l. To put this slightly di erently, we  rst
guess the number of words with each label l
according to the distribution p(N(l)=n(l)).
Given the ordering constraints, this completely
determines which words in ~n get which label.
We then guess each of the words according to
the distribution p(w(i) j l). The name model
does not use information concerning how often
each name occurs. (That is, it implicitly as-
sumes that all names occur equally often.)
We have also considered somewhat more com-
plex approximations to p(
~
l;~n). See section 5.
The coreference model is more complicated.
Here we estimate
arg max
~
l
p(
~
l j ~n;~c)=argmax
~
l
p(
~
l;~n j ~c)(3)
That is, for each name the program identi es
zero or one possible antecedent name. It does
this using a very crude  lter. The last word
of the proposed antecedent (unless that word is
\Jr.", in which case it looks at the second to last
word) must also appear in ~n as well. If no such
name exists, then ~c = 4 and we estimate the
distribution according to equation 2. If more
than one such name exists, we choose the  rst
one appearing in the article.
Even if there is such a name, the program
does not assume that the two names are, in fact,
coreferent. Rather, a hidden random variable R
determines how the two names relate. There are
three possibilities:
 ~c is not coreferent (R = 4), in which case
the probability is estimated according to
the ~c = 4 case.
 ~c is not coreferent but is a member of
the same family as ~n (e.g., \John Sebas-
tian Bach" and \Carl Philipp Emmanuel
Bach"). This case (R = f) is computed
as the non-coreference case, but the sec-
ond occurrence is given \credit" for the last
name.
 ~c is coreferent to ~n, in which case we com-
pute the probability as described below. In
this case we assume that any words shared
by the two names must appear with the
same label, and except for descriptions, la-
bels may not change between them (e.g.,
if ~c has a  rst name, then ~n can be given
a rstnameonlyifitisthesamewordas
that in ~c). This does not allow for nick-
names and other such cases, but they are
rare in the Wall Street Journal.
More formally, we have
p(~n;
~
l j ~c)=
X
R
p(R)p(~n;
~
l j R;~c)(4)
We then estimate p(~n;
~
l j R;~c) as follows:
 if R = 4, compute p(~n;
~
l) as estimated
from equation 2.
 if R = f,thenp(~n;
~
l)=p(s j
~
l(s)) where s is
thewordincommonthatcausedtheprevi-
ous name to be selected as a possible coref-
erent and
~
l(s) is the label assigned to s ac-
cording to
~
l.
 if R = ~c, use equation 8 below.
In equation 4 the R = 4 case is reasonably
straight-forward: we simply use equation 2 as
the non-coreferent distribution. For R = f,as
we noted earlier, we want to claim that the new
name is a member of the same family as that of
the earlier name. Thus, as we said earlier, we
get \credit" for the repeated family name. This
is why we take the non-coreferent probability
and divide by the probability of what we take
to be the family name.
This leaves the coreferent case. The basic
idea is that we view the labeling of new name
(
~
l) as a transformation of the labeling of the
old one (
~
l
0
). However, we do not know
~
l
0
so we
have to sum over all possible former labelings
L
0
. This is expressed as
p(~n;
~
l j~c)=
X
~
l
0
2L
0
p(
~
l
0
j ~c)p(~n;
~
l j
~
l
0
;~c)(5)
The  rst term, p(
~
l
0
j ~c), is easy to compute
from equation 2 using Bayes law. We now turn
our attention to the second term.
To establish a more detailed relationship be-
tween the old and new names we compute pos-
sible correspondences between the two names,
where a correspondence speci es for each word
in the old name if it is retained in the new name,
and if so, the word in the new name to which it
corresponds. Two words may correspond only
if they are the same lexical item. (The converse
does not hold.) Since in principle there can be
multiple correspondences, we introduce the cor-
respondences  by summing the probability over
all of them:
p(~n;
~
l j
~
l
0
;~c)=
X
 
p(~n;
~
l; j
~
l
0
;~c)(6)
 max
 
p(~n;
~
l; j
~
l
0
;~c)(7)
In the second equation we simplify by making
the assumption that the sum will be dominated
by one of the correspondences, a very good as-
sumption. Furthermore, as is intuitively plau-
sible, one can identify the maximum  with-
out actually computing the probabilities: it is
the  with the maximum number of words re-
tained from ~c. Henceforth we use  to denote
this maximum-probability correspondence.
By specifying  we divide the words of the
old name into two groups, those (R)thatare
retained in the new name and those (S)thatare
subtracted when going to the new name. Simi-
larly we have divided the words of the new name
into two classes, those retained and those added
(A). We then assume that the probability of a
word being subtracted or retained is indepen-
dent of the word and depends only on its label
(e.g., the probability of a subtraction given the
label l is p(s j l)). Furthermore, we assume that
the labels of words in R do not change between
~
l
0
and
~
l. Once we have pinned down R and S,
any words left in
~
l must be added. However,
we do not yet \know" the labels of those, so
we need a probability term p(l j a). Lastly, for
words that are added in the new name, we need
to guess the particular word corresponding to
the label type. This gives us the following dis-
tribution:
p(~n;
~
l; j ~c;
~
l
0
)=
Y
w2S
p(s j
~
l
0
(w))
Y
w2R
p(r j
~
l
0
(w))
Y
w2A
p(l j a)
p(w j l)(8)
Taken together, equations 2 | 8 de ne our
probability model.
4 Experiments
From the work on named entity recognition we
obtained a list of 145,670 names, of which 87,809
were marked as personal names. A second pro-
gram creates an ordered list of names that ap-
pear in each article in the corpus. The two  les,
names and article-name occurrences, are the in-
put to our procedures.
With one exception, all the probabilities re-
quired by the two models are initialized with
flat distributions | i.e., if a random variable
can take n possible values, each value is 1=n.
The probabilites so set are:
1. p(N(l)=n(l)) from equation 2 (the prob-
ability that label l appears n(l) times),
2. p(w(i) j l) from equation 2 (the probability
of generating w(i) given it has label l),
3. p(s j
~
l
0
(w)), p(r j
~
l
0
(w)), and p(a j
~
l(w))
from equation 8, the probabilities that a
the label
~
l(w) will be subtracted, retained,
or added when going from the old name to
the new name.
We then used the expectation-maximization
(EM) algorithm to re-estimate the values. We
initially decided to run EM for 100 iterations as
our benchmark. In practice no change in perfor-
mance was observed after about 15 iterations.
The one exception to the flat probability dis-
tribution rule is the probability distribution
p(R), the probability of an antecedent being
coreferent, a family relation, or non-coreferent.
This distribution was set at .993, .002, and .005
respectively for the three alternatives and the
values were not re-estimated by EM.
1
Figure
1 show some of the probabilities for individual
words given the possible labels.
The result shown in Figure 1 are basically
correct, with \Director" having a high proba-
bility as a descriptor, (0.0059), \Ms." having a
high probability as honori c (0.058), etc. Some
of the small non-zero probabilities are due to
genuine ambiguity (e.g., Fisher does occur as a
 rst name as well as a last name) but more of
it is due to small confusions in particular cases
(e.g., \Director" as a last-name, or \John" as
descriptor).
After EM training we evaluated the program
on 309 personal names from our names list that
we had annotated by hand. These names were
obtained by random selection of names labeled
1
These were the  rst values we tried and, as they
worked satisfactorily, we simply left them alone.
Word p(w j 0) p(w j 1) p(w j 2) p(w j 3) p(w j 4) p(w j 5)
Director 0.0059 0 2.0 10
−5
0 0.00016 0
Ms. 1.2 10
−7
0.058 0.0041 2.9 10
−14
00
John 0.0037 2.9 10
−6
0.038 9.7 10
−6
T. 0.0018 1.2 10
−12
.00032 0.02 0 0
Fisher 0 0 4.5 10
−5
7.4 10
−5
0.00073 0
I00 00 6.810
−5
0.16
Figure 1: Some example probabilities of words given labels after 15 EM iterations
as personal names by the name-entity recog-
nizer. If the named entity recognizer had mis-
takenly classi ed something as a personal name
it was not used in our test data.
For the name model we straightforwardly
used equation 2 to determine the most probable
label sequence
~
l for each name. Note, however,
that the testing data does not itself include any
information on whether or not the test name
was a  rst or subsequent occurrence of an indi-
vidual in the text. To evaluate the coreference
model we looked at the possible coreference data
to  nd if the test-data name was most common
as a  rst occurrence, or if not, which possible
antecedent was the most common. If  rst occur-
rence prevailed,
~
l was determined from equation
2, and otherwise it was determined using equa-
tion 3 with ~c set to the most common possible
coreferent for this name.
We compare the most probable labels
~
l for a
test example with the hand-labeled test data.
We report percentage of words that are given
the correct label and percentage of names that
are completely correct. The results of our ex-
periments are as follows:
Model Label% Name%
Name 92.6 85.1
Coreference 97.0 94.5
As can be seen, information about possible
coreference was a decided help in this task, lead-
ing to an error reduction of 59% for the number
of labels correct and 63% for names correct.
5 Error Analysis
The errors tend to arise from three situations:
the name disobeys the name structure assump-
tions upon which the program is based, the
name is anomalous in some way, or sparse data.
We consider each of these in turn.
Many of the names we encounter do not obey
our assumptions. Probably the most common
situation is last names that, contrary to our as-
sumption, are composed of more than word e.g.,
\Van Dam". Actually, a detail of our process-
ing has caused this case to be under-represented
in our data and testing examples. As noted in
Section 2, uncapitalized proper nouns were not
allowed. The most common extra last name is
probably \van," but all of these names were ei-
ther truncated or ignored because of our pro-
cessing step.
In principle, it should be possible to allow for
multiple last names, or alternatively have a new
label for \ rst of two last names". In practice,
it is of course the case that the more parameters
we give EM to  ddle with, the more mischief it
can get into. However, for a practical program
this is probably the most important extension
we envision.
Names may be anomalous while obeying our
restrictions at least in the letter if not the spirit.
Chinese names have something very much like
the  rst-middle-last name structure we assume,
but the family name comes  rst. This is partic-
ularly relevant for the coreferent model, since it
will be the family name that is repeated. There
is nothing in our model that prevents this, but
it is su ciently rare that the program gets \con-
fused". In a similar way, we marked both \Dr."
and \Sir" as honori cs in our test data. How-
ever, the Wall Street Journal treats them very
di erently from \Mr." in that the former tend
to be included even in the  rst mention of a
name, while the latter is not. Thus in some
cases our program labeled \Dr." and \Sir" as
descriptors.
Lastly, there are situations where we imag-
ine that if the program had more data (or if the
learning mechanisms were somehow \better") it
would get the example right. For example, the
name \Mikio Suzuki" appears only once in our
corpus, as does the word \Mikio". \Suzuki"
appears two times, the  rst being in \Yotaro
Suzuki" who is mentioned earlier in the same
article. Unfortunately, because \Mikio" does
not appear elsewhere, the program is at a loss
to decide which label to give it. However, be-
cause Yotaro is assumed to be a  rst name, the
program makes \Mikio Suzuki" coreferent with
\Yotaro Suzuki" by labeling \Mikio" descriptor.
As noted briefly in section 3, we have consid-
ered more complicated probabilities models to
replace equation 2. The most obvious of these is
to allow the distribution over numbers of words
for each label to be conditioned on the previ-
ous label | e.g., a bi-label model. This model
generally performed poorly, although the coref-
erence versions often performed as well as the
coreference model reported here. Our hypoth-
esis is that we are seeing problems similar to
those that have bedeviled applying EM to tasks
like part-of-speech tagging [7]. In such cases EM
typically tries to lower probabilities of the cor-
pus by using the tags to encode common word-
word combinations. As the models correspond-
ing to equations 2 and 8 do not include any
label-label probabilities, this problem does not
appear in these models.
6 Other Applications
It is probably clear to most readers that the
structure and probabilities learned by these
models, particularly the coreferent model, could
be used for tasks other than assigning structure
to names. For starters, we would imagine that
a named entity recognition program that used
information about name structure could do a
better job. The named entity recognition pro-
gram used to create the input looks at only a
few features of the context in which the name
appears, the complete name, and the individ-
ual words that appear in the name irrespective
of the other words. Since the di erent kinds
of names (person, company and location) di er
in structure from one another, a program that
simultaneously establishes both structure and
type would have an extra source of information,
thus enabling it to do a better job.
Our name-structure coreferent model is also
learning a lot of information that would be use-
ful for a program whose primary purpose is to
detect coreference. One way to see this is to
look at some of the probabilities that the pro-
gram learned. Consider the probability that we
will have an honori c in a  rst occurrence of a
name:
p(n(1) = 1) = :000044 (9)
This is very low. Contrast this with the prob-
ability that we add an honori c in the second
occurrence:
p(a j honori c) = 1 (10)
These dramatic probabilities are not, in fact,
accurate, as EM tends to exaggerate the e ects
by moving words that do not obey the trend out
of the honori c category. They are, however, in-
dicative of the fact that in the Wall Street Jour-
nal names are introduced without honori cs,
but subsequent occurrences tend to have them
(a fact we were not aware of at the start of this
research).
Another way to suggest the usefulness of
this research for coreference and named-entity
recognition is to consider the cases where our
program’s crude  lter suggests a possible an-
tecedent, but the probabilistic model of equa-
tion 4 rejects this analysis. The  rst 15 cases
are given in  gure 2. As can be seen, except for
\Mr. President" and \President Reagan", all
of the examples are either not coreferent or are
not people at all.
7 Conclusion
We have presented two methods for the un-
supervised discovery of personal-name struc-
ture. The two methods di er in that the second
uses multiple, possibly coreferent, occurrences
of names to constrain the problem. The meth-
ods perform at a level of 92.6 and 97.0 percent
accuracy respectively.
The methods are of potential interest in their
own right, e.g., to improve the level of detail
provided by Penn Treebank style parses. As
we have also noted, we should also be able to
use this research in the quest for better unsu-
pervised learning of named-entity recognition,
and the model that attends to coreference in-
formation can potentially be useful for programs
aimed directly at this latter problem.
Finally, many of us believe that the power
of unsupervised learning methods for linguistic
Vice President President Reagan
Mr. President President Reagan
Dean P. Guerin Guerin & Turner
Ronald Reagan White House Speaker
House James Wright
Rev. Leon H. Sullivan Sullivan Principles
Mr. Sullivan Sullivan Principles
Rev. Leon Sullivan Sullivan Principles
South Dakota Republican Party
Republican
Kansas Republican Republican Party
Cyril Wagner Jr. Wagner & Brown
General Preston R. Laurence A. Tisch
Tisch
Lt. Col. Oliver Whitney North
North Seymour Jr.
Republican White House Republicans
House
J. Seward Johnson Johnson & Johnson
Vice President President Nixon
Figure 2: The  rst 15 cases in which the coref-
erent model rejected the coreferent hypothesis
information will be proportional to the depth of
semantic or pragmatic analysis that goes into
the features they consider. The vastly superior
performance of the coreference model over the
basic name model moves this believe some small
distance from hope to hypothesis.
References
1. Berland, M. and Charniak, E. Finding
parts in very large corpora.InProceedings
of the ACL 1999. ACL, New Brunswick NJ,
1999, 57{64.
2. Brill, E. Unsupervised learning of disam-
biguation rules for part of speech tagging.In
Proceedings of the Third Workshop on Very
Large Corpora. 1995, 1{13.
3. Caraballo, S. A. Automatic construction
of a hypernym-labeled noun hierarchy from
text.InProceedings of the ACL 1999.ACL,
New Brunswick NJ, 1999.
4. Collins, M. and Singer, Y. Unsuper-
vised models for named entity classi cation.
In Proceedings of the 1999 Joint Sigdat Con-
ference on Empirical Methods in Natural
Language Processing and Very Large Cor-
pora. Association for Computational Linguis-
tics, 1999.
5. Ge, N., Hale, J. and Charniak, E. A
statistical approach to anaphora resolution.
In Proceedings of the Sixth Workshop on
Very Large Corpora. 1998, 161{171.
6. Linguistic Data Consortium. BLLIP
1987-1989 WSJ Corpus Release 1. 2000.
7. Merialdo, B. Tagging English text with a
probabilistic model. Computational Linguis-
tics 20 (1994), 155{172.
8. Riloff, E. Automatically generating ex-
traction patterns from untagged text.In
Proceedings of the Thirteenth National
Conference on Arti cial Intelligence. AAAI
Press/MIT Press, Menlo Park, 1996, 1044{
1049.
9. Riloff, E. and Shepherd, J. A corpus-
based approach for building semantic lexi-
cons.InProceedings of the Second Confer-
ence on Empirical Methods in Natural Lan-
guage Processing. 1997, 117{124.
10. Roark, B. and Charniak, E. Noun-
phrase co-occurrence statistics for semi-
automatic semantic lexicon construction.In
36th Annual Meeting of the Association for
Computational Linguistics and 17th Interna-
tional Conference on Computational Linguis-
tics. 1998, 1110{1116.
11. Yarowsky, D. Unsupervised word sense
disambiguation rivaling supervised methods.
In Proceedings of the 33rd Annual Meeting of
the Association for Computational Linguis-
tics. 1995, 189{196.
