Trained Named Entity Recognition Using Distributional Clusters
Dayne Freitag
HNC Software, LLC
3661 Valley Centre Drive
San Diego, CA 92130
DayneFreitag@fairisaac.com
Abstract
This work applies boosted wrapper induction
(BWI), a machine learning algorithm for informa-
tion extraction from semi-structured documents, to
the problem of named entity recognition. The de-
fault feature set of BWI is augmented with features
based on distributional term clusters induced from a
large unlabeled text corpus. Using no traditional lin-
guistic resources, such as syntactic tags or special-
purpose gazetteers, this approach yields results near
the state of the art in the MUC 6 named entity do-
main. Supervised learning using features derived
through unsupervised corpus analysis may be re-
garded as an alternative to bootstrapping methods.
1 Introduction
The problem of named entity recognition (NER) has
recently received increasing attention. Identifica-
tion of generic semantic categories in text—such
as mentions of people, organizations, locations, and
temporal and numeric expressions—is a necessary
first step in many applications of information ex-
traction, information retrieval, and question answer-
ing. To a large extent, knowledge-poor methods suf-
fice to yield good recognition performance. In par-
ticular, supervised learning can be used to produce
a system with performance at or near the state of the
art (Bikel et al., 1997).
In the supervised learning framework, a corpus
of (typically) a few hundred documents is annotated
by hand to identify the entities of interest. Features
of local context are then used to train a system to
distinguish instances from non-instances in novel
texts. Such features may include literal word tests,
patterns of orthography, parts of speech, seman-
tic categories, or membership in special-purpose
gazetteers.
While supervised training greatly facilitates the
development of a robust NER system, the re-
quirement of a substantial training corpus remains
an impediment to the rapid deployment of NER
in new domains or new languages. A number
bush peters reagan noriega ...
john robert james david ...
president chairman head owner ...
japan california london chicago ...
Table 1: Sample members of four clusters from the
Wall Street Journal corpus.
of researchers have therefore sought to exploit
the availability of unlabeled documents, typically
by bootstrapping a classifier using automatic la-
bellings (Collins and Singer, 1999; Cucerzan and
Yarowsky, 1999; Thelen and Riloff, 2002).
Here, we investigate a different approach. Us-
ing a distributional clustering technique called co-
clustering, we produce clusters which, intuitively,
should be useful for NER. Table 1 shows exam-
ple terms from several sample clusters induced us-
ing a collection of documents from the Wall Street
Journal (WSJ). Several papers have shown that dis-
tributional clustering yields categories that have
high agreement with part of speech (Sch¨utze, 1995;
Clark, 2000). As the table illustrates, these clus-
ters also tend to have a useful semantic dimension.
Clustering on the WSJ portion of the North Ameri-
can News corpus yields two clusters that clearly cor-
respond to personal names, one for first names and
one for last names. As an experiment, we scanned
the MUC6 NER data set for token sequences con-
sisting of zero or more members of the first name
cluster (or an initial followed by a period), followed
by one or more members of the last name cluster.
This simple procedure identified 64% of personal
names with 77% precision.
In this paper, we attempt to improve on this
result by converting the clusters into features to
be exploited by a general-purpose machine learn-
ing algorithm for information extraction. In Sec-
tion 2, we provide a brief description of Boosted
Wrapper Induction (BWI), a pattern learner that has
yielded promising results on semi-structured infor-
mation extraction problems (Freitag and Kushmer-
ick, 2000). In Section 3, we describe our clustering
approach and its particular application. Section 4
presents the results of our experiments. Finally, in
Section 5, we assess the significance of our contri-
bution and attempt to identify promising future di-
rections.
2 BWI
BWI decomposes the problem of recognizing field
instances into two Boolean classification problems:
recognizing field-initial and field-terminal tokens.
Given a target field, a separate classifier is learned
for each of these problems, and the distribution of
field lengths is modeled as a frequency histogram.
At application time, tokens that test positive for ini-
tial are paired with those testing positive for termi-
nal. If the length of a candidate instance, as defined
by such a pair, is determined to have non-zero like-
lihood using the length histogram, a prediction is
returned.
Each of the three parts of a full prediction—initial
boundary, terminal boundary, and length—is as-
signed a real-valued confidence. The confidence of
a boundary detection is its strength as determined by
AdaBoost, while that of the length assessment is the
empirical length probability, which is determined
using the length histogram. The confidence of the
full prediction is the product of these three individ-
ual confidence scores. In the event that overlapping
predictions are found in this way (a rare event, em-
pirically), the predictions with lower confidence are
discarded.
In this section, we sketch those aspects of BWI
relevant to the current application. More details
are available in the paper in which BWI was de-
fined (Freitag and Kushmerick, 2000).
2.1 Boosting
BWI uses generalized AdaBoost to produce each
boundary classifier (Schapire and Singer, 1998).
Boosting is a procedure for improving the perfor-
mance of a “weak learner” by repeatedly applying
it to a training set, at each step modifying exam-
ple weights to emphasize those examples on which
the learner has done poorly in previous steps. The
output is a weighted collection of weak learner hy-
potheses. Classification involves having the individ-
ual hypotheses “vote,” with strengths proportional
to their weights, and summing overlapping votes.
Although this is the first application of BWI to
NER, boosting has previously been shown to work
well on this problem. Differing from BWI in the de-
tails of the application, two recent papers neverthe-
less demonstrate the effectiveness of the boosting
Cap Initial capital
AllCap All capitals
Uncap Initial lower case
Alpha Entirely alphabetic characters
ANum Entirely alpha-numeric characters
Punc Punctuation
Num Entirely numeric characters
Schar Single alphabetic character
Any Anything
Table 2: Default wildcards used in these experi-
ments.
paradigm for NER in several languages (Carreras
et al., 2002; Wu et al., 2002), one of them achiev-
ing the best overall performance in a comparison of
several systems (Sang, 2002).
2.2 Boundary Detectors
The output of a single invocation of the weak learner
in BWI is always an individual pattern, called a
boundary detector. A detector has two parts, one
to match the text leading up to a boundary, the
other for trailing text. Each part is a list of zero
or more elements. In order for a boundary to
match a detector, the tokens preceding the bound-
ary (or following it) must match the corresponding
elements in sequence. For example, the detector
[ms .][jones] matches boundaries preceded
by the (case-normalized) two-token sequence “ms
.” and followed by the single token “jones”.
Detectors are grown iteratively, beginning with
an empty detector and repeatedly adding the ele-
ment that best increases the ability of the current
detector to discriminate true boundaries from false
ones, using a cost function sensitive to the exam-
ple weighting. A look-ahead parameter allows this
decision to be based on several additional context
tokens. The process terminates when no extensions
yield a higher score than the current detector.
2.3 Wildcards
The elements of the detector [ms .][jones]are
literal elements, which match tokens using case-
normalized string comparison. More interesting el-
ements can be introduced by defining token wild-
cards. Each wildcard defines some Boolean func-
tion over the space of tokens.
Table 2 lists the baseline wildcards. Using wild-
cards from this list, the example detector can be gen-
eralized to match a much broader range of bound-
aries (e.g., [ms <Any>][<Cap>]). By defin-
ing new wildcards, we can inject useful domain
knowledge into the inference process, potentially
improving the performance of the resulting extrac-
tor. For example, we might define a wildcard called
“Honorific” that matches any of “ms”, “mr”,
“mrs”, and “dr”.
2.4 Boundary Wildcards
In the original formulation of BWI, boundaries are
identified without reference to the location of the
opposing boundary. However, we might expect that
the end of a name, say, would be easier to identify if
we know where it begins. We can build detectors
that exploit this knowledge by introducing a spe-
cial wildcard (called Begin) that matches the be-
ginnings of names.
In these experiments, therefore, we modify
boundary detection in the following way. Instead
of two detector lists, we learn four—the two lists
as in the original formulation (call them a0a2a1a4a3a6a5a7 and
a0a9a8a11a10a13a12a15a14 ), and two more lists (a0a16a1a4a3a6a5a7a18a17 and a0a9a8a11a10a13a12a15a14a19a17 ). In
generating the latter two lists, we give the learner
access to these special wildcards (e.g., the wildcard
End in generating a0a20a1a4a3a21a5a7a18a17 ).
At extraction time, a0a22a1a4a3a21a5a7 and a0a20a8a11a10a13a12a15a14 are first
used to detect boundaries, as before. These de-
tections are then used to determine which tokens
match the “special” wildcards used by a0 a1a4a3a21a5a7a23a17 and
a0 a8a11a10a13a12a15a14a19a17 . Then, instead of pairing a0 a1a4a3a6a5a7 predictions
with those of a0a22a8a11a10a13a12a15a14 , they are paired with those
made by a0a20a8a11a10a13a12a15a14a19a17 (and a0a20a1a4a3a21a5a7a23a17 with a0a20a8a11a10a13a12a24a14 ). In in-
formal experiments, we found that this procedure
tended to increase F1 performance by several points
on a range of tasks. We adopt it uniformly in the
experiments reported here.
3 Co-Clustering
As in Brown, et al (1992), we seek a partition of
the vocabulary that maximizes the mutual infor-
mation between term categories and their contexts.
To achieve this, we use information theoretic co-
clustering (Dhillon et al., 2003), in which a space
of entities, on the one hand, and their contexts, on
the other, are alternately clustered to maximize mu-
tual information between the two spaces.
3.1 Background
The input to our algorithm is two finite sets of sym-
bols, say a25a27a26a29a28a31a30a33a32a35a34a36a30a38a37a31a34a40a39a40a39a40a39a41a34a36a30a33a42a44a43a46a45 (e.g., terms) and
a47
a26 a28a31a48a49a32a35a34a36a48a50a37a40a34a40a39a40a39a40a39a41a34a36a48a51a42a53a52a54a45 (e.g., term contexts), to-
gether with a set of co-occurrence count data con-
sisting of a non-negative integer a55a57a56a31a58a4a59a61a60 for every
pair of symbols a62a15a30a64a63a65a34a36a48a41a66a41a67 from a25 and a47 . The out-
put is two partitions: a25
a17
a26 a28a31a30
a17
a32
a34a40a68a4a68a4a68a4a34a36a30
a17
a42 a43a70a69
a45 and
a47
a17
a26a71a28a31a48
a17
a32
a34a40a68a4a68a4a68a4a34a36a48
a17
a42 a52a72a69
a45 , where each a30
a17
a63
is a subset of
a25 (a “cluster”), and each a48
a17
a63
a subset of a47 . The
co-clustering algorithm chooses the partitions a25
a17
and a47
a17
to (locally) maximize the mutual informa-
tion between them, under a constraint limiting the
total number of clusters in each partition.
Recall that the entropy or Shannon information of
a discrete distribution is:
a73a75a74
a26a77a76a79a78
a56a81a80
a62a15a30a44a67a83a82a85a84
a80
a62a15a30a53a67a86a68 (1)
This quantifies average improvement in one’s
knowledge upon learning the specific value of an
event drawn from a25 . It is large or small depending
on whether a25 has many or few probable values.
The mutual information between random vari-
ables a25 and a47 can be written:
a87a88a74a90a89
a26a91a78
a56a40a59 a80
a62a15a30a70a34a36a48a92a67a83a82a93a84
a80
a62a15a30a70a34a36a48a92a67
a80
a62a15a30a44a67
a80
a62a15a48a92a67
(2)
This quantifies the amount that one expects to learn
indirectly about a25 upon learning the value of a47 , or
vice versa.
3.2 The Algorithm
Let a25 be a random variable over vocabulary terms
as found in some text corpus. We define a47 to
range over immediately adjacent tokens, encoding
co-occurrences in such a way as to distinguish left
from right occurrences.
Given co-occurrence matrices tabulated in this
way, we perform an approximate maximization of
a94a95a74
a69
a89
a69 using a simulated annealing procedure in
which each trial move takes a symbol a30 or a48 out
of the cluster to which it is tentatively assigned and
places it into another. Candidate moves are chosen
by selecting a non-empty cluster uniformly at ran-
dom, randomly selecting one of its members, then
randomly selecting a destination cluster other than
the source cluster. When temperature 0 is reached,
all possible moves are repeatedly attempted until no
further improvements are possible.
For efficiency and noise reduction, we first clus-
ter only the 5000 most frequent terms and context
terms. The remaining terms in the corpus vocabu-
lary are then added by assigning each term to the
cluster that maximizes the mutual information ob-
jective function.
4 Evaluation
We experimented with the MUC 6 named entity
data set, which consists of a training set of 318 doc-
uments, a validation set of 30 documents, and a test
set of 30 documents.
All documents are annotated to identify three
types of name (PERSON, ORGANIZATION,
[][september]DATE
[in <Num>][<Punc>]
[][5 <ANum> . <ANum>]TIME
[midnight][]
[][$]MONEY
[$ <Any> billion][]
[<Alph> <Punc>][<Any> %]PCT.
[<Num> percentage <Alph>][]
[mr <Any>][<Cap>]PERSON
[<Cap>][, vice]
[][nissan]ORG.
[inc <Any>][]
[in][<Cap> , <Alph> <Punc>]LOC.
[germany][]
Table 3: Sample boundary detectors for the seven
MUC 6 fields produced by BWI using the baseline
feature set. An initial and terminal detector is shown
for each field.
LOCATION), two types of temporal expression
(DATE, TIME), and two types of numeric expres-
sion (MONEY, PERCENT). It is common to re-
port performance in terms of precision, recall, and
their harmonic mean (F1), a convention to which we
adhere.
4.1 Baseline
Using the wildcards listed in Table 2, we trained
BWI for 500 boosting iterations on each of the seven
entity fields. The output out each of these training
runs consists of a0a2a1a3a1a5a4a7a6 a26a9a8a2a1a3a1a3a1 boundary detectors.
Look-ahead was set to 3.
Table 3 shows a few of the boundary detectors
induced by this procedure. These detectors were
selected manually to illustrate the kinds of patterns
generated. Note how some of the detectors amount
to field-specific gazetteer entries. Others have more
interesting (and typically intuitive) structure. We
defer quantitative evaluation to the next section,
where a comparison with the cluster-enhanced ex-
tractors will be made.
4.2 Adding Cluster Features
The MUC 6 dataset was produced using articles
from the Wall Street Journal. In order to pro-
duce maximally relevant clusters, we used docu-
ments from the WSJ portion of the North Ameri-
can News corpus as input to co-clustering—some
119,000 documents in total. Note that there is a tem-
poral disparity between the MUC 6 corpus and this
clustering corpus, which has an undetermined im-
pact on performance.
[<C95>][<C73>]PERS.
[<C144> <Any> <C106>][<Uncap>]
[][<C178> express]ORG.
[bank <ANum> <C146>][]
[][<C72> korea]LOC.
[<C160>][<Punc>]
Table 4: Sample boundary detectors for the seven
MUC 6 fields produced by BWI using the expanded
feature set.
72 general south north poor ...
73 john robert james david ...
95 says adds asks recalls ...
106 clinton dole johnson gingrich ...
144 mr ms dr sen ...
146 japan american china congress ...
160 washington texas california ...
178 american foreign local ...
Table 5: Most frequent members of clusters refer-
enced by detectors in Table 4.
We used this data to produce 200 clusters, as de-
scribed in Section 3. Treating each of these clus-
ters as an unlabeled gazetteer, we then defined cor-
responding wildcards. For example, the value of
wildcard <C35> only matches a term belonging to
Cluster 35. In order to reduce the training time of a
given boundary learning problem, we tabulated the
frequency of wildcard occurrence within three to-
kens of any occurrences of the target boundary and
omitted from training wildcards testing true fewer
than ten times.1
Table 4, which lists sample detectors from these
runs, includes some that are clearly impossible to
express using the baseline feature set. An exam-
ple is the first row, which matches a third-person
present-tense verb used in quote attribution, fol-
lowed by a first name (see Table 5). At the same
time, some of the new wildcards are employed triv-
ially, such as the use of <C178> in the field-initial
detector for the ORGANIZATION field.
Table 6 shows performance of the two variants
on the individual MUC 6 fields, tested over the
“dryrun” and “formal” test sets combined. In this
table, we scored each field individually using our
own evaluation software. An entity instance was
judged to be correctly extracted if a prediction pre-
cisely identified its boundaries (ignoring “ALT” at-
1For the TIME field, which occurs a total of six times in the
training set, this cut-off was a single occurrence.
Field F1 Prec Rec
Base 0.766 0.765 0.768DATE
Clust 0.782 0.776 0.789
Base 0.667 1.000 0.500TIME
Clust 0.667 1.000 0.500
Base 0.938 0.926 0.949MONEY
Clust 0.943 0.938 0.949
Base 0.922 0.855 1.000PERCENT
Clust 0.930 0.869 1.000
Base 0.827 0.810 0.844PERSON
Clust 0.892 0.859 0.927
Base 0.587 0.811 0.460ORG.
Clust 0.733 0.796 0.680
Base 0.726 0.675 0.785LOCATION
Clust 0.724 0.648 0.821
Table 6: Performance on the seven MUC 6 fields,
without (Base) and with (Clust) cluster-based fea-
tures. Significantly better precision or recall scores,
at the 95% confidence level, are in boldface.
tributes). Non-matching predictions and missed en-
tities were counted as false positives and false neg-
atives, respectively. We assessed the statistical sig-
nificance of precision and recall scores by comput-
ing beta confidence intervals at the 95% level. In the
table, the higher precision or recall is in boldface if
its separation from the lower score is significant.
Except for TIME and LOCATION, all fields ben-
efit from inclusion of the cluster features. TIME,
which is scarce in the training and test sets, is insen-
sitive to their inclusion. The effect on LOCATION
is more interesting. It shares in the general tendency
of cluster features to increase recall, but loses preci-
sion as a result.2 Although the increase in recall is
approximately the same as the loss in precision, the
F1 score, which is more heavily influenced by the
lower of precision and recall, drops slightly.
While the effect of the cluster features on pre-
cision is inconsistent, they typically benefit recall.
This effect is most dramatic in the case of ORGA-
NIZATION, where, at the expense of a small drop in
precision, recall increases by more than 20 points.
The somewhat counter-intuitive improvements in
precision on some fields (particularly the significant
improvement on PERSON) is attributable to our
learning framework. Boosting for a sufficient num-
ber of iterations forces a learner to account for all
boundary tokens through one or more detectors. To
the exent that the baseline’s features are unable to
2Note, however, that none of the differences observed for
LOCATION are significant at the 95% level.
account for as many of the boundary tokens, it is
forced to learn a larger number of over-specialized
detectors that rely on questionable patterns in the
data. Depending on the task, these detectors can
lead to a larger proportion of false positives.
The relatively weak result for DATE comes as
a surprise. Inspection of the data leads us to at-
tribute this to two factors. On the one hand, there
is considerable temporal drift between the training
and test sets. Many of the dates are specific to
contemporaneous events; patterns based on specific
years, therefore, generalize in only a limited way.
At the same time, the notion of date, as understood
in the MUC 6 corpus, is reasonably subtle. Mean-
ing roughly “non-TIME temporal expression,” it in-
cludes everything from shorthand date expressions
to more interesting phrases, such as, “the first six
months of fiscal 1994.”
In passing we note a few potentially relevant id-
iosyncrasies in these experiments. Most significant
is a representational choice we made in tokenizing
the cluster corpus. In tallying frequencies we treated
all numeric expressions as occurrences of a special
term, “*num*”. Consequently, the tokens “1989”
and “10,000” are treated as instances of the same
term, and clustering has no opportunity to distin-
guish, say, years from monetary amounts.
The (perhaps) disappointing performance on the
relatively simple fields, TIME and PERCENT,
somewhat under-reports the strength of the learner.
As noted above, TIME occurs only very infre-
quently. Consequently, little training data is avail-
able for this field and mistakes (BWI missed one of
the three instances in the test set) have a large effect
on the TIME-specific scores. In the case of PER-
CENT, we ignored MUC instructions not to attempt
to recognize instances in tabular regions. One of
the documents contains a significant number of un-
labeled percentages in such a table. BWI duly rec-
ognized these—to the detriment of the reported pre-
cision.
4.3 MUC Evaluation
For comparison with numbers reported in the lit-
erature, we used the learned extractors to produce
mark-up and evaluated the result using the MUC 6
scorer. The MUC 6 evaluation framework differs
from ours in two key ways. Most importantly, all
entity types are to be processed simultaneously. We
benefit from this framework, since spurious predic-
tions for one entity type may be superseded by cor-
rect predictions for a related type. The opportunity
is greatest for the three name types; in inspecting
the false positives, we observed a number of confu-
Field F1 Prec Rec
Base 0.91 0.91 0.91DATE
Clust 0.92 0.90 0.94
Base 0 0 0TIME
Clust 0 0 0
Base 0.95 0.94 0.96MONEY
Clust 0.95 0.95 0.96
Base 0.97 0.94 1.0PERCENT
Clust 1.0 1.0 1.0
Base 0.88 0.91 0.86PERSON
Clust 0.94 0.94 0.95
Base 0.62 0.78 0.52ORG.
Clust 0.79 0.84 0.74
Base 0.86 0.86 0.87LOCATION
Clust 0.86 0.80 0.92
Base 0.79 0.85 0.73ALL
Clust 0.87 0.88 0.86
Table 7: Performance on the markup task, as scored
by the MUC 6 scorer.
sions among these fields.3 The MUC scorer is also
more lenient than ours, awarding points for extrac-
tion of alternative strings and forgiving the inclusion
of certain functional tokens in the extracted text.
In moving to the multi-entity extraction setting,
the obvious approach is to collect predictions from
all extractors simultaneously. However, this re-
quires a strategy for dealing with overlapping pre-
dictions (e.g., a single text fragment labeled as both
a person and organization). We resolve such con-
flicts by preferring in each case the extraction with
the highest confidence. In order to render confi-
dence scores more comparable, we normalized the
weights of detectors making up each boundary clas-
sifier so they sum to one.
A comparison of Table 7 with Table 6 suggests
the extent to which BWI benefits from the multi-
field mark-up setting. Note that, here, we used only
the “formal” test set for evaluation, in contrast with
the numbers in Table 6, which combine the two test
sets. The lift we observe from cluster features is also
in evidence here, and is most evident as an increase
in recall, particularly of PERSON and ORGANI-
ZATION. There is now also an increase in global
precision, attributable in large part to the benefit of
extracting multiple fields simultaneously.
The F1 score produced by BWI is compara-
ble to the best machine-learning-based results re-
3For example, companies are occasionally named after peo-
ple (e.g., Liz Claiborne).
ported elsewhere. Bikel, at al (1997), reports sum-
mary F1 of 0.93 on the same test set, but using
a model trained on 450,000 words. We count ap-
proximately 130,000 words in the experiments re-
ported here. The numbers reported by Bennett, et
al (1997), for PERSON, ORGANIZATION, and
LOCATION (F1 of 0.947, 0.815, and 0.925, respec-
tively), are slightly better than the numbers BWI
reaches on the same fields. Note, however, that the
features provided to their learner include syntactic
labels and carefully engineered semantic categories,
whereas we eschew knowledge- and labor-intensive
resources. This has important implications for the
portability of the approaches to new domains and
languages.
By taking a few post-processing steps, it is pos-
sible to realize further improvements. For example,
the learner occasionally identifies terms and phrases
which some simple rules can reliably reject. By sup-
pressing any prediction that consists entirely of a
stopword, we increase the precision of both ORGA-
NIZATION and LOCATION to 0.86 (from 0.84 and
0.80) and overall F1 to 0.88.
We can also exploit what Cucerzan and
Yarowsky (1999) call the one sense per discourse
phenomenon, the tendency of terms to have a fixed
meaning within a single document. By mark-
ing up unmarked strings that match extracted en-
tity instances in the same document, we can im-
prove the recall of some fields. We added this post-
processing step for the PERSON and ORGANI-
ZATION fields. This increased recall of PERSON
from 0.95 to 0.98 and of ORGANIZATION from
0.74 to 0.79 with minimal changes to precision and
a slight improvement in summary F1.
4.4 Analysis and Related Work
The promise of this general method—supervised
learning on small training set using features derived
from a larger unlabeled set—lies in the support it
provides for rapid deployment in novel domains and
languages. Without relying on any linguistic re-
sources more advanced than a tokenizer and some
orthographic features, we can produce a NER mod-
ule using only a few annotated documents.
How few depends ultimately on the difficulty of
the domain. We might also expect the benefit of
distributional features to decrease with increasing
training set size. Figure 1 displays the F1 learning-
curve performance of BWI, both with and without
cluster features, on the two fields that benefit the
greatest from these features, PERSON and ORGA-
NIZATION. As expected, the difference appears to
be greatest on the low end of the horizontal axis (al-
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 0  50  100  150  200  250  300  350
Formal test F1
Training documents
PERSON (clust.)
PERSON (base)
ORG. (clust.)
ORG. (base)
Figure 1: F1 as a function of training set size in
number of documents.
though overfitting complicates the comparison). At
the same time, the improvement is fairly consistent
at all training set sizes. Either the baseline feature
set is ultimately too impoverished for this task, or,
more likely, the complete MUC 6 training set (318
documents) is small for this class of learner.
Techniques to lessen the need for annotation for
NER have received a fair amount of attention re-
cently. The prevailing approach to this problem is
a bootstrapping technique, in which, starting with a
few hand-labeled examples, the system iteratively
adds automatic labels to a corpus, training itself,
as it were. Examples of this are Cucerzan and
Yarowsky (1999), Thelen and Riloff (2002), and
Collins and Singer (1999).
These techniques address the same problem as
this paper, but are otherwise quite different from the
work described here. The labeling method (seed-
ing) is an indirect form of corpus annotation. The
promise of all such approaches is that, by starting
with a small number of seeds, reasonable results can
be achieved at low expense. However, it is difficult
to tell how much labeling corresponds to a given
number of seeds, since this depends on the cover-
age of the seeds. Note, too, that any bootstrapping
approach must confront the problem of instability;
poor initial decisions by a bootstrapping algorithm
can lead to large eventual performance degrada-
tions. We might expect a lightly supervised learner
with access to features based on a full-corpus anal-
ysis to yield more consistently strong results.
Of the three approaches mentioned above, only
Cucerzan and Yarowsky do not presuppose a syn-
tactic analysis of the corpus, so their work is perhaps
most comparable to this one. Of course, compar-
isons must be strongly qualified, given the different
labeling methods and data sets. Nevertheless, per-
formance of cluster-enhanced BWI at the low end
of the horizontal axis compares favorably with the
English F1 performance of 0.543 they report using
190 seed words. And, arguably, annotating 10-20
documents is no more labor intensive than assem-
bling a list of 190 seed words.
Strong corroboration for the approach advocated
in this paper is provided by Miller, et al (2004),
in which cluster-based features are combined with
a sequential maximum entropy model proposed in
Collins (2002) to advance the state of the art. In ad-
dition, using active learning, the authors are able to
reduce human labeling effort by an order of magni-
tude.
Miller, et al, use a proprietary data set for train-
ing and testing, so it is difficult to make a close
comparison of outcomes. At roughly comparable
training set sizes, they appear to achieve a score
of about 0.89 (F1) with a “conventional” HMM,
versus 0.93 using the discriminative learner trained
with cluster features (compared with 0.86 reached
by BWI). Both the HMM and Collins model are
constrained to account for an entire sentence in tag-
ging it, making determinations for all fields simulta-
neously, in contrast to the individual, local boundary
detections made by BWI. This characteristic proba-
bly accounts for the accuracy advantage they appear
to enjoy.
An interesting distinguishing feature of Miller,
et al, is their use of hierarchical clustering. While
much is made of the ability of their approach to
accomodate different levels of granularity automat-
ically, no evidence is provided that the hierarchy
provides real benefit. At the same time, our work
shows that significant gains can be realized with a
single, sufficiently granular partition of terms. It is
known, moreover, that greedy agglomerative clus-
tering leads to partitions that are sub-optimal in
terms of a mutual information objective function
(see, for example, Brown, et al (1992)). Ultimately,
it is left to future research to determine how sensi-
tive, if at all, the NER gains are to the details of the
clustering.
5 Conclusion
There are several ways in which this work might be
extended and improved, both in its particular form
and in general:
a0 BWI models initial and terminal boundaries,
but ignores characteristics of the extracted
phrase other than its length. We are explor-
ing mechanisms for modeling relevant phrasal
structure.
a0 While global statistical approaches, such as se-
quential averaged perceptrons or CRFs (Mc-
Callum and Li, 2003), appear better suited to
the NER problem than local symbolic learners,
the two approaches search different hypothesis
spaces. Based on the surmise that, by combin-
ing them, we can realize improvements over ei-
ther in isolation, we are exploring mechanisms
for integration.
a0 The distributional clusters we find are indepen-
dent of the problem to which we want to apply
them and may sometimes be inappropriate or
have the wrong granularity. We are exploring
ways to produce groupings that are sensitive to
the task at hand.
Our results clearly establish that an unsupervised
distributional analysis of a text corpus can produce
features that lead to enhanced precision and, espe-
cially, recall in information extraction. We have
successfully used these features in lieu of domain-
specific, labor-intensive resources, such as syntac-
tic analysis and special-purpose gazetteers. Distri-
butional analysis, combined with light supervision,
is an effective, stable alternative to bootstrapping
methods.
Acknowledgments
This material is based on work funded in whole or
in part by the U.S. Government. Any opinions, find-
ings, conclusions, or recommendations expressed in
this material are those of the authors, and do not
necessarily reflect the views of the U.S. Govern-
ment.

References
S.W. Bennett and C. Aone. 1997. Learning to tag multilingual texts through observation. In Proc. 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-2), August.
D.M. Bikel, S. Miller, R. Schwartz, and R. Weischedel. 1997. Nymble: a high-performance learning name-finder. In Proc. 5th Conference on Applied Natural Language Processing (ANLP-97), April.
P.F. Brown, V.J. Della Pietra, P.V. deSouza, J.C. Lai, and R.L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479.
X. Carreras, L. M`arquez, and L. Padr´o. 2002. Named entity extraction using AdaBoost. In Proceedings of CoNLL-2002, Taipei, Taiwan. 
A. Clark. 2000. Inducing syntactic categories by context distribution clustering. In CoNLL 2000, September.
M. Collins and Y. Singer. 1999. Unsupervised models for named entity classification. In Proc. 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora.
M. Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP-2002.
S. Cucerzan and D. Yarowsky. 1999. Language independent named entity recognition combining morphological and contextual evidence. In Proc. 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pages 90–99.
I. S. Dhillon, S. Mallela, and D. S. Modha. 2003. Information-theoretic co-clustering. Technical Report TR-03-12, Dept. of Computer Science, U. Texas at Austin.
D. Freitag and N. Kushmerick. 2000. Boosted wrapper induction. In Proc. 17th National Conference on Artificial Intelligence (AAAI-2000), August.
A. McCallum and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proc. 7th Conference on Natural Language Learning (CoNLL-03).
S. Miller, J. Guinness, and A. Zamanian. 2004. Name tagging with word clusters and discriminative training. In Proceedings of HLT/NAACL 04.
E. F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2002.
R.E. Schapire and Y. Singer. 1998. Improved boosting algorithms using confidence-rated predictions. In Proc. 11th Annual Conference on Computational Learning Theory (COLT-98), pages 80–91, July.
H. Sch¨utze. 1995. Distributional part-of-speech tagging. In Proc. 7th EACL Conference (EACL-95), March.
M. Thelen and E. Riloff. 2002. A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In Proc. 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).
D. Wu, G. Ngai, M. Carpuat, J. Larsen, and Y. Yang. 2002. Boosting for named entity recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan.
