HowtogetaChineseName(Entity): Segmentation and Combination Issues
Hongyan Jing Radu Florian Xiaoqiang Luo
Tong Zhang Abraham Ittycheriah
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
a0 hjing,raduf,xiaoluo,tzhang,abei
a1 @us.ibm.com
Abstract
When building a Chinese named entity
recognition system, one must deal with
certain language-specific issues such as
whether the model should be based on
characters or words. While there is no
unique answer to this question, we discuss
in detail advantages and disadvantages of
each model, identify problems in segmen-
tation and suggest possible solutions, pre-
senting our observations, analysis, and
experimental results. The second topic
of this paper is classifier combination.
We present and describe four classifiers
for Chinese named entity recognition and
describe various methods for combining
their outputs. The results demonstrate that
classifier combination is an effective tech-
nique of improving system performance:
experiments over a large annotated corpus
of fine-grained entity types exhibit a 10%
relative reduction in F-measure error.
1 Introduction
Named entity (NE) recognition has drawn much at-
tention in recent years. It was a designated task
in a number of conferences, including the Mes-
sage Understanding Conferences (MUC-6, 1995;
MUC-7, 1997), the Information Retrieval and Ex-
traction Conference (IREX, 1999), the Conferences
on Natural Language Learning (Tjong Kim Sang,
2002; Tjong Kim Sang and De Meulder, 2003),
and the recent Automatic Content Extraction Con-
ference (ACE, 2002).
A variety of algorithms have been proposed for
NE recognition. Many of these algorithms are, in
principle, language-independent. However, when
applying these algorithms to languages such as
Chinese and Japanese, we must deal with cer-
tain language-specific issues: for example, should
we build a character-based model or a word-based
model? how do word segmentation errors affect NE
recognition? how should word segmentation and NE
recognition interact with each other? Besides word
segmentation related issues, Chinese does not have
capitalization, which is a very useful feature in iden-
tifying NEs in languages such as English, Spanish,
or Dutch. How does the lack of features such as cap-
italization affect the performance?
In the first part of this paper, we discuss these
language-specific issues in Chinese NE recogni-
tion. In particular, we use a hidden Markov model
(HMM) system as an example, and discuss various
issues related to applying the HMM classifier to Chi-
nese. The HMM classifier is similar to the one de-
scribed in (Bikel et al., 1999).
In the second part of this paper, we investigate
the combination of a set of diverse NE recognition
classifiers. Four statistical classifiers are combined
in the experiments, including the above-mentioned
hidden Markov model classifier, a transformation-
based learning classifier (Brill, 1995; Florian and
Ngai, 2001), a maximum entropy classifier (Ratna-
parkhi, 1999), and a robust risk minimization classi-
fier (Zhang et al., 2002).
The remainder of this paper is organized as fol-
lows: Section 2 describes the experiment data, Sec-
tion 3 discusses specific issues related to Chinese
NE recognition, Section 4 presents the four classi-
fiers and approaches to combining these classifiers.
2 Data
We used three annotated Chinese corpora in our ex-
periments.
The IBM-FBIS Corpus
The Foreign Broadcast Information Service (FBIS)
offers an extensive collection of translations and
transcriptions of open source information monitored
worldwide on diverse topics such as military af-
fairs, politics, economics, and science and technol-
ogy. The IBM-FBIS corpus consists of approxi-
mately 3,000 Chinese articles obtained from FBIS
(about 3.2 million Chinese characters in total). This
corpus was tagged by a native Chinese speaker with
32 NE categories, such as person, location, organi-
zation, country, people, date, time, percentage, car-
dinal, ordinal, product, substance, and salutation.
There are approximately 300,000 NEs in the entire
corpus, 16% of which are labeled as person, 16% as
organization, and 11% as location.
The IBM-CT Corpus
The Chinese Treebank (Xia et al., 2000), avail-
able from Linguistic Data Consortium, consists of
a 100,000 word (approximately 160,000 characters)
corpus annotated with word segmentation, part-of-
speech tags, and syntactic bracketing. It includes
325 articles from Xinhua newswire between 1994
and 1998. The same Chinese annotator who worked
on the above IBM-FBIS data also annotated the Chi-
nese Treebank data with NE information, henceforth
the IBM-CT corpus, using the same 32 categories as
mentioned above.
The IEER data
The National Institute of Standard and Technol-
ogy organized the Information Extraction – Entity
Recognition (IEER) evaluation, which involves en-
tity recognition from textual information sources in
both English and Mandarin. The Mandarin training
data consists of approximately 10 hours of broad-
cast news transcripts comprised of approximately
390 stories. The test data also contains transcripts of
broadcast news1. The training data includes approx-
imately 170,000 characters and the test data includes
approximately 6,500 characters. Ten categories of
NEs were annotated, such as person, location, orga-
nization, date, duration, and measure.
1Other types of test data were also used in IEER evaluation,
including newswire text and real automatic speech recognition
transcripts, but we did not use them in our experiments.
3 Language-Specific Issues in Chinese NE
Recognition
Chinese does not have delimiters between words,
so a key design issue in Chinese NE recognition
is whether to build a character-based model or a
word-based model. In this section, we use a hid-
den Markov model NE recognition system as an ex-
ample to discuss language-specific issues in Chinese
NE recognition.
3.1 The Hidden Markov Model Classifier
NE recognition can be formulated as a classification
task, where the goal is to label each token with a
tag indicating whether it belongs to a specific NE
or is not part of any NE. The HMM classifier used
in our experiments follows the algorithm described
in (Bikel et al., 1999). It performs sequence clas-
sification by assigning each token either one of the
NE types or the label “O” to represent “outside any
NE”. The states in the HMM are organized into re-
gions, one region for each type of NE plus one for
“O”. Within each of the regions, a statistical lan-
guage model is used to compute the likelihood of
words occurring within that region. The transition
probabilities are smoothed by deleted interpolation,
and the decoding is performed using the Viterbi al-
gorithm.
3.2 Character-Based, Word-Based, and
Class-Based Models
To build a model for identifying Chinese NEs, we
need to determine the basic unit of the model: char-
acter or word. On one hand, the word-based model
is attractive since it allows the system to inspect a
larger window of text, which may lead to more in-
formative decisions. On the other hand, a word seg-
menter is not error-prone and these errors may prop-
agate and result in errors in NE recognition.
Two systems, a character-based HMM model and
a word-based HMM model, were built for compar-
ison. The word segmenter used in our experiments
relies on dictionaries and surrounding words in lo-
cal context to determine the word boundaries. Dur-
ing training, the NE boundaries were provided to the
word segmenter; the latter is restricted to enforce
word boundaries at each entity boundary. Therefore,
at training time, the word boundaries are consistent
with the entity boundaries. At test time, however,
the segmenter could create words which do not agree
with the gold-standard entity boundaries.
Corpus Model Prec Rec a2a4a3a6a5a8a7
Character 74.36% 80.24% 77.19
IBM- Word 72.46% 75.97% 74.17
FBIS Class 72.74% 76.20% 74.43
Character 74.57% 78.01% 76.25
IEER Word 77.51% 65.22% 70.83
Class 77.21% 64.36% 70.20
Table 1: Performance of the character-based HMM
model, the word-based HMM model, and the class-
based HMM model. (The precision, recall, and F-
measure presented in this table and throughout this
paper are based on correct identification of all the
attributes of an NE, including boundary, content,
and type.)
The performance of the character-based model
and the word-based model are shown in Table 1. The
two corpora used in the evaluation, the IBM-FBIS
corpus and the IEER corpus, differ greatly in data
size and the number of NE types. The IBM-FBIS
training data consists of 3.1 million characters and
the corresponding test data has 270,000 characters.
As we can see from the table, for both corpora, the
character-based model outperforms the word-based
model, with a lead of 3 to 5.5 in F-measure. The per-
formance gap between two models is larger for the
IEER data than for the IBM-FBIS data.
We also built a class-based NE model. After word
segmentation, class tags such as number, chinese-
name, foreign-name, date, and percent are used to
replace words belonging to these classes. Whether
a word belongs to a specific class is identified by
a rule-based normalizer. The performance of the
class-based HMM model is also shown in Table 1.
For the IBM-FBIS corpus, the class-based model
outperforms the word-based model; for the IEER
corpus, the class-based model is worse than the
word-based model. In both cases, the performance
difference between the word-based model and the
class-based model is very small. The character-
based model outperforms the class-based model in
both tests.
A more careful analysis indicates that although
the word-based model performs worse than the
character-based model overall in our evaluation, it
performs better for certain NE types. For instance,
the word-based model has a better performance
for the organization category than the character-
based model in both tests. While the character-
based model has an F-measure of 65.07 (IBM-FBIS)
and 64.76 (IEER) for the organization category,
the word-based model achieves F-measure scores of
69.14 (IBM-FBIS) and 72.38 (IEER) respectively.
One reason may be that organization names tend to
contain many characters, and since the word-based
model allows the system to analyze a larger window
of text, it is more likely to make a correct guess.
We can integrate the character-based model and the
word-based model by combining the decisions from
the two models. For instance, if we use the de-
cisions of the word-based model for the organiza-
tion category, but use the decisions of the character-
based model for all the other categories, the over-
all F-measure goes up to 76.91 for the IEER data,
higher than using either the character-based or word-
based model alone. Another way to integrate the
two models is to use a hybrid model – starting with
a word-based model and backing off to character-
based model if the word is unknown.
3.3 Granularity of Word Segmentation
We believe that one main reason for the lower per-
formance of the word-based model is that the word
granularity defined by the word segmenter is not
suitable for the HMM model to perform the NE
recognition task. What exactly constitutes a Chinese
word has been a topic of major debate. We are in-
terested in what is the best word granularity for our
particular task.
To illustrate the word granularity problem for NE
tagging, we take person names as an example. Our
word segmenter marks a person’s name as one word,
consistent with the convention used by the Chinese
treebank and many other word segmentation sys-
tems. While this may be useful in other applications,
it is certainly not a good choice for our NE model.
Chinese names typically contain two or three char-
acters, with family name preceding first name. Only
a limited set of characters are used as family names,
while the first name can be any character(s). There-
fore, the family name is a very important and use-
ful feature in identifying an NE in the person cate-
gory. By combining the family name and the first
name into one word, this important feature is lost to
the word-based model. In our tests, the word-based
model performs much worse for the person category
than the character-based model. We believe that, for
the purpose of NE recognition, it is better to separate
the family name from the first name in word segmen-
tation, although this is not the convention used in the
Chinese treebank.
Other examples include the segmentation of
words indicating dates, countries, locations, percent-
ages, measures, and ordinals. For instance, “July
4th” is expressed by four characters “7th month 4th
day” in Chinese. The word segmenter marks the
four characters as a single word; however, the sec-
ond and the last character are actually good features
for indicating date, since the dates are usually ex-
pressed using the same structure (e.g., “March 25th”
is expressed by “3rd month 25th day” in Chinese).
For reasons similar to the above, we believe that it
is better to separate characters representing “month”
and “day”, rather than combining the four charac-
ters into one word. A similar problem can be ob-
served in English with tokens such as “61-year-old
man” if one is interested in identifying a person’s
age, in which case ’year’ and ’old’ are good features
for predication.
The above analysis suggests that a better way to
apply a word segmenter in an NE system is to first
adapt the segmenter so that the segmentation granu-
larity is more appropriate to the particular task and
model. As a guideline, characters that are good fea-
tures for identifying NEs should not be combined
with other characters into word. Additional ex-
amples include characters expressing “percent” and
characters representing monetary measures .
3.4 The Effect of Segmentation Errors
Word segmentation errors can lead to mistakes in
NE recognition. Suppose an NE consists of four
characters a9a11a10 a7a13a12 a10a15a14 a12 a10a15a16 a12 a10a15a17a19a18 , if the word segmen-
tation merges a10 a7 with a character preceding it,
then this NE cannot be correctly identified by the
word-based model since the boundary will be incor-
rect. Besides inducing NE boundary errors, incor-
rect word segmentation also leads to wrong match-
ings between training examples and testing exam-
ples, which may result in mistakes in identifying en-
tities.
We computed the upper bound for the word-based
model for the IBM-FBIS test presented in Table 1.
The upper bound of performance is computed by
dividing the total number of NEs whose bound-
aries are also recognized as boundaries by the word
segmenter by the total number of NEs in the cor-
pus, which is the precision, recall, and also the F-
measure. For the IBM-FBIS test data in Table 1,
the upper bound of the word-based model is 95.7 F-
measure.
We also did the following experiment to measure
the effect of word segmentation errors: we gave
the boundaries of NEs in the test data to the word
segmenter and forced it to mark entity boundaries
as word boundaries. This eliminates the word seg-
mentation errors that inevitably result in NE bound-
ary errors. For the IBM-FBIS data, the word-based
HMM model achieves 76.60 F-measure when the
entity boundaries in the test data are given, and the
class-based model achieves 77.77 F-measure, higher
than the 77.19 F-measure by the character-based
model in Table 1. For the IEER data, the F-measure
of the word-based model improves from 70.83 to
73.74 when the entity boundaries are given, and the
class-based model improves from 70.20 to 72.47.
This suggests that with the improvement in Chi-
nese word segmentation, the word-based model may
achieve comparable or better performance than the
character-based model.
3.5 Lexical Features
Capitalization in English gives good evidence of
names. Our HMM classifier for English uses a set
of word-features to indicate whether a word con-
tains all capitalized letters, only digits, or capital-
ized letters and period, as described in (Bikel et al.,
1999). However, Chinese does not have capitaliza-
tion. When we applied the HMM system to Chinese,
we retained such features since Chinese text also in-
clude digits and roman words (such as in product
or company names). In an attempt to investigate the
usefulness of such features for Chinese, we removed
them from the system and observed very little dif-
ference in overall performance (0.4 difference in F-
measure).
3.6 Sensitivity to Corpus and Training Size
Variation
To test the robustness of the model, we trained the
system on the 100,000 word IBM-CT data and tested
on the same IBM-FBIS data. The character-based
model achieves 61.36 F-measure and the word-
based model achieves 58.40 F-measure, compared
to 77.19 and 74.17, respectively, using the 20 times
larger IBM-FBIS training set. This represents an ap-
proximately 20% relative reduction in performance
when trained on a related yet different and consider-
ably smaller training set. We plan to investigate fur-
ther the relation between corpus type and size and
performance.
4 Classifier Combination
This section investigates the combination of a set of
classifiers for NE recognition. We first introduce the
classifiers used in our experiments and then describe
the combination methods.
4.1 The Classifiers
Besides the HMM classifier mentioned in the previ-
ous section, the following three classifiers were used
in the experiments.
4.1.1 The Transformation-Based Learning
(fnTBL) Classifier
Transformation-based learning is an error-driven
algorithm which has two major steps: it starts by
assigning some classification to each example, and
then automatically proposing, evaluating and select-
ing the classification changes that maximally de-
crease the number of errors.
TBL has some attractive qualities that make it
suitable for the language-related tasks: it can au-
tomatically integrate heterogeneous types of knowl-
edge, without the need for explicit modeling (similar
to Snow (Dagan et al., 1997), Maximum Entropy,
decision trees, etc); it is error–driven, thus directly
minimizes the ultimate evaluation measure: the er-
ror rate. The TBL toolkit used in this experiment is
described in (Florian and Ngai, 2001).
4.1.2 The Maximum Entropy Classifier
(MaxEnt)
The model used here is based on the maxi-
mum entropy model used for shallow parsing (Rat-
naparkhi, 1999). A sentence with NE tags is
converted into a shallow tree: tokens not in any
NE are assigned an “O” tag, while tokens within
an NE are represented as constituents whose la-
bel is the same as the NE type. For exam-
ple, the annotated sentence “I will fly to (LO-
CATION New York) (DATEREF tomorrow)” is
represented as a tree “(S I/O will/O fly/O to/O
(LOCATION New/LOCATION York/LOCATION)
(DATEREF tomorrow/DATEREF) )”. Once an NE
is represented as a shallow tree, NE recognition can
be realized by performing shallow parsing.
We use the tagging and chunking model described
in (Ratnaparkhi, 1999) for shallow parsing. In the
tagging model, the context consists of a window of
five tokens (including the token being tagged and
two tokens to its left and two tokens to its right) and
two tags to the left of the current token. Five groups
of feature templates are used: token unigram, token
bigram, token trigram, tag unigram and tag bigram
(all within the context window). In the chunking
model, the context is limited to a window of three
subtrees: the previous, current and next subtree. Un-
igram and bigram chunk (or tag) labels are used as
features.
4.1.3 The Robust Risk Minimization (RRM)
Classifier
This system is a variant of the text chunking
system described in Zhang et al. (2002), where the
NE recognition problem is regarded as a sequential
token-based tagging problem. We denote by a20a22a21a24a23a26a25
(a27a29a28a31a30 a12a33a32a34a12a33a35a33a35a33a35a36a12a38a37 ) the sequence of tokenized text,
which is the input to our system. In token-based
tagging, the goal is to assign a class-label a39 a23 to ev-
ery token a21 a23 .
In our system, this is achieved by estimating the
conditional probability a40a41a9a42a39 a23 a28a44a43a34a45a46 a23 a18 for every pos-
sible class-label value a43 , wherea46 a23 is a feature vector
associated with token a27 . The feature vector a46a47a23 can
depend on previously predicted class-labels a20a22a39a11a48a49a25a33a48a13a50a51a23 ,
but the dependency is typically assumed to be lo-
cal. Given such a conditional probability model,
in the decoding stage, we estimate the best possi-
ble sequence of a39 a23 ’s using a dynamic programming
approach.
In our system, the conditional probability model
has the following parametric form:
a40a52a9a42a39a11a23a53a28a54a43a19a45a46a4a23 a12 a20a22a39 a23a56a55a51a57 a12a33a35a33a35a33a35a49a12 a39a11a23a56a55 a7 a25a34a18a58a28a60a59a61a9a42a21a63a62a64 a46a4a23a4a65a67a66
a64
a18 a12
where a59a61a9a42a68a8a18a69a28a71a70a73a72a74a75a9 a32a34a12 a70a77a76a22a78a8a9a11a30 a12 a68a8a18a11a18 is the truncation
of a68 into the interval a79a30 a12a33a32a22a80 . a21 a64 is a linear weight
vector and a66 a64 is a constant. Parameters a21 a64 and a66 a64
can be estimated from the training data.
This classification method is based on approxi-
mately minimizing the risk function. The general-
ized Winnow method used in (Zhang et al., 2002)
implements such a robust risk minimization method.
4.1.4 Performance of the Classifiers
We compared the performance of the four classi-
fiers by training and testing them on the same data
sets. We divided the IBM-FBIS corpus into three
subsets: 2.8 million characters for training, 330,000
characters for development testing, and 330,000
characters for testing. Table 2 shows the results of
each classifier for the development test set and the
evaluation set. The RRM and fnTBL classifiers are
the best performers for the test set, followed by Max-
Ent. The HMM classifier lags behind by around 6
Development Test Test
Precision Recall F-measure Precision Recall F-measure
Baseline1 17.52% 24.92% 20.58 14.04% 20.52% 16.67
Baseline2 77.86 % 45.26% 57.24 71.06% 40.22% 51.37
HMM 71.38% 80.92% 75.85 71.73% 77.55% 74.53
MaxEnt 83.82% 78.08% 80.85 85.43% 75.22% 80.00
fnTBL 76.85% 81.55% 80.08 82.06% 80.68% 81.36
RRM 82.83% 81.06% 81.89 84.53% 78.64% 81.48
Table 2: Baselines and performance of the four classifiers.
points in F-measure from the best system. The pre-
sented results are for character-based models.
For comparison, we also computed two baselines:
one in which each character is labeled with its most
frequent label (Baseline1 in Table 2), and one in
which each entity that was seen in training data is
labeled with its most frequent classification (Base-
line2 in Table 2 - this baseline is computed using
the software provided with the CoNLL-2003 shared
task (Tjong Kim Sang and De Meulder, 2003)).
4.2 Combination
The four classifiers differ in multiple dimensions,
making them good candidates for combination. We
explored various ways to combine the results from
the four classifiers.
4.2.1 Combination algorithms
For the first part of the experimental setup, we
consider the following classification framework:
given the (probabilistic) output a40a82a81 a23 a9a84a83a85a45a21a69a18 ofa86 classi-
fiers a87 a7 a12a33a35a33a35a33a35a49a12 a87a89a88 , the classifier combination problem
can be viewed a probability interpolation problem
– compute the class probability distribution condi-
tioned on the joint classifier output:
a90a92a91a94a93a96a95a97a99a98a11a93a53a100
a101a103a102a105a104a107a106 a108
a90a96a109a33a110a33a91a94a93a96a95a97a82a98a38a93a4a110
a102a94a111
a110a113a112
a101a56a114a114a114
a100 (1)
where a10 a23 is the ith classifier’s output, a21 is an ob-
servable context (e.g., a word trigram) and a115 is a
combination function. A commonly used combining
scheme is through linear interpolation of the classi-
fiers’ class probability distributions:
a40a116a9a84a10a117a45a21 a12 a10
a88
a7
a18a119a118
a88
a23 a5a8a7
a40a116a9a11a10a120a45a21 a12 a27 a12 a10 a23 a18a8a83a19a40a116a9a42a27a103a45a21a121a18
a28
a88
a23 a5a8a7
a40 a23 a9a84a10a120a45a21 a12 a10 a23 a18a75a83a34a122 a23 a9a42a21a69a18 (2)
The weights a122 a23 a9a123a21a69a18 encode the importance given to
classifier a27 in combination, for the context a21 , and
a40a4a23a6a9a11a10a117a45a21 a12 a10a96a23a26a18 is an estimation of the probability that
the correct classification is a10 , given that the output
of the classifier a27 on context a21 is a10 a23 . These param-
eters from Equation (2) can be estimated, if needed,
on development data.
Table 3 presents the combination results, for
different ways of estimating the interpolation pa-
rameters. A simple combination method is the
equal voting method (van Halteren et al., 2001;
Tjong Kim Sang et al., 2000), where the parameters
are computed as a122 a23 a9a42a21a69a18a124a28 a7
a88
and a40 a23 a9a84a10a117a45a21 a12 a10 a23 a18a124a28
a125
a9a11a10 a12 a10 a23 a18 ,
a125 being the Kronecker operator:
a125
a9a42a46 a12 a68a8a18a58a28
a32 a46a126a28a60a68
a30a127a46a67a128a28a60a68
In other words, each of the classifiers votes with
equal weight for the class that is most likely under
its model, and the class receiving the largest num-
ber of votes wins (i.e., it is selected as the classifi-
cation output). However, this procedure may lead to
ties, where some classifications receive an identical
number of votes – one usually resorts to randomly
selecting one of the tied candidates in this case – Ta-
ble 3 presents the average results obtained by this
method, together with the variance obtained over 30
trials. To make the decision deterministically, the
weights associated with the classifiers can be cho-
sen as a122 a23 a9a42a21a69a18a126a28 a32a130a129 a40 a23 a9a11a131a11a81a103a81a133a132a134a81a133a18 . In this method,
presented in Table 3 as weighted voting, better per-
forming classifiers will have a higher impact on the
final classification.
In the previously described methods, also known
as voting, each classifier gave its entire vote to one
classification – its own output. However, Equa-
tion (2) allows for classifiers to give partial credit
to alternative classifications, through the probabil-
ity a40 a23 a9a11a10a117a45a21 a12 a10 a23 a18 . In the experiments described here,
this value is computed directly on the development
data. However, the space of possible choices for a10 ,
a21 and a10 a23 is large enough to make the estimation
Precision Recall F-measure
Best Classifier 84.53 78.64 81.48
Equal weight voting 85.2a135 0.03 81.7a135 0.02 83.3a135 0.02
Weighted voting 86.04 80.63 83.24
Model 1 83.07 82.10 82.58
Model 2 86.3 77.55 81.69
RRM 89.01 79.61 84.05
RRM+ flags 88.96 79.84 84.18
RRM+IOB1+IOB2 88.5 80.46 84.29
Oracle 91.6 92.3 91.95
Table 3: Classifier combination results.
unreliable, so we use two approximations, named
Model 1 and Model 2 in Table 3: a40 a23 a9a11a10a117a45a21 a12 a10 a23 a18a117a28
a40 a23 a9a11a10a117a45a21a137a136a34a18
2 and
a40 a23 a9a11a10a117a45a21 a12 a10 a23 a18a41a28a71a40 a23 a9a84a10a120a45a10 a23 a18 , respec-
tively. Both probability distributions are estimated
as smoothed relative frequencies on the develop-
ment data. Interestingly, both methods underper-
form the equal voting method, a fact which can be
explained by inspecting the results in Table 2: the
fnTBL method has an accuracy (computed on devel-
opment data) lower than the MaxEnt accuracy, but
it outperforms the latter on the test data. Since the
parameters a40a138a9a11a10a120a45a21 a12 a10 a23 a18 are computed on the devel-
opment data, they are probably favoring the Max-
Ent method, resulting in lower performance. On the
other hand, the equal voting method does not suffer
from this problem, as its parameters are not depen-
dent on the development data. In a last set of exper-
iments, we extend the classification framework to a
larger space, in which we compute the conditional
class probability distribution conditioned on an ar-
bitrary set of features:
a40 a10a117a45a115a89a139a7 a28a54a115 a40 a23 a10a117a45a21 a12 a115a89a139a7
a23 a5a8a7a84a140a141a140a141a140a88
(3)
This setup allows us to still use the classifications
of individual systems as features, but also allows for
other types of conditioning features – for instance,
one can use the output of any classifier (e.g., POS
tags, text chunk labels, etc) as features.
In the described experiments, we use the RRM
method to compute the function a115 in Equation (3),
allowing the system to select a good performing
combination of features. At training time, the sys-
tem was fed the output of each classifier on the de-
velopment data, and, in subsequent experiments, the
system was also fed a flag stream which briefly iden-
tifies some of the tokens (numbers, romanized char-
2a97a96a142 is the word associated with the context a97 .
acters, etc) and the output of each system in a differ-
ent NE encoding scheme.
In all the voting experiments, the NEs were en-
coded in an IOB1 scheme, since it seems to be the
most amenable to combination. Briefly, the IOB
general encoding scheme associates a label with
each word, corresponding to whether it begins a spe-
cific entity, continues the entity, or is outside any en-
tity. Tjong Kim Sang and Veenstra (1999) describes
in detail the IOB schemes. The final experiment also
has access to the output of systems trained in the
IOB2 encoding. The addition of each feature type
resulted in better performance, with the final result
yielding a 10% relative decrease of F-measure error
when compared with the best performing system3.
Table 3 also includes an upper-bound on the classi-
fier combination performance - the performance of
the switch oracle, which selects the correct classifi-
cation if at least one classifier outputs it.
Table 3 shows that, at least for the examined
types of combination, using a robust feature-based
classifier to compute the classification distribution
yields better performance than combining the clas-
sifications through either voting or weighted inter-
polation. The RRM-based classifier is able to in-
corporate heterogenous information from multiple
sources, obtaining a 2.8 absolute F-measure im-
provement versus the best performing classifier and
1.0 F-measure gain over the next best method.
5 Related Work
Sun et al. (2002) proposes to use a class-based
model for Chinese NE recognition. Specifically, it
uses a character-based trigram model for the class
person, a word-based model for the class location,
and a more complicated model for the class organi-
zation. This decision is consistent with our observa-
tion that the character-based model performs better
than the word-based model for classes such as per-
son, but is worse for classes such as organization.
Sekine and Eriguchi (2000) provides an overview
of Japanese NE recognition. It presents the re-
sults of 15 systems that participated in an evalua-
tion project for Information Retrieval and Informa-
tion Extraction (IREX, 1999). Utsuro et al. (2002)
studies combining outputs of multiple Japanese NE
systems by stacking. A second stage classifier – in
this case, a decision list – is trained to combine the
outputs from first stage classifiers. This is similar
3Measured as
a143a145a144
a104a124a146a145a147
a143 .
in spirit to our application of the RRM classifier for
combining classifier outputs.
Classifier combination has been shown
to be effective in improving the perfor-
mance of NLP applications, and have been
investigated by Brill and Wu (1998) and
van Halteren et al. (2001) for part-of-speech tag-
ging, Tjong Kim Sang et al. (2000) for base noun
phrase chunking, and Florian et al. (2003a) for
word sense disambiguation. Among the investigated
techniques were voting, probability interpolation,
and classifier stacking. We also applied the classifier
combination technique discussed in this paper to
English and German (Florian et al., 2003b).
6 Conclusion
In this paper, we discuss two topics related to Chi-
nese NE recognition: dealing with language-specific
issues such as word segmentation, and combining
multiple classifiers to enhance the system perfor-
mance. In the described experiments, the character-
based model consistently outperforms the word-
based model – one major reason for this fact is that
the segmentation granularity might not be suited for
this particular task. Combining four statistical clas-
sifiers, including a hidden Markov model classifier,
a transformation-based learning classifier, a maxi-
mum entropy classifier, and a robust risk minimiza-
tion classifier, significantly improves the system per-
formance, yielding a 10% relative reduction in F-
measure error over the best performing classifier.
Acknowledgments
We would like to specially thank Fei Xia for help-
ful discussions and for providing the Chinese word
segmenter. We also thank Weijing Zhu for his as-
sistance in data preparation, and Nanda Kambhatla,
Todd Ward and Salim Roukos for their support and
suggestions.
This work was partially supported by the Defense
Advanced Research Projects Agency and monitored
by SPAWAR under contract No. N66001-99-2-8916
and by the Knowledge Discovery and Dissemination
Project sponsored by the National Science Founda-
tion. The views and findings contained in this ma-
terial are those of the authors and do not necessarily
reflect the position of policy of the Government and
no official endorsement should be inferred.

References
ACE. 2002. Automatic content extraction.
http://www.ldc.upenn.edu/Projects/ACE/.
D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. 1999. An
algorithm that learns what’s in a name. Machine Learning,
34(1-3):211–231.
E. Brill and J. Wu. 1998. Classifier combination for improved
lexical disambiguation. Proceedings of COLING-ACL’98,
pages 191–195, August.
E. Brill. 1995. Transformation-based error-driven learning and
natural language processing: A case study in part of speech
tagging. Computational Linguistics, 21(4):543–565.
I. Dagan, Y. Karov, and D. Roth. 1997. Mistake-driven learning
in text categorization. In Proceedings of EMNLP-1997.
R. Florian and G. Ngai, 2001. Fast Transformation-
Based Learning Toolkit. Johns Hopkins University,
http://nlp.cs.jhu.edu/˜rflorian/fntbl/documentation.html.
R. Florian, S. Cucerzan, C. Schafer, and D. Yarowsky. 2003a.
Combining classifiers for word sense disambiguation. Jour-
nal of Natural Language Engineering, 8(4).
R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. 2003b. Named
entity recognition through classifier combination. In Pro-
ceedings of CoNLL-2003. Edmonton, Canada.
IREX. 1999. Information retrieval and extraction exercise.
http://nlp.cs.nyu.edu/irex/index-e.html.
MUC-6. 1995. The sixth message understanding conference.
http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html.
MUC-7. 1997. The seventh message understanding con-
ference. http://www.itl.nist.gov/iad/894.02/related projects/
muc/proceedings/muc 7 toc.html.
A. Ratnaparkhi. 1999. Learning to parse natural language with
maximum entropy models. Machine Learning, 34:151–178.
S. Sekine and Y. Eriguchi. 2000. Japanese named entity ex-
traction evaluation - analysis of results. In Proceedings of
COLING-2000, Saarbr¨uecken, Germany.
J. Sun, J. Gao, L. Zhang, M. Zhou, and C. Huang. 2002. Chi-
nese named entity identification using class-based language
model. In Proceedings of COLING-2002, Taiwan.
E. F. Tjong Kim Sang and F. De Meulder. 2003. Introduction to
the CoNLL-2003 shared task: Language-independent named
entity recognition. In Proceedings of CoNLL-2003.
E. F. Tjong Kim Sang and J. Veenstra. 1999. Representing text
chunks. In Proceedings of EACL’99.
E. F. Tjong Kim Sang, W. Daelemans, H. Dejean, R. Koeling,
Y. Krymolowsky, V. Punyakanok, and D. Roth. 2000. Ap-
plying system combination to base noun phrase identifica-
tion. In Proceedings of COLING 2000, pages 857–863.
E. F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002
shared task: Language-independent named entity recogni-
tion. In Proceedings of CoNLL-2002. Taiwan.
T. Utsuro, M. Sassano, and K. Uchimoto. 2002. Combining out-
puts of multiple Japanese named entity chunkers by stacking.
In Proceedings of EMNLP-2002, Pennsylvania.
H. van Halteren, J. Zavrel, and W. Daelemans. 2001. Improv-
ing accuracy in word class tagging through the combination
fo machine learning systems. Computational Linguistics,
27(2):199–230.
F. Xia, M. Palmer, N. Xue, M..E. Okurowski, J. Kovarik,
S. Huang, T. Kroch, and M. Marcus. 2000. Developing
guidelines and ensuring consistency for Chinese text annota-
tion. In Proceedings of the 2nd International Conference on
Language Resources and Evaluation, Athens, Greece.
T. Zhang, F. Damerau, and D. E. Johnson. 2002. Text chunking
based on a generalization of Winnow. Journal of Machine
Learning Research, 2:615–637.
