Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 150–153,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Boosting for Chinese Named Entity Recognition
Xiaofeng YU Marine CARPUAT Dekai WU*
Human Language Technology Center
HKUST
Department of Computer Science and Engineering
University of Science and Technology
Clear Water Bay, Hong Kong
{xfyu,marine,dekai}@cs.ust.hk
Abstract
We report an experiment in which a high-
performance boosting based NER model
originally designed for multiple European
languages is instead applied to the Chi-
nese named entity recognition task of the
third SIGHAN Chinese language process-
ing bakeoff. Using a simple character-
based model along with a set of features
that are easily obtained from the Chi-
nese input strings, the system described
employs boosting, a promising and the-
oretically well-founded machine learning
method to combine a set of weak classi-
fiers together into a final system. Even
though we did no other Chinese-specific
tuning, and used only one-third of the
MSRA and CityU corpora to train the
system, reasonable results are obtained.
Our evaluation results show that 75.07 and
80.51 overall F-measures were obtained
on MSRA and CityU test sets respectively.
1 Introduction
Named entity recognition (NER), which includes
the identification and classification of certain
proper nouns, such as person names, organiza-
tions, locations, temporal, numerical and mon-
etary phrases, plays an important part in many
natural language processing applications, such
as machine translation, information retrieval, in-
formation extraction and question answering.
Much of the NER research was pioneered in
the MUC/DUC and Multilingual Entity Task
(MET) evaluations, as a result of which signif-
icant progress has been made and many NER
∗This work was supported in part by DARPA GALE
contract HR0011-06-C-0023, and by the Hong Kong Re-
search Grants Council (RGC) research grants RGC6083/99E,
RGC6256/00E, and DAG03/04.EG09.
systems of fairly high accuracy have been con-
structed. In addition, the shared tasks of CoNLL-
2002 and CoNLL-2003 helped spur the devel-
opment toward more language-independent NER
systems, by evaluating four types of entities (peo-
ple, locations, organizations and names of miscel-
laneous entities) in English, German, Dutch and
Spanish.
However, these are all European languages, and
Chinese NER appears to be significantly more
challenging in a number of important respects.
We believe some of the main reasons to be as
follows: (1) Unlike European languages, Chi-
nese lacks capitalization information which plays
a very important role in identifying named enti-
ties. (2) There is no space between words in Chi-
nese, so ambiguous segmentation interacts with
NER decisions. Consequently, segmentation er-
rors will affect the NER performance, and vice
versa. (3) Unlike European languages, Chinese al-
lows an open vocabulary for proper names of per-
sons, eliminating another major source of explicit
clues used by European language NER models.
This paper presents a system that introduces
boosting to Chinese named entity identification
and classification. Our primary aim was to con-
duct a controlled experiment to test how well
the boosting based models we designed for Eu-
ropean languages would fare on Chinese, without
major modeling alterations to accommodate Chi-
nese. We evaluated the system using data from
the third SIGHAN Chinese language processing
bakeoff, the goal of which was to perform NER
on three types of named entities: PERSON, LO-
CATION and ORGANIZATION.1 Three training
corpora from MSRA, CityU and LDC were given.
TheMSRAandLDCcorporaweresimplifiedChi-
nese texts while the CityU corpus was traditional
1Except in the LDC corpus, which contains four types
of entities: PERSON, LOCATION, ORGANIZATION and
GEOPOLITICAL.
150
Chinese. In addition, the competition also spec-
ified open and closed tests. In the open test, the
participants may use any other material including
material from other training corpora, proprietary
dictionaries, and material from the Web besides
the given training corpora. In the closed test, the
participants can only use the three training cor-
pora. No other material or knowledge is allowed,
including part-of-speech (POS) information, ex-
ternally generated word-frequency counts, Arabic
and Chinese numbers, feature characters for place
names, common Chinese surnames, and so on.
The approach we used is based on selecting a
number of features, which are used to train several
weak classifiers. Using boosting, which has been
showntoperformwellonotherNLPproblemsand
is a theoretically well-founded method, the weak
classifiers are then combined to perform a strong
classifier.
2 Boosting
The main idea behind the boosting algorithm is
that a set of many simple and moderately accu-
rate weak classifiers (also called weak hypothe-
ses) can be effectively combined to yield a sin-
gle strong classifier (also called the final hypoth-
esis). The algorithm works by training weak clas-
sifiers sequentially whose classification accuracy
is slightly better than random guessing and finally
combining them into a highly accurate classifier.
Each weak classifier searches for the hypothesis in
the hypotheses space that can best classify the cur-
rent set of training examples. Based on the eval-
uation of each iteration, the algorithm reweights
thetrainingexamples, forcingthenewlygenerated
weak classifier to give higher weights to the exam-
ples that are misclassified in the previous iteration.
The boosting algorithm was originally created to
deal with binary classification in supervised learn-
ing. The boosting algorithm is simple to imple-
ment, does feature selection resulting in a rela-
tively simple classifier, and has fairly good gen-
eralization.
Based on the boosting framework, our system
uses the AdaBoost.MH algorithm (Schapire and
Singer, 1999) as shown in Figure 1, an n-ary clas-
sification variant of the original well-known bi-
nary AdaBoost algorithm (Freund and Schapire,
1997). The original AdaBoost algorithm was de-
signedforthebinaryclassificationproblembutdid
not fulfill the requirements of the Chinese NER
Input: A training set Tr = {< d1,C1 >,...,< dg,Cg >}
where Cj ⊆ C = {c1,...,cm} for all j = 1,...,g.
Output: A final hypothesis Φ(d,c) =summationtextSs=1 αsΦs(d,c).
Algorithm: LetD1(dj,ci) = 1mg for all j = 1,...,g and
for all i = 1,...,m. For s = 1,...,S do:
• pass distribution Ds(dj,ci)to the weak classifier;
• derive the weak hypothesis Φs from the weak
classifier;
• choose αs ∈ R;
• set Ds+1(dj,ci) = Ds(dj,ci)exp(−αsCj[ci]Φs(dj,ci))Zs
where
Zs =summationtextm
i=1
summationtextg
j=1 Ds(dj,ci )exp( −αsCj [ci]Φs(dj,ci))
is a normalization factor chosen so thatsummationtextm
i=1
summationtextg
j=1 Ds+1(dj,ci) = 1.
Figure 1: The AdaBoost.MH algorithm.
task. AdaBoost.MH has shown its usefulness on
standard machine learning tasks through exten-
sive theoretical and empirical studies, where dif-
ferent standard machine learning methods have
been used as the weak classifier (e.g., Bauer and
Kohavi (1999), Opitz and Maclin (1999), Schapire
(2002)). It also performs well on a number of nat-
ural language processing problems, including text
categorization (e.g., Schapire and Singer (2000),
Sebastiani et al. (2000)) and word sense disam-
biguation (e.g., Escudero et al. (2000)). In partic-
ular, it has also been demonstrated that boosting
can be used to build language-independent NER
models that perform exceptionally well (Wu et al.
(2002), Wu et al. (2004), Carreras et al. (2002)).
The weak classifiers used in the boosting algo-
rithmcomefromawiderangeofmachinelearning
methods. Wehavechosentouseasimpleclassifier
called a decision stump in the algorithm. A deci-
sion stump is basically a one-level decision tree
where the split at the root level is based on a spe-
cific attribute/value pair. For example, a possible
attribute/value pair could beW2 =�/.
3 Experiment Details
In order to implement the boosting/decision
stumps, we used the publicly available software
AT&T BoosTexter (Schapire and Singer, 2000),
which implements boosting on top of decision
stumps. For preprocessing we used an off-the-
shelf Chinese lexical analysis system, the open
source ICTCLAS (Zhang et al., 2003), to segment
and POS tag the training and test corpora.
151
3.1 Data Preprocessing
The training corpora provided by the SIGHAN
bakeoff organizers were in the CoNLL two col-
umn format, with one Chinese character per line
and hand-annotated named entity chunks in the
second column.
In order to provide basic features for training
thedecisionstumps,thetrainingcorporawereseg-
mented and POS tagged by ICTCLAS, which la-
bels Chinese words using a set of 39 tags. This
module employs a hierarchical hidden Markov
model (HHMM) and provides word segmentation,
POS tagging and unknown word recognition. It
performs reasonably well, with segmentation pre-
cision recently evaluated at 97.58%.2 The recall
rateofunknownwordsusingroletaggingwasover
90%.
We note that about 200 words in each train-
ing corpora remained untagged. For these words
we simply assigned the most frequently occurring
tags in each training corpora.
3.2 Feature Set
Theboosting/decisionstumpswereabletoaccom-
modate a large number of features. The primitive
features we used were:
• The current character and its POS tag.
• The characters within a window of 2 charac-
ters before and after the current character.
• The POS tags within a window of 2 charac-
ters before and after the current character.
• The chunk tags (gold standard named entity
label during the training) of the previous two
characters.
The chunk tag is the BIO representation, which
was employed in the CoNLL-2002 and CoNLL-
2003 evaluations. In this representation, each
character is tagged as either the beginning of a
named entity (B tag), a character inside a named
entity (I tag), or a character outside a named entity
(O tag).
When we used conjunction features, we found
that they helped the NER performance signifi-
cantly. The conjunction features used are basi-
cally conjunctions of 2 consecutive characters and
2 consecutive POS tags. We also found that a
2Results from the recent official evaluation in the national
973 project.
Table 1: Dev set results on MSRA and CityU.
Precision Recall Fβ=1
MSRA
LOC 82.00% 85.93% 83.92
ORG 76.99% 61.44% 68.34
PER 89.33% 74.47% 81.22
Overall 82.62% 76.45% 79.41
CityU
LOC 88.62% 81.69% 85.02
ORG 82.50% 66.44% 73.61
PER 84.05% 84.58% 84.31
Overall 86.46% 79.26% 82.71
Table 2: Test set results on MSRA, CityU, LDC.
Precision Recall Fβ=1
MSRA
LOC 84.98% 80.94% 82.91
ORG 72.82% 57.78% 64.43
PER 82.89% 59.91% 69.55
Overall 81.95% 69.26% 75.07
CityU
LOC 88.65% 83.58% 86.04
ORG 83.75% 57.25% 68.01
PER 86.11% 76.42% 80.98
Overall 86.92% 74.98% 80.51
LDC
LOC 65.84% 76.51% 70.78
ORG 53.69% 39.52% 45.53
PER 80.29% 68.97% 74.20
Overall 67.20% 65.54% 66.36
LDC (w/GPE)
GPE 0.00% 0.00% 0.00
LOC 1.94% 37.74% 3.70
ORG 53.69% 39.52% 45.53
PER 80.29% 68.97% 74.20
Overall 30.58% 29.82% 30.19
larger context window (3 characters instead of 2
before and after the current character) to be quite
helpful to performance.
Apart from the training and test corpora, we
considered the gazetteers from LDC which con-
tain about 540K persons, 242K locations and 98K
organization names. Named entities in the train-
ing corpora which appeared in the gazetteers were
identified lexically or by using a maximum for-
ward match algorithm. Once named entities have
been identified, each character can then be anno-
tated with an NE chunk tag. The boosting learner
152
can view the NE chunk tag as an additional fea-
ture. Here we used binary gazetteer features. If
the character was annotated with an NE chunk
tag, its gazetteer feature was set to 1; otherwise
it was set to 0. However we found that adding bi-
nary gazetteer features does not significantly help
the performance when conjunction features were
used. In fact, it actually hurt the performance
slightly.
The features used in the final experiments were:
• The current character and its POS tag.
• The characters within a window of 3 charac-
ters before and after the current character.
• The POS tags within a window of 3 charac-
ters before and after the current character.
• A small set of conjunctions of POS tags and
characters within a window of 3 characters of
the current character.
• The BIO chunk tags of the previous 3 charac-
ters.
4 Results
Table 1 presents the results obtained on the MSRA
and CityU development test set. Table 2 presents
theresultsobtainedontheMSRA,CityUandLDC
test sets. These numbers greatly underrepresent
what could be expected from the boosting model,
since we only used one-third of MSRA and CityU
training corpora due to limitations of the boost-
ing software. Another problem for the LDC cor-
pus was training/testing mismatch: we did not
train any models at all with the LDC training cor-
pus, which was the only training set annontated
with geopolitical entities (GPE). Instead, for the
LDC test set, we simply used the system trained
on the MSRA corpus. Thus, when we consider
the geopolitical entity (GPE), our low overall F-
measure on the LDC test set cannot be interpreted
meaningfully.3 Even so, using only one-third of
the training data, the results on the MSRA and
CityU test sets are reasonable: 75.07 and 80.51
overall F-measures were obtained on the MSRA
and CityU test sets, respectively.
5 Conclusion
We have described an experiment applying a
boosting based NER model originally designed
3Our LDC test result was scored twice by the organizer.
for multiple European languages instead to the
Chinese named entity recognition task. Even
though we only used one-third of the MSRA and
CityU corpora to train the system, the model
produced reasonable results, obtaining 75.07 and
80.51 overall F-measures on MSRA and CityU
test sets respectively.
Having established this baseline for compari-
son against our multilingual European language
boosting based NER models, our next step will be
to incorporate Chinese-specific attributes into the
model to compare with.

References
Eric Bauer and Ron Kohavi. An empirical comparison of
voting classification algorithms: Bagging, boosting, and
variants. Machine Learning, 36:105–142, 1999.
Xavier Carreras, Llu´ıs M`arquez, and Llu´ıs Padr´o. Named en-
tity extraction using AdaBoost. In Computational Natural
Language Learning (CoNLL-2002), at COLING-2002,
pages 171–174, Taipei, Sep 2002.
Gerard Escudero, Llu´ıs M`arquez, and German Rigau. Boost-
ing applied to word sense disambiguation. In 11th Euro-
pean Conference on Machine Learning (ECML-00), pages
129–141, 2000.
Yoav Freund and Robert E. Schapire. A decision-theoretic
generalization of on-line learning and an application to
boosting. Computer and System Sciences, 55(1):119–139,
1997.
David Opitz and Richard Maclin. Popular ensemble meth-
ods: An empirical study. Journal of Artificial Intelligence
Research, 11:169–198, 1999.
Robert E. Schapire and Yoram Singer. Improved boosting
algorithms using confidence-rated predictions. Machine
Learning, 37(3):297–336, 1999.
Robert E. Schapire and Yoram Singer. Boostexter: A
boosting-based system for text categorization. Machine
Learning, 39(2-3):135–168, 2000.
Robert E. Schapire. The boosting approach to machine learn-
ing: An overview. In MSRI workshop on Nonlinear Esti-
mation and Classification, 2002.
Fabrizio Sebastiani, Alessandro Sperduti, and Nicola Val-
dambrini. An improved boosting algorithm and its appli-
cation to automated text categorization. In Proceedings
of 9th ACM International Conference on Information and
Knowledge Management, pages 78–85, 2000.
Dekai Wu, Grace Ngai, Marine Carpuat, Jeppe Larsen, and
Yongsheng Yang. Boosting for named entity recognition.
In Computational Natural Language Learning (CoNLL-
2002), at COLING-2002, pages 195–198, Taipei, Sep
2002.
Dekai Wu, Grace Ngai, and Marine Carpuat. Why nitpicking
works: Evidence for Occam’s razor in error correctors. In
20th International Conference on Computational Linguis-
tics (COLING-2004), Geneva, 2004.
Hua Ping Zhang, Qun Liu, Xue-Qi Cheng, Hao Zhang, and
Hong Kui Yu. Chinese lexical analysis using Hierarchi-
cal Hidden Markov Model. In Proceedings of the second
SIGHAN workshop on Chinese language processing, vol-
ume 17, pages 63–70, 2003.
