Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 146–149,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Chinese Word Segmentation and Named Entity Recognition by 
Character Tagging 
 
Kun Yu1      Sadao Kurohashi2     Hao Liu1     Toshiaki Nakazawa1 
 Graduate School of Information Science and Technology, The University of 
Tokyo, Tokyo, Japan, 113-86561 
Graduate School of Informatics, Kyoto University, Kyoto, Japan, 606-85012 
 {kunyu, liuhao, nakazawa}@kc.t.u-tokyo.ac.jp1 
kuro@i.kyoto-u.ac.jp2 
  
Abstract 
This paper describes our word segmenta-
tion system and named entity recognition 
(NER) system for participating in the 
third SIGHAN Bakeoff. Both of them are 
based on character tagging, but use dif-
ferent tag sets and different features. 
Evaluation results show that our word 
segmentation system achieved 93.3% and 
94.7% F-score in UPUC and MSRA open 
tests, and our NER system got 70.84% 
and 81.32% F-score in LDC and MSRA 
open tests. 
1 Introduction 
Dealing with word segmentation as character 
tagging showed good results in last SIGHAN 
Bakeoff (J.K.Low et al.,2005). It is good at un-
known word identification, but only using char-
acter-level features sometimes makes mistakes 
when identifying known words (T.Nakagawa, 
2004). Researchers use word-level features 
(J.K.Low et al.,2005) to solve this problem. 
Based on this idea, we develop a word segmenta-
tion system based on character-tagging, which 
also combine character-level and word-level fea-
tures. In addition, a character-based NER module 
and a rule-based factoid identification module 
are developed for post-processing.  
Named entity recognition based on character-
tagging has shown better accuracy than word-
based methods (H.Jing et al.,2003). But the small 
window of text makes it difficult to recognize the 
named entities with many characters, such as 
organization names (H.Jing et al.,2003). Consid-
ering about this, we developed a NER system 
based on character-tagging, which combines 
word-level and character-level features together. 
In addition, in-NE probability is defined in this 
system to remove incorrect named entities and 
create new named entities as post-processing. 
2 Character Tagging for Word 
Segmentation and NER 
2.1 Basic Model 
We look both word segmentation and NER as 
character tagging, which is to find the tag se-
quence T* with the highest probability given a 
sequence of characters S=c1c2…cn.  
)|(maxarg* STPT
T
=  (1) 
Then we assume that the tagging of one char-
acter is independent of each other, and modify 
formula 1 as 
∏
==
=
=
=
n
i
ii
tttT
nn
tttT
ctP
ccctttPT
n
n
1...
2121
...
*
)|(maxarg     
)...|...(maxarg
21
21  
(2) 
 Beam search (n=3) (Ratnaparkhi,1996) is ap-
plied for tag sequence searching, but we only 
search the valid sequences to ensure the validity 
of searching result. SVM is selected as the basic 
classification model for tagging because of its 
robustness to over-fitting and high performance 
(Sebastiani, 2002). To simplify the calculation, 
the output of SVM is regarded as P(ti|ci). 
2.2 Tag Definition 
Four tags ‘B, I, E, S’ are defined for the word 
segmentation system, in which ‘B’ means the 
character is the beginning of one word, ‘I’ means 
the character is inside one word, ‘E’ means the 
character is at the end of one word and ‘S’ means 
the character is one word by itself. 
For the NER system, different tag sets are de-
fined for different corpuses. Table 1 shows the 
146
tag set defined for MSRA corpus. It is the prod-
uct of Segment-Tag set and NE-Tag set, because 
not only named entities but also words are seg-
mented in this corpus. Here NE-Tag ‘O’ means 
the character does not belong to any named enti-
ties. For LDC corpus, because there is no seg-
mentation information, we delete NE-Tag ‘O’ 
but add tag ‘NONE’ to indicate the character 
does not belong to any named entities (Table 2). 
Table 1 Tags of NER for MSRA corpus 
Segment-Tag NE-Tag 
B, I, E, S g152 PER, LOC, ORG, O 
Table 2 Tags of NER for LDC corpus 
Segment Tag NE Tag 
B, I, E, S g152 PER, LOC, ORG, GPE + NONE 
2.3 Feature Definition 
First, some features based on characters are 
defined for the two tasks, which are: 
(a) Cn (n=-2,-1,0,1,2) 
(b) Pu(C0) 
Feature Cn (n=-2,-1,0,1,2) mean the Chinese 
characters appearing in different positions (the 
current character and two characters to its left 
and right), and they are binary features. A char-
acter list, which contains all the characters in the 
lexicon introduced later, is used to identify them. 
Besides of that, feature Pu(C0) means whether C0 
is in a punctuation character list. It is also binary 
feature and all the punctuations in the punctua-
tion character list come from Penn Chinese Tree-
bank 5.1 (N.Xue et al.,2002). 
In addition, we define some word-level fea-
tures based on a lexicon to enlarge the window 
size of text in the two tasks, which are:  
(c) Wn (n=-1,0,1) 
Feature Wn (n=-1,0,1) mean the lexicon words 
in different positions (the word containing C0 
and one word to its left and right) and they are 
also binary features. We select all the possible 
words in the lexicon that satisfy the requirements, 
not like only selecting the longest one in 
(J.K.Low et al.,2005). To create the lexicon, we 
use following steps. First, a lexicon from NICT 
(National Institute of Information and Communi-
cations Technology, Japan) is used as the basic 
lexicon, which is extracted from Peking Univer-
sity Corpus of the second SIGHAN Bakeoff 
(T.Emerson, 2005), Penn Chinese Treebank 4.0 
(N.Xue et al.,2002), a Chinese-to-English Word-
list1 and part of NICT corpus (K.Uchimoto et 
al.,2004; Y.J.Zhang et al.,2005). Then, all the 
words containing digits and letters are removed 
                                                 
1 http://projects.ldc.upenn.edu/Chinese/  
from this lexicon. At last, all the punctuations in 
Penn Chinese Treebank 5.1 (N.Xue et al.,2002) 
and all the words in the training data of UPUC 
and MSRA corpuses are added into the lexicon.  
Besides of above features, some extra features 
are defined only for NER task. 
First, we add some character-based features to 
improve the accuracy of person name recognition, 
which are CNn (n=-2,-1,0,1,2). They mean 
whether C n (n=-2,-1,0,1,2)g3belong to a Chinese 
surname list. All of them are binary features. The 
Chinese surname list contains the most famous 
100 Chinese surnames, such as g17225, g19077, g4397, g7458 
(Zhao, Qian, Sun, Li). 
Then, we add some word-based features to 
help identify the organization name, which are 
WORGn (n=-1,0,1). They mean whether W n (n= 
-1,0,1) belong to an organization suffix list. All 
of them are also binary features. The organiza-
tion suffix list is created by extracting the last 
word from all the organization names in the 
training data of both MSRA and LDC corpuses. 
3 Post-processing 
Besides of the basic model, a NER module 
and a factoid identification module are developed 
in our word segmentation system for post-
processing. In addition, we define in-NE prob-
ability to delete the incorrect named entities and 
identify new named entities in the post-
processing phrase of our NER system. 
3.1 Named Entity Recognition for Word 
Segmentation 
In this module, if two or more segments in the 
outputs of basic model are recognized as one 
named entity, we combine them as one segment.  
This module uses the same basic NER model 
as what we introduced in the previous section. 
But it only identifies person and location names, 
because organization names often contain more 
than one word. In addition, to keep the high ac-
curacy of person name recognition, the features 
about organization suffixes are not used here.  
3.2 Factoid Identification for Word Seg-
mentation 
Rules are used to identify the following fac-
toids among the segments from the basic word 
segmentation model:  
NUMBER: Integer, decimal, Chinese number 
PERCENT: Percentage and fraction 
DATE: Date 
FOREIGN: English words 
147
Table 3 shows some rules defined here. 
Table 3 Some Rules for Factoid Identification 
Factoid Rule 
NUMBER If previous segment ends with DIGIT and current segment starts with DIGIT, then combine them. 
PERCENT If previous segment is composed of DIGIT and current segment equals ‘%’, then combine them. 
DATE 
If previous segment is composed of DIGIT and 
current segment is in the list of ‘g5192, g7388, g7097, g2507 
(Year, Month, Day, Day)’, then combine them. 
FOREIGN Combine the consequent letters as one segment. 
(DIGIT means both Arabic and Chinese numerals) 
3.3 NER Deletion and Creation 
In-word probability has been used in unknown 
word identification successfully (H.Q.Li et al., 
2004). Accordingly, we define in-NE probability 
to help delete and create named entities (NE). 
Formula 3 shows the definition of in-NE prob-
ability for character sequence cici+1…ci+n. Here ‘# 
of cici+1…ci+n as NE’ is defined as TimeInNE and 
the occurrence of cici+1…ci+n in different type of 
NE is treated differently. 
data in testing ... of #
NE as ... of #)...(
1
1
1
niii
niii
niiiInNE ccc
ccccccP
++
++
++ =
 (3) 
Then, we use some criteria to delete the incor-
rect NE and create new possible NE, in which 
different thresholds are set for different tasks. 
Criterion 1: If PInNE(cici+1…ci+n) of one NE 
type is lower than TDel, and TimeInNE(cici+1…ci+n) 
of the same NE type is also lower than TTime, then 
delete this type of NE composed of cici+1…ci+n.  
Criterion 2: If PInNE(cici+1…ci+n) of one NE 
type is higher than TCre, and in other places the 
character sequence cici+1…ci+n does not belong to 
any NE, then create a new NE containing 
cici+1…ci+n with this NE type.  
4 Evaluation Results and Discussion 
4.1 Evaluation Setting 
SVMlight (T.Joachims, 1999) was used as 
SVM tool. In addition, we used the MSRA train-
ing corpus of NER task in this Bakeoff to train 
our NER post-processing module. 
4.2 Results of Word Segmentation 
We attended the open track of word segmenta-
tion task for two corpuses: UPUC and MSRA. 
Table 4 shows the evaluation results. 
Table 4 Results of Word Segmentation Task (in percentage %) 
Corpus Pre. Rec. F-score Roov Riv 
UPUC 94.4 92.2 93.3 68.0 97.0 
MSRA 94.0 95.3 94.7 50.3 96.9 
The F-score of our word segmentation system 
in UPUC corpus ranked 4th (same as that of the 
3rd group) among all the 8 participants. And it 
was only 1.1% lower than the highest one and 
0.2% lower than the second one. It showed that 
our character-tagging approach was feasible. But 
the F-score of MSRA corpus was only higher 
than one participant in all the 10 groups (the 
highest one was 97.9%). Error analysis shows 
that there are two main reasons.  
First, in MSRA corpus, they tend to segment 
one organization name as one word, such as g13666
g3281g1025g3281g2842g1262(China Chamber of Commerce in 
USA). But our basic segmentation model seg-
mented such word into several words, e.g. g13666g3281/
g1025g3281/g2842g1262(USA/China/Chamber of Commerce), 
and our post-processing NER module does not 
consider about organization names.  
Second, our factoid identification rule did not 
combine the consequent DATE factoids into one 
word, but they are combined in MSRA corpus. 
For example, our system segmented the wordg7214
g990 9g7114g6984 (9 o’clock in the evening) into three 
parts g7214g990/9 g7114/g6984 (Evening/9 o’clock/Exact). 
This error can be solved by revising the rules for 
factoid identification. 
Besides of that, we also found although our 
large lexicon helped identify the known word 
successfully, it also decreased the recall of OOV 
words (our Riv of UPUC corpus ranked 2nd, with 
only 0.6% decrease than the highest one, but 
Roov ranked 4th, with 8.8% decrease than the 
highest one). The large size of this lexicon is 
looked as the main reason.  
Our lexicon contains 221,407 words, in which 
6,400 words are single-character words. It made 
our system easy to segment one word into sev-
eral words, for example word g13475g8994g13464 (Economy 
Group) in UPUC corpus was segmented intog13475
g8994 (Economy) and g13464(Group). Moreover, the 
large size of this lexicon also brought errors of 
combining two words into one word if the word 
was in the lexicon. For example, words g2494 (Only) 
and g7389 (Have) in MSRA corpus were identified 
as one word because there existed the wordg2494g7389 
(Only) in our lexicon. We will reduce our lexi-
con to a reasonable size to solve these problems. 
4.3 Results of NER 
We also attended the open track of NER task 
for both LDC corpus and MSRA corpus. Table 5 
and Table 6 give the evaluation results.  
There were only 3 participants in the open 
track of LDC corpus and our group got the best 
F-score. In addition, among all the 11 partici-
pants for MSRA corpus, our system ranked 6th 
148
by F-score. It showed the validity of our charac-
ter-tagging method for NER. But for location 
name (LOC) in LDC corpus, both the precision 
and recall of our NER system were very low. It 
was because there were too few location names 
in the training data (there were only 476 LOC in 
the training data, but 5648 PER, 5190 ORG and 
9545 GPE in the same data set). 
Table 5 Results of NER Task for LDC corpus (in percentage %) 
 PER LOC ORG GPE Overall 
Pre. 83.29 58.52 61.48 78.66 76.16 
Rec. 66.93 18.87 45.19 79.94 66.21 
F-score 74.22 28.57 52.09 79.30 70.84 
Table 6 Results of NER Task for MSRA corpus (in percentage %) 
 PER LOC ORG Overall 
Pre. 90.76 85.62 73.90 84.68 
Rec. 76.13 85.41 65.74 78.22 
F-score 82.80 85.52 69.58 81.32 
Besides of that, error analysis shows there are 
four types of main errors in the NER results. 
First, some organization names were very long 
and can be divided into several words, in which 
parts of them can also be looked as named enti-
ties. In such case, our system only recognized the 
small parts as named entities. For example,  g2716g1327
g3835g4410g17165g8503g9177g1008g1134g11752g12362g1025g5527 (Fei Zhengqing 
Eastern Asia Research Center of Harvard Univ.) 
was an organization name. But our system rec-
ognized it asg2716g1327g3835g4410(Harvard Univ.)/ORG+g17165
g8503g9177 (Fei Zheng Qing)/PER+ g1008g1134 (Eastern 
Asia)/LOC+ g11752g12362g1025g5527(Research Center)/ORG. 
Adding more context features may be useful to 
resolve this issue. 
In addition, our system was not good at recog-
nizing foreign person names, such as g17194g4584g11343 
(Riordan), and abbreviations, such as g8943g5078 (Los 
Angeles), if they seldom or never appeared in 
training corpus. It is because the use of the large 
lexicon decreased the unknown word identifica-
tion ability of our NER system simultaneously. 
Third, the in-NE probability used in post-
processing is helpful to identify named entities 
which cannot be recognized by the basic model. 
But it also recognized some words which can 
only be regarded as named entities in the local 
context incorrectly. For example, our system 
recognizedg2347g1152 (Najing) as GPE in g17877g2052g2347g1152g2319
g8847 (Send to Najing for remedy) in LDC corpus. 
We will consider about adding the in-NE prob-
ability as one feature into the basic model to 
solve this problem. 
At last, in LDC corpus, they combine the at-
tributive of one named entity (especially person 
and organization names) with the named entity 
together. But our system only recognized the 
named entity by itself. For example, our system 
only recognized g2028g7702g14471 (Liu Gui Fang) as PER 
in the reference person name g993g11705g1881g5785g11352g2028g7702g14471 
(Liu Gui Fang who does not know the inside). 
5 Conclusion and Future Work 
Through the participation of the third 
SIGHAN Bakeoff, we found that tagging charac-
ters with both character-level and word-level fea-
tures was effective for both word segmentation 
and NER. While, this work is only our 
preliminary attempt and there are still many 
works needed to do in the future, such as the 
control of lexicon size, the use of extra 
knowledge (e.g. pos-tag), the feature definition, 
and so on. In addition, our word segmentation 
system only combined the NER module as post-
processing, which resulted in that lots of infor-
mation from NER module cannot be used by the 
basic model. We will consider about combining 
the NER and factoid identification modules into 
the basic word segmentation model by defining 
new tag sets in our future work. 
Acknowledgement 
We would like to thank Dr. Kiyotaka Uchi-
moto for providing the NICT lexicon. 

References
T.Emerson. 2005. The Second International Chinese Word Seg-
mentation Bakeoff. In the 4th SIGHAN Workshop. pp. 123-133. 
H.Jing et al. 2003. HowtogetaChineseName(Entity): Segmentation 
and Combination Issues. In EMNLP 2003. pp. 200-207. 
T.Joachims. 1999. Making large-scale SVM learning practical. 
Advances in Kernel Methods - Support Vector Learning. MIT-
Press. 
H.Q.Li et al. 2004. The Use of SVM for Chinese New Word Identi-
fication. In IJCNLP 2004. pp. 723-732. 
J.K.Low et al. 2005. A Maximum Entropy Approach to Chinese 
Word Segmentation. In the 4th SIGHAN Workshop. pp. 161-164. 
T.Nakagawa. 2004. Chinese and Japanese Word Segmentation 
Using Word-level and Character-level Information. In COLING 
2004. pp. 466-472. 
A.Ratnaparkhi. 1996. A Maximum Entropy Model for Part-of-
Speech Tagging. In EMNLP 1996. 
F.Sebastiani. 2002. Machine learning in automated text categoriza-
tion. ACM Computing Surveys. 34(1): 1-47. 
K.Uchimoto et al. 2004. Multilingual Aligned Parallel Treebank 
Corpus Reflecting Contextual Information and its Applications. 
In Proceedings of the MLR 2004. pp. 63-70. 
N.Xue et al. 2002. Building a Large-Scale Annotated Chinese Cor-
pus. In COLING 2002. 
Y.J.Zhang et al. 2005. Building an Annotated Japanese-Chinese 
Parallel Corpus – A part of NICT Multilingual Corpora. In Pro-
ceedings of the MT SummitX. pp. 71-78. 
