Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 181–184,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Chinese Word Segmentation and Named Entity Recognition Based on 
Conditional Random Fields Models 
 
 
Yuanyong Feng Le Sun Yuanhua Lv 
Institute of Software, Chinese Academy of Sciences, Beijing, 100080, China 
{yuanyong02, sunle, yuanhua04}@ios.cn 
 
 
Abstract 
This paper mainly describes a Chinese 
named entity recognition (NER) system 
NER@ISCAS, which integrates text, 
part-of-speech and a small-vocabulary-
character-lists feature for MSRA NER 
open track under the framework of Con-
ditional Random Fields (CRFs) model. 
The techniques used for the close NER 
and word segmentation tracks are also 
presented. 
1 Introduction 
The system NER@ISCAS is designed under the 
Conditional Random Fields (CRFs. Lafferty et 
al., 2001) framework. It integrates multiple fea-
tures based on single Chinese character or space 
separated ASCII words. The early designed sys-
tem (Feng et al., 2005) is used for the MSRA 
NER open track this year. The output of an ex-
ternal part-of-speech tagging tool and some care-
fully collected small-scale-character-lists are 
used as outer knowledge. 
The close word segmentation and named en-
tity recognition tracks are also based on this sys-
tem by some adjustments.  
The remaining of this paper is organized as 
follows. Section 2 introduces Conditional Ran-
dom Fields model. Section 3 presents the details 
of our system on Chinese NER integrating mul-
tiple features. Section 4 describes the features 
extraction for close track. Section 5 gives the 
evaluation results. We end our paper with some 
conclusions and future works. 
2 Conditional Random Fields Model 
Conditional random fields are undirected graphi-
cal models for calculating the conditional prob-
ability for output vertices based on input ones. 
While sharing the same exponential form with 
maximum entropy models, they have more effi-
cient procedures for complete, non-greedy finite-
state inference and training.  
Given an observation sequence o=<o
1
, o
2
, ..., 
o
T
>, linear-chain CRFs model based on the as-
sumption of first order Markov chains defines 
the corresponding state sequence s′  probability as 
follows (Lafferty et al., 2001): 
1
1
1
(|) exp( ( , ,,))
T
kk t t
tk
pf
Z
λ
Λ−
=
=
∑∑
o
so o
 
(1)
Where Λ  is the model parameter set, Z
o
 is the 
normalization factor over all state sequences, f
k
 is 
an arbitrary feature function, and λ
k
 is the learned 
feature weight. A feature function defines its 
value to be 0 in most cases, and to be 1 in some 
designated cases. For example, the value of a 
feature named “MAYBE-SURNAME” is 1 if 
and only if s
t-1
 is OTHER, s
t
 is PER, and the t-th 
character in o is a common-surname. 
The inference and training procedures of 
CRFs can be derived directly from those equiva-
lences in HMM. For instance, the forward vari-
able α
t
(s
i
) defines the probability that state at 
time t being s
i
 at time t given the observation 
sequence o. Assumed that we know the proba-
bilities of each possible value s
i
 for the beginning 
state  α
0
(s
i
), then we have 
1
() ()exp( (,,,)
ti t kk i
sk
s sfstαα λ
+
′
′′=
∑ ∑
o
(2)
In similar ways, we can obtain the backward 
variables and Baum-Welch algorithm. 
3 Chinese NER Using CRFs Model Inte-
grating Multiple Features for Open  
Track 
In our system the text feature, part-of-speech 
(POS) feature, and small-vocabulary-character-
lists (SVCL) feature are combined under a uni-
fied CRFs framework. 
181
The text feature includes single Chinese char-
acter, some continuous digits or letters. 
POS feature is an important feature which car-
ries some syntactic information. Our POS tag set 
follows the criterion of modern Chinese corpora 
construction (Yu, 1999), which contains 39 tags.  
The last feature is based on lists. We first list 
all digits and English letters in Chinese. Then 
most frequently used character feature in Chinese 
NER are collected, including 100 single charac-
ter surnames, 100 location tail characters, and 40 
organization tail characters. The total number of 
these items in our lists is less than 600. The lists 
altogether make up a list feature (SVCL). Some 
examples of this list are given in Table 1. 
 
 
Each token is presented by its feature vector, 
which is combined by these features we just dis-
cussed. Once all token feature (Maybe including 
context features) values are determined, an ob-
servation sequence is feed into the model. 
Each token state is a combination of the type 
of the named entity it belongs to and the bound-
ary type it locates within. The entity types are 
person name (PER), location name (LOC), or-
ganization name (ORG), date expression (DAT), 
time expression (TIM), numeric expression 
(NUM), and not named entity (OTH). The 
boundary types are simply Beginning, Inside, 
and Outside (BIO).  
4 Feature Extraction for Close Tracks 
In close tracks, only character and word list fea-
tures which are extracted from training data are 
applied for word segmentation. In NER track we 
also include a named entity list extracted from 
the training data.  
To extract the list feature, we simply search 
each text string among the list items in maximum 
length forward way. 
Taking the word segmentation task for in-
stance, when a text string c
1
c
2
…c
n
 is given, we 
tag each character into a BIO-WL style. If 
c
i
c
i+1
…c
j
 matches an item I of length j-i+1 and no 
other item I’ of length k (k>j-i+1) in the list 
matches c
i
c
i+1
…c
j
…c
k+i-1
, then the characters are 
tagged as follows: 
 
c
i
 c
i+1
 … c
j
 
B-WL I-WL … I-WL 
 
If no item in the list matches head subpart of 
the string, then c
i
 is tagged as 0.  
The tagging operation iterates on the 
remaining part until all characters are tagged. 
5 Evaluation 
5.1 Results 
The system for our MSRA NER open track 
submission has some bugs and was trained on a 
much smaller training data set than the full set 
the organizer provided. The results are very low, 
see Table 2:  
 
Accuracy 96.28% 
Precision 83.20% 
Recall 67.03% 
FB1 74.24%
Table 2. MSRA NER Open 
 
When we fixed the bug and retrained on the 
full training corpus, the result comes out to be as 
follows: 
 
Accuracy 98.24% 
Precision 89.38% 
Recall 83.07% 
FB1 86.11%
Table 3. MSRA NER Open (retrained) 
 
All the submissions on close tracks are trained 
on 80% of the training corpora, the remaining 
20% parts are used for development. The results 
are shown in Table 4 and Table 5: 
 
Value Description Examples 
digit Arabic digit(s) 1,2,3 
letter Letter(s) A,B,C,...,a, b, c 
Continuous digits and/or letters (The sequence is
regarded as a single token) 
chseq Chinese order 1 ㈠⑴①Ⅰ, , ,  
chdigit Chinese digit １壹一, ,  
tianseq Chinese order 2 甲乙丙丁, , ,  
chsurn Surname 李吴郑王, , ,  
notname Not name 将对那的是说, , , , , 
loctch 
LOC tail char-
acter 
区国岛海台, , , , , 
庄冲,  
orgtch 
ORG tail char-
acter 
府团校协局, , , , , 
办 , 军  
other Other case 情规, 息, ,   ！ , 。  
Table 1.  Some Examples of SVCL Feature 
182
Corpus 
Measure 
UPUC  CityU  CKIP MSRA
Recall 0.922 0.952 0.939 0.933
Precision 0.912 0.954 0.929 0.942
FB1 0.917 0.953 0.934 0.937
OOV  
Recall 
0.680 0.747 0.606 0.640
IV Recall 0.945 0.960 0.954 0.943
Table 4. WS Close 
 
Measure MSRA CityU LDC 
Accuracy 92.44 97.80 93.82 
Precision 81.64 92.76 81.43 
Recall 31.24 81.81 59.53 
FB1 45.19 86.94 68.78
Table 5. NER Close 
 
The reason for low measure on MSRA NER 
track exists in that we chose a much smaller 
training data file encoded in CP936 (about 7% of 
the full data set). This file may be an incomplete 
output when the organizer transfers from another 
encoding scheme. 
5.2 Errors  from NER Track 
The NER errors in our system are mainly as fol-
lows: 
• Abbreviations 
Abbreviations are very common among the er-
rors. Among them, a significant part of abbrevia-
tions are mentioned before their corresponding 
full names. Some common abbreviations has no 
corresponding full names appeared in document. 
Here are some examples: 
R
1
:针对大陆人民申请进入  金  妈  地
区， [内政部警政署入出境管理局  ORG] 
[金门  GPE]、 [妈祖  GPE]服务站定于
明天……  
K:针对大陆人民申请进入  [金  GPE] 
[妈  GPE]地区， [内政部警政署入出境
管理局  ORG][金门  GPE]、 [妈祖  
GPE]服务站定于明天……  
R: 总后 [嫩江基地  LOC]的先进事迹  
K: [总后嫩江基地  LOC]的先进事迹  
R: [中  丹  LOC]兩國  
K: [中  LOC][丹  LOC]兩國  
In current system, the recognition is fully de-
pended on the linear-chain CRFs model, which is 
heavily based on local window observation fea-
tures; no abbreviation list or special abbreviation 
                                                 
1
 R stands for system response, K for key. 
recognition involved. Because lack of constraint 
checking on distant entity mentions, the system 
fails to catch the interaction among similar text 
fragments cross sentences. 
• Concatenated Names 
For many reasons, Chinese names in titles and 
some sentences, especially in news, are not sepa-
rated. The system often fails to judge the right 
boundaries and the reasonable type classification. 
For example: 
R:身边还有 [张龙  赵虎  PER]王朝 [马
汉  PER] 四个卫士  
K:身边还有 [张龙  PER][赵虎  
PER][王朝  PER][马汉  PER] 四个卫
士  
R:将 [瓦西里斯  LOC]与 [奥纳西斯  
PER]比较  
K:将 [瓦西里斯  PER]与 [奥纳西斯  
PER]比较  
• Hints 
Though it helps to recognize an entity at most 
cases, the small-vocabulary-list hint feature may 
recommend a wrong decision sometimes. For 
instance, common surname character “王 ” in the 
following sentence is wrongly labeled when no 
word segmentation information given: 
R:[希腊  LOC]船 [王  康斯坦塔科普洛
斯  PER] 
K:[希腊  LOC]船  王 [康斯坦塔科普洛
斯  PER] 
Other errors of this type may result from fail-
ing to identify verbs and prepositions, such as: 
R:[中共中央  致  中国致公党十一大  
ORG]的贺词……向 [致公党  ORG]的同
志们……  
K:[中共中央  ORG]致 [中国致公党十
一大  ORG]的贺词……向 [致公党  ORG]
的同志们……  
R:全国保护明天行动组委会  举行表彰会  
K:[全国保护明天行动组委会  ORG]举行表
彰会  
R:包公  赶驴  
K:[包公  PER] 赶驴  
• Other Types: 
R:特别助理  由喜贵  等也同机抵达。  
K:特别助理 [由喜贵  PER]等也同机抵
达。  
R:脸谱上还有  日  月  的图案  
183
K:脸谱上还有 [日  LOC][月  LOC]的
图案  
6 Conclusions and Future Work 
We mainly described a Chinese named entity 
recognition system NER@ISCAS, which inte-
grates text, part-of-speech and a small-
vocabulary-character-lists feature for MSRA 
NER open track under the framework of Condi-
tional Random Fields (CRFs) model. Although it 
provides a unified framework to integrate multi-
ple flexible features, and to achieve global opti-
mization on input text sequence, the popular lin-
ear chained Conditional Random Fields model 
often fails to catch semantic relations among re-
occurred mentions and adjoining entities in a 
catenation structure. 
The situations containing exact reoccurrence 
and shortened occurrence enlighten us to take 
more effort on feature engineering or post proc-
essing on abbreviations / recurrence recognition. 
Another effort may be poured on the common 
patterns, such as paraphrase, counting, and con-
straints on Chinese person name lengths. 
From current point of view, enriching the hint 
lists is also desirable. 
Acknowledgment 
This work is supported by the National Science 
Fund of China under contract 60203007. 

References 
Chinese 863 program. 2005. Results on Named 
Entity Recognition. The 2004HTRDP Chinese 
Information Processing and Intelligent Hu-
man-Machine Interface Technology Evalua-
tion. 
Yuanyong Feng, Le Sun and Junlin Zhang. 2005. 
Early Results for Chinese Named Entity Rec-
ognition Using Conditional Random Fields 
Model, HMM and Maximum Entropy. IEEE 
Natural Language Processing & Knowledge 
Engineering. Beijing: Publishing House, 
BUPT. pp. 549~552. 
John Lafferty, Andrew McCallum, and Fernando 
Pereira. 2001. Conditional Random Fields: 
Probabilistic Models for Segmenting and La-
beling Sequence Data. ICML. 
Shiwen Yu. 1999. Manual on Modern Chinese 
Corpora Construction. Institute of Computa-
tional Language, Peking Unversity. Beijing.  
