Chinese Named Entity Identification Using Class-based
Language Model
1
Jian Sun*, Jianfeng Gao
G06
, Lei Zhang**, Ming Zhou
G06
, Changning Huang
G06
* Beijing University of Posts & Telecommunications, China, jiansun_china@hotmail.com
#
Microsoft Research Asia, {jfgao, mingzhou, cnhuang}@microsoft.com
G0DG0D Tsinghua University, China
1
This work was done while the author was visiting Microsoft Research Asia
G24G45G56G57G55G44G46G57
We consider here the problem of Chinese 
named entity (NE) identification using 
statistical language model(LM). In this 
research, word segmentation and NE 
identification have been integrated into a
unified framework that consists of several 
class-based language models. We also adopt a 
hierarchical structure for one of the LMs so 
that the nested entities in organization names 
can be identified. The evaluation on a large test 
set shows consistent improvements. Our 
experiments further demonstrate the 
improvement after seamlessly integrating with 
linguistic heuristic information, cache-based 
model and NE abbreviation identification.
G14 G2CG51G57G55G52G47G58G46G57G4CG52G51
G31G28G03G4CG47G48G51G57G4CG49G4CG46G44G57G4CG52G51 is the key technique in many
applications such as information extraction, 
question answering, machine translation and so 
on. English NE identification has achieved a 
great success. However, for Chinese, NE 
identification is very different. There is no 
space to mark the word boundary and no 
standard definition of words in Chinese. The 
Chinese NE identification and word 
segmentation are interactional in nature.
This paper presents a unified approach 
that integrates these two steps together using a
class-based LM, and apply Viterbi search to 
select the global optimal solution. The 
class-based LM consists of two sub-models, 
namely the context model and the entity model. 
The context model estimates the probability of 
generating a NE given a certain context, and
the entity model estimates the probability of a 
sequence of Chinese characters given a certain 
kind of NE. In this study, we are interested in 
three kinds of Chinese NE that are most 
commonly used, namely person name (PER), 
location name (LOC) and organization name 
(ORG). We have also adopted a variety of 
approaches to improving the LM. In addition, a 
hierarchical structure for organization LM is 
employed so that the nested PER, LOC in 
ORG can be identified. 
The evaluation is conducted on a large test 
set in which NEs have been manually tagged. 
The experiment result shows consistent 
improvements over existing methods. Our 
experiments further demonstrate the 
improvement after integrating with linguistic 
heuristic information, cache-based model and 
NE abbreviation identification. The precision
of PER, LOC, ORG on the test set is 79.86%, 
80.88%, 76.63%, respectively; and the recall is
87.29%, 82.46%, 56.54%, respectively.
G15 G35G48G4FG44G57G48G47G03G3AG52G55G4E
Recently, research on English NE 
identification has been focused on the 
machine-learning approaches, including 
hidden Markov model (HMM), maximum 
entropy model, decision tree and 
transformation-based learning, etc. (Bikel et al, 
1997; Borthwick et al, 1999; Sekine et al, 
1998). Some systems have been applied to real 
application.
Research on Chinese NE identification is,
however, still at its early stage. Some 
researches apply methods of English NE 
identification to Chinese. Yu et al (1997) 
applied the HMM approach where the NE 
identification is formulated as a tagging 
problem using Viterbi algorithm. In general, 
current approaches to NE identification (e.g. 
Chen, 1997) usually contain two separate steps: 
word segmentation and NE identification. The 
word segmentation error will definitely lead to 
errors in the NE identification results. Zhang 
(2001) put forward class-based LM for 
Chinese NE identification. We further develop 
this idea with some new features, which leads 
to a new framework. In this framework, we 
integrate Chinese word segmentation and NE 
identification into a unified framework using a 
class-based language model (LM).
G16 G26G4FG44G56G56G10G45G44G56G48G47G03G2FG30 G49G52G55G03G31G28G03
G2CG47G48G51G57G4CG49G4CG46G44G57G4CG52G51
The n-gram LM is a stochastic model which 
predicts the next word given the previous n-1
words by estimating the conditional probability 
P(w
n
|w
1
…w
n-1
). In practice, trigram 
approximation P(w
i
|w
i-2
w
i-1
) is widely used, 
assuming that the word w
i
depends only on two 
preceding words w
i-2 
and w
i-1
. Brown et al (1992) 
put forward and discussed n-gram models 
based on classes of words. In this section, we 
will describe how to use class-based trigram 
model for NE identification. Each kind of NE 
(including PER, LOC and ORG) is defined as a 
class in the model. In addition, we differentiate 
the transliterated person name (FN) from the 
Chinese person name since they have different 
constitution patterns. The four classes of NE 
used in our model are shown in Table 1. All 
other words are also defined as individual 
classes themselves (i.e. one word as one class). 
Consequently, there are G5FG39G5F+4 classes in our 
model, where G5FG39G5F is the size of vocabulary.
G37G44G45G4FG48 G14: Classes defined in class-based model
G37G44G4A G27G48G56G46G55G4CG53G57G4CG52G51
PN Chinese person name
FN Transliterated person name
LN Location name
ON Organization name
G16G11G14 G37G4BG48G03G2FG44G51G4AG58G44G4AG48G03G30G52G47G48G4FG4CG51G4A
G16G11G14G11G14 G29G52G55G50G58G4FG44G57G4CG52G51
Given a Chinese character sequence G36G03=G56
G14
GABG56
G51
, 
the task of Chinese NE identification is to find 
the optimal class sequence G26G0D=G46
G14
GABG46
G50
(G50<=G51) 
that maximizes the probability G33G0BG26G5FG36G0C. It can be 
expressed in the equation (1) and we call it 
class-based model.
The class-based model consists of two 
sub-models: the context model G33G0BG26G0C and the 
entity model P (S|C). The context model 
indicates the probability of generating a NE
class given a (previous) context. P(C) is a 
priori probability, which is computed 
according to Equation (2):
�
=
--
@
G50
G4C
G4CG4CG4C
G46G46G46G33G26G33
1
12
)|()(
(2)
P(C) can be estimated using a NE labeled 
corpus. The entity model can be parameterized 
by Equation (3):
�
=
--
--
@
@
=
G50
G4D
G4DG48G51G47G46G56G57G44G55G57G46
G50G51G56G57G44G55G57G46G48G51G47G46
G50G51
G46G56G56G33
G46G46G56G56G56G56G33
G46G46G56G56G33G26G36G33
G4DG4D
G50
1
11
11
)|]...([
)...|]...]...[...([
)...|...()|(
1
(3)
The entity model estimates the generative 
probability of the Chinese character sequence in
square bracket pair (i.e. starting from G46
G4D
G10G56G57G44G55G57 to 
G46
G4D
G10G48G51G47) given the specific NE class.
For different class, we define the different 
entity model.
For the class of PER (including PN and 
FN), the entity model is a G46G4BG44G55G44G46G57G48G55G10G45G44G56G48G47
trigram model as shown in Equation (4).
�
-
-=
--
--
==
=
G48G51G47G46
G56G57G44G55G57G46G4E
G4DG4EG4EG4E
G4DG48G51G47G46G56G57G44G55G57G46
G4D
G4D
G4DG4D
G33G28G35G46G56G56G56G33
G33G28G35G46G56G56G33
),,|(
)|]...([
12
(4)
where s can be any characters occurred in a 
person name. For example, the generative 
probability of character sequence G22GFBG23 (Li 
Dapeng) is much larger than that of G8CGEEG48
(many years) given the PER since G22 is a 
commonly used family name, and GFBand G23are 
commonly used first names. The probabilities
can be estimated with the person name list.
For the class of LOC, the entity model is a 
G5AG52G55G47G10G45G44G56G48G47 trigram model as shown in
Equation (5).
)|(maxarg* G36G26G33G26
G26
=
)|()(maxarg G26G36G33G26G33
G26
�=
(1)
)|]...([ G2FG32G26G46G56G56G33
G4DG48G51G47G46G56G57G44G55G57G46
G4DG4D
=
--
G40G2FG32G26G0CG46G0FG5AG5AG5FG5AG0BG33G3EG50G44G5B
G2FG32G26G0CG46G5FG5AG11G11G11G5AG0BG33G50G44G5B
G4F
G14G4E
G4DG14G4EG15G4EG4E
G3A
G4DG4FG14
G3A
�
=
--
==
=�
(5)
where W = w
1
…w
l
is possible segmentation 
result of  character sequence 
G48G51G47G46G56G57G44G55G57G46
G4DG4D
G56G56
--
... .
For the class of ORG, the construction is 
much more complicated because an ORG often
contain PER and/or LOC. For example, the 
ORG �G01GD1GD1G19GFEG4EG40GCC� (Air China 
Corporation) contains the LOC �G01GD1� (China).
It is beneficial to such applications as question 
answering, information extraction and so on if 
nested NE can be identified as well. In order to 
identify the nested PER, LOC in ORG
2
, we 
adopted class-based LMs for ORG further, in 
which there are three sub models, one is the 
class generative model, and the others are entity 
model: person name model and location name 
model in ORG. Therefore, the entity model of 
ORG is shown in Equation (6) which is almost 
same as Equation (1).
)|]...([ G32G35G2AG46G56G56G33
G4DG48G51G47G46G56G57G44G55G57G46
G4DG4D
=
--
�
�
�
�
�
�
�
�
�
�
�
�
�
�
=�
=
@
�
�
�
�
�
�
�
�
=�
=
=
=@
�
=
--
=
--
--
--
G4E
G4C
G4DG4CG48G51G47G46G56G57G44G55G57G46
G4E
G14G4C
G4DG14G4CG15G4CG4C
G26
G4DG4EG48G51G47G46G56G57G44G55G57G46
G4DG4E
G26
G4DG48G51G47G46G56G57G44G55G57G46G4D
G26
G32G35G2AG46G46G56G56G33
G32G35G2AG0CG46G0FG46G0AG46G0AG5FG33G0BG46G0A
G32G35G2AG46G46G46G56G56G33
G32G35G2AG46G46G46G33
G32G35G2AG46G26G56G56G33G46G26G33
G4CG4C
G4DG4D
G4DG4D
1
''
'
1
1
'
'
),'|]...([
max
),'...'|]...([
)|'...'(
max
)],'|]...([)|'([max
(6)
where 
'
...
'
1
'
G4E
G46G46G26 = is the sequence of class 
corresponding to the Chinese character 
sequence.
In addition, if 
G4D
G46 is a normal word, 
1)|]...([ =
-- G4DG48G51G47G46G56G57G44G55G57G46
G46G56G56G33
G4DG4D
.                   (7)
Based on the context model and entity 
models, we can compute the probability G33G0BG26G5FG36G0CG03
2
For simplification, only nested person, location 
names are identified in organization. The nested 
person in location is not identified because of low 
frequency
and can get the optimal class sequenceG11 The 
Chinese PER and transliterated PER share the 
same context class model when computing the 
probability.
G16G11G14G11G15 G30G52G47G48G4FG56G03G28G56G57G4CG50G44G57G4CG52G51
As discussed in 3.1.1, there are two kinds of 
probabilities to be estimated: P(C) and P(S|C) . 
Both probabilities are estimated using 
Maximum Likelihood Estimation (MLE) with 
the annotated training corpus.
The parser NLPWin
3
was used to tag the 
training corpus. As a result, the corpus was 
annotated with NE marks. Four lists were 
extracted from the annotated corpus and each 
list corresponds one NE class. The context 
model G33G0BG26G0C was trained with the annotated 
corpus and the four entity models were trained 
with corresponding NE lists. The Figure 1 
shows the training process. (Begin of sentence 
(BOS) and end of sentence (EOS) is added)
NLPWin
Tagged
Sentence
<LOC>G62GD1</LOC>G0FGB3<PER>GD7
G94</PER>G2CG24<ORG>G01GD1GD1G19GFE
G4EG40GCC</ORG>G58GFEGC1G04G92<LOC>
G01GD1<LOC>
Context
Class
BOS LN G0FGB3 PN G2CG24 ON G58GFEGC1
G04G92 LN  EOS
LN list
G62GD1
G01GD1
FN list
GD7G94
ON list
G01GD1GD1G19GFEG4EG40GCC
ON Class list
LN GD1G19GFEG4EG40GCC
Corresponding 
English
Sentence
<LOC>U.S.</LOC>president 
<PER>Bush</PER> arrived in 
<LOC> P.R. China </LOC> by flight 
No.1 of <ORG>Air China 
Corp.</ORG>
Figure 1:  Example of  Training Process
G16G11G14G11G16 G27G48G46G52G47G48G55
Given a sequence of Chinese characters, the 
decoding process consists of the following 
three steps:
G36G57G48G53G03 G14G1D All possible word segmentations are 
generated using a Chinese lexicon 
containing 120,050 entries. The lexicon is 
only used for segmentation and there is no 
NE tag in it even if one word is PER, LOC or 
3
NLPWin system is a natural language processing 
system developed by Microsoft Research.
ORG. For example, GEBG80 (Beijing) is not 
tagged as LOC.
G36G57G48G53G03G15G1D NE candidates are generated from any 
one or more segmented character strings and 
the corresponding generative probability for 
each candidate is computed using entity 
models described in Equation (4)�(7). 
G36G57G48G53G03 G16G1D Viterbi search is used to select 
hypothesis with the highest probability as 
the best output. Furthermore, in order to 
identify nested named entities, two-pass
Viterbi search is adopted. The inner Viterbi 
search is corresponding to Equation (6) and 
the outer one corresponding to Equation (1). 
After the two-pass searches, the word 
segmentation and the named entities 
(including nested ones) can be obtained.
G16G11G15 G2CG50G53G55G52G59G48G50G48G51G57
There are some problems with the framework 
of NE identification using the class-based LM.
First, redundant candidates NEs are generated 
in the decoding process, which results in very 
large search space. The second problem is that 
data sparseness will seriously influence the 
performance. Finally, the abbreviation of NEs 
cannot be handled effectively. In the following 
three subsections, we provide solutions to the 
three problems mentioned above.
G16G11G15G11G14 G2BG48G58G55G4CG56G57G4CG46G03G2CG51G49G52G55G50G44G57G4CG52G51
In order to overcome the redundant candidate 
generation problem, the heuristic information 
is introduced into the class-based LM. The 
following resources were used: (1) Chinese 
family name list, containing 373 entries (e.g. GF4
(Zhang), G5F(Wang)); (2) transliterated name 
character list, containing 618 characters (e.g.G94
(shi), G53(dun)); and (3) ORG keyword list,
containing 1,355 entries (e.g. GFBG3A(university),
G40GCC(corporation)).
The heuristic information is used to 
constrain the generation of NE candidates. For 
PER (PN), only PER candidates beginning with 
the family name is considered. For PER (FN), a 
candidate is generated only if all its composing 
character belongs to the transliterated name 
character list. For ORG, a candidate is excluded 
if it does not contain one ORG keyword.
Here, we do not utilize the LOC keyword to 
generate LOC candidate because of the fact that 
many LOC do not end with keywords.
G16G11G15G11G15 G26G44G46G4BG48G03G30G52G47G48G4F
The cache entity model can address the data 
sparseness problem by adjusting the parameters 
continually as NE identification proceeds. The 
basic idea is to accumulate Chinese character or 
word n-gram so far appeared in the document 
and use them to create a local dynamic entity
model such as )|(
1-G4CG4CG45G4CG46G44G46G4BG48
G5AG5AG33 and 
)(
G4CG58G51G4CG46G44G46G4BG48
G5AG33 . We can interpolate the cache 
entity model with the static entity 
LM )...|(
121 -- G4CG4CG4CG56G57G44G57G4CG46
G5AG5AG5AG5AG33 :
G0CG5AG0FG5AG11G11G11G11G5AG5FG5AG0BG33
G14G4CG15G4CG14G4CG46G44G46G4BG48 --
(8)
)....|()1(
)|()(
1121
121
-
-
--+
+=
G4CG4CG56G57G44G57G4CG46
G4CG4CG45G4CG46G44G46G4BG48G4CG58G51G4CG46G44G46G4BG48
G5AG5AG5AG33
G5AG5AG33G5AG33
G4FG4F
G4FG4F
where ]1,0[,
21
�G4FG4F are interpolation weight 
that is determined on the held-out data set.
G16G11G15G11G16 G27G48G44G4FG4CG51G4AG03G5AG4CG57G4BG03G24G45G45G55G48G59G4CG44G57G4CG52G51
We found that many errors result from the 
occurrence of abbreviation of person, location, 
and organization. Therefore, different 
strategies are adopted to deal with 
abbreviations for different kinds of NEs. For 
PER, if Chinese surname is followed by the 
title, then this surname is tagged as PER. For 
example, GBAGF5G53 (President Zuo) is tagged as 
<PER>GBA</PER> GF5G53. For LOC, if at least 
two location abbreviations occur consecutive, 
the individual location abbreviation is tagged as 
LOC. For example,G01GB9G47GCF (Sino-Japan 
relation) is tagged as <LOC> G01
</LOC><LOC>GB9</LOC> G47GCF. For ORG, if 
organization abbreviation is followed by LOC, 
which is again followed by organization 
keyword, the three units are tagged as one ORG. 
For example, G01G45GEBG80GD6GA8(Chinese 
Communist Party Committee of Beijing) is 
tagged as <ORG>G01G45<LOC>GEBG80</LOC> GD6
GA8</ORG>. At present, we collected 112 
organization abbreviations and 18 location 
abbreviations.
G17 G28G5BG53G48G55G4CG50G48G51G57G56
G17G11G14 G28G59G44G4FG58G44G57G4CG52G51G03G30G48G57G55G4CG46
We conduct evaluations in terms of precision (P) 
and recall (R).
G31G28G4CG47G48G51G57G4CG49G4CG48G47G52G49G51G58G50G45G48G55
G31G28G4CG47G48G51G57G4CG49G4CG48G47G46G52G55G55G48G46G57G4FG5CG52G49G51G58G50G45G48G55
G33 =
(9)
G31G28G44G4FG4FG52G49G51G58G50G45G48G55
G31G28G4CG47G48G51G57G4CG49G4CG48G47G46G52G55G55G48G46G57G52G49G51G58G50G45G48G55
G35 =
(10)
We also used the F-measure, which is defined 
as a weighted combination of precision and 
recall as Equation (11):
G35G0CG33G0B
G35G33G0CG14G0B
G29
G15
G15
+�
��+
=
G45
G45
(11)
where G45 is the relative weight of precision and 
recall.
There are two differences between MET 
evaluation and ours. First, we include nested 
NE in our evaluation whereas MET does not.
Second, in our evaluation, only NEs with 
correct boundary and type label are considered 
the correct identifications. In MET, the 
evaluation is somewhat flexible. For example, a 
NE may be identified partially correctly if the 
label is correct but the boundary is wrongly 
detected.
G17G11G15 G27G44G57G44G03G36G48G57G56
The training text corpus contains data from
People�s Daily (Jan.-Jun.1998). It contains
357,544 sentences (about 9,200,000 Chinese 
characters). This corpus includes 104,487
Chinese PER, 51,708 transliterated PER, 
218,904 LOC, and 87,391 ORG. These data 
was obtained after this corpus was parsed with 
NLPWin.
We built the wide coverage test data 
according to the guidelines
4
that are just same 
as those of 1999 IEER. The test set (as shown in 
Table 2) contains half a million Chinese 
characters; it is a balanced test set covering 11 
domains. The test set contains 11,844 sentences,
49.84% of the sentences contain at least one NE. 
The number of characters in NE accounts for
8.448% in all Chinese characters.
We can see that the test data is much larger 
than the MET test data and IEER data
4
The difference between IEER’s guidelines and 
ours is that the nested person and location name in 
organization are tagged in our guidelines.
G37G44G45G4FG48G03G15: Statistics of Open-Test
Number of NE TokensID Domain
PER LOC ORG
Size
(byte)
1 Army 65 202 25 19k
2 Computer 75 156 171 59k
3 Culture 548 639 85 138k
4 Economy 160 824 363 108k
5 Entertainment 672 575 139 104k
6 Literature 464 707 122 96k
7 Nation 448 1193 250 101k
8 People 1147 912 403 116k
9 Politics 525 1148 218 122k
10 Science 155 204 87 60k
11 Sports 743 1198 628 114k
Total 5002 7758 2491 1037k
G17G11G16 G37G55G44G4CG51G4CG51G4AG03G27G44G57G44G03G33G55G48G53G44G55G44G57G4CG52G51
The training data produced by NLPWin has 
some noise due to two reasons. First, the NE 
guideline used by NLPWin is different from 
the one we used. For example, in NLPWin, GEB
G80GD6(Beijing City) is tagged as <LOC>GEBG80
</LOC> GD6, whereas GEBG80GD6 should be LOC 
in our definition. Second, there are some errors
in NLPWin results. We utilized 18 rules to 
correct the frequent errors. The following 
shows some examples.
The Table 4 shows the quality of our training 
corpus.
Table 4   Quality of Training Corpus 
NE P (%) R (%) F (%)
PER 61.05 75.26 67.42
LOC 78.14 71.57 74.71
ORG 68.29 31.50 43.11
Total 70.07 66.08 68.02
G17G11G17 G28G5BG53G48G55G4CG50G48G51G57G56
We conduct incrementally the following four 
experiments:
(1) Class-based LM, we view the results as 
baseline performance;
(2) Integrating heuristic information into (1);
(3) Integrating Cache-based LM with (2);
(4) Integrating NE abbreviation processing 
with (3).
G2FG31G03G0EG03G2FG52G46G44G57G4CG52G51G03G2EG48G5CG03G03 G3A G2FG31
G2FG31G03G0EG03G4FG0EG03G2FG31G03G03G0EG03G0DG0DG5A G3A G32G31G03
G0BG01G5FG62G5FGC5G5FGA9GABG0CGD1G8E G3A G2FG31G03G03G0EG03G8E
G17G11G17G11G14 G26G4FG44G56G56G10G45G44G56G48G47G03G2FG30G03G0BG25G44G56G48G4FG4CG51G48G0C
Based on the basic class-based models 
estimated with the training data, we can get the 
baseline performance, as is shown in Table 5. 
Comparing Table 4 and Table 5, we found that 
the performance of baseline is better than the 
quality of training data.
Table 5    Baseline Performance
NE P (%) R (%) F (%)
PER 65.70 84.37 73.87
LOC 82.73 76.03 79.24
ORG 56.55 38.56 45.86
Total 72.61 72.44 72.53
G17G11G17G11G15 G2CG51G57G48G4AG55G44G57G4CG51G4AG03G03G2BG48G58G55G4CG56G57G4CG46G03G2CG51G49G52G55G50G44G57G4CG52G51
In this part, we want to see the effects of using 
heuristic information. The results are shown in 
Table 6. In experiments, we found that by 
integrating the heuristic information, we not 
only achieved more efficient decoding, but also 
obtained higher NE identification precision. For 
example, the precision of PER increases from 
65.70% to 77.63%, and precision of ORG 
increases from 56.55% to 81.23%.  The reason 
is that adopting heuristic information reduces 
the noise influence.
However, we noticed that the recall of PER 
and LOC decreased a bit. There are two reasons. 
First, organization names without organization 
ending keywords were not marked as ORG. 
Second, Chinese names without surnames were 
also missed.
Table 6 Results of Heuristic Information Integrated 
into the Class-based LM
NE P (%) R (%) F (%)
PER 77.63 80.89 79.23
LOC 80.05 80.80 80.42
ORG 81.23 36.65 50.51
Total 79.26 73.41 76.23
G17G11G17G11G16 G2CG51G57G48G4AG55G44G57G4CG51G4AG03G26G44G46G4BG48G10G45G44G56G48G47G03G2FG30
Table 7 shows the evaluation results after 
cache-based LM was integrated. From Table 6 
and Table 7, we found that almost all the 
precision and recall of PER, LOC, ORG have 
obtained slight improvements.
Table 7   Results of our system
NE P (%) R (%) F (%)
PER 79.12 82.06 80.57
LOC 80.11 81.27 80.69
ORG 79.71 39.89 53.17
Total 79.72 74.58 77.06
G17G11G17G11G17 G2CG51G57G48G4AG55G44G57G4CG51G4AG03 G5AG4CG57G4BG03 G31G28G03 G24G45G45G55G48G59G4CG44G57G4CG52G51G03
G33G55G52G46G48G56G56G4CG51G4A
In this experiment, we integrated with NE 
abbreviation processing. As shown in Table 8, 
the experiment result indicates that the recall of 
PER, LOC, ORG increased from 82.06%, 
81.27%, 36.65% to 87.29%, 82.46%, 56.54%, 
respectively.
Table 8   Results of our system
NE P (%) R (%) F (%)
PER 79.86 87.29 83.41
LOC 80.88 82.46 81.66
ORG 76.63 56.54 65.07
Total 79.99 79.68 79.83
G17G11G17G11G18 G36G58G50G50G44G55G5C
From above data, we observed that (1) the class 
based SLM performs better than the training 
data automatically produced with the parser; (2) 
the distinct improvements is achieved by using 
heuristic information; (3) Furthermore, our 
method of dealing with abbreviation increases 
the recall of NEs.
In addition, the cache-based LM increases 
the performance not so much. The reason is as 
follows: The cache-based LM is based on the 
hypothesis that a word used in the recent past is 
much likely either to be used soon than its 
overall frequency in the language or a 3 -gram 
model would suggest (Kuhn, 1990). However, 
we found that the same NE often varies its 
morpheme in the same document. For example, 
the same NE G01G45GEBG80GD6GA8(Chinese 
Communist Party Committee of Beijing),GEBG80
GD6GA8(Committee of Beijing City), GD6GA8
(Committee) occur in order.
Furthermore, we notice that the 
segmentation dictionary has an important 
impact on the performance of NE 
identification. We do not think it is better if 
more words are added into dictionary. For 
example, because G01GD1G8E(Chinese) is in our 
dictionary, there is much possibility that G01GD1
(China) in G01GD1G8Eis missed identified.
G18 G28G59G44G4FG58G44G57G4CG52G51G03G5AG4CG57G4BG03G30G28G37G15G03G44G51G47G03
G2CG28G28G35G03G37G48G56G57G03G27G44G57G44
We also evaluated on the MET2 test data and 
IEER test data. The results are shown in Table 
9. The results on MET2 are lower than the 
highest report of MUC7 (PER: Precision 66%, 
Recall 92%; LOC: Precision 89%, Recall 91%; 
ORG: Precision 89%, Recall 88%, 
http://www.itl.nist.gov). We speculate the 
reasons for this in the following. The main 
reason is that our class-based LM was 
estimated with a general domain corpus, which 
is quite different from the domain of MUC 
data. Moreover, we didn’t use a NE dictionary. 
Another reason is that our NE definitions are 
slightly different from MET2.
Table 9 Results on MET2 and IEER
MET2 Data IEER DataNE
P
(%)
R 
(%)
F
(%)
P 
(%)
R 
(%)
F 
(%)
PER 65.86 94.25 77.54 79.38 84.43 81.83
LOC 77.42 89.60 83.07 79.09 80.18 79.63
ORG 88.47 75.33 81.38 88.03 62.30 72.96
Total 77.89 86.09 81.79 80.82 76.78 78.75
G19 G26G52G51G46G4FG58G56G4CG52G51G56 G09G03G29G58G57G58G55G48G03G5AG52G55G4E
In this research, Chinese word segmentation 
and NE identification has been integrated into 
a framework using class-based language 
models (LM). We adopted a hierarchical 
structure in ORG model so that the nested 
entities in organization names can be identified. 
Another characteristic is that our NE 
identification do not utilize NE dictionary
when decoding.
The evaluation on a large test set shows
consistent improvements. The integration of 
heuristic information improves the precision 
and recall of our system. The cache-based LM 
increases the recall of NE identification to 
some extent. Moreover, some rules dealing 
with abbreviations of NEs have increased 
dramatically the performance. The precision of 
PER, LOC, ORG on the test set is 79.86%, 
80.88%, 76.63%, respectively; and the recall is
87.29%, 82.46%, 56.54%, respectively.
In our future work, we will be focusing 
more on NE coreference using language model. 
Second, we intend to extend our model to 
include the part-of-speech tagging model to 
improve the performance. At present, the 
class-based LM is based on the general domain 
and we may need to fine-tune the model for a 
specific domain. 
ACKNOWLEDGEMENT
I would like to thank Ming Zhou, Jianfeng 
Gao, Changning Huang, Andi Wu, Hang Li
and other colleagues from Microsoft Research 
for their help. And I want to thank especially 
Lei Zhang from Tsinghua University for his 
help in developing the ideas.

References 

Borthwick. A. (1999) A Maximum Entropy 
Approach to Named Entity Recognition. PhD 
Dissertation

Bikel D., Schwarta R., Weischedel. R. (1997) An 
algorithm that learns what's in a name. Machine 
Learning 34, pp. 211-231

Brown, P. F., DellaPietra, V. J., deSouza, P. V., Lai, 
J. C., and Mercer, R. L. (1992). Class-based 
n-gram models of natural language. Computational 
Linguistics, 18(4):468--479.

Chinchor. N. (1997) MUC-7 Named Entity Task 
Definition Version 3.5. Available by from 
ftp.muc.saic.com/pub/MUC/MUC7-guidelines

Chen H.H., Ding Y.W., Tsai S.C. and Bian G.W. 
(1997) Description of the NTU System Used for 
MET2

Gao J.F., Goodman J., Li M.J., Lee K.F. (2001)  
Toward a unified Approach to Statistical Language 
Modeling for Chinese. To appear in ACM 
Transaction on Asian Language Processing

Kuhn R., Mori. R.D. (1990) A Cache-Based 
Natural Language Model for Speech Recognition. 
IEEE Transaction on Pattern Analysis and Machine 
Intelligence.Vol.12. No. 6. pp 570-583

Mikheev A., Grover C. and Moens M. (1997)
Description of the LTG System Used for MUC-7

Sekine S., Grishman R. and Shinou H. (1998), A 
decision tree method for finding and classifying 
names in Japanese texts, Proceedings of the Sixth 
Workshop on Very Large Corpora, Canada 

Yu S.H., Bai S.H. and Wu P. (1997) Description of 
the Kent Ridge Digital Labs System Used for 
MUC-7

Zhang L. (2001) Study on Chinese Proofreading 
Oriented Language Modeling, PhD Dissertation
