Probabilistic named entity verification
Yi-Chung Lin and Peng-Hsiang Hung
Advanced Technology Center, Computer and Communications Laboratories,
Industrial Technology Research Institute, Taiwan
{lyc,phhung}@itri.org.tw
Abstract
Named entity (NE) recognition is an impor-
tant task for many natural language applica-
tions, such as Internet search engines,
document indexing, information extraction
and machine translation. Moreover, in ori-
ental languages (such as Chinese, Japanese
and Korean), NE recognition is even more
important because it significantly affects the
performance of word segmentation, the most
fundamental task for understanding the texts
in oriental languages. In this paper, a prob-
abilistic verification model is designed for
verifying the correctness of a named entity
candidate. This model assesses the confi-
dence level of a candidate not only accord-
ing to the candidate’s structure but also ac-
cording to its context. In our design, the
clues for confidence measurement are col-
lected from both positive and negative ex-
amples in the training data in a statistical
manner. Experimental results show that the
proposed method significantly improves the
F-measure of Chinese personal name recog-
nition from 86.5% to 94.4%.
Introduction
Named entity (NE) recognition (or proper name
recognition) is a task to find the entities of per-
son, location, organization, date, time, percent-
age and monetary value in text documents. It is
an important task for many natural language
applications, such as Internet search engines,
document indexing, information extraction and
machine translation. Moreover, in oriental lan-
guages (such as Chinese, Japanese and Korean),
NE recognition is even more important because
it significantly affects the performance of word
segmentation, the most fundamental task for
understanding the texts in oriental languages.
Therefore, a high-accuracy NE recognition
method is highly demanded for most natural
language applications in various languages.
There are two major approaches to NE
recognition: the handcrafted approach (Grish-
man, 1995) and the statistical approach (Bikel,
1997; Chen, 1998; Yu, 1998). In the first ap-
proach, a system usually relies on a large num-
ber of handcrafted rules. This kind of systems
can be rapid prototyped but are hard to scale up.
In fact, there will be numerous exceptions for
most handcrafted rules. It is generally expensive
and impossible to code for every exception we
can imagine, not to mention those exceptions we
are not able to think of. Another serious problem
with the handcrafted approach is that systems
are hard to be ported across different domains
and different languages. Porting a handcrafted
system usually means rewriting all its rules.
Therefore, the statistical approach is becoming
more and more popular because of its cost-
effectiveness in scaling up and porting systems.
In general, the statistical approach to NE
recognition can be viewed as a two-stage proc-
ess. First, according to dictionaries and/or pat-
tern matching rules, the input text is tokenized
into tokens. Each token may be a word or an NE
candidate which can consist of more than one
word. Then, the most likely token sequence is
selected according to a statistical model, such as
Markov model (Bikel, 1997; Yu, 1998) or
maximum entropy model (Borthwick, 1999).
Although, the statistical NE recognition is much
more scalable and portable, its performance is
still not satisfactory. The insufficient cover-
age/precision of pattern matching rules and
unknown words are the major sources of errors.
Furthermore, the role of the statistical model is
to assess the relative possibilities of all possible
token sequences and select the most probable
one. The scores obtained from the statistical
model can be used for a comparison of compet-
ing token sequences, but not for an assessment
of the probability that a spotted named entity is
correct.
To reduce the recognition errors, we pro-
pose a probabilistic verification model to verify
the correctness of a named entity. This model
assesses the confidence level of a named entity
candidate not only according to the candidate’s
structure but also according to its contexts. In
our design, the clues for confidence measure-
ment are collected from both positive and nega-
tive examples in the training data. Therefore, the
confidence measure has strong discriminant
power for judging the correctness of a named
entity. In the experiments of Chinese personal
name recognition, the proposed verification
model increases the F-measure from 86.5% to
94.4%, which corresponds to 58.5% error re-
duction rate, where “error rate” is defined as
“100% F-measure− ”.
1. Named Entity Verification
As mentioned before, there are several kinds of
named entities, including person, location, or-
ganization, date, time, percentage and monetary
value. In the following description, we use the
task of verifying Chinese personal name as an
example. However, our proposed method is also
applicable on verifying other kinds of named
entities in different languages.
Before introducing our approach, we first
describe the notations that will be used. In this
proposal, a random variable is written with a
boldface italic letter. An outcome of a random
variable is written with the same italic letter but
in normal face. For example, an outcome of the
random variable o is denoted as o .Ifthereis
no confusion, we usually use ()P o to denote
the probability ()P o=o . A symbol sequence
“
1
,,
n
xx " ” is denoted as “
1
n
x ”. Likewise, “
,
,1
Yn
Y
x ”
denotes the sequence “
,1,
,,
Y Y n
x x " ”.
The task of verifying a named entity can-
didate is viewed as making an acceptance or
rejection decision according to the text segment
consisting of the candidate and its context.
Without loss of generality, a text segment is
considered as an outcome of the random vector
, ,,
,1,1,1
[, ,]
L xCyRz
LC R
=Oooo. The outcome of each ran-
dom variable in O is one basic element of text.
In Chinese, the basic elements of text are Chine-
se characters. However, in English, the basic
elements are English words. Figure 1 shows an
example of a text segment in which the size of
the candidate to be verified is 3 (i.e., consists of
three Chinese characters) and the sizes of its left
context and right context are set to 2 (i.e., two
Chinese characters).
Figure 2 depicts the flowchart of our veri-
fication approach. First, the candidate in the
input text segment is parsed by a predefined
grammar. If the candidate is ill-formed (i.e., fail
Figure 1: Example of a text segment.
Candidate
Parsing
Confidence
Measurement
Ill-formed?
Yes
Reject
No
cm δ<
Yes
Reject
No
Accept
Text
Segment
Figure 2: Flowchart of the verification
method.
      	     

G471
shi
,1L
o
G7cf
zhang
Gb2d
ma
G92e
ying
G390
jiu
G7c4
biao
G4a2
shi
,2L
o
,1C
o
,2C
o
,3C
o
,1R
o
,2R
o
   	  
   	  

Left Context Right Context
Candidate
to be parsed), it will be rejected immediately.
Otherwise, the text segment is passed to the
confidence measurement module to assess the
confidence level that the candidate in the text
segment is to be a named entity. If the confi-
dence measure is less than a predefined thresh-
old, the candidate will be rejected. Otherwise, it
will be accepted.
2. Confidence Measurement
The basic idea of our approach is to formulate
the confidence measurement problem as the
problem of hypothesis testing. The null hypothe-
sis
0
H in which the candidate is a name is
tested against the alternative hypothesis
1
H in
which the candidate is not a name. According to
Neyman-Pearson Lemma, an optimal hypothesis
test involves the evaluation of the following log
likelihood ratio:
,,,
,1 ,1 ,1
,,,
,1 ,1 ,1 0
,,,
,1 ,1 ,1 1
,,,
,1 ,1 ,1 0
,,,
,1,1,11
(, ,)
(, , |)
log
(, , |)
log ( , , | )
log ( , , | )
Lx Cy Rz
LC R
Lx Cy Rz
LC R
Lx Cy Rz
LC R
Lx Cy Rz
LC R
Lx Cy Rz
LC R
LLR o o o
Po o o H
Po o o H
Po o o H
P ooo H
=
=
−
(1)
where
,,,
,1,1,1 0
(, , |)
Lx Cy Rz
LC R
P ooo His the likelihood of
the candidate and its left and right contexts
given the hypothesis that the candidate is a name
and
,,,
,1,1,11
(, , |)
Lx Cy Rz
LC R
P ooo H is the likelihood of
the candidate and its left and right contexts
given the hypothesis that the candidate is not a
name. The hypothesis test is performed by com-
paring the log likelihood ratio
, ,,
,1,1,1
(, ,)
L xCyRz
LC R
LLR o o o to a predefined critical
threshold δ .If
,,,
,1 ,1 ,1
(, ,)
Lx Cy Rz
LC R
LLR o o o δ≥ ,the
null hypothesis will be accepted. Otherwise, it
will be rejected.
In our design, as shown in Figure 3, the
confidence measurement module consists of two
models, named NE model and anti-NE model.
The NE model is used to assess the value of
,,,
,1,1,1 0
log ( , , | )
Lx Cy Rz
LC R
P ooo Hand the anti-NE model
is used to assess the value of
,,,
,1,1,11
log ( , , | )
Lx Cy Rz
LC R
P ooo H.
2.1. NE Model
The purpose of the NE model is to evaluate the
value of
,,,
,1,1,1 0
log ( , , | )
Lx Cy Rz
LC R
P ooo H, the log like-
lihood of the candidate and its left and right
contexts given the hypothesis that the candidate
is a name. Since it is infeasible to directly esti-
mate the probability
,,,
,1,1,1 0
(, , |)
Lx Cy Rz
LC R
P ooo H,itis
approximated as follows:
, ,, ,,,
,1,1,1 0 0,1,1,1
,,,
0,10,10,1
(, , |) ( ,)
()()()
L xCyRz LxCyRz
LC R LC R
Lx Cy Rz
LC
Po o o H P o o o
Po Po Po
≡
≈
(2)
where the subscript of
0
()P ⋅ indicates the prob-
ability is evaluated given that the null hypothesis
is true. The probability
,
0 ,1
()
L x
L
P o is further ap-
proximated according to the bigram model as
follows:
,
0 ,1 0 , , 1
1
() (| )
x
Lx
LLiLi
i
Po Po o
−
=
≈
∏
(3)
where
0 ,1 ,0 0 ,1
(|) ()
LL L
P oo Po≡ . One should notice
that we do not assume that the random sequence
,
,1
Lx
L
o is time invariant. For example, the prob-
ability
,,1
(| )
Li Li
P x y
−
==oo is not assumed to
be equal to
,2 ,1
(|)
LL
P xy==oo for 3i ≥ .
Likewise, the probability
,
0 ,1
()
R z
R
P o is also fur-
ther approximated as follows:
,
0 ,1 0 , , 1
1
() (| )
z
Rz
RRiRi
i
Po Po o
−
=
≈
∏
(4)
where
0 ,1 ,0 0 ,1
(|) ()
RR R
P oo Po≡ .
The probability corresponding to the can-
didate is evaluated by applying the SCFG (Sto-
NE
()S ⋅
G2be
cm
NE Model
anti-NE
()S ⋅
Anti-NE Model
G2c0G2c4
Text
Segment
Figure 3: Block diagram of the confi-
dence measurement module.
chastic Context-free Grammar) model (Fujisaki,
1989) as follows:
,
0,1 0
00
() ()
max ( ) max ( | )
Cy
C
T
TT
AT
Po PT
P TPA
α
α
→∈
=
≈=
∑
∏
(5)
where T stands for one possible parse tree that
derive the candidate, A α→ indicates a rule in
the parse tree T , A stands for the left-hand-
side symbol of the rule and α stands for the
sequence of right-hand-side symbols of the rule.
Figure 4 shows an example of a parse tree of the
Chinese personal name candidate “Gb2dG92eG390(ma
ying jiu)”, where “Gb2d(ma)” is the surname and
“G92eG390(ying jiu)” is the given name. In this figure,
the symbol “S” denotes the start symbol, the
symbol “SNG” denotes the nonterminal deriving
surname characters and the symbol “GNC”
denotes the nonterminal deriving given name
characters. As a result, according to equations (2)
-(5), the scoring function in the NE model is
defined as equation (6) to assess the log likeli-
hood of the text segment “
, ,,
,1,1,1
,,
L xCyRz
L CR
o oo”given
the null hypothesis that “
,
,1
Cy
C
o ”isaname.
,,,
NE ,1 ,1 ,1
0 ,,1 0,,1
11
0
(, ,)
log ( | ) log ( | )
max log ( | )
Lx Cy Rz
LC R
xz
Li Li Ri Ri
ii
T
AT
Sooo
Po o Po o
PA
α
α
−−
==
→∈
=+
+
∑∑
∑
(6)
where T is one possible parse tree that derive
the candidate “
,
,1
Cy
C
o ”.
2.2. Anti-NE Model
The purpose of the anti-NE model is to evaluate
the value of
,,,
,1,1,11
log ( , , | )
Lx Cy Rz
LC R
P ooo H,thelog
likelihood of the candidate and its left and right
contexts given the hypothesis that the candidate
is not a name. Since it is infeasible to directly
estimate the probability
,,,
,1,1,11
(, , |)
Lx Cy Rz
LC R
P ooo H,it
is approximated as follows:
, ,, ,,,
,1,1,11 1,1,1,1
1 , ,1 1 , ,1
11
1, ,1
1
(, , |) ( ,)
(| ) (| )
(| )
L xCyRz LxCyRz
LC R LC R
yx
Li Li Ci Ci
ii
z
Ri Ri
i
Po o o H P o o o
Po o Po o
Po o
−−
==
−
=
=
≈×
×
∏∏
∏
(7)
where
,0,R C y
o o≡ ,
,0,C L x
o o≡ ,and
1 ,1 ,0
(|)
LL
P oo
1 ,1
()
L
P o≡ . Therefore, the following scoring
function is used in the anti-NE model to assess
the log likelihood of the text segment
“
, ,,
,1,1,1
,,
L xCyRz
L CR
o oo” given the alternative hypothesis
that “
,
,1
Cy
C
o ” is not a name.
,,,
anti-NE ,1 ,1 ,1
1 ,,1 1,,1
11
1, ,1
1
(, ,)
log ( | ) log ( | )
log ( | )
Lx Cy Rz
LC R
yx
Li Li Ci Ci
ii
z
Ri Ri
i
Sooo
Po o Po o
Po o
−−
==
−
=
=+
+
∑∑
∑
(8)
3. Experiment Setup
The proposed named entity verification method
is used to recognize Chinese personal names
from news. In Chinese, most of the personal
names consist of three Chinese characters. The
first character is a surname. The last two char-
acters are a given name. Therefore, our prelimi-
nary experiments are focused on recognizing the
personal names of three Chinese characters.
In our experiments, the training corpus
consists of about 14,339,000 Chinese characters
collected from economy and industry news. This
corpus should be annotated to estimate the prob-
abilistic parameters of the scoring functions
NE
()S ⋅ and
anti-NE
()S ⋅ . However, labeling such
large amount of data is too costly or prohibited
even if it is possible. Therefore, labeling
Gb2d
(ma)
G92e
(ying)
G390
(jiu)
SNC GNC GNC
S
Figure 4: A parse tree of the Chinese per-
sonal name “Gb2dG92eG390(ma ying jiu)”.
methods that can be bootstrapped from a little
seed data or a few seed rules (Collins, 1999;
Cucerzan, 1999) are highly demanded to auto-
matically annotate the training data. In the fol-
lowing section, we propose an EM-style boot-
strapping procedure (Cucerzan, 1999) for anno-
tating the training data automatically.
3.1. EM-Style Bootstrapping
The Expectation-Maximization (EM) algorithm
(Moon, 1996) has been widely used to estimate
model parameters from incomplete data in many
different applications. In this section, an EM-
style bootstrapping procedure is proposed to
automatically annotate the named entities in the
training corpus. It iteratively uses the proposed
verification model to label the training corpus
(expectation step), and then uses the labeled
training corpus to re-estimate the parameters of
the verification model (maximization step).
Figure 5 shows the flowchart of the bootstrap-
ping procedure. First, we collect the names of
541 famous people, including government offi-
cers and CEOs of big companies. These names
are used as seed names of the name corpus.
Then, the news is automatically annotated ac-
cording to the name corpus. The annotated cor-
pus is used to estimate the probabilistic
parameters of the scoring functions. Afterward,
the proposed verification procedure is used to
verify every possible name candidate in the
news. The candidates whose confidence meas-
ures are larger than a predefined threshold are
determined to be names. Currently, if the confi-
dence measures of two overlapped candidates
(such as “ma ying jiu” and “ying jiu biao” in
Figure 1) pass the threshold, both of them are
determined as names. Although this strategy is
inadequate, it does not make too much trouble
because the chance to get overlapped names is
very small in our experiments. Finally, these
guessed names are added to the name corpus
which will be used to annotate the news in next
iteration.
In our case, after four iterations, the size
of name corpus is enlarged from 541 to 6,296, as
shown in Table 1. The total occurrence frequen-
cy of these 6,296 names in the training corpus is
40,345.
3.2. Baseline Model
In the past, many researchers have studied the
problem of Chinese personal name recognition.
Chang (1994) used the 0-order Markov model to
segment a text into words, including Chinese
personal names. In his approach, a name prob-
ability model is proposed to estimate the prob-
abilities of Chinese personal names. Sproat
(1994) proposed to recognize Chinese personal
names with the stochastic finite-state word seg-
mentation algorithm. His approach is similar to
Chang’s, except that the name probability model
is slightly different. In addition to name prob-
ability, Chen (1998) also add extra scores to a
name candidate according to context clues (such
as position, title, speech-act verbs). In the re-
searches mentioned above, the reported F-
measure performances on recognizing Chinese
personal names are somewhere between 70%
and 86%. Since these performances are meas-
News
Name
Corpus
Name
Verification
Guessed
Names
Seed
Names
News
Annotation
Parameter
Estimation
Figure 5: EM-style bootstrapping.
Iteration
Nunber of
distinct names
Total frequency
of names
0 541 18310
1 3389 31157
2 5327 37423
3 6055 39977
4 6296 40345
Table 1: Number of distinct names in
the name corpus and total frequency
of names in the annotated news dur-
ing the bootstrapping iteration.
ured based on different data, higher reported
performance does not imply better. In fact, the
name probability models used in these re-
searches are very similar. Their performances
should be comparable to each other. Therefore,
in this paper, Chang’s approach, whose reported
F-measure is 86%, is chosen as the baseline
model.
The baseline model is additionally
equipped with a dictionary of 72,333 Chinese
words. The prior probabilities of words are esti-
mated from Academia Sinica Balanced Corpus,
which contains about 2 million Chinese words.
4. Experimental Results and Discussions
Both the baseline model and the proposed name
entity verification model (named NEV model)
are tested on the same testing corpus. The testing
corpus, also collected from economy and indus-
try news, consists of about 737,000 characters.
This corpus is annotated manually and contains
totally 2,545 Chinese personal names.
The F-measure of the baseline model is
86.5% (as indicated by the dashed line in Figure
6). The precision and recall rates of the baseline
model are 79.1% and 95.5% respectively. Alt-
hough the recall rate of the baseline model is
high, the precision rate is pretty low. Over 20%
of the name candidates proposed by the baseline
model are incorrect.
In our experiments, the sizes of the left-
and right-context windows of the NEV model
are set to 2. In Figure 6, the solid line with trian-
gle markers depicts the F-measure of the NEV
model versus the iteration number of bootstrap-
ping. The F-measure saturates after 3 iterations.
After 4 iterations, the F-measure of the NEV
model reaches 94.4%. The corresponding preci-
sion and recall rates are 96.4% and 92.5% re-
spectively. Compared with the baseline model,
the precision rate is greatly improved from
79.1% to 96.4% with a little sacrifice in recall
rate. The F-measure is improved from 86.5% to
94.4%, which corresponds to 58.5% error re-
duction rate, where “error rate” is defined as
“100% F-measure− ”.
Table 2 lists three examples of the mis-
recognized names made by the baseline model.
These examples clearly show that the baseline
model tends to incorrectly group consecutive
single characters, either from unknown words or
single-character words, into names. In the first
two examples, the single characters come from
the unknown location name “G44cG16e5G670Gaaf(ga luo
lai na; Carolina) and the unknown company
name “G16e5G5de(luo ji; Logitech)”. The single char-
acters in the last example are single-character
words “G6d1(gi; quarter)”, “G4c4(quan; all)” and “G90d
(mei; USA)”.
Without the inadequate strong tendency of
grouping single characters, the NEV model is
able to avoid the misrecognition errors made by
the baseline model. The NEV model assesses the
confidence measure of each name candidate
according to the context around the candidate. In
Table 2, the name candidates in the shaded
boxes are rejected by the NEV model because
0 1 2 3 4
0.85
0.90
0.95
1.00
Iteration
F−measure
NEV
Baseline
Figure 6: The performances of baseline
and name entity verification (NEV).
G3b5
da
G1277
chang
G16e5
luo
G5de
ji
G4e7
zai
G450
qu
G503
nian
...
(In last year, the big company Logitech ...)
Gb4f
dong
Gc83
di
G38b
yi
G6d1
ji
G4c4
quan
G90d
mei
Gd2e
lao
G7a3
zhe
...
(In the first quarter, the workers in all USA ...)
G447
bei
G44c
ga
G16e5
luo
G670
lai
Gaaf
na
G500
zhou
G8c1
wei
G66f
li
G430
yi
(take North Carolina State as an example)
Table 2: Examples of the incorrect Chi-
nese personal names (in the shaded boxes)
produced by the baseline model.
their confidence measures are too low.
To sum up, the experimental results de-
monstrate that the contextual information, either
from positive examples or from negative exam-
ples, is very helpful for named entity verification.
Besides, the superiority of the NEV model also
shows that the proposed probabilistic score
functions
NE
()S ⋅ and
anti-NE
()S ⋅ are effective in
providing the scores to produce a reliable confi-
dence measure. Especially, the proposed named
entity verification approach does not require any
dictionary in advance.
Conclusion
Named entity (NE) recognition is an important
task for many natural language applications,
such as Internet search engines, document in-
dexing, information extraction and machine
translation. Moreover, in oriental languages
(such as Chinese, Japanese and Korean), NE
recognition is even more important because it
significantly affects the performance of word
segmentation, the most fundamental task for
understanding the texts in oriental languages.
In this paper, a probabilistic verification
model is proposed to verify the correctness of a
named entity. This model assesses the confi-
dence level of a name candidate not only ac-
cording to the candidate’s structure but also
according to its contexts. The clues for confi-
dence measurement are collected from both
positive and negative examples in the training
data. Therefore, the confidence measure has
strong discriminant power for judging the cor-
rectness of a named entity. In the experiments of
Chinese personal name recognition, the pro-
posed verification model greatly increases the
precision rate from 79.1% to 96.4% with a little
sacrifice in recall rate. The F-measure is im-
proved from 86.5% to 94.4%, which corre-
sponds to 58.5% error reduction rate, where
“error rate” is defined as “100% F-measure− ”.
Acknowledgements
This paper is a partial result of Project
A311XS1211 conducted by ITRI under sponsor-
ship of the Ministry of Economic Affairs, R.O.C.
Especially thanks to the CKIP group of Acade-
mia Sinica for providing the Academia Sinica
Balanced Corpus.

References
Bikel, D., Miller S., Schwartz R., and Weischedel R.
(1997) Nymble: A High-performance Learning
Name Finder. In Proceedings of the Fifth Confer-
ence on Applied Natural Language Processing, pp.
194–201.
Borthwick, A. (1999) A Maximum Entropy Approach
to Named Entity Recognition.Ph.D.Thesis,New
York University.
ChangJ.,ChenS.,KerS.,ChenY.andLiuJ.(1994)
A Multiple-Corpus Approach to Recognition of
Proper Names in Chinese Texts. Computer Proc-
essing of Chinese and Oriental Languages, Vol. 8,
No. 1, pp. 75-85.
Chen, H., Ding Y., Tsai S. and Bian G. (1998) De-
scription of the NTU System used for MET2.in
Proceedings of the 7th Message Understanding
Conference (MUC-7)
Collins, M. and Singer, Y. (1999) Unsupervised
Models for Named Entity Classification.InPro-
ceedings of the Joint SIGDAT Conference on Em-
pirical Methods in Natural Language Processing
and Very Large Corpora, pp. 100-110.
Cucerzan S. and Yarowsky D. (1999) Language
independent named entity recognition combining
morphological and contextual evidence. In Pro-
ceedings of the Joint SIGDAT Conference on Em-
pirical Methods in Natural Language Processing
and Very Large Corpora, pp. 90-99.
Fujisaki, T., Jelinek F., Cocke J., Black E. and Nishi-
no T. (1989), A Probabilistic Parsing Method for
Sentence Disambiguation. In Proceedings of the
International Workshop on Parsing Technologies,
pp. 85-94.
Grishman, R. (1995) The NYU system for MUC-6 or
where's the syntax? In Proceedings of the 6th Mes-
sage Understanding Conference (MUC-6), pp. 167-
175.
Moon, T. K. (1996) The Expectation-Maximization
Algorithm, IEEE Signal Processing Magazine, No-
vember, 1996, pp. 47-60.
Sproat R. and Chang N. (1994) A Stochastic Finite-
State Word-Segmentation Algorithm for Chinese.
In Proceeding of 32nd Annual Meeting of the As-
sociation for Computational Linguistics, pp. 66-73.
Yu,S.,BaiS.andWuP.(1998)Description of the
Kent Ridge Digital Labs System Used for MUC-7.
In Proceedings of the 7th Message Understanding
Conference (MUC-7)
