Multi-Criteria-based Active Learning for Named Entity Recognition 
Dan Shen†‡1 Jie Zhang†‡ Jian Su† Guodong Zhou† Chew-Lim Tan‡ 
† Institute for Infocomm Technology 
21 Heng Mui Keng Terrace 
Singapore 119613 
‡ Department of Computer Science 
National University of Singapore 
3 Science Drive 2, Singapore 117543 
{shendan,zhangjie,sujian,zhougd}@i2r.a-star.edu.sg 
{shendan,zhangjie,tancl}@comp.nus.edu.sg 
                                                                 
1 Current address of the first author: Universität des Saarlandes, Computational Linguistics Dept., 66041 Saarbrücken, Germany 
dshen@coli.uni-sb.de 
 
 
 
 
Abstract 
In this paper, we propose a multi-criteria -
based active learning approach and effec-
tively apply it to named entity recognition. 
Active learning targets to minimize the 
human annotation efforts by selecting ex-
amples for labeling.  To maximize the con-
tribution of the selected examples, we 
consider the multiple criteria: informative-
ness, representativeness and diversity  and 
propose measures to quantify them.  More 
comprehensively, we incorporate all the 
criteria using two selection strategies, both 
of which result in less labeling cost than 
single -criterion-based method.  The results 
of the named entity recognition in both 
MUC-6 and GENIA show that the labeling 
cost can be reduced by at least 80% with-
out degrading the performance. 
1 Introduction 
In the machine learning approaches of natural la n-
guage processing (NLP), models are generally 
trained on large annotated corpus.  However, anno-
tating such corpus is expensive and time-
consuming, which makes it difficult to adapt an 
existing model to a new domain.  In order to over-
come this difficulty, active learning (sample sele c-
tion) has been studied in more and more NLP 
applications such as POS tagging (Engelson and 
Dagan 1999), information extraction (Thompson et 
al. 1999), text classif ication (Lewis and Catlett 
1994; McCallum and Nigam 1998; Schohn and 
Cohn 2000; Tong and Koller 2000; Brinker 2003), 
statistical parsing (Thompson et al. 1999; Tang et 
al. 2002; Steedman et al. 2003), noun phrase 
chunking (Ngai and Yarowsky 2000), etc. 
Active learning is based on the assumption that 
a small number of annotated examples and a large 
number of unannotated examples are available.  
This assumption is valid in most NLP tasks.  Dif-
ferent from supervised learning in which the entire 
corpus are labeled manually, active learning is to 
select the most useful example for labe ling and add 
the labeled example  to training set to retrain model.  
This procedure is repeated until the model achieves 
a certain level of performance.  Practically, a batch 
of examples are selected at a time, called batched-
based sample sele ction (Lewis and Catlett 1994) 
since it is time consuming to retrain the model if 
only one new example is added to the training set.  
Many existing work in the area focus on two ap-
proaches: certainty-based methods (Thompson et 
al. 1999; Tang et al. 2002; Schohn and Cohn 2000; 
Tong and Koller 2000; Brinker 2003) and commit-
tee-based methods (McCallum and Nigam 1998; 
Engelson and Dagan 1999; Ngai and Yarowsky 
2000) to select the most informative examples for 
which the current model are most uncertain. 
Being the first piece of work on active learning 
for name entity recognition (NER) task, we target 
to minimize the human annotation efforts yet still 
reaching the same level of performance as a super-
vised learning approach.  For this purpose, we 
make a more comprehensive consideration on the 
contribution of individual examples, and more im-
portantly maximizing the contrib ution of a batch 
based on three criteria : informativeness, represen-
tativeness and diversity. 
First, we propose three scoring functions to 
quantify the informativeness of an example , which 
can be used to select the most uncertain examples.  
Second, the representativeness measure is further 
proposed to choose the example s representing the 
majority.  Third, we propose two diversity consid-
erations (global and local) to avoid repetition 
among the examples of a batch.  Finally, two com-
bination strategies with the above three criteria are 
proposed to reach the maximum effectiveness on 
active learning for NER. 
We build our NER model using Support Vec-
tor Machines (SVM).  The experiment shows that 
our active learning methods achieve a promising 
result in this NER task.  The results in both MUC-
6 and GENIA show that the amount of the labeled 
training data can be reduced by at least 80% with-
out degrading the quality of the named entity rec-
ognizer.  The contributions not only come from the 
above measures, but also the two sample selection 
strategies which effectively incorporate informa-
tiveness, representativeness and diversity criteria.  
To our knowledge, it is the first work on consider-
ing the three criteria all together for active learning.  
Furthermore, such measures and strategies can be 
easily adapted to other active learning tasks as well.  
 
2 Multi-criteria for NER Active Learning 
Support Vector Machines (SVM) is a powerful 
machine learning method, which has been applied 
successfully in NER tasks, such as (Kazama et al. 
2002; Lee et al. 2003).  In this paper, we apply ac-
tive learning methods to a simple  and effective 
SVM model to recognize one class of names at a 
time, such as protein names, person names, etc.  In 
NER, SVM is to cla ssify a word into positive class 
“1” indicating that the word is a part of an entity, 
or negative class “-1” in dicating that the word is 
not a part of an entity.  Each word in SVM is rep-
resented as a high-dimensional feature vector in-
cluding surface word information, orthographic 
features, POS feature and semantic trigger features 
(Shen et al. 2003).  The semantic trigger features 
consist of some special head nouns for an entity 
class which is supplied by users.  Furthermore, a 
window (size = 7), which represents the local con-
text of the target word w, is also used to classify w.   
However, for active learning in NER, it is not 
reasonable to select a single word without context 
for human to label.  Even if we require human to 
label a single word, he has to make an addition 
effort to refer to the context of the word.  In our 
active learning process, we select a word sequence 
which consists of a machine-annotated named en-
tity and its context rather than a single word.  
Therefore, all of the measures we propose for ac-
tive learning should be applied to the machine-
annotated named entities and we have to further 
study how to extend the measures for words to 
named entities.  Thus, the active learning in SVM-
based NER will be more complex than that in sim-
ple classification tasks, such as text classif ication 
on which most SVM active learning works are 
conducted (Schohn and Cohn 2000; Tong and 
Koller 2000; Brinker 2003).  In the next part, we 
will introduce informativeness, representativeness 
and diversity measures for the SVM-based NER. 
2.1 Informativeness 
The basic idea of informativeness criterion is simi-
lar to certainty-based sample selection methods, 
which have been used in many previous works.  In 
our task, we use a distance-based measure to 
evaluate the informativeness of a word and extend 
it to the measure of an entity using three scoring 
functions.  We prefer the examples with high in-
formative degree for which the current model are 
most uncertain. 
2.1.1 Informativeness Measure for Word 
In the simplest linear form, training SVM is to find 
a hyperplane that can separate the posit ive and 
negative examples in training set with maximum 
margin.  The margin is defined by the distance of 
the hyperplane to the nearest of the positive and 
negative examples.  The training examples which 
are closest to the hyperplane are called support 
vectors.  In SVM, only the support vectors are use-
ful for the classific ation, which is different from 
statistical models.  SVM training is to get these 
support vectors and their weights from training set 
by solving quadratic programming problem.  The 
support vectors can later be used to classify the test 
data. 
Intuitively, we consider the informativeness of 
an example  as how it can make effect on the sup-
port vectors by adding it to training set.  An exam-
ple may be informative for the learner if the 
distance of its feature vector to the hyperplane is 
less than that of the support vectors to the hyper-
plane (equal to 1).  This intuition is also justified 
by (Schohn and Cohn 2000; Tong and Koller 2000) 
based on a version space analysis.  They state that 
labeling an example that lies on or close to the hy-
perplane is guaranteed to have an effect on the so-
lution.  In our task, we use the distance to measure 
the informativeness of an example. 
The distance of a word’s feature vector to the 
hyperplane is computed as follows: 
1
()(,)
N
iii
i
Distykba
=
=+∑wsw  
where w is the feature vector of the word, ai, yi, si 
corresponds to the weight, the class and the feature 
vector of the ith support vector respectively.  N is 
the number of the support vectors in current model. 
We select the example with minimal Dist, 
which indicates that it comes closest to the hyper-
plane in feature space.  This example is considered 
most informative for current model. 
2.1.2 Informativeness Measure for Named 
Entity 
Based on the above informativeness measure for a 
word, we compute the overall informativeness de-
gree of a named entity NE.  In this paper, we pro-
pose three scoring functions as follows. Let NE = 
w1…wN in which wi is the feature vector of the ith 
word of NE. 
• Info_Avg: The informativeness of NE is 
scored by the average distance of the words in 
NE to the hyperplane.  
 ()1()
i
i
NE
InfoNEDist
∈
=−∑
w
w  
 where, wi is the feature vector of the ith word in 
NE. 
• Info_Min: The informativeness of NE is 
scored by the minimal distance of the words in 
NE. 
 ()1{()}
i
iNEInfoNEMinDist∈=−w w  
• Info_S/N: If the distance of a word to the hy-
perplane is less than a threshold a (= 1 in our 
task), the word is considered with short dis-
tance.  Then, we compute the proportion of the 
number of words with short distance to the to-
tal number of words in the named entity and 
use this proportion to quantify the informa-
tiveness of the named entity.  
 (())() i iNENUMDistInfoNE
N
a
∈
<
= w
w  
In Section 4.3, we will evaluate the effective-
ness of these scoring functions. 
2.2 Representativeness 
In addition to the most informative example , we 
also prefer the most representative example .  The 
representativeness of an example can be evaluated 
based on how many examples there are similar or 
near to it.  So, the examples with high representa-
tive degree are less likely to be an outlier.  Adding 
them to the training set will have effect on a large 
number of unlabeled examples.  There are only a 
few works considering this selection criterion 
(McCallum and Nigam 1998; Tang et al. 2002) and 
both of them are specific to their tasks, viz. text 
classification and statistical parsing.  In this section, 
we compute the simila rity between words using a 
general vector-based measure, extend this measure 
to named entity level using dynamic time warping 
algorithm and quantify the representativeness of a 
named entity by its density. 
2.2.1 Similarity Measure  between Words 
In general vector space model, the similarity be-
tween two vectors may be measured by computing 
the cosine value of the angle  between them.  The 
smaller the angle is, the more similar between the 
vectors are.  This measure, called cosine-similarity 
measure, has been widely used in information re-
trieval tasks (Baeza-Yates and Ribeiro-Neto 1999).    
In our task, we also use it to quantify the similarity 
between two words.  Particularly, the calculation in 
SVM need be projected to a higher dimensional 
space by using a certain kernel function (,)ijK ww .  
Therefore, we adapt the cosine-similarity measure 
to SVM as follows: 
(,)(,)
(,)(,)
ij
ij
iijj
kSim
kk=
wwww
wwww
 
where, wi and wj are the feature vectors of the 
words i and j.  This calculation is also supported by 
(Brinker 2003)’s work.  Furthermore, if we use the 
linear kernel (,)ijijk =⋅wwww, the measure is 
the same as the traditional cosine similarity meas-
ure cos ij
ij
q ⋅= ⋅wwww and may be regarded as a 
general vector-based similarity measure. 
2.2.2 Similarity Meas ure  between Named En-
tities 
In this part, we compute the similarity between two 
machine-annotated named entities given the simi-
larities between words.  Regarding an entity as a 
word sequence, this work is analogous to the 
alignment of two sequences.  We employ the dy-
namic time warping (DTW) algorithm (Rabiner et 
al. 1978) to find an optimal alig nment between the 
words in the sequences which maximize the accu-
mulated similarity degree between the sequences.  
Here, we adapt it to our task.  A sketch of the 
modified algorithm is as follows. 
Let NE1 = w11w12…w1n…w1N, (n = 1,…, N) and 
NE2 = w21w22…w2m…w2M, (m = 1,…, M) denote two 
word sequences to be matched.  NE1 and NE2 con-
sist of M and N words respectively.  NE1(n) = w1n 
and NE2(m) = w2m.  A similarity value Sim(w1n ,w2m) 
has been known for every pair of words (w1n,w2m) 
within NE1 and NE2.  The goal of DTW is to find a 
path, m = map(n), which map n onto the corre-
sponding m such that the accumulated similarity 
Sim* along the path is maximized. 
12{()}
1
*{((),(())}
N
mapn n
SimMaxSimNEnNEmapn
=
= ∑  
A dynamic programming method is used to deter-
mine the optimum path map(n).  The accumulated 
similarity SimA to any grid point (n, m) can be re-
cursively calculated as 
12(,)(,)(1,)AnmAqmSimnmSimwwMaxSimnq≤=+−
Finally, *(,)ASimSimNM=  
Certainly, the overall similarity measure Sim* 
has to be normalized as longer sequences normally 
give higher similarity value.  So, the similarity be-
tween two sequences NE1 and NE2 is calc ulated as 
12
*(,)
(,)
SimSimNENE
MaxNM=
 
2.2.3 Representativeness Measure for Named 
Entity 
Given a set of machine-annotated named entities 
NESet = {NE1, … , NEN}, the representativeness of 
a named entity NEi in NESet is quantified by its 
density.  The density of NEi is defined as the aver-
age similarity between NEi and all the other enti-
ties NEj in NESet as follows. 
(,)
() 1
ij
ji
i
SimNENE
DensityNE N≠= −
∑
 
If NEi has the largest density among all the entities 
in NESet, it can be regarded as the centroid of NE-
Set and also the most representative examples in 
NESet. 
2.3 Diversity 
Diversity criterion is to maximize the training util-
ity of a batch.  We prefer the batch in which the 
examples have high variance to each other.  For 
example , given the batch size 5, we try not to se-
lect five repetitious examples at a time.  To our 
knowledge, there is only one work (Brinker 2003) 
exploring this criterion.  In our task, we propose 
two methods: local and global, to make the exam-
ples diverse enough in a batch.   
2.3.1 Global Consideration 
For a global consideration, we cluster all named 
entities in NESet based on the similarity measure 
proposed in Section 2.2.2.  The named entities in 
the same cluster may be considered similar to each 
other, so we will select the named entities from 
different clusters at one time.  We employ a K-
means clustering algorithm (Jelinek 1997), which 
is shown in Figure 1. 
Given: 
NESet = {NE1, … , NEN} 
Suppose: 
The number of clusters is K 
Initialization: 
Randomly equally partition {NE1, … , NEN} into K 
initial clusters Cj (j = 1, … , K). 
Loop until the number of changes for the centroids of 
all clusters is less than a threshold 
• Find the centroid of each cluster Cj (j = 1, … , K). 
 arg((,))
j ij
ji
NEC NEC
NECentmaxSimNENE
∈ ∈
= ∑  
• Repartition {NE1, … , NEN} into K clusters.  NEi 
will be assigned to Cluster Cj if 
 
(,)(,),ijiwSimNENECentSimNENECentwj≥≠ 
Figure 1: Global Consideration for Diversity: K-
Means Clustering algorithm 
In each round, we need to compute the pair-
wise similarities within each cluster to get the cen-
troid of the cluster.  And then, we need to compute 
the similarities between each example and all cen-
troids to repartition the example s.  So, the algo-
rithm is time-consuming.  Based on the assumption 
that N examples are uniformly distributed between 
the K clusters, the time complexity of the alg o-
rithm is about O(N2/K+NK) (Tang et al. 2002).  In 
one of our experiments, the size of the NESet (N) is 
around 17000 and K is equal to 50, so the time 
complexity is about O(106).  For efficiency, we 
may filter the entities in NESet before clustering 
them, which will be further discussed in Section 3.  
2.3.2 Local Consideration 
When selecting a machine-annotated named entity, 
we compare it with all previously selected named 
entities in the current batch.  If the similarity be-
tween them is above a threshold ß, this example  
cannot be allowed to add into the batch.  The order 
of selecting examples is based on some measure, 
such as informativeness measure, representative-
ness measure or their combination.  This local se-
lection method is shown in Figure 2.  In this way, 
we avoid selecting too similar examples (simila rity 
value ≥  ß) in a batch.  The threshold ß may be the 
average similarity between the examples in NESet. 
 
Given: 
NESet = {NE1, … , NEN} 
BatchSet with the maximal size K. 
Initialization:  
BatchSet = empty 
Loop until BatchSet is full 
• Select NEi based on some measure from NESet. 
• RepeatFlag = false; 
• Loop from j = 1 to CurrentSize(BatchSet)  
 If (,)ijSimNENE b≥ Then 
 RepeatFlag = true; 
 Stop the Loop; 
• If RepeatFlag == false Then 
add NEi into BatchSet 
• remove NEi from NESet 
Figure 2: Local Consideration for Diversity 
 
This consideration only requires O(NK+K2) 
computational time.  In one of our experiments (N 
˜ 17000 and K = 50), the time complexity is about 
O(105).  It is more efficient than clustering alg o-
rithm described in Section 2.3.1.  
 
3 Sample Selection strategies 
In this section, we will study how to combine and 
strike a proper balance between these criteria, viz. 
informativeness, representativeness and diversity, 
to reach the maximum effectiveness on NER active 
learning.  We build two strategies to combine the 
measures proposed above.  These strategies are 
based on the varying priorities of the criteria and 
the varying degrees to satisfy the criteria . 
• Strategy 1: We first consider the informative-
ness criterion.  We choose m examples with the 
most informativeness score from NESet to an in-
termediate set called INTERSet.  By this pre-
selecting, we make the selection process faster in 
the later steps since the size of INTERSet is much 
smaller than that of NESet.  Then we cluster the 
examples in INTERSet and choose the centroid of 
each cluster into a batch called BatchSet.  The cen-
troid of a cluster is the most representative exam-
ple in that cluster since it has the largest density.  
Furthermore, the examples in different clusters 
may be considered diverse to each other.  By this 
means, we consider representativeness and diver-
sity criteria at the same time.  This strategy is 
shown in Figure 3.  One limitation of this strategy 
is that clustering result may not reflect the distrib u-
tion of whole sample space since we only cluster 
on INTERSet for efficiency.  The other is that since 
the representativeness of an example is only evalu-
ated on a cluster.  If the cluster size is too small, 
the most representative example in this cluster may 
not be representative in the whole sample space. 
 
Given: 
NESet = {NE1, … , NEN} 
BatchSet with the maximal size K. 
INTERSet with the maximal size M 
Steps :  
• BatchSet = ∅  
• INTERSet = ∅  
• Select M entities with most Info score from NESet 
to INTERSet. 
• Cluster the entities in INTERSet into K clusters 
• Add the centroid entity of each cluster to BatchSet 
Figure 3: Sample Selection Strategy 1 
 
• Strategy 2: (Figure 4) We combine the infor-
mativeness and representativeness criteria  using 
the functio ()(1)()iiInfoNEDensityNEll+− , in 
which the Info and Density  value of NEi are nor-
malized first.  The individual importance of each 
criterion in this function is adjusted by the trade-
off parameter l ( 01l≤≤) (set to 0.6 in our 
experiment).  First, we select a candidate example 
NEi with the maximum value of this function from 
NESet.  Second, we consider diversity criterion 
using the local method in Section 3.3.2.  We add 
the candidate example  NEi to a batch only if NEi is 
different enough from any previously selected ex-
ample  in the batch.  The threshold ß is set to the 
average pair-wise similarity of the entities in NE-
Set. 
 
Given: 
NESet = {NE1, … , NEN} 
BatchSet with the maximal size K. 
Initialization:  
BatchSet = ∅  
Loop until BatchSet is full 
• Select NEi which have the maximum value for the 
combination function between Info score and Den-
sity socre from NESet. 
arg(()(1)())
i
iiiNENESetNEMaxInfoNEDensityNEll∈=+−  
• RepeatFlag = false; 
• Loop from j = 1 to CurrentSize(BatchSet)  
 If (,)ijSimNENE b≥ Then 
 RepeatFlag = true; 
 Stop the Loop; 
• If RepeatFlag == false Then 
add NEi into BatchSet 
• remove NEi from NESet 
Figure 4: Sample Selection Strategy 2 
 
4 Experimental Results and Analysis 
4.1 Experiment Settings  
In order to evaluate the effectiveness of our sele c-
tion strategies, we apply them to recognize protein 
(PRT) names in biomedical domain using GENIA 
corpus V1.1 (Ohta et al. 2002) and person (PER), 
location (LOC), organization (ORG) names in 
newswire domain using MUC-6 corpus.  First, we 
randomly split the whole corpus into three parts: an 
initial training set to build an in itial model, a test 
set to evaluate the performance of the model and 
an unlabeled set to select examples.  The size of 
each data set is shown in Table 1.  Then, iteratively, 
we select a batch of examples following the sele c-
tion strategie s proposed, require human experts to 
label them and add them into the training set.  The 
batch size K = 50 in GENIA and 10 in MUC-6.  
Each example is defined as a machine-recognized 
named entity and its context words (previous 3 
words and next 3 words). 
Domain Class Corpus Initial Training Set Test Set Unlabeled Set 
Biomedical PRT GENIA1.1 10 sent. (277 words) 900 sent. (26K words) 8004 sent. (223K words) 
PER 5 sent. (131 words) 7809 sent. (157K words) 
LOC 5 sent. (130 words) 7809 sent. (157K words) 
 
Newswire 
ORG 
 
MUC-6 
 5 sent. (113 words) 
 
602 sent. (14K words) 
 7809 sent. (157K words) 
Table 1: Experiment settings for active learning using GENIA1.1(PRT) and MUC-6(PER,LOC,ORG) 
The goal of our work is to minimize the human 
annotation effort to learn a named entity recognizer 
with the same performance level as supervised 
learning.  The performance of our model is evalu-
ated using “precision/recall/F-measure”. 
4.2 Overall Result in GENIA and MUC-6 
In this section, we evaluate our selection strategies 
by comparing them with a random sele ction 
method, in which a batch of examples is randomly 
selected iteratively, on GENIA and MUC-6 corpus.  
Table 2 shows the amount of training data needed 
to achieve the performance of supervised learning 
using various selection methods, viz. Random, 
Strategy1 and Strategy2.  In GENIA, we find: 
• The model achieves 63.3 F-measure using 223K  
words in the supervised learning. 
• The best performer is Strategy2 (31K words), 
requiring less than 40% of the training data that 
Random (83K words) does and 14% of the train-
ing data that the supervised learning does. 
• Strategy1 (40K words) performs slightly worse 
than Strategy2, requiring 9K more words.  It is 
probably because Strategy1 cannot avoid select-
ing outliers if a cluster is too small. 
• Random (83K words) requires about 37% of the 
training data that the supervised learning does.  It 
indicates that only the words in and around a 
named entity are useful for classific ation and the 
words far from the named entity may not be 
helpful. 
 
Class Supervised Random Strategy1 Strategy2 
PRT 223K (F=63.3) 83K 40K 31K 
PER 157K (F=90.4) 11.5K 4.2K 3.5K 
LOC 157K (F=73.5) 13.6K 3.5K 2.1K 
ORG 157K (F=86.0) 20.2K 9.5K 7.8K 
Table 2: Overall Result in GENIA and MUC-6 
Furthermore, when we apply our model to news-
wire domain (MUC-6) to recognize person, loca-
tion and organization names, Strategy1 and 
Strategy2 show a more promising result by com-
paring with the supervised learning and Random, 
as shown in Table 2.  On average, about 95% of 
the data can be reduced to achieve the same per-
formance with the supervised learning in MUC-6.  
It is probably because NER in the newswire do-
main is much simpler than that in the biomedical 
domain (Shen et al. 2003) and named entities are 
less and distributed much sparser in the newswire 
texts than in the biomedical texts. 
 
4.3 Effectiveness of Informativeness-based 
Selection Method 
In this section, we investigate the effectiveness of 
informativeness criterion in NER task.  Figure 5 
shows a plot of training data size versus F-measure 
achieved by the informativeness-based measures in 
Section 3.1.2: Info_Avg, Info_Min  and Info_S/N as 
well as Random.  We make the comparisons in 
GENIA corpus.  In Figure 5, the horizontal line is 
the performance level (63.3 F-measure) achieved 
by supervised learning (223K words).  We find 
that the three informativeness-based measures per-
form similarly and each of them outperforms Ra n-
dom.  Table 3 highlights the various data sizes to 
achieve the peak performance using these selection 
methods.  We find that Random (83K words) on 
average requires over 1.5 times as much as data to 
achieve the same performance as the informative-
ness-based sele ction methods (52K words). 
 
0.5
0.55
0.6
0.65
0 20 40 60 80K words
F
Supervised
Random
Info_Min
Info_S/N
Info_Avg
 Figure 5: Active learning curves: effectiveness of the three in-
formativeness-criterion-based selections comparing with the 
Random selection. 
Supervised Random Info_Avg Info_Min Info_ S/N 
223K 83K 52.0K 51.9K 52.3K 
Table 3: Training data sizes for various selection methods to 
achieve the same performance level as the supervised learning 
 
4.4 Effectiveness of Two Sample Selection 
Strategies 
In addition to the informativeness criterion, we 
further incorporate representativeness and diversity 
criteria into active learning using two strategies 
described in Section 3.  Comparing the two strate-
gies with the best result of the single -criterion-
based selection methods Info_Min , we are to jus-
tify that representativeness and diversity are also 
important factors for active learning.  Figure 6 
shows the learning curves for the various methods: 
Strategy1, Strategy2 and Info_Min.  In the begin-
ning iterations (F-measure < 60), the three methods 
performed similarly.  But with the larger training 
set, the efficiencies of Stratety1 and Strategy2 be-
gin to be evident.  Table 4 highlights the final re-
sult of the three methods.  In order to reach the 
performance of supervised learning, Strategy1 
(40K words) and Strategyy2 (31K words) require 
about 80% and 60% of the data that Info_Min 
(51.9K) does.  So we believe the effective combi-
nations of informativeness, representativeness and 
diversity will help to learn the NER model more 
quickly and cost less in annotation. 
0.5
0.55
0.6
0.65
0 20 40 60 K words
F
Supervised
Info_Min
Strategy1
Strategy2
 Figure 6: Active learning curves: effectiveness of the two 
multi-criteria-based selection strategies comparing with the 
informativeness-criterion-based selection (Info_Min). 
Info_Min Strategy1 Strategy2 
51.9K 40K 31K 
Table 4: Comparisons of training data sizes for the multi-
criteria-based selection strategies and the informativeness-
criterion-based selection (Info_Min) to achieve the same per-
formance level as the supervised learning. 
 
5 Related Work 
Since there is no study on active learning for NER 
task previously, we only introduce general active 
learning methods here.  Many existing active learn-
ing methods are to select the most uncertain exam-
ples using various measures (Thompson et al. 1999; 
Schohn and Cohn 2000; Tong and Koller 2000; 
Engelson and Dagan 1999; Ngai and Yarowsky 
2000).  Our informativeness-based measure is 
similar to these works.  However these works just 
follow a single criterion.  (McCallum and Nigam 
1998; Tang et al. 2002) are the only two works 
considering the representativeness criterion in ac-
tive learning.  (Tang et al. 2002) use the density 
information to weight the selected examples while 
we use it to select examples.  Moreover, the repre-
sentativeness measure we use is relatively general 
and easy to adapt to other tasks, in which the ex-
ample selected is a sequence of words, such as text 
chunking, POS ta gging, etc.  On the other hand, 
(Brinker 2003) first incorporate diversity in active 
learning for text classification.  Their work is simi-
lar to our local consideration in Section 2.3.2.  
However, he didn’t further explore how to avoid 
selecting outliers to a batch.  So far, we haven’t 
found any previous work integrating the informa-
tiveness, representativeness and diversity all to-
gether. 
 
6 Conclusion and Future Work 
In this paper, we study the active learning in a 
more complex NLP task, named entity recognition.  
We propose a multi-criteria -based approach to se-
lect examples based on their informativeness, rep-
resentativeness and diversity, which are 
incorporated all together by two strategies (local 
and global).  Experiments show that, in both MUC-
6 and GENIA, both of the two strategies combin-
ing the three criteria outperform the single criterion 
(informativeness).  The labeling cost can be sig-
nificantly reduced by at least 80% comparing with 
the supervised learning.  To our best knowledge, 
this is not only the first work to report the empir i-
cal results of active learning for NER, but also the 
first work to incorporate the three criteria all to-
gether for selecting examples. 
Although the current experiment results are 
very promising, some parameters in our experi-
ment, such as the batch size K and the l in the 
function of strategy 2, are decided by our experi-
ence in the domain.  In practical application, the 
optimal value of these parameters should be de-
cided automatically based on the training process.  
Furthermore, we will study how to overcome the 
limitation of the strategy 1 discussed in Section 3 
by using more effective clustering algorithm.  An-
other interesting work is to study when to stop ac-
tive learning.  
 
References 
R. Baeza-Yates and B. Ribeiro-Neto. 1999. Mod-
ern Information Retrieval. ISBN 0-201-39829-X. 
K. Brinker. 2003. Incorporating Diversity in Ac-
tive Learning with Support Vector Machines. In 
Proceedings of ICML, 2003. 
S. A. Engelson and I. Dagan. 1999. Committee-
Based Sample Selection for Probabilistic Classi-
fiers. Journal of Artifical Intelligence Research. 
F. Jelinek. 1997. Statistical Methods for Speech 
Recognition. MIT Press. 
J. Kazama, T. Makino, Y. Ohta and J. Tsujii. 2002. 
Tuning Support Vector Machines for Biomedi-
cal Named Entity Recognition. In Proceedings 
of the ACL2002 Workshop on NLP in Biomedi-
cine. 
K. J. Lee, Y. S. Hwang and H. C. Rim. 2003. Two-
Phase Biomedical NE Recognition based on 
SVMs.  In Proceedings of the ACL2003 Work-
shop on NLP in Biomedicine. 
D. D. Lewis and J. Catlett. 1994. Heterogeneous 
Uncertainty Sampling for Supervised Learning. 
In Proceedings of ICML, 1994. 
A. McCallum and K. Nigam. 1998. Employing EM 
in Pool-Based Active Learning for Text Classi-
fication. In Proceedings of ICML, 1998. 
G. Ngai and D. Yarowsky. 2000. Rule Writing or 
Annotation: Cost-efficient Resource Usage for 
Base Noun Phrase Chunking. In Proceedings of 
ACL, 2000. 
T. Ohta, Y. Tateisi, J. Kim, H. Mima and J. Tsujii. 
2002. The GENIA corpus: An annotated re-
search abstract corpus in molecular biology do-
main. In Proceedings of HLT 2002. 
L. R. Rabiner, A. E. Rosenberg and S. E. Levinson. 
1978. Considerations in Dynamic Time Warping 
Algorithms for Discrete Word Recognition.  In 
Proceedings of IEEE Transactions on acoustics, 
speech and signal processing. Vol. ASSP-26, 
NO.6. 
D. Schohn and D. Cohn. 2000. Less is More: Ac-
tive Learning with Support Vector Machines. In 
Proceedings of the 17th International Confer-
ence on Machine Learning. 
D. Shen, J. Zhang, G. D. Zhou, J. Su and C. L. Tan. 
2003. Effective Adaptation of a Hidden Markov 
Model-based Named Entity Recognizer for Bio-
medical Domain. In Proceedings of the 
ACL2003 Workshop on NLP in Biomedicine. 
M. Steedman, R. Hwa, S. Clark, M. Osborne, A. 
Sarkar, J. Hockenmaier, P. Ruhlen, S. Baker and 
J. Crim. 2003. Example Selection for Bootstrap-
ping Statistical Parsers. In Proceedings of HLT-
NAACL, 2003. 
M. Tang, X. Luo and S. Roukos. 2002. Active 
Learning for Statistical Natural Language Pars-
ing. In Proceedings of the ACL 2002. 
C. A. Thompson, M. E. Califf and R. J. Mooney. 
1999. Active Learning for Natural Language 
Parsing and Information Extraction. In Proceed-
ings of ICML 1999. 
S. Tong and D. Koller. 2000. Support Vector Ma-
chine Active Learning with Applications to Text 
Classification. Journal of Machine Learning Re-
search. 
V. Vapnik. 1998. Statistical learning theory. 
N.Y.:John Wiley. 
 
