Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 69–72,
New York, June 2006. c©2006 Association for Computational Linguistics
MMR-based Active Machine Learning  
for Bio Named Entity Recognition 
Seokhwan Kim
1
Yu Song
2
Kyungduk Kim
1
Jeong-Won Cha
3
Gary Geunbae Lee
1
1
 Dept. of Computer Science and Engineering, POSTECH, Pohang, Korea 
2
 AIA Information Technology Co., Ltd. Beijing, China 
3
 Dept. of Computer Science, Changwon National University, Changwon, Korea 
megaup@postech.ac.kr, Song-Y.Song@AIG.com, getta@postech.ac.kr
jcha@changwon.ac.kr, gblee@postech.ac.kr
Abstract 
This paper presents a new active learning 
paradigm which considers not only the 
uncertainty of the classifier but also the 
diversity of the corpus. The two measures 
for uncertainty and diversity were com-
bined using the MMR (Maximal Marginal 
Relevance) method to give the sampling 
scores in our active learning strategy. We 
incorporated MMR-based active machine-
learning idea into the biomedical named-
entity recognition system. Our experimen-
tal results indicated that our strategies for 
active-learning based sample selection 
could significantly reduce the human ef-
fort. 
1 Introduction 
Named-entity recognition is one of the most ele-
mentary and core problems in biomedical text min-
ing. To achieve good recognition performance, we 
use a supervised machine-learning based approach 
which is a standard in the named-entity recognition 
task. The obstacle of supervised machine-learning 
methods is the lack of the annotated training data 
which is essential for achieving good performance. 
Building a training corpus manually is time con-
suming, labor intensive, and expensive. Creating 
training corpora for the biomedical domain is par-
ticularly expensive as it requires domain specific 
expert knowledge. 
One way to solve this problem is through active 
learning method to select the most informative 
samples for training. Active selection of the train-
ing examples can significantly reduce the neces-
sary number of labeled training examples without 
degrading the performance. 
Existing work for active learning explores two 
approaches: certainty or uncertainty-based methods 
(Lewis and Gale 1994; Scheffer and Wrobel 2001; 
Thompson et al. 1999) and committee-based 
methods (Cohn et al. 1994; Dagan and Engelson 
1995; Freund et al. 1997; Liere and Tadepalli 
1997). Uncertainty-based systems begin with an 
initial classifier and the systems assign some un-
certainty scores to the un-annotated examples. The 
k examples with the highest scores will be anno-
tated by human experts and the classifier will be 
retrained. In the committee-based systems, diverse 
committees of classifiers were generated. Each 
committee member will examine the un-annotated 
examples. The degree of disagreement among the 
committee members will be evaluated and the ex-
amples with the highest disagreement will be se-
lected for manual annotation. 
Our efforts are different from the previous ac-
tive learning approaches and are devoted to two 
aspects: we propose an entropy-based measure to 
quantify the uncertainty that the current classifier 
holds. The most uncertain samples are selected for 
human annotation. However, we also assume that 
the selected training samples should give the dif-
ferent aspects of learning features to the classifica-
tion system. So, we try to catch the most 
representative sentences in each sampling. The 
divergence measures of the two sentences are for 
the novelty of the features and their representative 
levels, and are described by the minimum similar-
ity among the examples. The two measures for un-
certainty and diversity will be combined using the 
MMR (Maximal Marginal Relevance) method 
(Carbonell and Goldstein 1998) to give the sam-
pling scores in our active learning strategy. 
69
We incorporate MMR-based active machine-
learning idea into the POSBIOTM/NER (Song et 
al. 2005) system which is a trainable biomedical 
named-entity recognition system using the Condi-
tional Random Fields (Lafferty et al. 2001) ma-
chine learning technique to automatically identify 
different sets of biological entities in the text. 
2 MMR-based Active Learning for Bio-
medical Named-entity Recognition 
2.1 Active Learning 
We integrate active learning methods into the 
POSBIOTM/NER (Song et al. 2005) system by the 
following procedure: Given an active learning 
scoring strategy S and a threshold value th, at each 
iteration t, the learner uses training corpus TMt   to 
train the NER module Mt. Each time a user wants 
to annotate a set of un-labeled sentences U, the 
system first tags the sentences using the current 
NER module Mt. At the same time, each tagged 
sentence is assigned with a score according to our 
scoring strategy S. Sentences will be marked if its 
score is larger than the threshold value th. The tag 
result is presented to the user, and those marked 
ones are rectified by the user and added to the 
training corpus. Once the training data accumulates 
to a certain amount, the NER module Mt will be 
retrained. 
2.2 Uncertainty-based Sample Selection 
We evaluate the uncertainty degree that the current 
NER module holds for a given sentence in terms of 
the entropy of the sentence. Given an input se-
quence o, the state sequence set S is a finite set. 
And  is the probability distribu-
tion over S. By using the equation for CRF 
(Lafferty et al. 2001) module, we can calculate the 
probability of any possible state sequence s given 
an input sequence o. Then the entropy of  
is defined to be: 
Sso|s ∈
∧
  ),(p
)( o|s
∧
p
∑ ∧∧
−=
s
o|so|s )]([log)(
2
PPH  
The number of possible state sequences grows 
exponentially as the sentence length increases. In 
order to measure the uncertainty by entropy, it is 
inconvenient and unnecessary to compute the 
probability of all the possible state sequences. In-
stead we implement N-best Viterbi search to find 
the N state sequences with the highest probabilities. 
The entropy H(N) is defined as the entropy of the 
distribution of the N-best state sequences: 
∑
∑∑
=
=
∧
∧
=
∧
∧
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
−=
N
i
N
i
i
i
N
i
i
i
P
P
P
P
NH
1
1
2
1
)(
)(
log
)(
)(
)(
o|s
o|s
o|s
o|s
.  (1) 
The range of the entropy H(N) is [0, 
N
1
log
2
− ] which varies according to different N. 
We could use the equation (2) to normalize the 
H(N) to [0, 1]. 
N
NH
NH
1
log
)(
)(
2
−
=′ .  (2) 
2.3 Diversity-based Sample Selection 
We measure the sentence structure similarity to 
represent the diversity and catch the most represen-
tative ones in order to give more diverse features to 
the machine learning-based classification systems. 
We propose a three-level hierarchy to represent 
the structure of a sentence. The first level is NP 
chunk, the second level is Part-Of-Speech tag, and 
the third level is the word itself. Each word is rep-
resented using this hierarchy structure. For exam-
ple in the sentence "I am a boy", the word "boy" is 
represented as w
r
=[NP, NN, boy]. The similarity 
score of two words is defined as: 
)()(
),(2
)(
21
21
21
wDepthwDepth
wwDepth
wwsim
rr
rr
rr
+
∗
=⋅  
Where ),(
21
wwDepth
rr
 is defined from the top 
level as the number of levels that the two words are 
in common. Under our three-level hierarchy 
scheme above, each word representation has depth 
of 3. 
The structure of a sentence S is represented as 
the word representation vectors ],  ,,[
21 N
www
r
K
rr
. 
We measure the similarity of two sentences by the 
standard cosine-similarity measure. The similarity 
score of two sentences is defined as: 
2211
21
21
),(
SSSS
SS
SSsimilarity
rrrr
rr
rr
⋅⋅
⋅
= , 
∑∑
⋅=⋅
ij
ji
wwsimSS )(
2121
rr
rr
. 
70
2.4 MMR Combination for Sample Selection 
We would like to score the sample sentences with 
respect to both the uncertainty and the diversity. 
The following MMR (Maximal Marginal Rele-
vance) (Carbonell and Goldstein 1998) formula is 
used to calculate the active learning score: 
),(Similaritymax                   
)1(),(yUncertaint)(
jiTs
i
def
i
ss
Mssscore
Mj
∈
∗
−−∗= λλ
  (3) 
where si is the sentence to be selected, Uncertainty 
is the entropy of si given current NER module M, 
and Similarity indicates the divergence degree be-
tween the si and the sentence sj in the training cor-
pus TM of M. The combination rule could be 
interpreted as assigning a higher score to a sen-
tence of which the NER module is uncertain and 
whose configuration differs from the sentences in 
the existing training corpus. The value of parame-
ter λ  coordinates those two different aspects of 
the desirable sample sentences. 
After initializing a NER module M and an ap-
propriate value of the parameterλ , we can assign 
each candidate sentence a score under the control 
of the uncertainty and the diversity. 
3 Experiment and Discussion 
3.1 Experiment Setup 
We conducted our active learning experiments us-
ing pool-based sample selection (Lewis and Gale 
1994). The pool-based sample selection, in which 
the learner chooses the best instances for labeling 
from a given pool of unlabelled examples, is the 
most practical approach for problems in which 
unlabelled data is relatively easily available. 
For our empirical evaluation of the active learn-
ing methods, we used the training and test data 
released by JNLPBA (Kim et al. 2004). The train-
ing corpus contains 2000 MEDLINE abstracts, and 
the test data contains 404 abstracts from the 
GENIA corpus. 100 abstracts were used to train 
our initial NER module. The remaining training 
data were taken as the pool. Each time, we chose k 
examples from the given pool to train the new 
NER module and the number k varied from 1000 
to 17000 with a step size 1000. 
We test 4 different active learning methods: Ran-
dom selection, Entropy-based uncertainty selection, 
Entropy combined with Diversity, and Normalized 
Entropy (equation (2)) combined with Diversity. 
When we compute the active learning score using 
the entropy based method and the combining 
methods we set the values of parameter N (from 
equation (1)) to 3 and λ  (from equation (3)) to 0.8 
empirically. 
 
Fig1. Comparison of active learning strategies with the ran-
l in the y-axis shows the 
per
bin  
ies consistently outperform 
the
dom selection 
3.2 Results and Analyses 
The initial NER module gets an F-score of 52.54, 
while the F-score performance of the NER module 
using the whole training data set is 67.19. We plot-
ted the learning curves for the different sample 
selection strategies. The interval in the x-axis be-
tween the curves shows the number of examples 
selected and the interva
formance improved. 
We compared the entropy, entropy combined 
with sentence diversity, normalized entropy com-
ed with sentence diversity and random selection.
The curves in Figure 1 show the relative per-
formance. The F-score increases along with the 
number of selected examples and receives the best 
performance when all the examples in the pool are 
selected. The results suggest that all three kinds of 
active learning strateg
 random selection.  
The entropy-based example selection has im-
proved performance compared with the random 
selection. The entropy (N=3) curve approaches to 
the random selection around 13000 sentences se-
lected, which is reasonable since all the methods 
choose the examples from the same given pool. As 
71
the number of selected sentences approaches the 
pool size, the performance difference among the 
different methods gets small. The best performance 
of the entropy strategy is 67.31 when 17000 exam-
ple
the
 normalized combined strategy 
behaves the worst. 
4 Conclusion 
ction could significantly reduce 
the human effort. 
by Minis-
try of Commerce, Industry and Energy. 
s are selected. 
Comparing with the entropy curve, the com-
bined strategy curve shows an interesting charac-
teristic. Up to 4000 sentences, the entropy strategy 
and the combined strategy perform similarly. After 
the 11000 sentence point, the combined strategy 
surpasses the entropy strategy. It accords with our 
belief that the diversity increases the classifier's 
performance when the large amount of samples is 
selected.  The normalized combined strategy dif-
fers from the combined strategy. It exceeds the 
other strategies from the beginning and maintains 
 best performance up until 12000 sentence point. 
   The entropy strategy reaches 67.00 in F-score 
when 11000 sentences are selected. The combined 
strategy receives 67.17 in F-score while 13000 sen-
tences are selected, while the end performance is 
67.19 using the whole training data. The combined 
strategy reduces 24.64 % of training examples 
compared with the random selection. The normal-
ized combined strategy achieves 67.17 in F-score 
when 11000 sentences are selected, so 35.43% of 
the training examples do not need to be labeled to 
achieve almost the same performance as the end 
performance. The normalized combined strategy's 
performance becomes similar to the random selec-
tion strategy at around 13000 sentences, and after 
14000 sentences the
 
We incorporate active learning into the biomedical 
named-entity recognition system to enhance the 
system's performance with only small amount of 
training data. We presented the entropy-based un-
certainty sample selection and combined selection 
strategies using the corpus diversity. Experiments 
indicate that our strategies for active-learning 
based sample sele
Acknowledgement  
This research was supported as a Brain Neuroin-
formatics Research Program sponsored 
References 
Carbonell J., & Goldstein J. (1998). The Use of MMR, 
Diversity-Based Reranking for Reordering Docu-
ments and Producing Summaries. In Proceedings of 
the 21st Annual International ACM-SIGIR Confer-
ence on Research and Development in Information 
Retrieval, pages 335-336. 
Cohn, D. A., Atlas, L., & Ladner, R. E. (1994). Improv-
ing generalization with active learning, Machine 
Learning, 15(2), 201-221. 
Dagan, I., & Engelson S. (1995). Committee-based 
sampling for training probabilistic classifiers. In Pro-
ceedings of the Twelfth International Conference on 
Machine Learning, pages 150-157, San Francisco, 
CA, Morgan Kaufman. 
Freund Y., Seung H.S., Shamir E., & Tishby N. (1997). 
Selective sampling using the query by committee al-
gorithm, Machine Learning, 28, 133-168. 
Kim JD., Ohta T., Tsuruoka Y., & Tateisi Y. (2004). 
Introduction to the Bio-Entity Recognition Task at 
JNLPBA, Proceedings of the International Workshop 
on Natural Language Processing in Biomedicine and 
its Application (JNLPBA). 
Lafferty, J., McCallum, A., & Pereira, F. (2001). Condi-
tional random fields: Probabilistic models for seg-
menting and labeling sequence data. In Proc. of the 
18th International Conf. on Machine Learning, pages 
282-289, Williamstown, MA, Morgan Kaufmann. 
Lewis D., & Gale W. (1994). A Sequential Algorithm 
for Training Text Classifiers, In: Proceedings of the 
Seventeenth Annual International ACM-SIGIR Con-
ference on Research and Development in Information 
Retrieval. pp. 3-12, Springer-Verlag. 
Liere, R., & Tadepalli, P. (1997). Active learning with 
committees for text categorization, In proceedings of 
the Fourteenth National Conference on Artificial In-
telligence, pp. 591-596 Providence, RI. 
Scheffer T., & Wrobel S. (2001). Active learning of 
partially hidden markov models. In Proceedings of 
the ECML/PKDD Workshop on Instance Selection. 
Song Y., Kim E., Lee G.G., & Yi B-k. (2005). 
POSBIOTM-NER: a trainable biomedical named-
entity recognition system. Bioinformatics, 21 (11): 
2794-2796. 
Thompson C.A., Califf M.E., & Mooney R.J. (1999). 
Active Learning for Natural Language Parsing and 
Information Extraction, In Proceedings of the Six-
teenth International Machine Learning Conference, 
pp.406-414, Bled, Slovenia. 
72
