Active Annotation
Andreas Vlachos
William Gates Building
Computer Laboratory
University of Cambridge
av308@cl.cam.ac.uk
Abstract
This paper introduces a semi-supervised
learning framework for creating training
material, namely active annotation. The
main intuition is that an unsupervised
method is used to initially annotate imper-
fectly the data and then the errors made
are detected automatically and corrected
by a human annotator. We applied ac-
tive annotation to named entity recogni-
tion in the biomedical domain and encour-
aging results were obtained. The main
advantages over the popular active learn-
ing framework are that no seed annotated
data is needed and that the reusability of
the data is maintained. In addition to the
framework, an efficient uncertainty esti-
mation for Hidden Markov Models is pre-
sented.
1 Introduction
Training material is always an issue when applying
machine learning to deal with information extrac-
tion tasks. It is generally accepted that increasing
the amount of training data used improves perfor-
mance. However, training material comes at a cost,
since it requires annotation.
As a consequence, when adapting existing meth-
ods andtechniques to anew domain, researchers and
users are faced with the problem of absence of an-
notated material that could be used for training. A
good example is the biomedical domain, which has
attracted the attention of the NLP community rel-
atively recently (Kim et al., 2004). Even though
there are plenty of biomedical texts, very little of it
is annotated, such as the GENIA corpus (Kim et al.,
2003).
A very popular and well investigated framework
in order to cope with the lack of training mate-
rial is the active learning framework (Cohn et al.,
1995; Seung et al., 1992). It has been applied
to various NLP/IE tasks, including named entity
recognition (Shen et al., 2004) and parse selec-
tion (Baldridge and Osborne, 2004) with rather im-
pressive results in reducing the amount of anno-
tated training data. However, some criticism of ac-
tive learning has been expressed recently, concern-
ing the reusability of the data (Baldridge and Os-
borne, 2004).
This paper presents a framework in order to deal
with the lack of training data for NLP tasks. The
intuition behind it is that annotated training data is
produced by applying an (imperfect) unsupervised
method, and then the errors inserted in the annota-
tion are detected automatically and reannotated by
a human annotator. The main difference compared
to active learning is that instead of selecting unla-
beled instances for annotation, possible erroneous
instances are selected for checking and correction
if they are indeed erroneous. We will refer to this
framework as “active annotation” in the rest of the
paper.
The structure of this paper is as follows. In Sec-
tion 2 we describe the software and the dataset used.
Section 3 explores the effect of errors in the training
data and motivates the active annotation framework.
64
In Section 4 we describe the framework in detail,
while Section 5presents amethod forestimating un-
certainty for HMMs. Section 6 presents results from
applying the active annotation. Section 7 compares
the proposed framework to active learning and Sec-
tion 8 attempts an analysis of its performance. Fi-
nally, Section 9 suggests some future work.
2 Experimental setup
The data used in the experiments that follow are
taken from the BioNLP 2004 named entity recog-
nition shared task (Kim et al., 2004). The text pas-
sages have been annotated with five classes of en-
tities, “DNA”, “RNA”, “protein”, “cell type” and
“cell line”. In our experiments, following the ex-
ample of Dingare et al. (2004), we simplified the an-
notation to one entity class, namely “gene”, which
includes the DNA, RNA and protein classes. In or-
der to evaluate the performance on the task, we used
the evaluation script supplied with the data, which
computes the F-score (F1 = 2∗Precision∗RecallPrecision+Recall ) for
each entity class. It must be noted that all tokens
of an entity must be recognized correctly in order to
count as a correct prediction. A partially recognized
entity counts both as a precision and recall error. In
all the experiments that follow, the official split of
the data in training and testing was maintained.
The named entity recognition system used in our
experiments is the open source NLP toolkit Ling-
pipe1. The named entity recognition module is
an HMM model using Witten-Bell smoothing. In
our experiments, using the data mentioned earlier it
achieved 70.06% F-score.
3 Effect of errors
Noise in the training data is acommon issue in train-
ing machine learning for NLP tasks. It can have sig-
nificant effect on the performance, as it was pointed
out by Dingare et al. (2004), where the performance
of the same system on the same task (named entity
recognition in the biomedical domain) was lower
when using noisier material. The effect of noise in
the data used to train machine learning algorithms
forNLPtaskshasbeen explored byOsborne (2002),
using the task of shallow parsing as the case study
and a variety of learners. The impact of different
1http://www.alias-i.com/lingpipe/
types of noise was explored and learner specific ex-
tensions were proposed in order to deal with it.
In ourexperiments weexplored the effect of noise
intraining theselected namedentity recognition sys-
tem, keeping in mind that we are going to use an
unsupervised method to create the training material.
Thekindofnoiseweexpect ismislabelled instances.
In order to simulate the behavior of a hypothetical
unsupervised method, wecorrupted the training data
artificially using the following models:
• LowRecall: Change tokens labelled as entities
to non-entities. It must be noted that in this
model, due to the presence of multi-token en-
tities precision is reduced too, albeit less than
recall.
• LowRecall WholeEntities: Change the label-
ing of whole entities to non-entities. In this
model, precision is kept intact.
• LowPrecision: Change tokens labelled as non-
entities to entities.
• Random: Entities and non-entities are changed
randomly. It can be viewed alternatively as a
random tagger which labels the data with some
accuracy.
The level of noise inserted is adjusted by specify-
ing the probability with which a candidate label is
changed. In all the experiments in this paper, for a
particular model and level of noise, the corruption
of the dataset was repeated five times in order to
produce more reliable results. In practice, the be-
havior of an unsupervised method is likely to be a
mixture of the above models. However, given that
the method would tag the data with a certain per-
formance, we attempted through our experiments to
identify which of these (extreme) behaviors would
be less harmful. In Figure 1, we present graphs
showing the effect of noise inserted with the above
models. The experimental procedure was to add
noise to the training data according to a model, eval-
uate the performance of the hypothetical tagger that
produced it, train Lingpipe on thenoisy training data
and evaluate the performance of the latter on the test
data. The process was repeated for various levels
of noise. In the top graph, the F-score achieved by
Lingpipe (F-ling) is plotted against the F-score of
65
the hypothetical tagger (F-tag), while in the bottom
graph the F-score achieved by Lingpipe (F-ling) is
plotted against the number of erroneous classifica-
tions made by the hypothetical tagger.
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0  0.2  0.4  0.6  0.8  1
F-ling
F-tag
random
lowrecall
wholeentities
lowprecision
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0  50  100 150 200 250 300 350 400 450 500
F-ling
raw errors (in thousands)
random
lowrecall
wholeentities
lowprecision
Figure 1: F-score achieved by Lingpipe is plotted
against (a) the F-score of the hypothetical tagger in
the top graph and (b) the number of errors made by
the hypothetical tagger in the bottom graph.
A first observation is that limited noise does not
affect the performance significantly, a phenomenon
that can be attributed to the capacity of the machine
learning method to deal with noise. From the point
of view of correcting mistakes in the training data
this suggests that not all mistakes need to be cor-
rected. Another observation is that while the perfor-
mance for all the models follow similar curves when
plotted against the F-score of the hypothetical tag-
ger, the same does not hold when plotted against the
number of errors. While this can be attributed to the
unbalanced nature of the task (very few entity to-
kens compared to non-entities), it also suggests that
the raw number of errors in the training data is not
a good indicator for the performance obtained by
training on it. However, it represents the effort re-
quired to obtain the maximum performance from the
data by correcting it.
4 Active Annotation
In this section we present a detailed description of
the active annotation framework. Initially, we have
a pool of unlabeled data, D, whose instances are an-
notated using anunsupervised methodu, whichdoes
not need training data. As expected, a significant
amount of errors is inserted during this process. A
list L is created containing the tokens that have not
been checked by a human annotator. Then, a super-
vised learner s is used to train a model M on this
noisy training material. A query module q, which
uses the model created by s decides which instances
of D will be selected to be checked for errors by
a human annotator. The selected instances are re-
moved from L so that q does not select them again
in future. The learner s is then trained on this par-
tially corrected training data and the sequence is re-
peated from the point of applying the querying mod-
ule q. The algorithm written in pseudocode appears
in Figure 2.
Data D, unsupervised tagger u,
supervised learner s, query module q.
Initialization:
Apply u to D.
Create list of instances L.
Loop:
Using s train a model M on D.
Using q and M select a batch of instances B
to be checked.
Correct the instances of B in D.
Remove the instances of B from L.
Repeat until:
L is empty or annotator stops.
Figure 2: Active annotation algorithm
Comparing it with active learning, the similarities
are apparent. Both frameworks have a loop in which
a query module q, using a model produced by the
learner, selects instances to be presented to a human
annotator. The efficiency of active annotation can be
measured in two ways, both of them used in evalu-
ating active learning. The first is to measure the re-
duction in the checked instances needed in order to
achieve a certain level of performance. The second
is the increase in performance for a fixed number
of checked instances. Following the active learning
66
paradigm, a baseline for active annotation is random
selection of instances to be checked.
There are though some notable differences. Dur-
ing initialization, an unsupervised method u is re-
quired to provide an initial tagging on the data D.
This is an important restriction which is imposed
by the lack of any annotated data. Even under this
restriction, there are some options available, espe-
cially for tasks which have compiled resources. One
option is to use an unsupervised learning algorithm,
such the one presented by Collins & Singer (1999),
where a seed set of rules is used to bootstrap a rule-
based named entity recognizer. A different approach
could be the use of a dictionary-based tagger, as in
Morgan et al. (2003). It must be noted that the unsu-
pervised method used to provide the initial tagging
does not need to generalize to any data (a common
problem for such methods), it only needs to perform
well on the data used during active annotation. Gen-
eralization on unseen data is an attribute we hope
that the supervised learning method s will have af-
ter training on the annotated material created with
active annotation.
The query module q is also different from the cor-
responding module in active learning. Instead of se-
lecting unlabeled informative instances to be anno-
tated and added to the training data, its purpose is
to identify likely errors in the imperfectly labelled
training data, so that they are checked and corrected
by the human annotator.
In order to perform error-detection, we chose
to adapt the approach of Nakagawa and Mat-
sumoto (2002) which resembles uncertainty based
sampling for active learning. According to their
paradigm, likely errors in the training data are in-
stances that are “hard” for the classifier and incon-
sistent with the rest of the data. In our case, we used
the uncertainty of the classifier as the measure of the
“hardness” of an instance. As an indication of in-
consistency, we used the disagreement of the label
assigned bytheclassifierwiththecurrent label ofthe
instance. Intuitively, if the classifier disagrees with
the label of an instance used in its training, it indi-
cates that there have been other similar instances in
the training data that were labelled differently. Re-
turning to the description of active annotation, the
query module q ranks the instances inLfirst by their
inconsistency and then by decreasing uncertainty of
the classifier. As aresult, instances that are inconsis-
tent with the rest of the data and hard for the classi-
fier are selected first, then those that are inconsistent
but easy for the classifier, then the consistent ones
but hard for the classifier and finally the consistent
and easy ones.
While this method of detecting errors resembles
uncertainty sampling, there are other approaches
that could have been used instead and they can be
very different. Sj¨obergh and Knutsson (2005) in-
serted artificial errors and trained a classifier to rec-
ognize them. Dickinson and Meuers (2003) pro-
posed methods based on n-grams occurring with dif-
ferent labellings in the corpus. Therefore, while it is
reasonable to expect some correlation between the
selections of active annotation and active learning
(hard instances are likely to be erroneously anno-
tated by the unsupervised tagger), the task of select-
ing hard instances is quite different from detecting
errors. The use of the disagreement between tag-
gers for selecting candidates for manual correction
is reminiscent of corrected co-training (Pierce and
Cardie, 2001). However, the main difference is cor-
rected co-training results in a manually annotated
corpus, while active annotation allows automatically
annotated instances to be kept.
5 HMM uncertainty estimation
In order to perform error detection according to the
previous section we need to obtain uncertainty es-
timations over each token from the named entity
recognition module of Lingpipe. For each token t
and possible label l, Lingpipe estimates the follow-
ing Hidden Markov Model from the training data:
P(t[n],l[n]|l[n − 1],t[n − 1],t[n − 2]) (1)
Whenannotating acertain textpassage, thetokens
are fixed and the joint probability of Equation 1 is
computed for each possible combination of labels.
From Bayes’ rule, we obtain:
P(l[n]|t[n],l[n− 1],t[n − 1],t[n − 2]) =
P(t[n],l[n]|l[n − 1],t[n − 1],t[n − 2])
P(t[n]|l[n− 1],t[n − 1],t[n − 2]) (2)
For fixed token sequence t[n],t[n − 1],t[n − 2]
and previous label (l[n − 1]) the second term of the
67
left part of Equation 2 is a fixed value. Therefore,
under these conditions, we can write:
P(l[n]|t[n],l[n− 1],t[n − 1],t[n − 2]) ∝
P(t[n],l[n]|l[n − 1],t[n − 1],t[n − 2]) (3)
From Equation 3 we obtain an approximation for
the conditional distribution of the current label (l[n])
conditioned on the previous label (l[n − 1]) for a
fixed sequence of tokens. It must be stressed that
the later restriction is very important. The result-
ing distribution from Equation 3 cannot be com-
pared across different token sequences. However,
for the purpose of computing the uncertainty over
a fixed token sequence it is a reasonable approxi-
mation. One way to estimate the uncertainty of the
classifier is to calculate the conditional entropy of
this distribution. The conditional entropy for a dis-
tribution P(X|Y ) can be computed as:
H[X|Y ] =summationdisplay
y
P(Y = y)summationdisplay
x
logP(X = x|Y = y)
(4)
In our case, X is l[n] and Y is l[n − 1]. Func-
tion 4 can be interpreted as the weighted sum of
the entropies of P(l[n]|l[n − 1]) for each value
of l[n − 1], in our case the weighted sum of en-
tropies of the distribution of the current label for
each possible previous label. The probabilities for
each tag (needed for P(l[n − 1])) are not calcu-
lated directly from the model. P(l[n]) corresponds
to P(l[n]|t[n],t[n − 1],t[n − 2]), but since we are
considering a fixed token sequence, we approxi-
mate its distribution using the conditional proba-
bility P(l[n]|t[n],l[n − 1],t[n − 1],t[n − 2]), by
marginalizing over l[n− 1].
Again, itmustbenoted thattheabovecalculations
are to be used in estimating uncertainty over a single
word. Oneproperty oftheconditional entropy isthat
it estimates the uncertainty of the predictions for the
current label given knowledge of the previous tag,
which is important in our application because we
need the uncertainty over each label independently
from the rest of the sequence. This is confirmed by
the theory, from which we know that for a condi-
tional distribution of X given Y the following equa-
tion holds, H[X|Y ] = H[X,Y ] − H[Y ], where H
denotes the entropy.
A different way of obtaining uncertainty estima-
tions from HMMs in the framework of active learn-
ing has been presented in (Scheffer et al., 2001).
There, the uncertainty is estimated by the margin
between the two most likely predictions that would
result in a different current label, explicitly:
M = maxi,j {P(t[n] = i|t[n − 1] = j)} −
maxk,l,knegationslash=i {P(t[n] = k|t[n − 1] = l)} (5)
Intuitively, the margin M is the difference be-
tween the two highest scored predictions that dis-
agree. The lower the margin, the higher the uncer-
tainty oftheHMMonthe token atquestion. Adraw-
back of this method is that it doesn’t take into ac-
count the distribution of the previous label. It is pos-
sible that the two highest scored predictions are ob-
tained for two different previous labels. It may also
be the case that ahighly scored label can be obtained
given a very improbable previous label. Finally, an
alternative that we did not explore in this work is
the Field Confidence Estimation (Culotta and Mc-
Callum, 2004), which allows the estimation of con-
fidence over sequences of tokens, instead of single-
ton tokens only. However, in this work confidence
estimation over singleton tokens is sufficient.
6 Experimental Results
In this section we present results from applying ac-
tive annotation to biomedical named entity recogni-
tion. Using the noise models described in Section 3,
we corrupted the training data and then using Ling-
pipe as the supervised learner we applied the algo-
rithm of Figure 2. The batch of tokens selected to be
checked in each round was 2000 tokens. As a base-
line for comparison we used random selection of to-
kens to be checked. The results for various noise
models and levels are presented in the graphs of Fig-
ure 3. In each of these graphs, the performance of
Lingpipe trained on the partially corrected material
(F-ling) is plotted against the number of checked in-
stances, under the label “entropy”.
In all the experiments, active annotation signifi-
cantly outperformed random selection, with the ex-
ception of 50% Random, where the high level of
noise (the F-score of the hypothetical tagger that
provided the initial data was 0.1) affected the initial
68
 0.56
 0.58
 0.6
 0.62
 0.64
 0.66
 0.68
 0.7
 0.72
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
Random 10%
random
margin
entropy
 0.675
 0.68
 0.685
 0.69
 0.695
 0.7
 0.705
 0  75 150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
LowRecall_WholeEntities 20%
random
margin
entropy
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
LowRecall 50%
random
margin
entropy
 0.6
 0.61
 0.62
 0.63
 0.64
 0.65
 0.66
 0.67
 0.68
 0.69
 0.7
 0.71
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
LowRecall 20%
random
uncertainty
entropy
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
LowPrecision 20%
random
uncertainty
entropy
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
Random 50%
random
uncertainty
entropy
Figure 3: F-score achieved by Lingpipe is plotted
against the number of checked instances for various
models and levels of noise.
judgements of the query module on which instances
should be checked. After having checked some por-
tion of the dataset though, active annotation started
outperforming random selection. In the graphs for
the 10% Random, 20% LowRecall WholeEntities
and 50% LowRecall noise models, under the la-
bel “margin”, appear the performance curves ob-
tained using the uncertainty estimation of Scheffer
et al. (2001). Even though active annotation us-
ing this method performs better than random se-
lection, active annotation using conditional entropy
performs significantly better. These results provide
evidence ofthe theoretical advantages of conditional
entropy described earlier. We also ran experiments
using pure uncertainty based sampling (i.e. with-
out checking the consistency of the labels) on se-
lecting instances to be checked. The performance
curves appear under the label “uncertainty” for the
20% LowRecall, 50% Random and 20% LowPreci-
sion noise models. The uncertainty was estimated
using the method described in Section 5. As ex-
pected, uncertainty based sampling performed rea-
sonably well, betterthan random selection but worse
than using labelling consistency, except for the ini-
tial stage of 20% LowPrecision.
7 Active Annotation versus Active
Learning
Inordertocompare active annotation toactive learn-
ing, we run active learning experiments using the
same dataset and software. The paradigm employed
was uncertainty based sampling, using the uncer-
tainty estimation presented in Sec. 5. HMMsrequire
annotated sequences of tokens, therefore annotating
whole sentences seemed as the natural choice, as
in (Becker et al., 2005). While token-level selec-
tions could be used in combination with EM, (as in
(Scheffer et al., 2001)), constructing a corpus of in-
dividual tokens would result in a corpus that would
be very difficult to be reused, since it would be par-
tially labelled. We employed the two standard op-
tions of selecting sentences, selecting the sentences
with the highest average uncertainty over the tokens
or selecting the sentence containing the most uncer-
tain token. As cost metric we used the number of
tokens, which allows more straightforward compar-
ison with active annotation.
In Figure 4 (left graph), each active learning ex-
periment isstarted byselecting arandom sentence as
seed data, repeating the seed selection 5 times. The
random selection is repeated 5 times for each seed
selection. As in (Becker et al., 2005), selecting the
sentences with the highest average uncertainty (ave)
performs better than selecting those with the most
uncertain token (max).
In the right graph, we compare the best active
learning method with active annotation. Apparently,
the performance of active annotation is highly de-
pendent on the performance of the unsupervised tag-
ger used to provide us with the initial annotation of
the data. In the graph, we include curves for two
of the noise models reported in the previous sec-
tion, LowRecall20% and LowRecall50% which cor-
respond to tagging performance of 0.66 / 0.69 / 0.67
and 0.33 / 0.43 / 0.37 respectively, in terms of Re-
call / Precision / F. We consider such tagging per-
formances feasible with a dictionary-based tagger,
since Morgan et al. (2003) report performance of
69
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
Active learning
ave
max
random
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0  75  150 225 300 375 450 525
F-ling
tokens_checked (in thousands)
AA vs AL
AA-20%
AA-50%
AL-ave
Figure 4: Left, comparison among various active
learning methods. Right, comparison of active
learning and active annotation.
0.88 / 0/78 / 83 with such a method.
These results demonstrate that active annotation,
given a reasonable starting point, can achieve reduc-
tions in the annotation cost comparable to those of
active learning. Furthermore, active annotation pro-
duces an actual corpus, albeit noisy. Active learn-
ing, as pointed out by Baldridge & Osborne (2004),
while it reduces the amount of training material
needed, it selects data that might not be useful to
train a different learner. In the active annotation
framework, it is likely to preserve correct instances
that might not be useful to the machine learning
method used to create it, but maybe beneficial to a
different method. Furthermore, producing an actual
corpus can be very important when adding new fea-
tures to the model. In the case of biomedical NER,
one could consider adding document-level features,
such as whether a token has been seen as part of a
gene name earlier in the document. With the cor-
pus constructed using active learning this is not fea-
sible, since it is unlikely that all the sentences of a
document are selected for annotation. Also, if one
intended to use the same corpus for a different task,
such as anaphora resolution, again the imperfectly
annotated corpus constructed using active annota-
tion can be used more efficiently than the partially
annotated one produced by active learning.
8 Selecting errors
In order to investigate further the behavior of ac-
tive annotation, we evaluated the performance of the
trained supervised method against the number of er-
rors corrected by the human annotator. The aim of
this experiment was to verify whether the improve-
ment in performance compared to random selection
is due to selecting “informative” errors to correct, or
due tothe efficiency ofthe errordetection technique.
 0.56
 0.58
 0.6
 0.62
 0.64
 0.66
 0.68
 0.7
 0.72
 0  5  10 15 20 25 30 35 40 45 50
F-ling
errors corrected (in thousands)
random
entropy
reverse
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
 0  75  150 225 300 375 450 525errors corrected (in thousands)
tokens checked (in thousands)
random
entropy
reverse
Figure 5: Left: F-score achieved by Lingpipe
is plotted against the number of corrected errors.
Right: Errors corrected plotted against the number
of checked tokens.
In Figure 5, we present such graphs for the 10%
Random noise model. Similar results were obtained
with different noise models. As can be observed on
the left graph, the errors corrected initially during
random selection are farmoreinformative compared
to those corrected at the early stages of active anno-
tation (labelled “entropy”). The explanation for this
is that using the error detection method described in
Section 4, the errors that are detected are those on
which the supervised method s disagrees with the
training material, which implies that even if such an
instance is indeed an error then it didn’t affect s.
Therefore, correcting such errors will not improve
the performance significantly. Informative errors are
those that s has learnt to reproduce with high cer-
tainty. However, such errors are hard to detect be-
cause similar attributes are exhibited usually by cor-
rectly labelled instances. This can be verified by the
curves labelled “reverse” in the graphs of Figure 5,
in which the ranking of the instances to be selected
was reversed, so that instances where the supervised
method agrees confidently with the training material
70
are selected first. The fact that errors with high un-
certainty arelessinformative thanthosewithlowun-
certainty suggests that active annotation, while be-
ing related to active learning, it is sufficiently differ-
ent. Theright graph suggests that the error-detection
performance during active annotation is much better
than that of random selection. Therefore, the per-
formance of active annotation could be improved by
preserving the high error-detection performance and
selecting more informative errors.
9 Future work
This paper described active annotation, a semi-
supervised learning framework that reduces the ef-
fort needed to create training material, which is
very important in adapting existing trainable meth-
ods to new domains. Future work should investi-
gate the applicability of the framework in a variety
ofNLP/IEtasksandsettings. Weintend toapply this
framework to NER for biomedical literature from
the FlyBase project for which no annotated datasets
exist.
While we have used the number of instances
checked by a human annotator to measure the cost
of annotation, this might not be representative of the
actual cost. The task of checking and possibly cor-
recting instances differs from annotating them from
scratch. In this direction, experiments in realistic
conditions with human annotators should be carried
out. We also intend to explore the possibility of
grouping similar mistakes detected in a round of ac-
tive annotation, so that the human annotator can cor-
rect them with less effort. Finally, alternative error-
detection methods should be investigated.
Acknowledgments
The author was funded by BBSRC, grant number
38688. I would like to thank Ted Briscoe and Bob
Carpenter for their feedback and comments.

References
J. Baldridge and M. Osborne. 2004. Active learning and
the total cost ofannotation. In Proceedingsof EMNLP
2004, Barcelona, Spain.
M. Becker, B. Hachey, B. Alex, and C. Grover.
2005. Optimising selective sampling for bootstrap-
ping named entity recognition. In Proceedings of the
Workshop on Learning with Multiple Views, ICML.
D. A. Cohn, Z. Ghahramani, and M. I. Jordan. 1995.
Active learning with statistical models. In Advances
in Neural Information Processing Systems, volume 7.
M. Collins and Y. Singer. 1999. Unsupervised models
for named entity classification. In Proceedings of the
Joint SIGDAT Conference on EMNLP and VLC.
Aron Culotta and Andrew McCallum. 2004. Confidence
estimation for information extraction. In Proceedings
of HLT 2004, Boston, MA.
M.DickinsonandW.D.Meurers. 2003. Detectingerrors
in part-of-speech annotation. In Proceedings of EACL
2003, pages 107–114,Budapest, Hungary.
S. Dingare, J. Finkel, M. Nissim, C. Manning, and
C. Grover. 2004. A system for identifying named en-
tities in biomedicaltext: How results fromtwo evalua-
tions reflect on boththe system andthe evaluations. In
The 2004 BioLink meeting at ISMB.
J. D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. 2003. Ge-
nia corpus - a semantically annotated corpus for bio-
textmining. In ISMB (Supplement of Bioinformatics).
J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier,
editors. 2004. Proceedings of JNLPBA, Geneva.
A. Morgan, L. Hirschman, A. Yeh, and M. Colosimo.
2003. Gene name extraction using FlyBase resources.
In Proceedings of the ACL 2003 Workshop on NLP in
Biomedicine, pages 1–8.
T. Nakagawa and Y. Matsumoto. 2002. Detecting errors
in corporausing supportvectormachines. In Proceed-
ings of COLING 2002.
M. Osborne. 2002. Shallow parsing using noisy and
non-stationarytrainingmaterial. J. Mach. Learn. Res.,
2:695–719.
D.PierceandC.Cardie. 2001. Limitationsofco-training
for natural language learning from large datasets. In
Proceedings of EMNLP 2001, pages 1–9.
T. Scheffer, C. Decomain, and S. Wrobel. 2001. Ac-
tive hiddenMarkovmodels forinformationextraction.
Lecture Notes in Computer Science, 2189:309+.
H. S. Seung, M. Opper, and H. Sompolinsky. 1992.
Query by committee. In Proceedings of COLT 1992.
D. Shen, J. Zhang, J. Su, G. Zhou, and C. L. Tan. 2004.
Multi-criteria-based active learning for named entity
recongition. In Proceedings of ACL 2004, Barcelona.
J. Sj¨obergh and O. Knutsson. 2005. Faking errors to
avoid making errors: Machine learning for error de-
tection in writing. In Proceedings of RANLP 2005.
