Exploiting Context for Biomedical Entity Recognition:
From Syntax to the Web
Jenny Finkel,* Shipra Dingare,† Huy Nguyen,*
Malvina Nissim,† Christopher Manning,* and Gail Sinclair†
*Department of Computer Science
Stanford University
Stanford, CA 93405-9040
United States
{jrfinkel|htnguyen|manning}
@cs.stanford.edu
†Institute for Communicating and
Collaborative Systems
University of Edinburgh
Edinburgh EH8 9LW
United Kingdom
{sdingar1|mnissim|csincla1}
@inf.ed.ac.uk
Abstract
We describe a machine learning system for the
recognition of names in biomedical texts. The sys-
tem makes extensive use of local and syntactic fea-
tures within the text, as well as external resources
including the web and gazetteers. It achieves an F-
score of 70% on the Coling 2004 NLPBA/BioNLP
shared task of identifying five biomedical named en-
tities in the GENIA corpus.
1 Introduction
The explosion of information in the fields of molec-
ular biology and genetics has provided a unique
opportunity for natural language processing tech-
niques to aid researchers and curators of databases
in the biomedical field by providing text mining
services. Yet typical natural language processing
tasks such as named entity recognition, informa-
tion extraction, and word sense disambiguation are
particularly challenging in the biomedical domain
with its highly complex and idiosyncratic language.
With the increasing use of shared tasks and shared
evaluation procedures (e.g., the recent BioCreative,
TREC, and KDD Cup), it is rapidly becoming clear
that performance in this domain is markedly lower
than the field has come to expect from the standard
domain of newswire. The Coling 2004 shared task
focuses on the problem of Named Entity Recogni-
tion, requiring participating systems to identify the
five named entities of protein, RNA, DNA, cell line,
and cell type in the GENIA corpus of MEDLINE
abstracts (Ohta et al., 2002). In this paper we de-
scribe a machine learning system incorporating a di-
verse set of features and various external resources
to accomplish this task. We describe our system in
detail and also discuss some sources of error.
2 System Description
Our system is a Maximum Entropy Markov Model,
which further develops a system earlier used for the
CoNLL 2003 shared task (Klein et al., 2003) and the
2004 BioCreative critical assessment of information
extraction systems, a task that involved identifying
gene and protein name mentions but not distinguish-
ing between them (Dingare et al., 2004). Unlike
the above two tasks, many of the entities in the cur-
rent task do not have good internal cues for distin-
guishing the class of entity: various systematic pol-
ysemies and the widespread use of acronyms mean
that internal cues are lacking. The challenge was
thus to make better use of contextual features, in-
cluding local and syntactic features, and external re-
sources in order to succeed at this task.
2.1 Local Features
We used a variety of features describing the imme-
diate content and context of each word, including
the word itself, the previous and next words, word
prefixes and suffix of up to a length of 6 characters,
word shapes, and features describing the named en-
tity tags assigned to the previous words. Word
shapes refer to a mapping of each word onto equiva-
lence classes that encodes attributes such as length,
capitalization, numerals, greek letters, and so on.
For instance, “Varicella-zoster” would become Xx-
xxx, “mRNA” would become xXXX, and “CPA1”
would become XXXd. We also incorporated part-of-
speech tagging, using the TnT tagger(Brants, 2000)
retrained on the GENIA corpus gold standard part-
of-speech tagging. We also used various interaction
terms (conjunctions) of these base-level features in
various ways. The full set of local features is out-
lined in Table 1.
2.2 External Resources
We made use of a number of external resources, in-
cluding gazetteers, web-querying, use of the sur-
rounding abstract, and frequency counts from the
British National Corpus.
88
Word Features wi, wi−1, wi+1
Disjunction of 5 prev words
Disjunction of 5 next words
TnT POS POSi, POSi−1, POSi+1
Prefix/suffix Up to a length of 6
Abbreviations abbri
abbri−1 + abbri
abbri + abbri+1
abbri−1 + abbri + abbri+1
Word Shape shapei, shapei−1, shapei+1
shapei−1 + shapei
shapei + shapei+1
shapei−1 + shapei + shapei+1
Prev NE NEi−1, NEi−2 + NEi−1
NEi−3 + NEi−2 + NEi−1
Prev NE + Word NEi−1 + wi
Prev NE + POS NEi−1 + POSi−1 + POSi
NEi−2 + NEi−1 + POSi−2 +
POSi−1 + POSi
Prev NE + Shape NEi−1 + shapei
NEi−1 + shapei+1
NEi−1 + shapei−1 + shapei
NEi−2 + NEi−1 + shapei−2 +
shapei−1 + shapei
Paren-Matching Signals when one parenthesis
in a pair has been assigned a
different tag than the other in a
window of 4 words
Table 1: Local Features (+ indicates conjunction)
2.2.1 Frequency
Many entries in gazetteers are ambiguous words,
occasionally used in the sense that the gazetteer
seeks to represent, but at least as frequently not.
So while the information that a token was seen in
a gazetteer is an unreliable indicator of whether it
is an entity, less frequent words are less likely to be
ambiguous than more frequent ones. Additionally,
more frequent words are likely to have been seen
often in the training data and the system should be
better at classifying them, while less frequent words
are a common source of error and their classifica-
tion is more likely to benefit from the use of external
resources. We assigned each word in the training
and testing data a frequency category correspond-
ing to its frequency in the British National Corpus,
a 100 million word balanced corpus, and used con-
junctions of this category and certain other features.
2.2.2 Gazetteers
Our gazetteer contained only gene names and was
compiled from lists from biomedical websites (such
as LocusLink) as well as from the Gene Ontol-
ogy and the data provided for the BioCreative 2004
tasks. The final gazetteer contained 1,731,496 en-
tries. Because it contained only gene names, and for
the reasons discussed earlier, we suspect that it was
not terribly useful for identifying the presences of
entities, but rather that it mainly helped to establish
the exact beginning and ending point of multi-word
entities recognized mainly through other features.
2.2.3 Web
For each of the named entity classes, we built in-
dicative contexts, such as “X mRNA” for RNA, or
“X ligation” for protein. For each entity X which
had a frequency lower than 10 in the British Na-
tional Corpus, we submitted instantiations of each
pattern to the web, using the Google API, and ob-
tained the number of hits. The pattern that returned
the highest number of hits determined the feature
value (e.g., “web-protein”, or “web-RNA”). If no
hits were returned by any pattern, a value “O-web”
was assigned. This value was also assigned to all
words whose frequency was higher than 10 (using
yet another value for words with higher frequency
did not improve the tagger’s performance).
2.2.4 Abstracts
A number of NER systems have made effective use
of how the same token was tagged in different parts
of the same document (see (Curran and Clark, 2003)
and (Mikheev et al., 1999)). A token which appears
in an unindicative context in one sentence may ap-
pear in a very obvious context in another sentence
in the same abstract. To leverage this we tagged
each abstract twice, providing for each token a fea-
ture indicating whether it was tagged as an entity
elsewhere in the abstract. This information was
only useful when combined with information on fre-
quency.
2.3 Deeper Syntactic Features
While the local features discussed earlier are all
fairly surface level, our system also makes use of
deeper syntactic features. We fully parsed the train-
ing and testing data using the Stanford Parser of
(Klein and Manning, 2003) operating on the TnT
part-of-speech tagging – we believe that the un-
lexicalized nature of this parser makes it a partic-
ularly suitable statistical parser to use when there
is a large domain mismatch between the training
material (Wall Street Journal text) and the target
domain, but have not yet carefully evaluated this.
Then, for each word in the sentence which is in-
side a noun phrase, the head and governor of the
noun phrase are extracted. These features are not
very useful when identifying only two classes (such
as GENE and OTHER in the BioCreative task), but
they were quite useful for this task because of the
large number of classes which the system needed to
distinguish between. Because the classifier is now
89
choosing between classes where members can look
very similar, longer distance information can pro-
vide a better representation of the context in which
the word appears. For instance, the word phospho-
rylation occurs in the training corpus 492 times, 482
of which it is was classified as other. However,
it is the governor of 738 words, of which 443 are
protein, 292 are other and only 3 are cell
line.
We also made use of abbreviation matching to
help ensure consistency of labels. Abbreviations
and long forms were extracted from the data using
the method of (Schwartz and Hearst, 2003). This
data was combined with a list of other abbreviations
and long forms extracted from the BioCreative 2004
task. Then all occurrences of either the long or short
forms in the data was labeled. These labels were in-
cluded in the system as features and helped to im-
prove boundary detection.
2.4 Adjacent Entities
When training our classifier, we merged the B- and
I- labels for each class, so it did not learn how to
differentiate between the first word of a class and
internal word. There were several motivations for
doing this. Foremost was memory concerns; our fi-
nal system trained on just the six classes had 1.5
million features – we just did not have the resources
to train it over more classes without giving up many
of our features. Our second motivation was that by
merging the beginning and internal labels for a par-
ticular class, the classifier would see more examples
of that class and learn better how to identify it. The
drawback of this move is that when two entities be-
longing to the same class are adjacent, our classifier
will automatically merge them into one entity. We
did attempt to split them back up using NP chunks,
but this severely reduced performance.
3 Results and Discussion
Our results on the evaluation data and a confusion
matrix are shown in Tables 2 and 4. Table 4 sug-
gests areas for further work. Collapsing the B- and
I- tags does cost us quite a bit. Otherwise confusions
between some named entity and being nothing are
most of the errors, although protein/DNA and cell-
line/cell-type confusions are also noticeable.
Analysis of performance in biomedical Named
Entity Recognition tends to be dominated by the
perceived poorness of the results, stemming from
the twin beliefs that performance of roughly ninety
percent is the state-of-the-art and that performance
of 100% (or close to that) is possible and the goal
to be aimed for. Both of these beliefs are ques-
tionable, as the top MUC 7 performance of 93.39%
Entity Precision Recall F-Score
Fully Correct
protein 77.40% 68.48% 72.67%
DNA 66.19% 69.62% 67.86%
RNA 72.03% 65.89% 68.83%
cell line 59.00% 47.12% 52.40%
cell type 62.62% 76.97% 69.06%
Overall 71.62% 68.56% 70.06%
Left Boundary Correct
protein 82.89% 73.34% 77.82%
DNA 68.47% 72.01% 70.19%
RNA 75.42% 68.99% 72.06%
cell line 63.80% 50.96% 56.66%
cell type 63.93% 78.57% 70.49%
Overall 75.72% 72.48% 74.07%
Right Boundary Correct
protein 84.70% 74.96% 79.53%
DNA 74.43% 78.29% 76.31%
RNA 78.81% 72.09% 75.30%
cell line 70.2% 56.07% 62.34%
cell type 71.68% 88.10% 79.05%
Overall 79.65% 76.24% 77.91%
Table 2: Results on the evaluation data
(Mikheev et al., 1998) in the domain of newswire
text used an easier performance metric where incor-
rect boundaries were given partial credit, while both
the biomedical NER shared tasks to date have used
an exact match criterion where one is doubly penal-
ized (both as a FP and as a FN) for incorrect bound-
aries. However, the difference in metric clearly can-
not account entirely for the performance discrep-
ancy between newswire NER and biomedical NER.
Biomedical NER appears to be a harder task due
to the widespread ambiguity of terms out of con-
text, the complexity of medical language, and the
apparent need for expert domain knowledge. These
are problems that more sophisticated machine learn-
ing systems using resources such as ontologies and
deep processing might be able to overcome. How-
ever, one should also consider the inherent “fuzzi-
ness” of the classification task. The few existing
studies of inter-annotator agreement for biomedi-
cal named entities have measured agreement be-
tween 87%(Hirschman, 2003) and 89%(Demetrious
and Gaizauskas, 2003). As far as we know there
are no inter-annotator agreement results for the GE-
NIA corpus, and it is necessary to have such results
before properly evaluating the performance of sys-
tems. In particular, the fact that BioNLP sought to
distinguish between gene and protein names, when
these are known to be systematically ambiguous,
and when in fact in the GENIA corpus many enti-
ties were doubly classified as “protein molecule or
90
DNA RNA cell line cell type protein
gold\ans B- I- B- I- B- I- B- I- B- I- O
B-DNA 723 39 0 0 1 0 0 0 154 1 138
I-DNA 52 1390 0 0 0 0 0 0 19 71 257
B-RNA 1 0 89 3 0 0 0 0 14 0 11
I-RNA 0 1 5 164 0 0 0 0 2 0 15
B-cell line 3 0 0 0 319 41 37 5 12 1 82
I-cell line 0 6 0 0 24 713 5 104 0 14 123
B-cell type 1 0 0 0 164 22 1228 90 31 5 380
I-cell type 0 0 0 0 13 383 88 2101 8 27 371
B-protein 48 5 10 3 20 1 19 3 4200 192 566
I-protein 6 66 0 11 0 10 2 25 245 3630 779
O 170 240 25 26 85 142 184 132 1042 656 78945
Table 3: Our confusion matrix over the evaluation data
human B-cell type
monocytes I-cell type
human O
monocytes B-cell type
macrophages B-cell type
primary B-cell type JJ
T I-cell type NN
lymphocytes I-cell type NNS
primary O JJ
peripheral B-cell type JJ
blood I-cell type NN
lymphocytes I-cell type NNS
Table 4: Examples of annotation inconsistencies
region” and “DNA molecule or region”, suggests
that inter-annotator agreement could be low, and
that many entities in fact have more than one classi-
fication.
One area where GENIA appears inconsistent is
in the labeling of preceding adjectives. The data
was selected by querying for the term human, yet
the term is labeled inconsistently, as is shown in Ta-
ble 4. Of the 1790 times the term human occurred
before or at the beginning of an entity in the train-
ing data, it was not classified as part of the entity
110 times. In the test data, there is only on instance
(out of 130) where the term is excluded. Adjectives
are excluded approximately 25% of the time in both
the training and evaluation data. There are also in-
consistencies when two entities are separated by the
word and.
4 Acknowledgements
This paper is based on work supported in part by a
Scottish Enterprise Edinburgh-Stanford Link Grant
(R36759), as part of the SEER project, and in part
the National Science Foundation under the Knowl-
edge Discovery and Dissemination program.

References
Thorsten Brants. 2000. TnT – a statistical part-of-speech
tagger. In ANLP 6, pages 224–231.
James R. Curran and Stephen Clark. 2003. Language
independent NER using a maximum entropy tagger.
In Proceedings of the Seventh Conference on Natural
Language Learning (CoNLL-03), pages 164–167.
George Demetrious and Rob Gaizauskas. 2003. Corpus
resources for development and evaluation of a biolog-
ical text mining system. In Proceedings of the Third
Meeting of the Special Interest Group on Text Mining,
Brisbane, Australia, July.
Shipra Dingare, Jenny Rose Finkel, Christopher Man-
ning, Malvina Nissim, and Beatrice Alex. 2004. Ex-
ploring the boundaries: Gene and protein identifica-
tion in biomedical text. In Proceedings of the BioCre-
ative Workshop.
Lynette Hirschman. 2003. Using biological resources to
bootstrap text mining.
Dan Klein and Christopher D. Manning. 2003. Accurate
unlexicalized parsing. In ACL 41, pages 423–430.
Dan Klein, Joseph Smarr, Huy Nguyen, and Christo-
pher D. Manning. 2003. Named entity recognition
with character-level models. In CoNLL 7, pages 180–
183.
Andrei Mikheev, Claire Grover, and Mark Moens. 1998.
Description of the LTG system used for MUC-7. In
Proceedings of MUC-7.
Andrei Mikheev, Marc Moens, and Claire Grover. 1999.
Named entity recognition without gazetteers. In Pro-
ceedings of the ninth conference on European chap-
ter of the Association for Computational Linguistics,
pages 1–8. Association for Computational Linguis-
tics.
Tomoko Ohta, Yuka Tateisi, Hideki Mima, and Jun’ichi
Tsujii. 2002. GENIA corpus: an annotated research
abstract corpus in molecular biology domain. In Pro-
ceedings of he Human Language Technology Confer-
ence, pages 73–77.
Ariel Schwartz and Marti Hearst. 2003. A simple al-
gorithm for identifying abbreviation definitions in
biomedical text. In Pacific Symposium on Biocomput-
ing, Kauai, Jan.
