Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining
Biological Semantics, pages 32–37, Detroit, June 2005. c©2005 Association for Computational Linguistics
MedTag: A Collection of Biomedical Annotations
L.H. Smitha0 , L. Tanabea0 , T. Rind escha1 , W.J. Wilbura0
a0 National Center for Biotechnology Information
a1 Lister Hill National Center for Biomedical Communications
NLM, NIH, 8600 Rockville Pike, Bethesda, MD 20894
a2 lsmith,tanabe,wilbur
a3 @ncbi.nlm.nih.gov
rind esch@nlm.nih.gov
Abstract
We present a database of annotated
biomedical text corpora merged into a
portable data structure with uniform con-
ventions. MedTag combines three cor-
pora, MedPost, ABGene and GENETAG,
within a common relational database data
model. The GENETAG corpus has been
modi ed to re ect new de nitions of
genes and proteins. The MedPost cor-
pus has been updated to include 1,000
additional sentences from the clinical
medicine domain. All data have been up-
dated with original MEDLINE text ex-
cerpts, PubMed identi ers, and tokeniza-
tion independence to facilitate data accu-
racy, consistency and usability.
The data are available in  at  les along
with software to facilitate loading the
data into a relational SQL database
from ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith
/MedTag/medtag.tar.gz.
1 Introduction
Annotated text corpora are used in modern computa-
tional linguistics research and development to  ne-
tune computer algorithms for analyzing and classi-
fying texts and textual components. Two important
factors for useful text corpora are 1) accuracy and
consistency of the annotations, and 2) usability of
the data. We have recently updated the text corpora
we use in our research with respect to these criteria.
Three different corpora were combined. The AB-
Gene corpus consists of over 4 000 sentences anno-
tated with gene and protein named entities. It was
originally used to train the ABGene tagger to recog-
nize gene/protein names in MEDLINE records, and
recall and precision rates in the lower 70 percentile
range were achieved (Tanabe and Wilbur, 2002).
The MedPost corpus consists of 6 700 sentences,
and is annotated with parts of speech, and gerund
arguments. The MedPost tagger was trained on
3 700 of these sentences and achieved an accuracy
of 97.4% on the remaining sentences (Smith et. al.,
2004). The GENETAG corpus for gene/protein
named entity identi cation, consists of 20 000 sen-
tences and was used in the BioCreative 2004 Work-
shop (Yeh et. al., 2005; Tanabe et. al., 2005) (only
15 000 sentences are currently released, the remain-
ing 5 000 are being retained for possible use in a fu-
ture workshop). Training on a portion of the data,
the top performing systems achieved recall and pre-
cision rates in the lower 80 percentile range. Be-
cause of the scarcity of good annotated data in the
realm of biomedicine, and because good perfor-
mance has been obtained using this data, we feel
there is utility in presenting it to a wider audience.
All of the MedTag corpora are based on MED-
LINE abstracts. However, they were queried at dif-
ferent times, and used different (but similar) algo-
rithms to perform tokenization and sentence seg-
mentation. The original annotations were assigned
to tokens, or sequences of tokens, and extensively
reviewed by the authors at different times for the dif-
ferent research projects.
The main goals in combining and updating these
32
MedTag  Collection 
MedPost ABGene GENETAG 
6,700 sentences 
5,700 molecular biology 
1,000 clinical medicine 
60 part-of-speech tags 
97.4% accuracy 
1 annotator 
4,265 sentences 
Molecular Biology 
Single Genes/Proteins tagged 
1 annotator 
15,000 sentences 
50% Molecular Biology 
50% Other Biomedical 
All Genes/Proteins tagged 
3 annotators 
EXCERPT 
ID 
PubMed  ID 
Corpus code 
Original  text 
Citation 
ANNOTATION 
ID 
Corpus code 
Character offsets 
Annotated text 
Data Model 
SQL Relational 
Database with 
Web Interface 
MEDLINE 
MEDLINE 
Figure 1: Component corpora, common data model
and main record types of the MedTag collection.
corpora into a single corpus were to
1. update the text for all corpora to that currently
found in MEDLINE, storing a correct citation
and the original, untokenized text for each ex-
cerpt
2. eliminate tokenization dependence
3. put all text and annotations into a common
database format
4. provide programs to convert from the new cor-
pus format to the data formats used in previous
research
2 Merging the Corpora
We describe what was done to merge the original
corpora, locating original sources and modifying the
text where needed. An overview is given in Figure
1. Some basic statistics are given in Table 1.
2.1 Identifying Source Data
The original data of the three corpora were assem-
bled and the text was used to search MEDLINE to
Corpus sentences tokens most frequent tag
GENETAG-05 15,000 418,246 insulin GENE(112)
MedPost 6,700 181,626 the DD(8,507)
AbGene 4,265 123,208 cyclin GENE(165)
MedPost
Adj Adv Aux Noun Punct Verb
14,648 4,553 56,262 60,732 21,806 23,625
GENETAG-05
GENE ALTGENE
24,562 19,216
ABGene
GENE ALTGENE
8,185 0
Table 1: MedTag Corpora. GENE = gene and pro-
tein names, ALTGENE = acceptable alternatives for
gene and protein names. MedPost tagset contains
60 parts of speech which have been binned here for
brevity.
 nd the closest match. An exact or near exact match
was found for all but a few excerpts. For only a
few excerpts, the MEDLINE record from which the
excerpt was originally taken had been removed or
modi ed and an alternative sentence was selected.
Thus, each excerpt in the database is taken from a
MEDLINE record as it existed at one time in 2004.
In order to preserve the reference for future work,
the PubMed ID and citation data were also retrieved
and stored with each excerpt. Each excerpt in the
current database roughly corresponds to a sentence,
although the procedure that extracted the sentence is
not speci ed.
2.2 Eliminating Tokenization Dependence
In the original ABGene and GENETAG corpora, the
gene and protein phrases were speci ed by the to-
kens contained in the phrase, and this introduced
a dependence on the tokenization algorithm. This
created problems for researchers who wished to use
a different tokenization. To overcome this depen-
dence, we developed an alternative way of specify-
33
ing phrases. Given the original text of an excerpt,
the number of non-whitespace characters to the start
of the phrase does not depend on the tokenization.
Therefore, all annotations now refer to the  rst and
last character of the phrase that is annotated. For
example the protein serum LH in the excerpt
There was no correlation between serum
LH and chronological or bone age in this
age group, which suggests that the corre-
lation found is not due to age-related par-
allel phenomena.
is speci ed as characters 28 to 34 (the  rst character
is 0).
2.3 Data Model
There are two main record types in the database,
EXCERPT and ANNOTATION. Each EXCERPT
record stores an identi er and the original corpus
code (abgene, medpost, and genetag) as well as sub-
corpus codes that were de ned in the original cor-
pora. The original text, as it was obtained from
MEDLINE, is also stored, and a human readable ci-
tation to the article containing the reference.
Each ANNOTATION record contains a reference
to the excerpt (by identi er and corpus), the char-
acter offset of the  rst and last characters of the
phrase being annotated (only non-whitespace char-
acters are counted, starting with 0), and the corre-
sponding annotation. The annotated text is stored
for convenience, though it can be obtained from
the corresponding excerpt record by counting non-
whitespace characters.
The data is provided as an ASCII  le in a standard
format that can be read and loaded into a relational
database. Each record in the  le begins with a line
of the form a4a5a4 table name where table name is the
name of the table for that record. Following the table
name is a series of lines with the form  eld: value
where  eld is the name of the  eld and value is the
value stored in that  eld.
Scripts are provided for loading the data into a re-
lational database, such as mysql or ORACLE. SQL
queries can then be applied to retrieve excerpts and
annotations satisfying any desired condition. For
example, here is an SQL query to retrieve excerpts
from the MedPost corpus containing the token p53
and signaling or signalling
Figure 2: A screen capture of the annotator’s inter-
face and the GENETAG-05 annotations for a sen-
tence.
select text from excerpt
where text like ’%p53%’
and text rlike ’signa[l]*ing’;
2.4 Web Interface
A web-based corpus editor was used to enter and
review annotations. The code is being made avail-
able, as is, and requires that the data are loaded into a
mysql database that can be accessed by a web server.
The interface supports two annotation types: Med-
Post tags and arbitrary phrase annotations. MedPost
tags are selectable from a pull-down menu of pre-
programmed likely tags. For entering phrase anno-
tations, the user highlights the desired phrase, and
pressing the enter key computes and saves the  rst
and last character offsets. The user can then enter
the annotation code and an optional comment be-
fore saving it in the database. A screen dump of the
phrase annotations for a sentence in the genetag cor-
pus is shown in  gure 2.
The data from the database was dumped to the  at
 le format for this release. We have also included
some  les to accommodate previous users of the
corpora. A perl program, alt eval.perl is in-
34
cluded that replaces the GENETAG evaluation pro-
gram using non-whitespace character numbers in-
stead of token numbers. Copies of the ABGene and
MedPost corpora, in the original formats, are also
included.
3 Updates of Component Corpora
3.1 MedPost Update
The MedPost corpus (Smith et. al., 2004) originally
contained 5 700 tokenized sentences. An additional
1 000 annotated sentences have been added for this
release. Each sentence in the MedPost corpus is
fully tokenized, that is, divided into non-overlapping
annotated portions, and each token is annotated with
one of 60 part of speech tags (see Table 1). Minor
corrections to the annotations have been made since
the original release.
Since most of the original corpus, and all of the
sentences used for training the MedPost tagger, were
in the area of molecular biology, we added an addi-
tional 1 000 sentences selected from random MED-
LINE abstracts on the subject of clinical medicine.
As a preliminary result, the trained MedPost tag-
ger achieves approximately 96.9% accuracy, which
is comparable to the 97.4% accuracy achieved on the
subset of 1 000 sentences selected randomly from all
of MEDLINE. An example of a sentence from the
clinical medicine collection is
EvidenceNN isVBZ nowRR availableJJ
toTO showVVI aDD bene cialJJ effectNN
ofII beza brateNN onII retardingVVGN
atheroscleroticJJ processesNNS andCC inII
reducingVVGN riskNN ofII coronaryJJ heartNN
diseaseNN .
In addition to the token-level annotations, all
of the gerunds in the MedPost corpus (these are
tagged VVGN) were also examined and it was noted
whether the gerund had an explicit subject, direct
object, or adjective complement. This annotation
is stored with an annotation of type gerund. To il-
lustrate, the two gerunds in the previous example,
retarding and reducing both have direct objects (re-
tarding processes and reducing risk), and the gerund
tag is entered as  o . The gerund annotations have
been used to improve a noun phrase bracketer able
to recognize gerundive phrases.
3.2 GENETAG Update
GENETAG is a corpus of MEDLINE sentences that
have been annotated with gene and protein names.
The closest related work is the GENIA corpus (Kim
et. al., 2003). GENIA provides detailed coverage of
a large number of semantic entities related to a spe-
ci c subset of human molecular biology, whereas
GENETAG provides gene and protein name anno-
tations only, for a wide range of organisms and
biomedical contexts (molecular biology, genetics,
biochemistry, clinical medicine, etc.)
We are including a new version of GENE-
TAG, GENETAG-05, as part of the MedTag sys-
tem. GENETAG-05 differs from GENETAG in
four ways: 1) the de nition of a gene/protein en-
tity has been modi ed, 2) signi cant annotation er-
rors in GENETAG have been corrected, 3) the con-
cept of a non-speci c entity has been re ned, and 4)
character-based indices have been introduced to re-
duce tokenization problems. We believe that these
changes result in a more accurate and robust corpus.
GENETAG-05 maintains a wide de nition of a
gene/protein entity including genes, proteins, do-
mains, sites, sequences, and elements, but exclud-
ing plasmids and vectors. The speci city con-
straint requires that a gene/protein name must be
included in the tagged entity. This constraint has
been applied more consistently in GENETAG-05.
Additionally, plain sequences like ATTGGCCTT-
TAAC are no longer tagged, embedded names are
tagged (ras-mediated), and signi cantly more terms
have been judged to violate the speci city constraint
(growth factor, proteases, protein kinase, ribonu-
clease, snoRNA, rRNA, tissue factor, tumor anti-
gen, complement, hormone receptors, nuclear fac-
tors, etc.).
The original GENETAG corpus contains some en-
tities that were erroneously tagged as gene/proteins.
Many of these errors have been corrected in the up-
dated corpus. Examples include camp-responsive
elements, mu element, VDRE, melanin, dentin,
myelin, auxin, BARBIE box, carotenoids, and cel-
lulose. Error analysis resulted in the updated anno-
tation conventions given in Table 1.
Enzymes are a special class of proteins that cat-
alyze biochemical reactions. Enzyme names have
varying degrees of speci city, so the line drawn for
35
tagging purposes is based on online resources1 as
well as background knowledge. In general, tagged
enzymes refer to more speci c entities than un-
tagged enzymes (tyrosine kinase vs. protein kinase,
ATPase vs. protease). Enzymes that can refer to
either DNA or RNA are tagged if the reference is
speci ed (DNA endonuclease vs. endonuclease).
Enzymes that do not require DNA/RNA distinction
are tagged (lipase vs. ligase, cyclooxygenase vs.
methylase). Non-speci c enzymes are tagged if they
clearly refer to a gene or protein, as in (1).
1) The structural gene for hydrogenase en-
codes a protein product of molecular mass
45820 Da.
Semantic constraints in GENETAG-05 are the
same as those for GENETAG. To illustrate, the name
in (2) requires rabies because RIG implies that the
gene mentioned in this sentence refers to the rabies
immunoglobulin, and not just any immunoglobulin.
In (3), the word receptor is necessary to differen-
tiate IGG receptor from IGG, a crucial biological
distinction. In (4), the number 1 is needed to ac-
curately describe a speci c type of tumor necrosis
factor, although tumor necrosis factor alone might
be adequate in a different context.
2) rabies immunoglobulin (RIG)
3) IGG receptor
4) Tumor necrosis factor 1
Application of the semantic constraint can result in
apparent inconsistencies in the corpus (immunoglob-
ulin is suf cient on its own in some sentences in the
corpus, but is insuf cient in (2)). However, we be-
lieve it is important that the tagged entity retain its
true meaning in the sentence context.
4 Recommended Uses
We have found the component corpora of MedTag
to be useful for the following functions:
1) Training and evaluating part-of-speech
taggers
2) Training and evaluating gene/protein
named entity taggers
1http://cancerweb.ncl.ac.uk/omd/copyleft.html
http://www.onelook.com/
3) Developing and evaluating a noun
phrase bracketer for PubMed phrase
indexing
4) Statistical analysis of grammatical
usage in medical text
5) Feature generation for machine learn-
ing
The MedPost tagger was recently ported to Java
and is currently being employed in MetaMap, a pro-
gram that maps natural language text into the UMLS
(Aronson,A.R., 2001).
5 Conclusion
We have merged three biomedical corpora into a col-
lection of annotations called MedTag. MedTag uses
a common relational database format along with a
web interface to facilitate annotation consistency.
We have identi ed the MEDLINE excerpts for each
sentence and eliminated tokenization dependence,
increasing the usability of the data. In GENETAG-
05, we have clari ed many grey areas for annotation,
providing better guidelines for tagging these cases.
For users of previous versions of the component cor-
pora, we have included programs to convert from the
new standardized format to the formats used in the
older versions.

References
Aronson, A. R. 2001. Effective mapping of biomedical
text to the UMLS Metathesaurus: the MetaMap pro-
gram. Proc. AMIA Symp., 1721.
Kim, J.-D., Ohta, T., Tateisi, Y. and Tsujii, J. 2003. GE-
NIA corpus: a semantically annotated corpus for bio-
textmining. Bioinformatics, 19: 180 - 182.
Tanabe, L and Wilbur, WJ. 2002. Tagging gene and
protein names in biomedical text. Bioinformatics, 18,
1124-1132.
Tanabe L, Xie N, Thom, LH, Matten W, Wilbur, WJ:
GENETAG: a tagged gene corpus for gene/protein
named entity recognition. BMC Bioinformatics 2005.
Smith, L, Rind esch, T, and Wilbur, WJ. 2004. MedPost:
a part of speech tagger for biomedical text. Bioinfor-
matics, 20(13) 2320-2321.
Yeh A, Hirschman L, Morgan A, Colosimo M: BioCre-
AtIvE task 1A: gene mention  nding evaluation. BMC
Bioinformatics 2005.
