Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL 06, pages 114–115,
New York City, June 2006. c©2006 Association for Computational Linguistics
Biomedical Term Recognition
With the Perceptron HMM Algorithm
Sittichai Jiampojamarn and Grzegorz Kondrak and Colin Cherry
Department of Computing Science,
University of Alberta,
Edmonton, AB, T6G 2E8, Canada
{sj,kondrak,colinc}@cs.ualberta.ca
Abstract
We propose a novel approach to the iden-
ti cation of biomedical terms in research
publications using the Perceptron HMM
algorithm. Each important term is iden-
ti ed and classi ed into a biomedical con-
cept class. Our proposed system achieves
a 68.6% F-measure based on 2,000 train-
ing Medline abstracts and 404 unseen
testing Medline abstracts. The system
achieves performance that is close to the
state-of-the-art using only a small feature
set. The Perceptron HMM algorithm pro-
vides an easy way to incorporate many po-
tentially interdependent features.
1 Introduction
Every day, new scienti c articles in the biomedi-
cal  eld are published and made available on-line.
The articles contain many new terms and names
involving proteins, DNA, RNA, and a wide vari-
ety of other substances. Given the large volume of
the new research articles, it is important to develop
systems capable of extracting meaningful relation-
ships between substances from these articles. Such
systems need to recognize and identify biomedical
terms in unstructured texts. Biomedical term recog-
nition is thus a step towards information extraction
from biomedical texts.
The term recognition task aims at locating
biomedical terminology in unstructured texts. The
texts are unannotated biomedical research publica-
tions written in English. Meaningful terms, which
may comprise several words, are identi ed in order
to facilitate further text mining tasks. The recogni-
tion task we consider here also involves term clas-
si cation, that is, classifying the identi ed terms
into biomedical concepts: proteins, DNA, RNA, cell
types, and cell lines.
Our biomedical term recognition task is de ned
as follows: given a set of documents, in each docu-
ment,  nd and mark each occurrence of a biomedi-
cal term. A term is considered to be annotated cor-
rectly only if all its composite words are annotated
correctly. Precision, recall and F-measure are deter-
mined by comparing the identi ed terms against the
terms annotated in the gold standard.
We believe that the biomedical term recogni-
tion task can only be adequately addressed with
machine-learning methods. A straightforward dic-
tionary look-up method is bound to fail because
of the term variations in the text, especially when
the task focuses on locating exact term boundaries.
Rule-based systems can achieve good performance
on small data sets, but the rules must be de ned
manually by domain experts, and are dif cult to
adapt to other data sets. Systems based on machine-
learning employ statistical techniques, and can be
easily re-trained on different data. The machine-
learning techniques used for this task can be divided
into two main approaches: the word-based methods,
which annotate each word without taking previous
assigned tags into account, and the sequence based
methods, which take other annotation decisions into
account in order to decide on the tag for the current
word.
We propose a biomedical term identi cation
114
system based on the Perceptron HMM algo-
rithm (Collins, 2004), a novel algorithm for HMM
training. It uses the Viterbi and perceptron algo-
rithms to replace a traditional HMM’s conditional
probabilities with discriminatively trained parame-
ters. The method has been successfully applied to
various tasks, including noun phrase chunking and
part-of-speech tagging. The perceptron makes it
possible to incorporate discriminative training into
the traditional HMM approach, and to augment it
with additional features, which are helpful in rec-
ognizing biomedical terms, as was demonstrated in
the ABTA system (Jiampojamarn et al., 2005). A
discriminative method allows us to incorporate these
features without concern for feature interdependen-
cies. The Perceptron HMM provides an easy and
effective learning algorithm for this purpose.
The features used in our system include the part-
of-speech tag information, orthographic patterns,
word pre x and suf x character strings. The ad-
ditional features are the word, IOB and class fea-
tures. The orthographic features encode the spelling
characteristics of a word, such as uppercase letters,
lowercase letters, digits, and symbols. The IOB and
class features encode the IOB tags associated with
biomedical class concept markers.
2 Results and discussion
We evaluated our system on the JNLPBA Bio-Entity
recognition task. The training data set contains
2,000 Medline abstracts labeled with biomedical
classes in the IOB style. The IOB annotation method
utilizes three types of tags: <B> for the beginning
word of a term, <I> for the remaining words of a
term, and <O> for non-term words. For the purpose
of term classi cation, the IOB tags are augmented
with the names of the biomedical classes; for ex-
ample, <B-protein> indicates the  rst word of
a protein term. The held-out set was constructed
by randomly selecting 10% of the sentences from
the available training set. The number of iterations
for training was determined by observing the point
where the performance on the held-out set starts to
level off. The test set is composed of new 404 Med-
line abstracts.
Table 1 shows the results of our system on all  ve
classes. In terms of F-measure, our system achieves
Class Recall Precision F-measure
protein 76.73 % 65.56 % 70.71 %
DNA 63.07 % 64.47 % 63.76 %
RNA 64.41 % 59.84 % 62.04 %
cell type 64.71 % 76.35 % 70.05 %
cell line 54.20 % 52.02 % 53.09 %
ALL 70.93 % 66.50 % 68.64 %
Table 1: The performance of our system on the test
set with respect to each biomedical concept class.
the average of 68.6%, which a substantial improve-
ment over the baseline system (based on longest
string matching against a lists of terms from train-
ing data) with the average of 47.7%, and over the
basic HMM system, with the average of 53.9%. In
comparison with the results of eight participants at
the JNLPBA shared tasks (Kim et al., 2004), our
system ranks fourth. The performance gap between
our system and the best systems at JNLPBA, which
achieved the average up to 72.6%, can be attributed
to the use of richer and more complete features such
as dictionaries and Gene ontology.
3 Conclusion
We have proposed a new approach to the biomedical
term recognition task using the Perceptron HMM al-
gorithm. Our proposed system achieves a 68.6% F-
measure with a relatively small number of features
as compared to the systems of the JNLPBA partici-
pants. The Perceptron HMM algorithm is much eas-
ier to implement than the SVM-HMMs, CRF, and
the Maximum Entropy Markov Models, while the
performance is comparable to those approaches. In
the future, we plan to experiment with incorporat-
ing external resources, such as dictionaries and gene
ontologies, into our feature set.

References
M. Collins. 2004. Discriminative training methods for
hidden markov models: Theory and experiments with
perceptron algorithms. In Proceedings of EMNLP.
S. Jiampojamarn, N. Cercone, and V. Keselj. 2005. Bi-
ological named entity recognition using n-grams and
classi cation methods. In Proceedings of PACLING.
J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier.
2004. Introduction to the bio-entity recognition task at
JNLPBA. In Proceedings of JNLPBA.
