Extracting the Names of Genes and Gene Products with a 
Hidden Markov Model 
Nigel Collier, Chikashi Nobata and Jun-ichi Tsujii 
l)el)artm(mt of Information Science 
(h'aduate School of Science 
University of Tokyo, Hongo-7-3-1 
Bunkyo-ku, Tokyo 113, .Japan 
E-maih {nigel, nova, tsuj ££}@£s. s. u-tokyo, ac. jp 
Abstract 
\~e report the results of a study into the use 
of a linear interpolating hidden Marker model 
(HMM) for the task of extra.('ting lxw\]mi(:al |;er- 
minology fl:om MEDLINE al)stra('ts and texl;s 
in the molecular-bioh)gy domain. Tiffs is the 
first stage isl a. system that will exl;ra('l; evenl; 
information for automatically ut)da.ting 1)ioh)gy 
databases. We trained the HMM entirely with 
1)igrams based (m lexical and character fea- 
tures in a relatively small corpus of 100 MED- 
LINE abstract;s that were ma.rked-ul) l)y (lo- 
main experts wil;h term (:lasses su(:h as t)rol;eins 
and DNA. I.Jsing cross-validation methods we 
a(:\]fieved a,n \].e-score of 0.73 and we (',xmnine the 
('ontrilmtion made by each 1)art of the interl)o- 
lation model to overconfing (la.ta Sl)arsen('.ss. 
1 Introduction 
Ill the last few ye~trs there has t)een a great in- 
vestment in molecula.r-l)iology resear(:h. This 
has yielded many results l;\]la.1;, 1;ogel;her wil;h 
a migration of m:c\]fival mal;erial to the inter- 
net, has resulted in an exl)losion in l;tm nuns- 
\])el7 of research tmbli('ations aa~ailat)le in online 
databases. The results in these 1)al)ers how- 
ever arc not available ill a structured fornmt and 
have to 1)e extracted and synthesized mammlly. 
Updating databases such as SwissProt (Bairoch 
mid Apweiler, 1.997) this way is time (:onsmning 
and nmans l;h~tt the resull;s are not accessible so 
conveniently to he11) researchers in their work. 
Our research is aimed at autonmti(:ally ex- 
tra(:ting facts Kern scientific abstracts and flfll 
papers ill the molecular-biology domain and us- 
ing these to update databases. As the tirst stage 
in achieving this goal we have exl)lored th(; use 
of a generalisable, supervised training method 
based on hidden Markov models (ItMMs) (Ra- 
biner and .\]uang, 1986) fbr tim identification mid 
classitieation of technical expressions ill these 
texts. This task can 1)e considered to be similar 
to the named c.ntity task in the MUC evaluation 
exercises (MUC, 1995). 
In our current work we are using abstracts 
available fl:om PubMed's MEDLINE (MED- 
\],INE, 1999). The MEDLINE (lnta.l)ase is an 
online collection of al)straets for pul)lished jour- 
nal articles in biology mid medicine and con- 
tains more than nine million articles. 
With the rapid growth in the mlmbcr of tmb- 
\]ished l)al)ers in the field of moh;('ular-biolog 3, 
there has been growing interest in the at)pli- 
cation of informa.tion extra(:tion, (Sekimizu et 
al., 1998) (Collier et al., 1999)(Thomas et al., 
1999) (Craven and Kmnlien, 1999), to help solve 
souse (sf the t)robhmss that are associated with 
information overload. 
In the remainder of this i)aper we will first 
of all (ratline the t)ackground to the task and 
then d(~s('ril)e t;hc basics of ItMMs and the fi)r- 
real model wc are using. The following sections 
give an outline of a. lse\v tagged ('orlms (Ohta et 
al., 1999) thnt our team has deveh)i)ed using al)- 
stra('ts taken from a sub-domain of MEDLINF, 
and the results of our experinmnts on this cor- 
lmS. 
2 Background 
Ileeent studies into the use of SUl)ervised 
learning-t)ased models for the n~mled entity task 
in the miero-lsioh)gy domain have. shown that 
lnodels based on HMMs and decision trees such 
as (Nol)al;~t et al., 1999) ~,r(; much more gener- 
alisable and adaptable to slew classes of words 
than systems based on traditional hand-lmilt 
1)attexns a.nd domain specific heuristic rules 
such as (Fukuda et al., 1998), overcoming the 
1)rol)lems associated with data sparseness with 
the help of sophisticated smoothing algorithms 
201 
(Chen and Goodman, 1996). 
HMMs can be considered to be stochastic fi- 
nite state machines and have enjoyed success 
in a number of felds including speech recogni- 
tion and part-of-speech tagging (Kupiec, 1992). 
It has been natural therefore that these mod- 
els have been adapted tbr use in other word- 
class prediction tasks such as the atoned-entity 
task in IE. Such models are often based on n- 
grams. Although the assumption that a word's 
part-of speech or name class can be predicted 
by the previous n-1 words and their classes is 
counter-intuitive to our understanding of lin- 
guistic structures and long distance dependen- 
cies, this simple method does seem to be highly 
effective ill I)ractice. Nymble (Bikel et al., 
1997), a system which uses HMMs is one of the 
most successflfl such systems and trains on a 
corpus of marked-up text, using only character 
features in addition to word bigrams. 
Although it is still early days for the use of 
HMMs for IE, we can see a number of trends 
in the research. Systems can be divided into 
those which use one state per class such as 
Nymble (at the top level of their backoff model) 
and those which automatically learn about the 
model's structure such as (Seymore et al., 1999). 
Additionally, there is a distinction to be made 
in the source of the knowledge for estimating 
transition t)robabilities between models which 
are built by hand such as (Freitag and McCal- 
lure, 1999) and those which learn fl'om tagged 
corpora in the same domain such as the model 
presented in this paper, word lists and corpora 
in different domains - so-called distantly-labeled 
data (Seymore et al., 1999). 
2.1 Challenges of name finding in 
molecular-biology texts 
The names that we are trying to extract fall into 
a number of categories that are often wider than 
the definitions used for the traditional named- 
entity task used in MUC and may be considered 
to share many characteristics of term recogni- 
tion. 
The particular difficulties with identit)dng 
and elassit~qng terms in the molecular-biology 
domain are all open vocabulary and irrgeular 
naming conventions as well as extensive cross- 
over in vocabulary between classes. The irreg- 
ular naming arises in part because of the num- 
ber of researchers from difli;rent fields who are 
TI - Activation of <PROTEIN> JAK kinases 
</PROTEIN> and <PROTEIN>STAT pTvteins 
</PR, OTEIN> by <PROTEIN> interlcukin - 2 
</PROTEIN> and <PROTEIN> intc~fc~vn alph, a 
</PROTEIN> , but not the <PROTEIN> T cell 
antigen receptor <~PROTEIN> , in <SOURCE.ct> 
h, uman T lymphoeytes </SOURCE.et> . 
AB The activation of <PROTEIN> Janus 
protein t,.flvsine kinascs </PROTEIN> ( 
<PROTEIN> JAI(s </PROTEIN> ) and 
<PROTEIN> signal transducer and ac- 
tivator of transcription </PROTEIN> ( 
<PROTEIN> STAT </PROTEIN> ) pro- 
reins by <PROTEIN> intcrIcukin ( IL ) 2 
</PROTEIN> ,thc <PROTEIN> T cell antigen 
receptor </PROTEIN> ( <PROTEIN> TCR 
</PROTEIN> ) and <PROTEIN> intc~fcrvn 
(IFN) alpha </PROTEIN> was czplorcd in 
<SOURCE.ct> human periph, cral blood- derived 
T cclls </SOURCE.et> and the <SOURCE.el> 
leukemic T cell line Kit225 </SOURCE.el> . 
Figure 1: Example MEDLINE sentence marked 
up in XML for lfiochemical named-entities. 
working on the same knowledge discovery area 
as well as the large number of substances that 
need to be named. Despite the best, etforts of 
major journals to standardise the terminology, 
there is also a significant problem with syn- 
onymy so that often an entity has more tlm.n 
one name that is widely used. The class cross- 
over of terms arises because nlally prot(:ins are 
named after DNA or RNA with which they re- 
act. 
All of the names which we mark up must be- 
long to only one of the name classes listed in 
Table 1. We determined that all of these name 
classes were of interest to domain experts and 
were essential to our domain model for event 
extraction. Example sentences from a nmrked 
ut) abstract are given in Figure 1. 
We decided not to use separate states ibr 
pre- and post-class words as had been used in 
some other systems, e.g. (Freitag and McCal- 
lure, 1999). Contrary to our expectations, we 
observed that our training data provided very 
poor maximum-likelihood probabilities for these 
words as class predictors. 
We found that protein predictor words had 
the only significant evidence and even this was 
quite weak, except in tlm case of post-class 
words which included a mmfi)er of head nouns 
such as "molecules" or "heterodimers". In our 
202 
Class ~/: Examl)le l)escription 
P1K)TEIN 21.25 .MK ki'n,a.se 
\])NA 358 IL-2 \]rlvmotcr 
\]{NA 30 771I?, 
S()UI{CF,.cl 93 le'ukemic T cell line Kit225 
S()UI\],CE.(:t 417 h,'wm, an T lymphocytes 
SOURCE.too 21 ,%hizosacch, aromyces pombc 
S()URCE.mu 64 mice 
SOURCE.vi 90 ItJV-1 
S()UI{CE.sl 77 membrane 
S()UI{CE.ti 37 central 'ner,vo'us system 
UNK t,y~vsine ph, osphovylal, ion 
t)ro{xfiils~ protein groups, 
families~ cOral)loxes and Slll)Sl;I'llCI;lll'eS. 
I)NAs I)NA groups, regions and genes 
RNAs I~NA groups, regions and genes 
cell line 
(:ell type 
lllOll()-organism 
multiorganism 
viruses 
sublocat;ion 
tissue 
lmckground words 
Table l: Named (mtilsy (:lasses. ~/: indi(:at(ts tsfic ~mmt)cr of XMI, tagged terms in our (:orpus of 100 
abstracts. 
early experiments using IIMMs that in(:orpo- 
rated pro- and 1)ost-c\]ass states we \[imnd tha.t 
pcrforlnance was signiticantly worse than wil;h- 
Ollt; sll(;h si;at;cs an(l st) w('. formulated the ~uodcl 
as g,~ivcll ill S(;(;\[;iOll :/. 
~,.f(Qi,..~,l < _,Ffi,..~.,, >) + 
(1) 
and for all other words and their name classes 
as tbllows: 
3 Mx.'tzho d 
The lmrl)osc of our mod(;1 is Io lind t;hc n,osl: 
likely so(tilth, liCe of name classes (C) lbr a given 
se(tucncc of wor(ls (W). The set of name ('lasses 
inchutcs the 'Unk' name (:lass whi('h we use li)r 
1)ackgromM words not 1)elonging to ally ()\[ the 
interesting name classes given in Tal)lc 1 and 
t;hc given st;qu(m(:e of words which w(~ ,>('. spans 
a single s(,Jd;cn('c. The task is thcrcfor(~ 1(} max- 
intize Pr((TIH:). \¥c iml)lem(mt a I\]MM to es- 
timate this using th('. Markov assuml)tion that 
Pr(CII¥ ) can be t'(mnd from t)igrams of ha.me 
classes. 
In th('. following model we (:onsid(u" words to 
1)c ordered pairs consisting of a. surface word, 
W, and a. word tbature, 1", given as < W, F >. 
The word features thcms('Jvcs arc discussed in 
Section 3.1. 
As is common practice, we need to (:alculatc 
the 1)rol)abilities for a word sequence for the 
first; word's name class and every other word 
diflbrently since we have no initial nalnt>class 
to make a transition frolll. Accordingly we use 
l;he R)llowing equation to (:alculatc the ilfitial 
name (:lass probability, 
~,,J'(Cz,..~,,I < wi~,..~, 19~,.,~,, >) + 
I',,.( G 
)~o.1' ( G 
A ,./' (G 
;v~.f ( G 
5:I./'(G 
), ~./' ( G 
),,~.I(G) 
< Wt,l,} >,< l lS,_,,l,i ~ >,G J) :- 
< 1'15., I,~,, >, < I,V~_~, l,)_j >, G.-~) + 
< _, l'i >, < 115_ l, Ft,- ~ >, Ct-., ) + 
< 115, Fi >, < _, P~,_~ >, G ~) + 
< _, l,) >, < ._, 1% ~ >, C~__~) + 
(2) 
whc,:c f(I) is ('alculatcd with nmxinluln- 
likelihood estimates from counts on training 
data, so that tbr example, 
.f(G,I < 1,~5,1,i >,< I,t,~_,, F~_~ >,G-~) - 
T(< I lS, 1,~ >, G., < 1'15_,, 1~}_~ >, G.-@, 
T(< l'lZj,,l~J >,< \['Vt-l,Ft-I >,Ct-l) ~3) 
Where T() has been found from counting the 
events in thc training cortms. In our current 
sysl;oln \vc SC\[; t;tlc C()llSt;&lltS ~i }lJld o- i \])y halld 
all(l let ~ ai = 1.0, ~ Ai = 1.0, a0 > al k O-2, 
A0 > A I... _> As. Tile current name-class Ct 
is conditioned oil the current word and fea- 
t;llrc~ thc I)rcviolls name-class, ~*t--l: and t)rc- 
vious word an(t tbaturc. 
Equations 1 and 2 implement a linear- 
interpolating HMM that incorporates a mmfl)cr 
203 
of sub-models (rethrred to fl'om now by their 
A coefficients) designed to reduce the effects of 
data sparseness. While we hope to have enough 
training data to provide estimates tbr all model 
parameters, in reality we expect to encounter 
highly fl'agmented probability distributions. In 
the worst case, when even a name class pair 
has not been observed beibre in training, the 
model defaults at A5 to an estimate of name 
class unigrams. We note here that the bigram 
 model has a non-zero probability asso- 
ciated with each bigram over the entire vocal)- 
ulary. 
Our model differs to a backoff formulation be- 
cause we tbund that this model tended to suffer 
fl'om the data sparseness problem on our small 
training set. Bikel et al for example consid- 
ers each backoff model to be separate models, 
starting at the top level (corresl)onding approx- 
imately to our Ao model) and then falling back 
to a lower level model when there not enough 
evidence. In contrast, we have combined these 
within a single 1)robability calculation tbr state 
(class) transitions. Moreover, we consider that 
where direct bigram counts of 6 or more occur 
in the training set, we can use these directly to 
estimate the state transition probability and we 
nse just the ,~0 model in this case. For counts 
of less than 6 we smooth using Equation 2; this 
can be thought of as a simt)le form of q)nck- 
eting'. The HMM models one state per name 
(:lass as well as two special states tbr the start 
and end ofa sentence. 
Once the state transition l)rol)abilities have 
been calcnlated according to Equations 1 and 2, 
the Viterbi algorithm (Viterbi, 1967) is used to 
search the state space of 1)ossible name class as- 
signments. This is done in linear time, O(MN 2) 
for 54 the nunfl)er of words to be classified and 
N the number of states, to find the highest prob- 
ability path, i.e. to maxinfise Pr(W, C). In our 
exl)eriments 5/i is the length of a test sentence. 
The final stage of our algorithm that is used 
after name-class tagging is complete is to use 
~ clean-up module called Unity. This creates a 
frequency list of words and name-classes tbr a 
docmnent and then re-tags the document using 
the most frequently nsed name class assigned by 
the HMM. We have generally tbund that this 
improves F-score performance by al)out 2.3%, 
both tbr re-tagging spuriously tagged words and 
Word Feature Exmnl)le 
DigitNmnber 15 
SingleCap M 
GreekLetter alpha 
CapsAndDigits I2 
TwoCaps RalGDS 
LettersAndDigits p52 
hfitCap Interleukin 
LowCaps ka,t)paB 
Lowercase kinases 
IIyphon 
Backslash / 
OpenSquare \[ 
CloseSquare \] 
Colon 
SemiColon 
Percent % 
Oi) enParen ( 
CloseParen ) 
Comma 
FullStop 
Deternliner the 
Conjmmtion and 
Other * + 
Table 2: Word tbatures with examples 
tbr finding untagged words in mlknown contexts 
that had been correctly tagged elsewhere in the 
text. 
3.1 Word features 
Table 2 shows the character t'eatnres that we 
used which are based on those given for Nymble 
and extended to give high pertbrmance in both 
molecular-biology and newswire domains. The 
intnition is that such features provide evidence 
that helps to distinguish nmne classes of words. 
Moreover we hyt)othesize that such featnres 
will help the model to find sinfilarities between 
known words that were tbnnd in the training 
set and unknown words (of zero frequency in 
the training set) and so overcome the unknown 
word t)rol)lem. To give a simple example: if we 
know that LMP - 1 is a member of PROTEIN 
and we encounter AP - 1 for the first time in 
testing, we can make a fairly good guess about 
the category of the unknown word 'LMP' based 
on its sharing the same feature TwoCaps with 
the known word 'AP' and 'AP's known relation- 
ship with '- 1'. 
Such unknown word evidence is captured in 
submodels A1 through ),3 in Equation 2. \¥e 
204 
consider that character information 1)rovides 
more mealfingflll distinctions between name 
(;\]asses than for examI)le part-of-speech (POS), 
since POS will 1)redominmltly 1)e noun fi)r all 
name-class words. The t'catures were chosen 
to be as domain independent as possit)le, with 
the exception of Ilyphon and Greel,:Letter which 
have t)articular signitieance for the terminology 
in this dolnain. 
4 Experiments 
4.1 Training and testing set 
The training set we used in our experiments 
('onsisted of 100 MEI)II, INI~ al)stra(:ts, marked 
Ul) ill XS/\[L l)y a (lonmin ext)ert for the name 
('lasses given in Tal)le 1. The mmfl)er of NEs 
that were marked u 1) by class are also given in 
Tfl)le 1 and the total lmmber of words in the 
corlms is 299/\]:0. The al)stracts were chosen from 
a sul)(lomain of moleeular-1)iology that we for- 
mulated by s(',ar(;hing under the terms h/uman, 
blood cell, trav,.scription ,/'actor in the 1)utiMed 
datal)asc, This yiel(l('.(t al)t)roximately 33(10 al/- 
stracts. 
4.2 Results 
The results are given as F-scores, a (;Ollllll()ll 
measurement for a(:(:ura(:y in tlw, MUC con- 
ferences that eonfl)ines r(;(:all and 1)re(:ision. 
These are eah:ulated using a standard MUC tool 
(Chinchor, 1995). F-score is d('.iin(~d as 
'2 x lS"(eci.sion x l~cc, ll F - .~cor. = (4) 
l)'rccisio~, + \]¢,cc(dl 
The tirst set ot7 experiments we did shows the 
effectiveness of the mode.1 for all name (:lasses 
and is smnmarized in Table 3. We see that data 
sparseness does have an etfe('t~ with 1)roteins - 
the most mlmerous (;lass in training - getting 
the best result and I/,NA - the snmllc, st training 
(:lass - getting the worst result. The tal)le also 
shows the ett'eetiveness of the character feature 
set, whi('h in general adds 10.6% to the F-score. 
This is mainly due to a t)ositive effect on words 
in the 1)R,OTEIN and DNA elases, but we also 
see that memt)ers of all SOURCE sul)-('lasses 
sufl'er from featurization. 
We have atteml)ted to incorl)orate generali- 
sation through character t'eatm:es and linear in- 
teri)olation, which has generally \])een quite su(:- 
cessful. Nevertheless we were (:urious to see just 
Class Base llase-l'eatures 
PROTEIN 0.759 0.670 (-11.7%) 
DNA 0.472 0.376 (-20.3%) 
\]~NA 0.025 0.OOO (-leo.o%) 
SOURCE(all) 0.685 0.697 (+1.8%) 
S()UI{CE.cl 0.478 0.503 (+5.2%) 
SOURCE.el 0.708 0.752 (+6.2%) 
SOURCE.me 0.200 0.311 (+55.5%) 
SOURCE.mu 0.396 0.402 (+1.5%) 
SOURCE.vi 0.676 0.713 (+5.5%) 
S()URCI,Lsl 0.540 0.549 (+1.7%) 
SOURCE.ti 0.206 0.216 (+4.9%) 
All classes 0.728 0.651 (-10.6%) 
q)d)le 3: Named entity acquisition results us- 
ing 5-fi)ld cross validation on 100 XML tagged 
MEI)I~INE al/stra(:ts, 80 for training and 20 fin. 
testing, l\]ase-J'(',at'urc.s uses no character feature 
inibrmation. 
)~ Mode\[ No. 
# Texts 0 1 2 3 4 5 
80 
40 
20 
10 
5 
0.06 0.22 0.10 0.67 0.93 1.0 
0.06 0.19 0.10 0.63 0.94 1.0 
().()~l 0.15 0.09 0.59 0.89 1.0 
0.03 0.12 0.08 0.52 0.83 1.0 
0.02 0.09 0.06 0.41 0.68 1.0 
Tal)le 4: M(',an lmml)er of successflll calls to sul)- 
m(i(t(;ls during testing as a fl'aetion of total mnn- 
1)er (If stale transitions in the Viterl)i latti(:e, g/: 
T(!xis indicates the mmfl)er of al)stra(:ts used ill 
training. 
whi(:h t)arts of the model were contributing to 
the bigram s(:ores. Table 4 shows the l)ercent- 
age of bigranls which could be mat('hed against 
training t)igrams. The result indicate tha~ a 
high 1)ereentage of dire(:t bigrams in the test 
eorl)uS never al)t)(;ar in the training (:oft)us and 
shows tha, t our HMM model is highly depel> 
(l(mt on smoothing through models ~kl and )~:~. 
\¥e can take another view of the training data 
1)y 'salalni-slieing' the model so that only evi- 
(tenee from 1)art of the model is used. Results 
are shown in Tat)le 5 and support the eonchl- 
sion that models Al, A2 and Aa are. crucial at 
this sir,(; of training data, although we would 
expect their relative ilnportance to fifil as we 
have more (tircct observations of bigrams with 
larger training data sets. 
Tal)le 6 shows the rolmstness of the model 
205 
I Backoff models 
\[ F-score (all classes) 0.728 0.722 0.644 0.572 0.576 \] 
Table 5: F-scores using different nfixtures of models tested on 100 abstracts, 80 training and 20 
testing. 
I # Texts 80 40 20 10 5 \] 
I F-score 0.728 0.705 0.647 0.594 0.534\] 
Table 6: 
training 
stracts). 
F-score for all classes agMnst size of 
corpus (in number of MEDLINE ab- 
for data sparseness, so that even with only 10 
training texts the model can still make sensible 
decisions about term identification and classi- 
fication. As we would expect;, the table ;flso 
clearly shows that more training data is better, 
and we have not yet reached a peak in pertbr- 
inance. 
5 Conclusion 
HMMs are proving their worth for various 
tasks in inibrmation extraction and the results 
here show that this good performance can be 
achieved across domains, i.e. in molecular- 
biology as well as rising news paper reports. The 
task itself', while being similar to named entity 
in MUC, is we believe more challenging due to 
the large nunfl)er of terms which are not proper 
nouns, such as those in the source sub-classes as 
well as the large lexieal overlap between classes 
such as PROTEIN and DNA. A usefifl line of 
work in the future would be to find empirical 
methods for comparing difficulties of domains. 
Unlike traditional dictionary-based lnethods, 
the method we have shown has the advantage of 
being portable and no hand-made patterns were 
used. Additiolmlly, since the character tbatures 
are quite powerful, yet very general, there is lit- 
tle need for intervention to create domain spe- 
cific features, although other types of features 
could be added within the interpolation frame- 
work. Indeed the only thing that is required is 
a quite small corpus of text containing entities 
tagged by a domain expert. 
Currently we have optinfized the ,k constants 
by hand but clearly a better way would be to do 
this antomatically. An obvious strategy to use 
would be to use some iterative learning method 
such as Expectation Maximization (Dempster 
et al., 1977). 
The model still has limitations, most obvi- 
ously when it needs to identity, term boundaries 
for phrases containing potentially ambiguous lo- 
cal structures such as coordination and pa.ren- 
theses. For such cases we will need to add post- 
processing rules. 
There are of course many NF, models that 
are not based on HMMs that have had suc- 
cess in the NE task at the MUC conferences. 
Our main requirement in implementing a model 
for the domain of molecular-biology has been 
ease of development, accuracy and portability 
to other sub-domains since molecular-biology it- 
self is a wide field. HMMs seemed to be the 
most favourable option at this time. Alterna- 
tives that have also had considerable success 
are decision trees, e.g. (Nobata et al., 1.999) 
and maximum-entropy. The maximum entropy 
model shown in (Borthwick et al., 1998) in par- 
ticular seems a promising approach because of 
its ability to handle overlapping and large fea- 
ture sets within n well founded nmthenmtical 
ti'amework. However this implementation of the 
method seems to incorporate a number of hand- 
coded domain specitic lexical Datures and dic- 
tionary lists that reduce portability. 
Undoubtedly we could incorporate richer tba- 
tures into our model and based on the evidence 
of others we would like to add head nouns as 
one type of feature in the future. 
Acknowledgements 
We would like to express our gratitude to Yuka 
Tateishi and Tomoko Ohta of the Tsujii labora- 
tory for their efforts to produce the tagged cor- 
tins used in these experiments and to Sang-Zoo 
Lee also of the Tsujii laboratory tbr his com- 
ments regarding HMMs. We would also like to 
thank the anonymous retirees tbr their helpflfl 
comments. 
206 

References 

A. Bairoch and R. Apweiler. 1997. The SWISS- 
PF\[OT 1)r{)t{~in sequence data bank and its 
new SUl)l)lement 15:EMBL. Nucleic Acids Re- 
search, 25:31-36. 

D. Bikel, S. Miller, R. Schwartz, and 
R. Wesichedel. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied 
Natural Language Processing, pages 194 201. 

A. Borthwick, J. Sterling, E. Agichtein, and 
R,. Grishman. 1998. Exploiting diverse 
knowledge sources via maximum entropy in 
named entity recognition. In Proceedings 
of the Workshop on Very Large Corpora 
(WVLC'98). 

S. Chen and J. Goodman. 1996. An empirical 
study of smoothing techniques for  
modeling. 31st Annual Meeting of the Association of Computational Linguistics, Calffonia, USA, 24-27 .hme. 

N. Chinchor. 1995. MUC-5 evaluation metrics. 
In Proceedings of the Fifth MEssage Understanding Conference (MUC-5), Baltimore,, 
Maryland, USA., 1)ages 69 78. 

N. Collier, H.S. Park, N. Ogata, Y. Tateishi, 
C. Nobata, T. Ohta, T. Sekimizu, H. Imai, 
and J. Tsujii. 1999. The GENIA project: 
corpus-based knowledge acquisition and information extraction from genome research papers, in Proceedings of the Annual Meeting 
of the European chapter of the Association for 
Computational Linguistics (EACL '99), 3 uuc. 

M. Craven and 3, Kumlien. 1999. Construct- 
ing bioh}gical knowh;{tg{; t}ases t)y extracting 
information from text sour(:es. In \]}~vc(:(,Aings 
of the 7th, hl, tcrnational CoTff(:rence on Intelli- 
gent Systcmps for Molecular Biology (ISMB- 
99), Heidellmrg, Germmly, August 6 10. 

A.P. Dempster, N.M. Laird, and D.B. Rubins. 
1977. Maximmn likelihood from incoml)lete 
data via the EM algorithm. ,\]ou'rnal of the 
Royal Statistical Society (B), 39:1-38. 

l). Freitag and A. McCMlum. 1999. Intbrma- 
tion extraction with HMMs and shrinkage. 
In Proceedings of the AAAl'99 Worl~.~h, op ou, 
Machine Learning for IT~:formation Extrac- 
tion, Orlando, Florida, July 19th. 

K. Fuku(la, T. Tsunoda, A. 2)mmra, and 
T. Takagi. 1998. ~12)ward intbrmation extrac- 
tion: identifying l)rotein names from biologi- 
eal papers. Ill PTvcccdings of thc Pac'lific Sym- 
posium on Biocomp'uting'98 (PSB'98), .Jan- 
1uAYy. 

.1. Kupiec. 1992. l/obust Imrt-ofspeech tag- 
ging using a hidden markov model. Computer 
Speech and Lang'aagc, 6:225-242. 

MEI)LINE. 1999. The PubMed 
datal)ase can be t'(mnd at:. 
httt)://www.ncbi.nhn.nih.gov/Pul}Med/. 

DAIIPA. 1995. l}roceeding.s o.fl th, c Sixth, 
Message Understanding Cou:fcrcnce(MUC-6), 
Cohmdfia, MI), USA, Nove, nfl}er. Morgan 
Nail\['\] llallll. 

C. Nobata, N. Collier, and J. Tsujii. 1999. Automatic term identification and classification 
in biology texts. In Proceedings of the Natural Language Pacific Rim Symposium (NL- 
PRS'2000), November. 

Y. Ohta, Y. Tateishi, N. Collie'r, C. Nobata, K. Ibushi, and J. Tsujii. 1999. A 
semantically annotated corpus from MEDLINE abstracts. In Proceedings of the Tenth 
Workshop on Genome Informatics. Universal 
A{:ademy Press, Inc., 14 15 Deccntl)er. 

l~. llabiner and B..\]uang. 1!)86. An intro{tu{:- 
ti(m to hidden Markov too(Ms. H'2EE ASSP 
Magazi',,(',, 1}ages d 16, Jammry. 

T. Sekilnizu, H. Park, and J. 'l'sujii. 1998. 
I{lenti\[ying l;he interaction 1)etween genes an{1 
gOlle i}ro(lucts \]}ase(l on f\]'e(lue\]My seen verbs 
in n\]e{tline al)si;rael;s. Ill ~(:'li,()?ll,('~ \]~ffor'm, al, ics'. 
Univcrsa,1 Academy Press, Inc. 

K. Seymore, A. MeCallum, and l{. I{oscnfeld. 
1999. Learning hidden Markove strucl:ure 
for informati{m (,xtraction. In \])wcccdings of 
the AAAl'99 Workshop on Macfli'n,(: Lcarni'n 9 
for l',fo'rmation E:draction, Orland{}, Flori{ta., 
July 19th. 

.J. Thomas, D. Milward, C. Ouzounis, S. Pul- 
man, and M. Carroll. 1999. Automatic ex- 
traction of 1)rotein interactions fl'om s{'ien- 
tific abstracts. In Proceedings of the I}ac'll/ic 
Symposium on Biocomputing'99 (PSB'99), 
Hawaii, USA, Jmmary 4-9. 

A. 3. Vit(;rbi. 1967. Error l){mnds for {:onvolu- 
tions e{}{les and an asyml)totically optimum 
deco(ling algorithm. IEEE Tran,s'actiou,.s' on 
I~formation Theory, IT-13(2):260 269. 
