Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL 06, pages 33–40,
New York City, June 2006. c©2006 Association for Computational Linguistics
A Priority Model for Named Entities 
 
 
Lorraine Tanabe W. John Wilbur 
National Center for Biotechnology 
Information 
National Center for Biotechnology 
Information 
Bethesda, MD 20894 Bethesda, MD 20894 
tanabe@ncbi.nlm.nih.gov wilbur@ncbi.nlm.nih.gov 
  
 
 
 
Abstract 
We introduce a new approach to named 
entity classification which we term a Pri-
ority Model. We also describe the con-
struction of a semantic database called 
SemCat consisting of a large number of 
semantically categorized names relevant 
to biomedicine. We used SemCat as train-
ing data to investigate name classification 
techniques. We generated a statistical lan-
guage model and probabilistic context-
free grammars for gene and protein name 
classification, and compared the results 
with the new model.  For all three meth-
ods, we used a variable order Markov 
model to predict the nature of strings not 
represented in the training data.  The Pri-
ority Model achieves an F-measure of 
0.958-0.960, consistently higher than the 
statistical language model and probabilis-
tic context-free grammar.   
1 Introduction 
Automatic recognition of gene and protein names 
is a challenging first step towards text mining the 
biomedical literature. Advances in the area of gene 
and protein named entity recognition (NER) have 
been accelerated by freely available tagged corpora 
(Kim et al., 2003, Cohen et al., 2005, Smith et al., 
2005, Tanabe et al., 2005).  Such corpora have 
made it possible for standardized evaluations such 
as Task 1A of the first BioCreative Workshop 
(Yeh et al., 2005). 
Although state-of-the-art systems now perform 
at the level of 80-83% F-measure, this is still well 
below the range of 90-97% for non-biomedical 
NER.  The main reasons for this performance dis-
parity are 1) the complexity of the genetic nomen-
clature and 2) the confusion of gene and protein 
names with other biomedical entities, as well as 
with common English words. In an effort to allevi-
ate the confusion with other biomedical entities we 
have assembled a database consisting of named 
entities appearing in the literature of biomedicine 
together with information on their ontological 
categories. We use this information in an effort to 
better understand how to classify names as repre-
senting genes/proteins or not.  
2 Background  
A successful gene and protein NER system must 
address the complexity and ambiguity inherent in 
this domain.  Hand-crafted rules alone are unable 
to capture these phenomena in large biomedical 
text collections.  Most biomedical NER systems 
use some form of language modeling, consisting of 
an observed sequence of words and a hidden se-
quence of tags.  The goal is to find the tag se-
quence with maximal probability given the 
observed word sequence.  McDonald and Pereira 
(2005) use conditional random fields (CRF) to 
identify the beginning, inside and outside of gene 
and protein names.   GuoDong et al. (2005) use an 
ensemble of one support vector machine and two 
Hidden Markov Models (HMMs).  Kinoshita et al. 
(2005) use a second-order Markov model.  Dingare 
et al. (2005) use a maximum entropy Markov 
model (MEMM) with large feature sets.   
33
NER is a difficult task because it requires both 
the identification of the boundaries of an entity in 
text, and the classification of that entity.  In this 
paper, we focus on the classification step.  Spasic 
et al. (2005) use the MaSTerClass case-based rea-
soning system for biomedical term classification.  
MaSTerClass uses term contexts from an annotated 
corpus of 2072 MEDLINE abstracts related to nu-
clear receptors as a basis for classifying new 
terms.  Its set of classes is a subset of the UMLS 
Semantic Network (McCray, 1989), that does not 
include genes and proteins.  Liu et al. (2002) clas-
sified terms that represent multiple UMLS con-
cepts by examining the conceptual relatives of the 
concepts.  Hatzivassiloglou et al. (2001) classified 
terms known to belong to the classes Protein, Gene 
and/or RNA using unsupervised learning, achieving 
accuracy rates up to 85%.  The AZuRE system 
(Podowski et al., 2004) uses a separate modified 
Naive Bayes model for each of 20K genes.  A term 
is disambiguated based on its contextual similarity 
to each model. Nenadic et al. (2003) recognized 
the importance of terminological knowledge for 
biomedical text mining. They used the C/NC-
methods, calculating both the intrinsic characteris-
tics of terms (such as their frequency of occurrence 
as substrings of other terms), and the context of 
terms as linear combinations.  These biomedical 
classification systems all rely on the context sur-
rounding named entities. While we recognize the 
importance of context, we believe one must strive 
for the appropriate blend of information coming 
from the context and information that is inherent in 
the name itself.  This explains our focus on names 
without context in this work.  
We believe one can improve gene and protein 
entity classification by using more training data 
and/or using a more appropriate model for names.  
Current sources of training data are deficient in 
important biomedical terminologies like cell line 
names.  To address this deficiency, we constructed 
the SemCat database, based on a subset of the 
UMLS Semantic Network enriched with categories 
from the GENIA Ontology (Kim et al, 2003), and a 
few new semantic types. We have populated Sem-
Cat with over 5 million entities of interest from 
Figure 1. SemCat Physical Object Hierarchy.  White = UMLS SN, Light Grey = GENIA semantic 
types, Dark Grey = New semantic types. 
34
standard knowledge sources like the UMLS 
(Lindberg et al., 1993), the Gene Ontology (GO) 
(The Gene Ontology Consortium, 2000), Entrez 
Gene (Maglott et al., 2005), and GENIA, as well as 
from the World Wide Web.  In this paper, we use 
SemCat data to compare three probabilistic frame-
works for named entity classification.   
3 Methods 
We constructed the SemCat database of biomedical 
entities, and used these entities to train and test 
three probabilistic approaches to gene and protein 
name classification: 1) a statistical language model 
with Witten-Bell smoothing, 2) probabilistic con-
text-free grammars (PCFGs) and 3) a new ap-
proach we call a Priority Model for named entities. 
As one component in all of our classification algo-
rithms we use a variable order Markov Model for 
strings.   
3.1 SemCat Database Construction 
The UMLS Semantic Network (SN) is an ongoing 
project at the National Library of Medicine.  Many 
users have modified the SN for their own research 
domains.  For example, Yu et al. (1999) found that 
the SN was missing critical components in the ge-
nomics domain, and added six new semantic types 
including Protein Structure and Chemical Com-
plex. We found that a subset of the SN would be 
sufficient for gene and protein name classification, 
and added some new semantic types for better cov-
erage.  We shifted some semantic types from 
suboptimal nodes to ones that made more sense 
from a genomics standpoint. For example, there 
were two problems with Gene or Genome. Firstly, 
genes and genomes are not synonymous, and sec-
ondly, placement under the semantic type Fully 
Formed Anatomical Structure is suboptimal from a 
genomics perspective. Since a gene in this context 
is better understood as an organic chemical, we 
deleted Gene or Genome, and added the GENIA 
semantic types for genomics entities under Or-
ganic Chemical. The SemCat Physical Object hier-
archy is shown in Figure 1.  Similar hierarchies 
exist for the SN Conceptual Entity and Event trees.  
A number of the categories have been supple-
mented with automatically extracted entities from 
MEDLINE, derived from regular expression pat-
tern matching.  Currently, SemCat has 77 semantic 
types, and 5.11M non-unique entries. Additional 
entities from MEDLINE are being manually classi-
fied via an annotation website.  Unlike the Ter-
mino database (Harkema et al. (2004), which 
contains terminology annotated with morpho-
syntactic and conceptual information, SemCat cur-
rently consists of gazetteer lists only.  
For our experiments, we generated two sets of 
training data from SemCat, Gene-Protein (GP) and 
Not-Gene-Protein (NGP).  GP consists of specific 
terms from the semantic types DNA MOLECULE, 
PROTEIN MOLECULE, DNA FAMILY, 
PROTEIN FAMILY, PROTEIN COMPLEX and 
PROTEIN SUBUNIT.  NGP consists of entities 
from all other SemCat types, along with generic 
entities from the GP semantic types.  Generic enti-
ties were automatically eliminated from GP using 
pattern matching to manually tagged generic 
phrases like abnormal protein, acid domain, and 
RNA.  
Many SemCat entries contain commas and pa-
rentheses, for example, “receptors, tgf beta.”  A 
better form for natural language processing would 
be “tgf beta receptors.”  To address this problem, 
we automatically generated variants of phrases in 
GP with commas and parentheses, and found their 
counts in MEDLINE.  We empirically determined 
the heuristic rule of replacing the phrase with its 
second most frequent variant, based on the obser-
vation that the most frequent variant is often too 
generic.  For example, the following are the phrase 
variant counts for “heat shock protein (dnaj)”: 
 
• heat shock protein (dnaj) 0 
• dnaj heat shock protein  84 
• heat shock protein  122954 
• heat shock protein dnaj  41 
 
Thus, the phrase kept for GP is dnaj heat shock 
protein.  
After purifying the sets and removing ambigu-
ous full phrases (ambiguous words were retained), 
GP contained 1,001,188 phrases, and NGP con-
tained 2,964,271 phrases. From these, we ran-
domly generated three train/test divisions of 90% 
train/10% test (gp1, gp2, gp3), for the evaluation.   
3.2    Variable Order Markov Model for Strings 
As one component in our classification algorithms 
we use a variable order Markov Model for strings.  
Suppose C represents a class and 
123
...
n
x xx x repre-
35
sents a string of characters. In order to estimate the 
probability that 
123
...
n
x xx x belongs to  we apply 
Bayes’ Theorem to write 
C
 
()
( ) ( )
()
123
123
123
... |
| ...
...
n
n
n
pxxx x CpC
pC xxx x
pxxx x
=
     (1) 
 
Because 
( )
123
...
n
pxxx x does not depend on the 
class and because we are generally comparing 
probability estimates between classes, we ignore 
this factor in our calculations and concentrate our 
efforts on evaluating ()( )
123
... |
n
p xxx x C p C . 
First we write 
 
()(
123 123 1
1
... | | ... ,
n
nk
k
pxxx x C px xxx x C
−
=
=
∏
)
k
)
     (2) 
 
which is an exact equality. The final step is to give 
our best approximation to each of the num-
bers (
123 1
|..,
kk
p xxxxx C
−
. To make these ap-
proximations we assume that we are given a set of 
strings and associated probabilities ()
{ }
1
,
M
ii
i
sp
=
 
where for each i ,  and 0
i
p >
i
p  is assumed to 
represent the probability that  belongs to the 
class C .  Then for the given string 
i
s
123
...
n
x xx x and 
a given  we let  be the smallest integer for 
which 
k 1r ≥
12
...
rr r k
x xx x
++
 is a contiguous substring in 
at least one of the strings .  Now let 
i
s N′ be the 
set of all i  for which 
12
...
rr r k
x xx x
++
 is a substring 
of  and let  be the set of all  for which 
i
s N i
12
...
1rr r k
x xx x
++ −
 is a substring of . We set 
i
s
()
123 1
| ... ,
i
iN
kk
i
iN
p
px xxx x C
p
′∈
−
∈
=
∑
∑
. (3) 
In some cases it is appropriate to assume that 
( )
p C  is proportional to 
1
M
i
i
p
=
∑
 or there may be 
other ways to make this estimate. This basic 
scheme works well, but we have found that we can 
obtain a modest improvement by adding a unique 
start character to the beginning of each string. This 
character is assumed to occur nowhere else but as 
the first character in all strings dealt with including 
any string whose probability we are estimating.  
This forces the estimates of probabilities near the 
beginnings of strings to come from estimates based 
on the beginnings of strings.  We use this approach 
in all of our classification algorithms. 
 
Table 1. Each fragment in the left column appears in the 
training data and the probability in the right column 
represents the probability of seeing the underlined por-
tion of the string given the occurrence of the initial un-
underlined portion of the string in a training string.  
GP 
!apoe
7
9.55 10
−
×  
oe-e
3
2.09 10
−
×  
e-epsilon
2
4.00 10
−
×  
( )|p apoe epsilon GP−  
11
7.98 10
−
×  
( )|p GP apoe epsilon−  0.98448 
NGP 
!apoe
8
8.88 10
−
×  
poe-
2
1.21 10
−
×  
oe-e
2
6.10 10
−
×  
e-epsilon
3
6.49 10
−
×  
( )|p apoe epsilon NGP−  
13
4.25 10
−
×  
( )|p NGP apoe epsilon−  0.01552 
In Table 1, we give an illustrative example of 
the string apoe-epsilon which does not appear in 
the training data.  A PubMed search for apoe-
epsilon gene returns 269 hits showing the name is 
known. But it does not appear in this exact form in 
SemCat. 
3.3   Language Model with Witten-Bell Smooth-
ing 
A statistical n-gram model is challenged when a 
bigram in the test set is absent from the training 
set, an unavoidable situation in natural language 
due to Zipf’s law.  Therefore, some method for 
assigning nonzero probability to novel n-grams is 
required.  For our language model (LM), we used 
Witten-Bell smoothing, which reserves probability 
mass for out of vocabulary values (Witten and 
Bell, 1991, Chen and Goodman, 1998).  The dis-
counted probability is calculated as 
 
   
)...()...(#
)...(#
)...(
ˆ
1111
1
11
−+−−+−
+−
−+−
+
=
iniini
ini
ini
wwDww
ww
wwP
   (4)   
 
36
where  is the number of distinct 
words that can appear after in the 
training data. Actual values assigned to tokens out-
side the training data are not assigned uniformly 
but are filled in using a variable order Markov 
Model based on the strings seen in the training 
data.  
)...(
11 −+− ini
wwD
11
...
−+− ini
ww
3.4   Probabilistic Context-Free Grammar 
The Probabilistic Context-Free Grammar 
(PCFG) or Stochastic Context-Free Grammar 
(SCFG) was originally formulated by Booth 
(1969).  For technical details we refer the reader to 
Charniak (1993). For gene and protein name classi-
fication, we tried two different approaches.  In the 
first PCFG method (PCFG-3), we used the follow-
ing simple productions: 
 
1) CATP → CATP CATP 
2) CATP → CATP postCATP 
3) CATP → preCATP CATP 
 
CATP refers to the category of the phrase, GP 
or NGP.  The prefixes pre and post refer to begin-
nings and endings of the respective strings.  We 
trained two separate grammars, one for the positive 
examples, GP, and one for the negative examples, 
NGP.  Test cases were tagged based on their score 
from each of the two grammars. 
In the second PCFG method (PCFG-8), we 
combined the positive and negative training exam-
ples into one grammar.  The minimum number of 
non-terminals necessary to cover the training sets 
gp1-3 was six {CATP, preCATP, postCATP, Not-
CATP, preNotCATP, postNotCATP}. CATP 
represents a string from GP, and NotCATP repre-
sents a string from NGP.  We used the following 
production rules: 
 
1) CATP → CATP CATP 
2) CATP → CATP postCATP 
3) CATP → preCATP CATP 
4) CATP → NotCATP CATP 
5) NotCATP → NotCATP NotCATP 
6) NotCATP → NotCATP postNotCATP 
7) NotCATP→ preNotCATP NotCATP 
8) NotCATP → CATP NotCATP 
 
It can be seen that (4) is necessary for strings like 
“human p53,” and (8) covers strings like “p53 
pathway.” 
     In order to deal with tokens that do not ap-
pear in the training data we use variable order 
Markov Models for strings. First the grammar is 
trained on the training set of names. Then any to-
ken appearing in the training data will have as-
signed to it the tags appearing on the right side of 
any rule of the grammar (essentially part-of-speech 
tags) with probabilities that are a product of the 
training.  We then construct a variable order 
Markov Model for each tag type based on the to-
kens in the training data and the assigned prob-
abilities for that tag type. These Models (three for 
PCFG-3 and six for PCFG-8) are then used to as-
sign the basic tags of the grammar to any token not 
seen in training.  In this way the grammars can be 
used to classify any name even if its tokens are not 
in the training data.  
3.5   Priority Model 
There are problems with the previous ap-
proaches when applied to names. For example, 
suppose one is dealing with the name “human liver 
alkaline phosphatase” and class  represents pro-
tein names and class  anatomical names. In that 
case a language model is no more likely to favor 
 than . We have experimented with PCFGs 
and have found the biggest challenge to be how to 
choose the grammar. After a number of attempts 
we have still found problems of the “human liver 
alkaline phosphatase” type to persist. 
1
C
2
C
1
C
2
C
The difficulties we have experienced with lan-
guage models and PCFGs have led us to try a dif-
ferent approach to model named entities. As a 
general rule in a phrase representing a named en-
tity a word to the right is more likely to be the head 
word or the word determining the nature of the 
entity than a word to the left. We follow this rule 
and construct a model which we will call a Priority 
Model. Let  be the set of training data (names) 
for class  and likewise  for . Let 
1
T
1
C
2
T
2
C { }
A
t
α
α∈
 
denote the set of all tokens used in names con-
tained in . Then for each token 
1
TT∪
2
, tA
α
α∈ , 
we assume there are associated two probabilities 
p
α
 and q
α
 with the interpretation that p
α
 is the 
37
probability that the appearance of the token t
α
 in a 
name indicates that name belongs to class  and 
1
C
q
α
 is the probability that t
α
 is a reliable indicator 
of the class of a name. Let  be 
composed of the tokens on the right in the given 
order. Then we compute the probability 
() () ()12 k
ntt t
αα α
=…
 
()
() ()
( )
() () ()
( )1 1
221
.|1 1
kk
jiii
j
pC n p q q p q
αααα =
==
=−+ −
∑∏∏
k
j
j
α
+
      (5)                                         
 
This formula comes from a straightforward in-
terpretation of priority in which we start on the 
right side of a name and compute the probability 
the name belongs to class  stepwise. If  is 
the rightmost token we multiple the reliability 
 times the significance 
1
C
()k
t
α
()k
q
α ()k
p
α
 to obtain 
, which represents the contribution of 
. The remaining or unused probability is 
 and this is passed to the next token to the 
left, . The probability  is scaled by 
the reliability and then the significance of 
() ()kk
qp
αα
()k
t
α
()
1
k
q
α
−
()1k
t
α − ()
1
k
q
α
−
()1k
t
α −
 to 
obtain , which is the contri-
bution of  toward the probability that the 
name is of class . The remaining probability is 
now  and this is again 
passed to the next token to the left, etc. At the last 
token on the left the reliability is not used to scale 
because there are no further tokens to the left and 
only significance 
() ( ) ( )1
(1 )
kk k
qq p
αα α−
−
1−
)
kα
()1k
t
α −
1
C
()
()
()
(
1
1
k
qq
α−
−−
()1
p
α
 is used.  
We want to choose all the parameters p
α
 and 
q
α
 to maximize the probability of the data. Thus 
we seek to maximize 
 
 () ( )()
12
1
log | log 2 |
nT nT
F pC n pC n
∈∈
=+
∑∑
.
 (6) 
                 
Because probabilities are restricted to be in the 
interval 
[ ]
0,1 , it is convenient to make a change of 
variables through the definitions 
, 
11
xy
x
e
pq
ee
αα
y
e
α α
α
==
++
. (7) 
Then it is a simple exercise to show that 
() (1, 1
dp dq
pp qq
dx dy
αα
)
α αα
αα
=− =−
α
. (8) 
From (5), (6), and (8) it is straightforward to com-
pute the gradient of  as a function of F x
α
 and y
α
 
and because of (8) it is most naturally expressed in 
terms of p
α
 and q
α
. Before we carry out the op-
timization one further step is important. Let B  
denote the subset of Aα∈  for which all the oc-
currences of t
α
 either occur in names in  or all 
occurrences occur in names in . For any such 
1
T
2
T α  
we set 1q
α
=  and if all occurrences of t
α
 are in 
names in   we set 
1
T 1p
α
= , while if all occur-
rences are in names in  we set .  These 
choices are optimal and because of the form of (8) 
it is easily seen that 
2
T 0p
α
=
0
FF
xy
αα
∂ ∂
= =
∂∂
 (9) 
for such an α . Thus we may ignore all the Bα∈  
in our optimization process because the values of 
p
α
 and q
α
 are already set optimally. We therefore 
carry out optimization of  using the F
, , x yA
αα
Bα∈ − . For the optimization we have 
had good success using a Limited Memory BFGS 
method (Nash et al., 1991). 
 
When the optimization of  is complete we 
will have estimates for all the 
F
p
α
 and q
α
, Aα∈ . 
We still must deal with tokens t
β
 that are not in-
cluded among the t
α
.  For this purpose we train 
variable order Markov Models 
1
MP  based on the 
weighted set of strings ()
{ }
,
A
tp
αα
α∈
 and 
2
MP  
based on ( )
{ }
,1
A
tp
αα
α∈
− . Likewise we train 
1
MQ  based on ( )
{ }
,
A
tq
αα
α∈
 and 
2
MQ  based on 
( ){ },1
A
tq
αα
α∈
− . Then if we allow 
( )
i
mp t
β
 to 
represent the prediction from model 
i
MP  and 
( )
i
mq t
β
 that from model 
i
MQ , we set 
38
 
()
() ()
( )
() ()
1
12 12
, 
mp t mq t
pq
mp t mp t mq t mq t
β
ββ
ββ β
==
+
β
+
(10) 
This allows us to apply the priority model to 
any name to predict its classification based on 
equation 5.  
4 Results 
We ran all three methods on the SemCat sets gp1, 
gp2 and gp3.  Results are shown in Table 2.  For 
evaluation we applied the standard information 
retrieval measures precision, recall and F-measure.   
_
(_ _)
rel ret
precision
relretnonrelret
=
+−
 
_
(_ _ _)
rel ret
recall
rel ret rel not ret
=
+
 
 
2* *
()
precision recall
F measure
precision recall
−=
+
 
 
For name classification, rel_ret refers to true posi-
tive entities, non-rel_ret to false positive entities 
and rel_ not_ret to false negative entities. 
 
Table 2. Three-fold cross validation results. P = Preci-
sion, R = Recall, F = F-measure. PCFG = Probabilistic 
Context-Free Grammar, LM = Bigram Model with Wit-
ten-Bell smoothing,  PM = Priority Model. 
Method Run P R F 
PCFG-3 gp1 0.883 0.934 0.908 
 gp2 0.882 0.937 0.909 
 gp3 0.877 0.936 0.906 
PCFG-8 gp1 0.939 0.966 0.952 
 gp2 0.938 0.967 0.952 
 gp3 0.939 0.966 0.952 
LM gp1 0.920 0.968 0.944 
 gp2 0.923 0.968 0.945 
 gp3 0.917 0.971 0.943 
PM gp1 0.949 0.968 0.958 
 gp2 0.950 0.968 0.960 
 gp3 0.950 0.967 0.958 
5 Discussion 
Using a variable order Markov model for strings 
improved the results for all methods (results not 
shown).  The gp1-3 results are similar within each 
method, yet it is clear that the overall performance 
of these methods is PM > PCFG-8 > LM > PCFG-
3. The very large size of the database and the very 
uniform results obtained over the three independ-
ent random splits of the data support this conclu-
sion.  
The improvement of PCFG-8 over PCFG-3 can 
be attributed to the considerable ambiguity in this 
domain. Since there are many cases of term over-
lap in the training data, a grammar incorporating 
some of this ambiguity should outperform one that 
does not. In PCFG-8, additional production rules 
allow phrases beginning as CATPs to be overall 
NotCATPs, and vice versa.   
The Priority Model outperformed all other meth-
ods using F-measure.  This supports our impres-
sion that the right-most words in a name should be 
given higher priority when classifying names.  A 
decrease in performance for the model is expected 
when applying this model to the named entity ex-
traction (NER) task, since the model is based on 
terminology alone and not on the surrounding 
natural language text.  In our classification experi-
ments, there is no context, so disambiguation is not 
an issue. However, the application of our model to 
NER will require addressing this problem. 
 SemCat has not been tested for accuracy, but 
we retain a set of manually-assigned scores that 
attest to the reliability of each contributing list of 
terms.  Table 2 indicates that good results can be 
obtained even with noisy training data.   
6 Conclusion 
In this paper, we have concentrated on the infor-
mation inherent in gene and protein names versus 
other biomedical entities.  We have demonstrated 
the utility of the SemCat database in training prob-
abilistic methods for gene and protein entity classi-
fication.  We have also introduced a new model for 
named entity prediction that prioritizes the contri-
bution of words towards the right end of terms. 
The Priority Model shows promise in the domain 
of gene and protein name classification.  We plan 
to apply the Priority Model, along with appropriate 
contextual and meta-level information, to gene and 
protein named entity recognition in future work.  
We intend to make SemCat freely available. 
 
39
Acknowledgements 
This research was supported in part by the Intra-
mural Research Program of the NIH, National Li-
brary of Medicine. 

References  
T. L. Booth.  1969.  Probabilistic representation of for-
mal languages.  In:  IEEE Conference Record of the 
1969 Tenth Annual Symposium on Switching and 
Automata Theory, 74-81. 
 
Stanley F. Chen and Joshua T. Goodman. 1998.  An 
empirical study of smoothing techniques for lan-
guage modeling. Technical Report TR-10-98, Com-
puter Science Group, Harvard University. 
 
Eugene Charniak.  1993.  Statistical Language Learn-
ing.  The MIT Press,  Cambridge, Massachusetts. 
 
K. Bretonnel Cohen, Lynne Fox, Philip V. Ogren and 
Lawrence Hunter. 2005. Corpus design for biomedi-
cal natural language processing. Proceedings of the 
ACL-ISMB Workshop on Linking Biological Litera-
ture, Ontologies and Databases, 38-45. 
 
The Gene Ontology Consortium.  2000. Gene Ontology: 
tool for the unification of biology, Nat Genet. 25: 25-
29. 
Henk Harkema, Robert Gaizauskas, Mark Hepple, An-
gus Roberts, Ian Roberts, Neil Davis and Yikun Guo.  
2004.  A large scale terminology resource for bio-
medical text processing.  Proc BioLINK 2004, 53-60. 
Vasileios Hatzivassiloglou, Pablo A. Duboué and An-
drey Rzhetsky.  2001.  Disambiguating proteins, 
genes, and RNA in text: a machine learning ap-
proach.  Bioinformatics 17 Suppl 1:S97-106. 
J.-D. Kim, Tomoko Ohta, Yuka Tateisi and Jun-ichi 
Tsujii. 2003. GENIA corpus--semantically annotated 
corpus for bio-textmining. Bioinformatics 19 Suppl 
1:i180-2.  
 
Donald A. Lindberg, Betsy L. Humphreys and Alexa T. 
McCray. 1993. The Unified Medical Language Sys-
tem. Methods Inf Med 32(4):281-91. 
 
Hongfang Liu, Stephen B. Johnson, and Carol Fried-
man.  2002.  Automatic resolution of ambiguous terms 
based on machine learning and conceptual relations in 
the UMLS.  J Am Med Inform Assoc 9(6): 621–636. 
 
Donna Maglott, Jim Ostell, Kim D. Pruitt and Tatiana 
Tatusova. 2005. Entrez Gene: gene-centered informa-
tion at NCBI. Nucleic Acids Res. 33:D54-8.   
 
Alexa T. McCray. 1989. The UMLS semantic network. 
In: Kingsland LC (ed). Proc 13rd Annu Symp Com-
put Appl Med Care. Washington, DC: IEEE Com-
puter Society Press, 503-7. 
 
Ryan McDonald and Fernando Pereira.  2005.  Identify-
ing gene and protein mentions in text using condi-
tional random fields.  BMC Bioinformatics 6 Supp 
1:S6. 
 
S. Nash and J. Nocedal. 1991. A numerical study of the 
limited memory BFGS method and the truncated-
Newton method for large scale optimization, SIAM J. 
Optimization1(3): 358-372. 
 
Goran Nenadic, Irena Spasic and Sophia Ananiadou.  
2003.  Terminology-driven mining of biomedical lit-
erature.  Bioinformatics 19:8, 938-943. 
 
Raf M. Podowski, John G. Cleary, Nicholas T. Gon-
charoff, Gregory Amoutzias and William S. Hayes.  
2004.  AZuRE, a scalable system for automated term 
disambiguation of gene and protein Names IEEE 
Computer Society Bioinformatics Conference, 415-
424. 
 
Lawrence H. Smith, Lorraine Tanabe, Thomas C. Rind-
flesch and W. John Wilbur. 2005. MedTag: A collec-
tion of biomedical annotations. Proceedings of the 
ACL-ISMB Workshop on Linking Biological Litera-
ture, Ontologies and Databases, 32-37.  
 
Lorraine Tanabe, Natalie Xie, Lynne H. Thom, Wayne 
Matten and W. John Wilbur. 2005. GENETAG: a 
tagged corpus for gene/protein named entity recogni-
tion. BMC Bioinformatics 6 Suppl 1:S3.  
 
I. Witten and T. Bell, 1991. The zero-frequency prob-
lem:  Estimating the probabilities of novel events in 
adaptive text compression. IEEE Transactions on In-
formation Theory 37(4).  
 
Alexander Yeh, Alexander Morgan, Mark Colosimo and 
Lynette Hirschman. 2005. BioCreAtIvE Task 1A: 
gene mention finding evaluation. BMC Bioinformat-
ics 6 Suppl 1:S2.  
 
Hong Yu, Carol Friedman, Andrey Rhzetsky and 
Pauline Kra. 1999. Representing genomic knowledge 
in the UMLS semantic network. Proc AMIA Symp. 
181-5. 
