Assessing the correlation between contextual patterns and
biological entity tagging.
M.KRALLINGER, M.PADR ON, C.BLASCHKE, A.VALENCIA
Protein Design Group,
National Center of Biotechnology (CNB-CSIC),
Cantoblanco,
E-28049 Madrid,
martink, mpadron, blaschke, valencia@cnb.uam.es
Abstract
The tagging of biological entities, and in partic-
ular gene and protein names, is an essential step
in the analysis of textual information in Molec-
ular Biology and Biomedicine. The problem is
harder than was originally thought because of
the highly dynamic nature of the research area,
in which new genes and their functions are con-
stantly being discovered, and because of the lack
of commonly accepted standards. An impres-
sive collection of techniques has been used to
detect protein and gene names in the last four-
 ve years, ranging from typical NLP to purely
bioinformatics approaches. We explore here the
relationship between protein/gene names and
expressions used to characterize protein/gene
function. These expressions are captured in a
collection of patterns derived from an original
set of manually derived expressions, extended
to cover lexical variants and  ltered with known
cases of association patterns/ names. Apply-
ing these patterns to a large collection of cu-
rated sentences, we found a signi cant number
of patterns with a very strong tendency to ap-
pear only in sentences in which a protein/gene
name is simultaneously present. This approach
is part of a larger e ort to incorporate contex-
tual information so as to make biological infor-
mation less ambiguous.
1 Introduction
Molecular Biology and biomedical research cov-
ers a broad variety of research topics, connected
to the function of genes and proteins. The infor-
mation on the experimental characterization of
essential functional aspects of these genes and
proteins is manually extracted from primary sci-
enti c publications by  eld-speci c databases.
This process requires highly specialist person-
nel, and is costly and time-consuming. Indeed,
only a small number of genes and proteins have
been annotated with information directly re-
lated to experiments, whereas in the immense
majority of cases the annotations are trans-
ferred from other similar entries. The anno-
tations provided by the databases are a valu-
able source for large-scale analysis, but are in-
evitably incomplete at the level of detailed func-
tion and experimental results.
It is in the context of fast-growing biblio-
graphic information (over 12 million references
are collected in the PubMed database, with an
average of 500,000 new references added every
year) and annotation of the function of genes
and proteins that Text Mining and Information
Extraction systems become important tools
for biological research (Blaschke and Valencia,
2001).
Since the  rst papers were published in
this  eld in the late 90’s, it has become clear
that the detection of gene and protein names
(gene tagging) is a key  rst step towards Text
Mining systems becoming really useful.
The detection of names is particularly com-
plex in the domain of Molecular Biology, for a
number of reasons:
(1) Sociological, since names are perceived as
associated with the recognition of the groups
that  rst discovered them.
(2) As biologists tend not to adopt available
naming standards, often the disease related
to a gene disorder has the same name as the
gene itself (homonyms). This can be only be
addressed using context based sense disam-
biguation procedures.
(3) Gene names or symbols are often the same
as common English terms. For instance, many
D. melanogaster gene names, such as ’hedge-
hog’, lead to lexical ambiguity. (4) Symbols
and abbreviations are commonly used without
any control. This gives rise to the problems of
acronym disambiguation and expansion. There
is no high-quality gene acronym dictionary.
(5) Proteins are related by a process of evalua-
tion, which creates ontological associations that
36
are mixed with the various levels of knowledge
for di erent members of the protein families.
(6) The  eld itself is still evolving, and the
catalogue of genes even for the genomes already
sequenced, such as the Human one, is still
incomplete.
Our own assessment of the evolution of
gene names shows that names evolve over
time into a complex system with scale-free
behavior, with the presence of a few very
oft-quoted names (attractors) and many very
seldom quoted ones. The system itself is in a
critical state and the fate of current names is
unpredictable (Ho mann and Valencia, 2003).
A signi cant number of applications have
been developed to identify gene names and sym-
bols in the biomedical literature (see (Tanabe
and Wilbur, 2002; Yu et al., 2002; Proux et al.,
1998; Krauthammer et al., 2000) for four dif-
ferent methodological approaches). In order to
assess the performance of di erent approaches
the BioCreative challenge was carried out. The
recent BioCreative challenge showed that gene
and protein names can be detected by several
techniques, with a signi cant success that can
be as high as 80% for the best-performing sys-
tems (Blaschke et al. in preparation; and special
issue of BMC Bioinformatics on the BioCreative
challenge cup, in preparation). However, detec-
tion of the remaining 20% of names is really im-
portant for many operations. Therefore, there
is signi cant room for improvement, and a clear
need for new approaches able to use alternative
sources of information.
We explore here a new avenue for the detec-
tion of gene and protein names by using con-
textual information, since in many cases gene
tagging requires knowledge of context (context-
based approach for disambiguation). We pre-
viously explored relevant information by cre-
ating context-based sentence sliding windows
for entity-relation extraction (Krallinger and
Padron, 2004).
We propose here to detect those sentences de-
scribing the function of genes and proteins in
the literature that are good candidates for con-
taining unambiguous information about corre-
sponding gene and protein names.
To detect these sentences, we relied on the
identi cation of typical expressions (patterns)
associated with the description of protein func-
tion in the literature. Context information in
the form of heuristically extracted sentence pat-
terns, known as frames, proved useful in the
past for deriving protein interactions automat-
ically (Blaschke et al., 1999; Blaschke and Va-
lencia, 2001) from protein co-occurrences.
The approach proposed here is based on the
extension of heuristically derived trigger words
(Rilo , 1993; Agichtein and Gravano, 2000) and
the  ltering of patterns using previously gene-
indexed sentences. The extraction patterns ob-
tained were then ranked, using a validation set
of gene-indexed sentences and sentences lacking
the gene symbols. Precision-ranked extraction
patterns and indexing of sentences using those
patterns allowed ranking of these sentences ac-
cording to whether they contained relevant in-
formation for protein indexing and annotation.
2 Methods
In the case of complex domains, such as Molec-
ular Biology and Biomedicine, a prohibitively
large training set is generally required in order
to mine scienti c literature. Often inter-domain
portable methods do not perform well enough.
Nevertheless, within relevant sentences contain-
ing protein or gene names, commonly used pat-
terns often describe or de ne relevant aspects of
those entities.
Figure 1: Flow chart of the main steps for constructing
the extraction pattern set.
Therefore, a list or dictionary of such ex-
traction patterns was developed, starting with
a small list of trigger words which, after sev-
37
eral processing and  ltering steps, resulted in
a ranked list of protein-speci c extraction pat-
terns (see Figure 1).
2.1 Set of trigger words
First, a domain expert manually analyzed gene-
indexed sentences to extract key words that
could trigger potential extraction patterns. The
expert used heuristics based on background
knowledge of the domain. These trigger words
(Rilo , 1993; Agichtein and Gravano, 2000)
constituted frequent word types which, in the
context of other word types (often prepositions
or articles), displayed a strong association with
the given gene or protein entity. Trigger words
thus made up a sort of concept node 1 by scan-
ning through gene-indexed sentences. Most of
these trigger words were in fact verbal phrases
(e.g. transitive verbs), which were often encoun-
tered in sentences de ning or describing relevant
features of genes and gene products. There-
fore, only trigger words which helped describe
or de ne relevant aspects of the protein were ex-
tracted. These trigger words are also useful for
computerized annotation of extraction of pro-
teins. Among the trigger words were 507 verbs,
127 adjectives and 265 nouns.
2.2 Heuristic trigger word extension
The trigger words were then extended and com-
bined by the domain expert using context-
based heuristics to extract a seed set of ini-
tial extraction patterns and a set of regular
extraction expressions. An example (1) of
the heuristic trigger words used was ’encod-
ing’. Among the resulting expert-derived ex-
traction patterns were: ’, encoding a’, ’, encod-
ing the’, ’encoding a’,’encoding the’,"encoding
a <PROT>’,’gene , encoding’ and ’protein, en-
coding’. Here <PROT> represents a previously
gene tagged word type.
2.3 Automatic extension of seed
extraction patterns
To extend the set of extraction patterns and to
expand the regular expressions to obtain de ned
patterns, a rule-based system was used. Among
the extension rules for these seed patterns were
preposition substitutions, comma addition be-
fore verbs, article insertion before certain nouns
and pattern fusions. Some of the patterns gen-
erated were revised manually and inconsisten-
1Concept nodes are essentially case frames which are
triggered through a lexical item and its corresponding
linguistic context (Rilo , 1993)
cies were removed. Examples of the extraction
patterns based on the seed patterns provided in
example (1) were ’the gene encoding the’ , ’, a
gene encoding’, ’, a gene encoding the’,’, gene
encoding a’,’, the gene encoding’,’, the gene en-
coding a’,’the gene encoding the’. Some of the
extensions did not correspond to natural lan-
guage and some were too long. Thus, in a sec-
ond step, those patterns not encountered in free
text, namely the initial set of gene-indexed sen-
tences, were removed.
Figure 2: Emprirical and random distribution of the
pattern to gene name average o sets.
2.4 Temporary extraction pattern
 ltering.
The extended set of extraction patterns had to
be analyzed as to whether they really corre-
sponded to patterns encountered in sentences
in which gene and protein names or symbols
were found. Therefore we tagged, using exact
pattern matching, the temporary set of extrac-
tion patterns to a set of previously gene-indexed
sentences.
These sentences contained gene symbols of
the yeast S. cerevisiae provided by the SGD
database. A total of 36,543 sentences were gen-
erated with the use of a re ned gene tagger. In
general, as yeast genes are easier to tag and on
the whole do not correspond to common En-
glish word types, they became a high-quality
gene-indexed data set for further analysis using
o set statistics.
A total of 769 patterns were matched to these
sentences and the rest of the patterns were dis-
38
carded. To determine whether those matched
patterns had a distance association to the gene
names, we calculated the empirical average o -
set of each pattern. The distances used for the
o set calculation were measured in word tokens.
Thus average empirical o set  de was calcu-
lated by
 de =
Pn
i=1 di
n (1)
where n is the number of occurrences of the
given pattern in the gene indexed sentences and
di is the observed o set.
Taking into account the sentence length, the
individual pattern length and the gene posi-
tion within the pattern matching sentences also
a random o set was calculated for each pat-
tern occurrence and the average random o set
for each pattern,  dr was calculated (see Fig-
ure 2). In order to determine whether the aver-
age empirical distance of the patterns were sig-
ni cantly di erent from the corresponding ran-
dom o sets, the distributions were further an-
alyzed. A chi-square test was applied to verify
that both, the empirical and the random o set
distributions were normally distributed. Then
we used the Kullback-Leibler divergence to mea-
sure how di erent the two probability distribu-
tions were :
D(pkq) =
X
i
pilog2(pi=qi) (2)
where p corresponds to the normal distribution
of the empirical average o set and q to the nor-
mal distribution of the random average o set.
In our case the distributions showed a large KL
divergence. This means that  de is signi cantly
smaller when compared to  dr (i.e. the patterns
are closer to the gene names).
To be able to use the average o set di er-
ences of the patterns as a  ltering criterion we
calculated the distribution of the di erences  i
between  dr and the corresponding  de (see Fig-
ure 3). Only patterns with  i > 0 passed the
selection  lter.
2.5 Permanent extraction pattern
ranking.
After the  ltering of the temporary extrac-
tion patterns using gene/protein indexed sen-
tences, it was important to determine the preci-
sion of the extraction patterns for gene-indexed
sentences, compared with sentences without
mentions of genes. Therefore, two validation
sets were constructed: one containing a set of
Figure 3: Di erence between the average random and
average empirical o set. The extraction patterns which
did not pass the  ltering step (di erence > 0 are dis-
played in red. The remaining pattern set (blue) consti-
tuted the permanent pattern set which was used for the
f-score ranking.
Data set Total
Initial trigger words 899
Seed heuristic patterns 472,427
Extended heuristic patterns 525,408
Filtered heuristic patterns 53,185
Temporary patterns 769
Permanent patterns 655
Gene indexed sentences 36,543
Validation sentences (+)) 45,119
Validation sentences (-) 45,119
Table 1: Overview of the used dataset of pat-
terns and sentences.
gene indexed sentences using gene names con-
tained in the SwissProt database, and the other
consisting of the remaining sentences, without
those symbols. The sets were used to calculate
recall and precision:
R = TPTP + FN (3)
P = TPTP + FP (4)
The corresponding f-score is given by
F - score = 1 1
P + (1 -  )
1
P
(5)
Where P is precision, R is recall and  , which
consists in a weighting factor for precision and
39
recall, here  = 0.5, were both precision and
recall had the same weight. Regarding the ob-
tained f-score or the precision we could rank the
permanent extraction patterns.
Figure 4: F-score plot for the extraction patterns in
the validation set: f-score for each pattern (numbered)
3 Results
The total number of initial patterns was
472,427, the extended version included more
than 525,408 patterns and the reduced  ltered
 nal set included 655 patterns. Even though
these patterns clearly do not include all pos-
sible mentions of functions in texts, it is also
clear that they provide a good statistical base
(32,641 sentences detected in a corpus of 36,543
sentences) for screening the sentences to search
for protein names.
To assess the relationship between patterns
and names, we compared the frequency of the
patterns in two sets of sentences, one containing
and one not containing gene names. A highly
relevant number of patterns appears more fre-
quently in sentences containing names (324 of
the 518 patterns). A subset of these, 202, ap-
pears only in sentences where a gene or protein
name is present. This subset is an ideal candi-
date for enhancing the discriminative power of
gene/protein detection systems.
The permanent extraction patterns displayed
in general a very high precision (see Figure 5),
but the recall for an individual pattern was
relatively low. Nevertheless, most of the pat-
terns were matched to the validation sentences
(79.08%) and 13,799 sentences of the gene in-
dexed validation set had at least one pattern.
There were a total of 59.82% of high precision
patterns, i.e. with a precision greater then 0.8,
of which 64.86% had a precision of 1. Thus
the patterns are in general very speci c for gene
containing sentences.
Figure 5: Precision plot for the extraction patterns in
the validation set: PRECISION for each pattern (num-
bered)
Patterns P R F-score
protein is required 1 5.73E-05 1.15E-04
was localized on 1 5.73E-05 1.15E-04
gene is essential for 1 1.15E-04 2.29E-04
located in the 0.93 3.26E-03 6.51E-03
Table 2: Sample of high precision patterns.
In table 2 some of the top scoring precision
patterns can be seen. Most of them contained
a verb as trigger word, in contrast to the lower
scoring precision patterns (table 3), which often
corresponded to patterns were the trigger word
corresponded to a noun.
Patterns P R F-score
the human 0.544 0.015 0.029
, acts as 0.5 5.72E-05 0.0001
is associated with 0.494 0.005 0.010
role for 0.463 0.003 0.006
Table 3: Sample of low precision patterns.
Moreover most of the high scoring patterns
had a di erence between the random and the
empirical average o set greater then 8, while in
cases of low scoring precision if was mainly be-
40
low 2.5. Therefore, the use of o set calculation
as a  ltering step to extract co-occurrences of
gene-indexed sentences is seen as promising.
For sample sentences of true pattern-matching
cases, see:
(a)Although generally involved with detoxi-
 cation, overexpression of one family mem-
ber, cytochrome P450 1B1 (CYP1B1),
has been associated with human epithelial
tumors [PMID:12813131]
(b) For example, we have identi ed a novel
gene called mta1 (rat) or MTA1 (human)
that appears to be involved in mammary cell
motility and growth regulation [PMID:9891220]
(c) PEX13 protein has an SH3 docking site that
binds to the PTS-1 receptor [PMID:11405337]
(d) We have previously reported the identi ca-
tion of human PEX13, the gene encoding the
docking factor for the PTS1 receptor, or PEX5
protein [PMID:9878256]
The above examples show correctly identi-
 ed gene containing sentences using extraction
patterns. The extraction patterns are un-
derlined, while the relevant gene symbols are
displayed in bold.
After detailed analysis of the positive
matched patterns, we found that certain pat-
terns were more suited to annotating functional
implications of disease-related features of genes
or proteins (see example a). Other patterns
were more suited to extracting descriptions of
the participation of proteins in distinct biolog-
ical processes (example b). In addition, func-
tional descriptions and protein-protein interac-
tion information, useful for deriving functional
annotation data and protein de nitions, were
associated with certain patterns (see c and d).
4 Conclusions
We have described here a new approach for the
identi cation of sentences containing informa-
tion relating to gene and protein names in bio-
logical literature. Our proposal is based on the
detection of sentences that contain information
relating to protein (or gene) function as an in-
dicator of the presence of protein/gene names.
To identify these sentences, we used a
pattern-based approach that encapsulates the
characteristic ways in which function is de-
scribed in text. To generate the set of pat-
terns describing functions, we started with an
initial set of manually derived patterns, which
was extended to cover a number of lexical varia-
tions. This larger set was  ltered by matching of
the patterns using previously gene-indexed sen-
tences. The trigger word extension idea is based
on the proposal by (Rilo , 1993; Agichtein and
Gravano, 2000). Among the extraction patterns
with high precision, a signi cant number con-
tained verbs as trigger words. This corroborates
previous studies that used verbs to extract bio-
logical interactions (Hatzivassiloglou and Weng,
2002; Sekimizu et al., 1998).
We plan to analyze further the patterns used
in the study in order to explore the di eren-
tial behavior of verb-containing patterns and
noun-containing patterns for protein annota-
tion extractions. The use of verbs to trigger
extraction patterns for biological interactions
have already been explored (Hatzivassiloglou
and Weng, 2002; Sekimizu et al., 1998), but
their use for protein indexing and annotation
extraction was not previously studied in detail.
The overall performance of extraction patterns
for interactions and for annotation extraction
seems to be similar.
Most of the extraction patterns used showed
no dependency on the organism that was the
source of the genes, with the exceptions of the
patterns containing the trigger words human,
yeast, mammalian and mouse. Therefore, the
majority of the extraction patterns could be
used for extraction of genes from a broad range
of organisms, and especially aid in disambigua-
tion of  y genes. As the extraction patterns
can be applied without prior gene indexing, they
could be used to enhance compound gene-name
indexing, to extract rare typographical variants
of existing gene names (not deposited in anno-
tation databases) or even to mine the literature
to discover new genes not yet described in cur-
rent annotation databases.
The main focus in extraction patterns was pre-
cision, which was attained through a pipeline of
 ltering steps. The use of a larger set of ini-
tial trigger words might further increase recall
in some cases.
We also plan to explore the use of the infor-
mation of the patterns to improve the capac-
ity of our current entity recognition systems.
In particular, we would like to do this in the
context of our system for detecting associations
41
between proteins and their functions. In the
recent BioCreative challenge, it was clear that
our system could be substantially improved by
enhancing its name recognition capacity. This
could be done by incorporating the frames as
additional context information into the previ-
ously developed subset strategy (Krallinger and
Padron, 2004).
Finally, we also plan to compare extraction pat-
terns with automatically derived n-grams from
previously gene-indexed sentences, in order to
 nd which features are best suited for itera-
tive bootstrapping to create new extraction pat-
terns.
5 Acknowledgements
This research was sponsored by DOC, doctoral
scholarship of the Austrian Academy of Sciences
and the ORIEL (IST-2001-32688) and TEM-
BLOR (QLRT-2001-00015) projects. We are
grateful to R. Ho mann for providing the  l-
tering set of gene-indexed sentences.

References
E. Agichtein and L. Gravano. 2000. Snowball:
Extracting relations from large plain-text col-
lections. Proc. 5th ACM International Con-
ference on Digital Libraries., pages 85{94.
C. Blaschke and A. Valencia. 2001. The poten-
tial use of SUISEKI as a protein interaction
discovery tool. Genome Inform Ser Work-
shop Genome Inform., 12:123{134.
C. Blaschke, A. Andrade, M, C. Ouzounis,
and A. Valencia. 1999. Automatic extrac-
tion of biological information from scienti c
text: protein-protein interactions. Proc Int
Conf Intell Syst Mol Biol., pages 60{67.
V. Hatzivassiloglou and W. Weng. 2002. Learn-
ing anchor verbs for biological interaction
patterns from published text articles. Int J
Med Inf., 67:19{32.
R. Ho mann and A. Valencia. 2003. Life cycles
of successful genes. Trends Genet, 19:79{81.
M. Krallinger and M. Padron. 2004. Prediction
of GO annotation by combining entity spe-
ci c sentence sliding window pro les. Proc.
BioCreative Challenge Evaluation Workshop
2004.
M. Krauthammer, A. Rzhetsky, P. Morozov,
and C. Friedman. 2000. Using BLAST for
identifying gene and protein names in journal
articles. Gene, 259:245{252.
D. Proux, F. Rechenmann, L. Julliard, V.V. Pil-
let, and B. Jacq. 1998. Detecting Gene Sym-
bols and Names in Biological Texts: A First
Step toward Pertinent Information Extrac-
tion. Genome Inform Ser Workshop Genome
Inform, 9:72{80.
E. Rilo . 1993. Automatically Constructing a
Dictionary for Information Extraction Tasks.
Proceedings of the Eleventh National Confer-
ence on Arti cial Intelligence., pages 811{
816.
T. Sekimizu, H.S. Park, and J. Tsujii. 1998.
Identifying the Interaction between Genes
and Gene Products Based on Frequently Seen
Verbs in Medline Abstracts. Genome Inform
Ser Workshop Genome Inform., 9:62{71.
L. Tanabe and W.J. Wilbur. 2002. Tagging
gene and protein names in biomedical text.
Bioinformatics, 18:1124{1132.
H. Yu, V. Hatzivassiloglou, C. Friedman,
A. Rzhetsky, and W.J. Wilbur. 2002. Au-
tomatic Extraction of Gene and Protein Syn-
onyms from MEDLINE and Journal Articles.
Proc AMIA Symp., pages 919{23.
