Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 732–739, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Multi-way Relation Classification:
Application to Protein-Protein Interactions
Barbara Rosario
SIMS
UC Berkeley
Berkeley, CA 94720
rosario@sims.berkeley.edu
Marti A. Hearst
SIMS
UC Berkeley
Berkeley, CA 94720
hearst@sims.berkeley.edu
Abstract
We address the problem of multi-way re-
lation classification, applied to identifica-
tion of the interactions between proteins
in bioscience text. A major impediment
to such work is the acquisition of appro-
priately labeled training data; for our ex-
periments we have identified a database
that serves as a proxy for training data.
We use two graphical models and a neu-
ral net for the classification of the inter-
actions, achieving an accuracy of 64%
for a 10-way distinction between relation
types. We also provide evidence that the
exploitation of the sentences surrounding
a citation to a paper can yield higher accu-
racy than other sentences.
1 Introduction
Identifying the interactions between proteins is one
of the most important challenges in modern ge-
nomics, with applications throughout cell biology,
including expression analysis, signaling, and ratio-
nal drug design. Most biomedical research and
new discoveries are available electronically but only
in free text format, so automatic mechanisms are
needed to convert text into more structured forms.
The goal of this paper is to address this difficult
and important task, the extraction of the interactions
between proteins from free text. We use graphical
models and a neural net that were found to achieve
high accuracy in the related task of extracting the re-
lation types might hold between the entities “treat-
ment” and “disease” (Rosario and Hearst, 2004).
Labeling training and test data is time-consuming
and subjective. Here we report on results using an
existing curated database, the HIV-1 Human Protein
Interaction Database1, to train and test the classifica-
tion system. The accuracies obtained by the classi-
fication models proposed are quite high, confirming
the validity of the approach. We also find support
for the hypothesis that the sentences surrounding ci-
tations are useful for extraction of key information
from technical articles (Nakov et al., 2004).
In the remainder of this paper we discuss related
work, describe the dataset, and show the results of
the algorithm on documents and sentences.
2 Related work
There has been little work in general NLP on trying
to identify different relations between entities. Many
papers that claim to be doing relationship recogni-
tion in actuality address the task of role extraction:
(usually two) entities are identified and the relation-
ship is implied by the co-occurrence of these enti-
ties or by some linguistic expression (Agichtein and
Gravano, 2000; Zelenko et al., 2002).
The ACE competition2 has a relation recognition
subtask, but assumes a particular type of relation
holds between particular entity types (e.g., if the two
entities in question are an EMP and an ORG, then
an employment relation holds between them; which
type of employment relation depends on the type of
entity, e.g., staff person vs partner).
1www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html
2http://www.itl.nist.gov/iaui/894.01/tests/ace/
732
In the BioNLP literature there have recently
been a number of attempts to automatically extract
protein-protein interactions from PubMed abstracts.
Some approaches simply report that a relation exists
between two proteins but do not determine which
relation holds (Bunescu et al., 2005; Marcotte et al.,
2001; Ramani et al., 2005), while most others start
with a list of interaction verbs and label only those
sentences that contain these trigger words (Blaschke
and Valencia, 2002; Blaschke et al., 1999; Rind-
flesch et al., 1999; Thomas et al., 2000; Sekimizu et
al., 1998; Ahmed et al., 2005; Phuong et al., 2003;
Pustejovsky et al., 2002). However, as Marcotte et
al. (2001) note, “... searches for abstracts contain-
ing relevant keywords, such as interact*, poorly dis-
criminate true hits from abstracts using the words
in alternate senses and miss abstracts using different
language to describe the interactions.”
Most of the existing methods also suffer from low
recall because they use hand-built specialized tem-
plates or patterns (Ono et al., 2001; Corney et al.,
2004). Some systems use link grammars in conjunc-
tion with trigger verbs instead of templates (Ahmed
et al., 2005; Phuong et al., 2003). Every paper eval-
uates on a different test set, and so it is quite difficult
to compare systems.
In this paper, we use state-of-the-art machine
learning methods to determine the interaction types
and to extract the proteins involved. We do not use
trigger words, templates, or dictionaries.
3 Data
We use the information from a domain-specific
database to gather labeled data for the task of classi-
fying the interactions between proteins in text. The
manually-curated HIV-1 Human Protein Interaction
Database provides a summary of documented inter-
actions between HIV-1 proteins and host cell pro-
teins, other HIV-1 proteins, or proteins from disease
organisms associated with HIV or AIDS. We use this
database also because it contains information about
the type of interactions, as opposed to other protein
interaction databases (BIND, MINT, DIP, for exam-
ple3) that list the protein pairs interacting, without
3DIP lists only the protein pairs, BIND has only some in-
formation about the method used to provide evidence for the
interaction, and MIND does have interaction type information
but the vast majority of the entries (99.9% of the 47,000 pairs)
Interaction #Triples Interaction #Triples
Interacts with 1115 Complexes with 45
Activates 778 Modulates 43
Stimulates 659 Enhances 41
Binds 647 Stabilizes 34
Upregulates 316 Myristoylated by 34
Imported by 276 Recruits 32
Inhibits 194 Ubiquitinated by 29
Downregulates 124 Synergizes with 28
Regulates 86 Co-localizes with 27
Phosphorylates 81 Suppresses 24
Degrades 73 Competes with 23
Induces 52 Requires 22
Inactivates 51
Table 1: Number of triples for the most common
interactions of the HIV-1 database, after removing
the distinction in directionality and the triples with
more than one interaction.
specifying the type of interactions.
In this database, the definitions of the interactions
depend on the proteins involved and the articles de-
scribing the interactions; thus there are several def-
initions for each interaction type. For the interac-
tion bind and the proteins ANT and Vpr, we find
(among others) the definition “Interaction of HIV-
1 Vpr with human adenine nucleotide translocator
(ANT) is presumed based on a specific binding in-
teraction between Vpr and rat ANT.”
The database contains 65 types of interactions and
809 proteins for which there is interaction informa-
tion, for a total of 2224 pairs of interacting proteins.
For each documented protein-protein interaction the
database includes information about:
a0 A pair of proteins (PP),
a0 The interaction type(s) between them (I), and
a0 PubMed identification numbers of the journal
article(s) describing the interaction(s) (A).
A protein pair a1a2a1 can have multiple interactions
(for example, AIP1 binds to HIV-1 p6 and also is in-
corporated into it) for an average of 1.9 interactions
per a1a2a1 and a maximum of 23 interactions for the
pair CDK9 and tat p14.
We refer to the combination of a protein pair a1a2a1
and an article a3 as a “triple.” Our goal is to au-
tomatically associate to each triple an interaction
have been assigned the same type of interaction (aggregation).
These databases are all manually curated.
733
type. For the example above, the triple “AIP1 HIV-
1-p6 14519844” is assigned the interaction binds
(14519844 being the PubMed number of the paper
providing evidence for this interaction)4.
Journal articles can contain evidence for multi-
ple interactions: there are 984 journal articles in the
database and on average each article is reported to
contain evidence for 5.9 triples (with a maximum
number of 90 triples).
In some cases the database reports multiple dif-
ferent interactions for a given triple. There are
5369 unique triples in the database and of these 414
(7.7%) have multiple interactions. We exclude these
triples from our analysis; however, we do include ar-
ticles and a1a2a1 s with multiple interactions. In other
words, we tackle cases such as the example above
of the pair AIP1, HIV-1-p6 (that can both bind and
incorporate) as long as the evidence for the different
interactions is given by two different articles.
Some of the interactions differ only in the direc-
tionality (e.g., regulates and regulated by, inhibits
and inhibited by, etc.); we collapsed these pairs of
related interactions into one5. Table 1 shows the
list of the 25 interactions of the HIV-1 database for
which there are more than 10 triples.
For these interactions and for a random subset of
the protein pairs a1a2a1 (around 45% of the total pairs
in the database), we downloaded the corresponding
full-text papers. From these, we extracted all and
only those sentences that contain both proteins from
the indicated protein pair. We assigned each of these
sentences the corresponding interaction a4 from the
database (“papers”).
Nakov et al. (2004) argue that the sentences sur-
rounding citations to related work, or citances, are a
useful resource for bioNLP. Building on that work,
we use citances as an additional form of evidence
to determine protein-protein interaction types. For a
given database entry containing PubMed article a3 ,
4To be precise, there are for this
a5a6a5 (as there are often)
multiple articles (three in this case) describing the interaction
binds, thus we have the following three triples to which we
associate binds: “AIP1 HIV-1-p6 14519844,” “AIP1 HIV-1-p6
14505570” and “AIP1 HIV-1-p6 14505569.”
5We collapsed these pairs because the directionality of the
interactions was not always reliable in the database. This im-
plies that for some interactions, we are not able to infer the dif-
ferent roles of the two proteins; we considered only the pair
“prot1 prot2” or “prot2 prot1,” not both. However, our algo-
rithm can detect which proteins are involved in the interactions.
protein pair a1a2a1 , and interaction type a4 , we down-
loaded a subset of the papers that cite a3 . From these
citing papers, we extracted all and only those sen-
tences that mention a3 explicitly; we further filtered
these to include all and only the sentences that con-
tain a1a7a1 . We labeled each of these sentences with
interaction type a4 (“citances”).
There are often many different names for the same
protein. We use LocusLink6 protein identification
numbers and synonym names for each protein, and
extract the sentences that contain an exact match for
(some synonym of) each protein. By being conser-
vative with protein name matching, and by not doing
co-reference analysis, we miss many candidate sen-
tences; however this method is very precise.
On average, for “papers,” we extracted 0.5 sen-
tences per triple (maximum of 79) and 50.6 sen-
tences per interaction (maximum of 119); for “ci-
tances” we extracted 0.4 sentences per triple (with a
maximum of 105) and 49.2 sentences per interaction
(162 maximum). We required a minimum number
(40) of sentences for each interaction type for both
“papers” and “citances”; the 10 interactions of Table
2 met this requirement. We used these sentences to
train and test the models described below7.
Since all the sentences extracted from one triple
are assigned the same interaction, we ensured that
sentences from the same triple did not appear in both
the testing and the training sets. Roughly 75% of the
data were used for training and the rest for testing.
As mentioned above the goal is to automatically
associate to each triple an interaction type. The task
tackled here is actually slightly more difficult: given
some sentences extracted from article a3 , assign to
a3 an interaction type a4 and extract the proteins
a1a2a1
involved. In other words, for the purpose of clas-
sification, we act as if we do not have information
about the proteins that interact. However, given the
way the sentence extraction was done, all the sen-
tences extracted from a3 contain the a1a2a1 .
6LocusLink was recently integrated into En-
trez Gene, a unified query environment for genes
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene).
7We also looked at larger chunks of text, in particular, we
extracted the sentence containing the a5a8a5 along with the pre-
vious and the following sentences, and the three consecutive
sentences that contained the a5a8a5 (the proteins could appear in
any of the sentences). However, the results obtained by using
these larger chunks were consistently worse.
734
Interaction Papers Citances
Degrades 60 63
Synergizes with 86 101
Stimulates 103 64
Binds 98 324
Inactivates 68 92
Interacts with 62 100
Requires 96 297
Upregulates 119 98
Inhibits 78 84
Suppresses 51 99
Total 821 1322
Table 2: Number of interaction sentences extracted.
f 1 
R ole 
f 2 f a9 n    . . . f 1 
R ole 
f 2 f a9 n    . . . f a9 1 
R ole 
f 2 f n    . . . 
 Inter. 
Figure 1: Dynamic graphical model (DM) for pro-
tein interaction classification (and role extraction).
A hand-assessment of the individual sentences
shows that not every sentence that mentions the tar-
get proteins a1a2a1 actually describes the interaction a4
(see Section 5.4). Thus the evaluation on the test set
is done at the document level (to determine if the
algorithm can predict the interaction that a curator
would assign to a document as a whole given the
protein pair).
Note that we assume here that the papers that pro-
vide the evidence for the interactions are given – an
assumption not usually true in practice.
4 Models
For assigning interactions, we used two generative
graphical models and a discriminative model. Fig-
ure 1 shows the generative dynamic model, based
on previous work on role and relation extraction
(Rosario and Hearst, 2004) where the task was to ex-
tract the entities TREATMENT and DISEASE and
the relationships between them. The nodes labeled
“Role” represent the entities (in this case the choices
are PROTEIN and NULL); the children of the role
nodes are the words (which act as features), thus
there are as many role states as there are words in the
sentence; this model consists of a Markov sequence
of states where each state generates one or multiple
observations. This model makes the additional as-
sumption that there is an interaction present in the
sentence (represented by the node “Inter.”) that gen-
erates the role sequence and the observations. (We
assume here that there is a single interaction for each
sentence.) The “Role” nodes can be observed or
hidden. The results reported here were obtained us-
ing only the words as features (i.e., in the dynamic
model of Figure 1 there is only one feature node
per role) and with the “Role” nodes hidden (i.e., we
had no information regarding which proteins were
involved). Inference is performed with the junction
tree algorithm8.
We used a second type of graphical model, a sim-
ple Naive Bayes, in which the node representing the
interaction generates the observable features (all the
words in the sentence). We did not include role in-
formation in this model.
We defined joint probability distributions over
these models, estimated using maximum likelihood
on the training set with a simple absolute discount-
ing smoothing method. We performed 10-fold cross
validation on the training set and we chose the
smoothing parameters for which we obtained the
best classification accuracies (averaged over the ten
runs) on the training data; the results reported here
were obtained using these parameters on the held-
out test sets9.
In addition to these two generative models, we
also used a discriminative model, a neural network.
We used the Matlab package to train a feed-forward
network with conjugate gradient descent. The net-
work has one hidden layer, with a hyperbolic tangent
function, and an output layer representing the rela-
tionships. A logistic sigmoid function is used in the
output layer. The network was trained for several
choices of numbers of hidden units; we chose the
best-performing networks based on training set er-
ror. We then tested these networks on held-out test-
ing data. The features were words, the same as those
used for the graphical models.
8Using Kevin Murphy’s BNT package:
http://www.cs.ubc.ca/˜murphyk/Software/BNT/bnt.html.
9We did not have enough data to require that the sentences in
the training and test sets of the cross validation procedure orig-
inate from disjoint triples (they do originate from disjoint triple
in the final held out data). This may result in a less than optimal
choice of the parameters for the aggregate measures described
below.
735
All Papers Citances
Mj Cf Mj Cf Mj Cf
DM 60.5 59.7 57.8 55.6 53.4 54.5
NB 58.1 61.3 57.8 55.6 55.7 54.5
NN 63.7 – 44.4 – 55.8 –
Key 20.1 – 24.4 – 20.4 –
KeyB 25.8 – 40.0 – 26.1 –
Base. 21.8 11.1 26.1
Table 3: Accuracies for classification of the 10
protein-protein interactions of Table 2. DM: dy-
namic model, NB: Naive Bayes, NN: neural net-
work. Baselines: Key: trigger word approach,
KeyB: trigger word with backoff, Base: the accu-
racy of choosing the most frequent interaction.
The task is the following: given a triple consist-
ing of a a1a2a1 and an article, extract the sentences
from the article that contain both proteins. Then,
predict for the entire document one of the interac-
tions of Table 2 given the sentences extracted for
that triple. This is a 10-way classification problem,
which is significantly more complex than much of
the related work in which the task is to make the bi-
nary prediction (see Section 2).
5 Results
The evaluation was done on a document-by-
document basis. During testing, we choose the inter-
action using the following aggregate measures that
use the constraint that all sentences coming from the
same triple are assigned the same interaction.
a0 Mj: For each triple, for each sentence of the
triple, find the interaction that maximizes the
posterior probability of the interaction given
the features; then assign to all sentences of
this triple the most frequent interaction among
those predicted for the individual sentences.
a0 Cf: Retain all the conditional probabilities (do
not choose an interaction per sentence), then,
for each triple, choose the interaction that max-
imizes the sum over all the triple’s sentences.
Table 3 reports the results in terms of classifi-
cation accuracies averaged across all interactions,
for the cases “all” (sentences from “papers” and
“citances” together), only “papers” and only “ci-
tances”. The accuracies are quite high; the dy-
namic model achieves around 60% for “all,” 58%
for “papers” and 54% for “citances.” The neural
net achieves the best results for “all” with around
64% accuracy. From these results we can make the
following observations: all models greatly outper-
form the baselines; the performances of the dynamic
model DM, the Naive Bayes NB and the NN are very
similar; for “papers” the best results were obtained
with the graphical models; for “all” and “citances”
the neural net did best. The use of “citances” al-
lowed the gathering of additional data (and therefore
a larger training set) that lead to higher accuracies
(see “papers” versus “all”).
In the confusion matrix in Table 5 we can see the
accuracies for the individual interactions for the dy-
namic model DM, using “all” and “Mj.” For three
interactions this model achieves perfect accuracy.
5.1 Hiding the protein names
In order to ensure that the algorithm was not over-
fitting on the protein names, we ran an experiment
in which we replaced the protein names in all sen-
tences with the token “PROT NAME.” For example,
the sentence: “Selective CXCR4 antagonism by Tat”
became: “Selective PROT NAME2 antagonism by
PROT NAME1.”
Table 5.1 shows the results of running the mod-
els on this data. For “papers” and “citances” there
is always a decrease in the classification accuracy
when we remove the protein names, showing that
the protein names do help the classification. The
differences in accuracy in the two cases using “ci-
tances” are much smaller than the differences using
“papers” at least for the graphical models. This sug-
gests that citation sentences may be more robust for
some language processing tasks and that the models
that use “citances” learn better the linguistic context
of the interactions. Note how in this case the graph-
ical models always outperform the neural network.
5.2 Using a “trigger word” approach
As mentioned above, much of the related work in
this field makes use of “trigger words” or “interac-
tion words” (see Section 2). In order to (roughly)
compare our work and to build a more realistic base-
line, we created a list of 70 keywords that are repre-
736
Prediction Acc.
Truth D SyW St B Ina IW R Up Inh Su (%)
Degrades (D) 5 0 0 0 0 0 0 0 0 0 100.0
Synergizes with (SyW) 0 1 0 0 0 1 0 3 3 0 12.5
Stimulates (St) 0 0 4 0 0 0 6 0 1 0 36.4
Binds (B) 0 0 0 18 0 4 1 1 3 0 66.7
Inactivates (Ina) 0 0 0 0 9 0 0 0 0 0 100.0
Interacts with (IW) 0 0 4 3 0 5 1 0 1 2 31.2
Requires (R) 0 0 0 0 0 3 3 0 1 1 37.5
Upregulates (Up) 0 0 0 2 1 0 0 12 2 0 70.6
Inhibits (Inh) 0 0 0 3 0 0 1 1 12 0 70.6
Suppresses (Su) 0 0 0 0 0 0 0 0 0 6 100.0
Table 4: Confusion matrix for the dynamic model DM for “all,” “Mj.” The overall accuracy is 60.5%. The
numbers indicate the number of articles a3 (each paper has several relevant sentences).
All Papers Citances
Mj Cf Diff Mj Cf Diff Mj Cf Diff
DM 60.5 60.5 0.7% 44.4 40.0 -25.6% 52.3 53.4 -2.0%
NB 59.7 59.7 0.1% 46.7 51.1 -11.7% 53.4 53.4 -3.1%
NN 51.6 -18.9% 44.4 0% 50.0 -10.4%
Table 5: Accuracies for the classification of the 10 protein-protein interactions of Table 2 with the protein
names removed. Columns marked Diff show the difference in accuracy (in percentages) with respect to the
original case of Table 3, averaged over all evaluation methods.
sentative of the 10 interactions. For example, for
the interaction degrade some of the keywords are
“degradation,” “degrade,” for inhibit we have “inhib-
ited,” “inhibitor,” “inhibitory” and others. We then
checked whether a sentence contained such key-
words. If it did, we assigned to the sentence the
corresponding interaction. If it contained more than
one keyword corresponding to multiple interactions
consisting of the generic interact with plus a more
specific one, we assigned the more specific interac-
tion; if the two predicted interactions did not include
interact with but two more specific interactions, we
did not assign an interaction, since we wouldn’t
know how to choose between them. Similarly, we
assigned no interaction if there were more than two
predicted interactions or no keywords present in the
sentence. The results are shown in the rows labeled
“Key” and “KeyB” in Table 3. Case “KeyB” is the
“Key” method with back-off: when no interaction
was predicted, we assigned to the sentence the most
frequent interaction in the training data. As before,
we calculated the accuracy when we force all the
sentences from one triple to be assign to the most
frequent interaction among those predicted for the
individual sentences.
KeyB is more accurate than Key and although
the KeyB accuracies are higher than the other base-
lines, they are significantly lower than those ob-
tained with the trained models. The low accuracies
of the trigger-word based methods show that the re-
lation classification task is nontrivial, in the sense
that not all the sentences contain the most obvious
word for the interactions, and suggests that the trig-
ger word approach is insufficient.
5.3 Protein extraction
The dynamic model of Figure 1 has the appealing
property of simultaneously performing interaction
recognition and protein name tagging (also known
as role extraction): the task consists of identifying
all the proteins present in the sentence, given a se-
quence of words. We assessed a slightly different
task: the identification of all (and only) the proteins
present in the sentence that are involved in the inter-
action.
The F-measure10 achieved by this model for this
task is 0.79 for “all,” 0.67 for “papers” and 0.79 for
“citances”; again, the model parameters were cho-
sen with cross validation on the training set, and “ci-
10The F-measure is a weighted combination of precision and
recall. Here, precision and recall are given equal weight, that is,
F-measure = a10a12a11a14a13a15a5a6a16a18a17a19a13a15a16a18a17a21a20a8a22a24a23a25a10a26a5a8a16a18a17a28a27a7a16a18a17a21a20a8a22 .
737
tances” had superior performance. Note that we did
not use a dictionary: the system learned to recog-
nize the protein names using only the training data.
Moreover, our role evaluation is quite strict: every
token is assessed and we do not assign partial credit
for constituents for which only some of the words
are correctly labeled. We also did not use the in-
formation that all the sentences extracted from one
triple contain the same proteins.
Given these strong results (both F-measure and
classification accuracies), we believe that the dy-
namic model of Figure 1 is a good model for per-
forming both name tagging and interaction classifi-
cation simultaneously, or either of these task alone.
5.4 Sentence-level evaluation
In addition to assigning interactions to protein pairs,
we are interested in sentence-level semantics, that
is, in determining the interactions that are actually
expressed in the sentence. To test whether the infor-
mation assigned to the entire document by the HIV-
1 database record can be used to infer information
at the sentence level, an annotator with biological
expertise hand-annotated the sentences from the ex-
periments. The annotator was instructed to assign
to each sentence one of the interactions of Table 2,
“not interacting,” or “other” (if the interaction be-
tween the two proteins was not one of Table 2).
Of the 2114 sentences that were hand-labeled,
68.3% of them disagreed with the HIV-1 database la-
bel, 28.4% agreed with the database label, and 3.3%
were found to contain multiple interactions between
the proteins. Among the 68.3% of the sentences
for which the labels did not agree, 17.4% had the
vague interact with relation, 7.4% did not contain
any interaction and 43.5% had an interaction differ-
ent from that specified by the triple11. In Table 6
we report the mismatch between the two sets of la-
bels. The total accuracy of 38.9%12 provides a use-
ful baseline for using a database for the labeling at
the sentence level. It may be the case that certain
interactions tend to be biologically related and thus
11For 28% of the triples, none of the sentences extracted from
the target paper were found by the annotator to contain the in-
teraction given by the database. We read four of these papers
and found sentences containing that interaction, but our system
had failed to extract them.
12The accuracy without the vague interact with is 49.4%.
All Papers Citan.
DM 48.9 28.9 47.9
NB 47.1 33.3 53.4
NN 52.9 36.7 63.2
Key 30.5 18.9 38.3
KeyB 46.2 36.3 52.6
Base 36.3 34.4 37.6
Table 7: Classification accuracies when the models
are trained and tested on the hand labeled sentences.
tend to co-occur (upregulate and stimulate or inacti-
vate and inhibit, for example).
We investigated a few of the cases in which the
labels were “suspiciously” different, for example a
case in which the database interaction was stimulate
but the annotator found the same proteins to be re-
lated by inhibit as well. It turned out that the authors
of the article assigned stimulate found little evidence
for this interaction (in favor of inhibit), suggesting
an error in the database. In another case the database
interaction was require but the authors of the article,
while supporting this, found that under certain con-
ditions (when a protein is too abundant) the interac-
tion changes to one of inhibit. Thus we were able
to find controversial facts about protein interactions
just by looking at the confusion matrix of Table 6.
We trained the models using these hand-labeled
sentences in order to determine the interaction ex-
pressed for each sentence (as opposed to for each
document). This is a difficult task; for some sen-
tences it took the annotator several minutes to un-
derstand them and decide which interaction applied.
Table 7 shows the results on running the classi-
fication models on the six interactions for which
there were more than 40 examples in the training
sets. Again, the sentences from “papers” are espe-
cially difficult to classify; the best result for “papers”
is 36.7% accuracy versus 63.2% accuracy for “ci-
tances.” In this case the difference in performance
of “papers” and “citances” is larger than for the pre-
vious task of document-level relation classification.
6 Conclusions
We tackled an important and difficult task, the clas-
sification of different interaction types between pro-
teins in text. A solution to this problem would
have an impact on a variety of important challenges
in modern biology. We used a protein-interaction
738
Annotator
Database D SyW St B Ina R Up Inh Su IW Ot No
Degrades (D) 44 0 2 5 6 5 2 0 23 9 11 6
Synergizes with (SyW) 0 78 3 14 0 13 8 0 0 26 31 11
Stimulates (St) 0 5 23 12 0 8 7 5 1 26 60 18
Binds (B) 0 6 9 118 0 25 8 10 1 129 77 22
Inactivates (Ina) 0 0 4 25 0 2 4 33 6 14 27 11
Requires (R) 0 5 29 20 0 63 8 54 0 85 80 33
Upregulates (Up) 0 4 24 0 0 0 124 2 0 21 32 4
Inhibits (Inh) 0 8 4 8 2 2 2 43 9 24 37 19
Suppresses (Su) 3 0 0 1 5 0 0 42 34 33 24 4
Interacts with (IW) 0 1 5 28 1 12 6 1 1 49 27 28
Accuracy 93.6 72.9 22.3 51.1 0 48.5 73.4 22.7 45.3 11.8
Table 6: Confusion matrix comparing the hand-assigned interactions and those extracted from the HIV-1
database. Ot: sentences for which the annotator found an interaction different from those in Table 2. No:
sentences for which the annotator found no interaction. The bottom row shows the accuracy of using the
database to label the individual sentences.
database to automatically gather labeled data for this
task, and implemented graphical models that can
simultaneously perform protein name tagging and
relation identification, achieving high accuracy on
both problems. We also found evidence support-
ing the hypothesis that citation sentences are a good
source of training data, most likely because they pro-
vide a concise and precise way of summarizing facts
in the bioscience literature.
Acknowledgments. We thank Janice Hamer for her
help in labeling examples and other biological in-
sights. This research was supported by a grant from
NSF DBI-0317510 and a gift from Genentech.
References
E. Agichtein and L. Gravano. 2000. Snowball: Extracting rela-
tions from large plain-text collections. Proc. of DL ’00.
S. Ahmed, D. Chidambaram, H. Davulcu, and C. Baral. 2005.
Intex: A syntactic role driven protein-protein interaction ex-
tractor for bio-medical text. In Proceedings ISMB/ACL Bi-
olink 2005.
C. Blaschke and A. Valencia. 2002. The frame-based module
of the suiseki information extraction system. IEEE Intelli-
gent Systems, 17(2).
C. Blaschke, M.A. Andrade, C. Ouzounis, and A. Valencia.
1999. Automatic extraction of biological information from
scientific text: Protein-protein interactions. Proc. of ISMB.
R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. J. Mooney, A. K.
Ramani, and Y. W. Wong. 2005. Comparative experiments
on learning information extractors for protiens and their in-
teractions. Artificial Intelligence in Medicine, 33(2).
D. Corney, B. Buxton, W. Langdon, and D. Jones. 2004. Bio-
rat: extracting biological information from full-length pa-
pers. Bioinformatics, 20(17).
E. Marcotte, I. Xenarios, and D. Eisenberg. 2001. Mining liter-
ature for protein-protein interactions. Bioinformatics, 17(4).
P. Nakov, A. Schwartz, and M. Hearst. 2004. Citances: Cita-
tion sentences for semantic analysis of bioscience text. In
Proceedings of the SIGIR’04 workshop on Search and Dis-
covery in Bioinformatics.
T. Ono, H. Hishigaki, A. Tanigami, and T. Takagi. 2001. Auto-
mated extraction of information on protein-protein interac-
tions from the biological literature. Bioinformatics, 17(1).
T. Phuong, D. Lee, and K-H. Lee. 2003. Learning rules to ex-
tract protein interactions from biomedical text. In PAKDD.
J. Pustejovsky, J. Castano, and J. Zhang. 2002. Robust rela-
tional parsing over biomedical literature: Extracting inhibit
relations. Proc. of Pac Symp Biocomputing.
C. Ramani, E. Marcotte, R. Bunescu, and R. Mooney. 2005.
Using biomedical literature mining to consolidate the set of
known human protein-protein interactions. In Proceedings
ISMB/ACL Biolink 2005.
T. Rindflesch, L. Hunter, and L. Aronson. 1999. Mining molec-
ular binding terminology from biomedical text. Proceedings
of the AMIA Symposium.
Barbara Rosario and Marti A. Hearst. 2004. Classifying se-
mantic relations in bioscience texts. In Proc. of ACL 2004.
T. Sekimizu, H.S. Park, and J. Tsujii. 1998. Identifying the
interaction between genes and gene products based on fre-
quently seen verbs in medline abstracts. Gen. Informat., 9.
J. Thomas, D. Milward, C. Ouzounis, and S. Pulman. 2000.
Automatic extraction of protein interactions from scientific
abstracts. Proceedings of the Pac Symp Biocomput.
D. Zelenko, C. Aone, and A. Richardella. 2002. Kernel meth-
ods for relation extraction. Proceedings of EMNLP 2002.
739
