Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL 06, pages 104–111,
New York City, June 2006. c©2006 Association for Computational Linguistics
Using Dependency Parsing and Probabilistic Inference to Extract Rela-
tionships between Genes, Proteins and Malignancies Implicit Among 
Multiple Biomedical Research Abstracts 
 
Ben Goertzel Hugo Pinto Ari Heljakka 
Applied Research Lab for National 
and Homeland Security 
Novamente LLC Novamente LLC 
Virginia Tech 1405 Bernerd Place 1405 Bernerd Place 
Arlington VA 22216 
 
Rockville MD 20851 Rockville MD 20851 
ben@goertzel.org hugo@vettalabs.com heljakka@iki.fi 
 
Izabela Freire Goertzel Mike Ross Cassio Pennachin 
Novamente LLC SAIC Novamente LLC 
1405 Bernerd Place 5971 Kingstowne Village Parkway 1405 Bernerd Place 
Rockville MD 20851 Kingstowne, VA 22315 Rockville MD 20851 
izabela@goertzel.org miross@objectsciences.com cassio@vettalabs.com
 
 
Abstract 
We describe BioLiterate, a prototype software 
system which infers relationships involving re-
lationships between genes, proteins and ma-
lignancies from research abstracts, and has ini-
tially been tested in the domain of the molecu-
lar genetics of oncology. The architecture uses 
a natural language processing module to ex-
tract entities, dependencies and simple seman-
tic relationships from texts, and then feeds 
these features into a probabilistic reasoning 
module which combines the semantic relation-
ships extracted by the NLP module to form 
new semantic relationships.  One application 
of this system is the discovery of relationships 
that are not contained in any individual ab-
stract but are implicit in the combined knowl-
edge contained in two or more abstracts.  
1 Introduction 
Biomedical literature is growing at a breakneck 
pace, making the task of remaining current with all 
discoveries relevant to a given research area nearly 
impossible without the use of advanced NLP-based 
tools (Jensen et al, 2006). Two classes of tools that 
provide great value in this regard are those that 
help researchers find relevant documents and sen-
tences in  large bodies of biomedical texts (Müller, 
2004; Schuler, 1996; Tanabe, 1999), and those that 
automatically extract knowledge from a set of 
documents (Smalheiser and Swanson, 1998; 
Rzhetsky et al, 2004). Our work falls into the latter 
category.  We have created a prototype software 
system called BioLiterate, which applies depend-
ency parsing and advanced probabilistic inference 
to the problem of combining semantic relationships 
extracted from biomedical texts, have tested this 
system via experimentation on research abstracts in 
the domain of the molecular genetics of oncology. 
In order to concentrate our efforts on the infer-
ence aspect of biomedical text mining, we have 
built our BioLiterate system on top of a number of 
general NLP and specialized bioNLP components 
created by others.  For example, we have handled 
entity extraction -- perhaps the most mature exist-
ing bioNLP technology (Kim, 2004) -- via incorpo-
rating a combination of existing open-source tools.  
And we have handled syntax parsing via integrat-
104
ing a modified version of the link parser (Sleator 
and Temperley, 1992). 
 The BioLiterate system is quite general in ap-
plicability, but in our work so far we have focused 
on the specific task of extracting relationships re-
garding interactions between genes, proteins and 
malignancies contained in, or implicit among mul-
tiple, biomedical research abstracts.  This applica-
tion is critical because the extraction of pro-
tein/gene/disease relationships from text is neces-
sary for the discovery of metabolic pathways and 
non-trivial disease causal chains, among other ap-
plications (Nédellec, 2005; Davulcu, 2005, Ah-
med, 2005).   
Systems extracting these sorts of relationships 
from text have been developed using a variety of 
technologies, including support vector machines 
(Donaldson et al, 2003), maximum entropy models 
and graph algorithms (McDonald, 2005), Markov 
models and first order logic (Riedel, 2005) and 
finite state automata (Hakenberg, 2005).   How-
ever, these systems are limited in the relationships 
that they can extract.  Most of them focus on rela-
tionships described in single sentences.  The results 
we report here support the hypothesis that the 
methods embodied in BioLiterate, when developed 
beyond the prototype level and implemented in a 
scalable way, may be significantly more powerful, 
particularly in the extraction of relationships whose 
textual description exists in multiple sentences or 
multiple documents. 
Overall, the extraction of both entities and sin-
gle-sentence-embodied inter-entity relationships 
has proved far more difficult in the biomedical 
domain than in other domains such as newspaper 
text (Nédellec, 2005; Jing et al, 2003; Pyysalo, 
2004).  One reason for this is the lack of resources, 
such as large tagged corpora, to allow statistical 
NLP systems to perform as well as in the news 
domain. Another is that biomedical text has many 
features that are quite uncommon or even non-
existent in newspaper text (Pyysalo, 2004), such as 
numerical post-modifiers of nouns (Serine 38), 
non-capitalized entity names (…ftsY is solely ex-
pressed during...), hyphenated verbs (X cross-links 
Y), nominalizations, and uncommon usage of pa-
rentheses (sigma(H)-dependent expression of 
spo0A).  While recognizing the critical importance 
of overcoming these issues more fully, we have not 
addressed them in any novel way in the context of 
our work on BioLiterate, but have rather chosen to 
focus attention on the other end of the pipeline: 
using inference to piece together relationships ex-
tracted from separate sentences, to construct new 
relationships implicit among multiple sentences or 
documents.  
The BioLiterate system incorporates three main 
components: an NLP system that outputs entities, 
dependencies and basic semantic relations; a prob-
abilistic reasoning system (PLN = Probabilistic 
Logic Networks); and a collection of hand-built 
semantic mapping rules used to mediate between 
the two prior components. 
  One of the hypotheses underlying our work is 
that the use of probabilistic inference in a bioNLP 
context may allow the capturing of relationships 
not covered by existing systems, particularly those 
that are implicit or spread among several abstracts.  
This application of BioLiterate is reminiscent of 
the Arrowsmith system (Smalheiser and Swanson, 
1998), which is focused on creating novel bio-
medical discoveries via combining pieces of in-
formation from different research texts; however, 
Arrowsmith is oriented more toward guiding hu-
mans to make discoveries via well-directed litera-
ture search, rather than more fully automating the 
discovery process via unified NLP and inference. 
Our work with the BioLiterate prototype has 
tentatively validated this hypothesis via the pro-
duction of interesting examples, e.g. of conceptu-
ally straightforward deductions combining prem-
ises contained in different research papers.
1
  Our 
future research will focus on providing more sys-
tematic statistical validation of this hypothesis. 
2 System Overview 
For the purpose of running initial experiments 
with the BioLiterate system, we restricted our at-
tention to texts from the domain of molecular ge-
netics of oncology, mostly selected from the Pub-
MEd subset selected for the PennBioNE project 
(Mandel, 2006).  Of course, the BioLiterate archi-
tecture in general is not restricted to any particular 
type or subdomain of texts. 
The system is composed of a series of compo-
nents arranged in a pipeline: Tokenizer !Gene, 
                                                           
1
  It is worth noting that inference which appear conceptually to be “straight-
forward deductions” often manifest themselves within BioLiterate as PLN 
inference chains with 1-2 dozen inferences.  This is mostly because of the rela-
tively complex way in which logical relationships emerge from semantic map-
ping, and also because of the need for inferences that explicitly incorporate 
“obvious” background knowledge. 
105
Protein and Malignancy Tagger ! Nominalization 
Tagger ! Sentence Extractor ! Dependency Ex-
tractor !  Relationship Extractor ! Semantic 
Mapper ! Probabilistic Reasoning System. 
Each component, excluding the semantic map-
per and probabilistic reasoner, is realized as a 
UIMA (Götz and Suhre, 2004) annotator, with in-
formation being accumulated in each document as 
each phase occurs.
2
The gene/protein and malignancy taggers collec-
tively constitute our “entity extraction” subsystem. 
Our entity extraction subsystem and the tokenizer 
were adapted from PennBioTagger (McDonald et 
al, 2005; Jin et al, 2005; Lerman et al, 2006).  The 
tokenizer uses a maximum entropy model trained 
upon biomedical texts, mostly in the oncology do-
main. Both the protein and malignancy taggers 
were built using conditional random fields. 
  The nominalization tagger detects nominaliza-
tions that represent possible relationships that 
would otherwise go unnoticed. For instance, in the 
sentence excerpt “… intracellular signal transduc-
tion leading to transcriptional activation…” both 
“transduction” and “activation” are tagged. The 
nominalization tagger uses a set of rules based on 
word morphology and immediate context. 
Before a sentence passes from these early proc-
essing stages into the dependency extractor, which 
carries out syntax parsing, a substitution process is 
carried out in which its tagged entities are replaced 
with simple unique identifiers. This way, many 
text features that often impact parser performance 
are left out, such as entity names that have num-
bers or parenthesis as post-modifiers.
The dependency extractor component carries out 
dependency grammar parsing via a customized 
version of the open-source Sleator and Temperley 
link parser (1993). The link parser outputs several 
parses, and the dependencies of the best one are 
taken.
3
 
The relationship extractor component is com-
posed of a number of template matching algo-
rithms that act upon the link parser’s output to pro-
duce a semantic interpretation of the parse. This 
component detects implied quantities, normalizes 
passive and active forms into the same representa-
                                                           
2
 The semantic mapper will be incorporated into the UIMA framework in a later 
revision of the software.
3
 We have experimented with using other techniques for selecting dependencies, 
such as getting the most frequent ones, but variations in this aspect did not 
impact our results significantly.
tion and assigns tense and number to the sentence 
parts. Another way of conceptualizing this compo-
nent is as a system that translates link parser de-
pendencies into a graph of semantic primitives 
(Wierzbicka, 1996), using a natural semantic meta-
language (Goddard, 2002). 
Table 1 below shows some of the primitive se-
mantic relationships used, and their associated link 
parser links: 
 
subj Subject S, R, RS 
Obj Direct object O, Pv, B 
Obj-2 Indirect object O, B 
that Clausal Complement TH, C 
to-do  Subject Raising Complement 
(do)  
I, TO, Pg 
Table 1.  Semantic Primitives and Link Parser Links 
 
For a concrete example, suppose we have the 
sentences: 
 
a) Kim kissed Pat.  
b) Pat was kissed by Kim. 
 
Both would lead to the extracted relationships: 
 
subj(kiss, Kim), obj(kiss, Pat) 
 
For a more interesting case consider: 
 
c) Kim likes to laugh. 
d) Kim likes laughing. 
 
Both will have a to-do (like, laugh) seman-
tic relation. 
Next, this semantic representation, together with 
entity information, is feed into the Semantic Map-
per component, which applies a series of hand-
created rules whose purpose is to transform the 
output of the Relationship Extractor into logical 
relationships that are fully abstracted from their 
syntactic origin and suitable for abstract inference.  
The need for this additional layer may not be ap-
parent a priori, but arises from the fact that the 
output of the Relationship Extractor is still in a 
sense “too close to the syntax.”  The rules used 
within the Relationship Extractor are crisp rules 
with little context-dependency, and could fairly 
easily be built into a dependency parser (though 
the link parser is not architected in such a way as 
to make this pragmatically feasible); on the other 
106
hand, the rules used in the Semantic Mapper are 
often dependent upon semantic information about 
the words being interrelated, and would be more 
challenging to integrate into the parsing process. 
As an example, the semantic mapping rule 
 
by($X,$Y) & Inh($X, transitive_event) ! 
subj ($X,$Y) 
 
maps the relationship by(prevention, inhi-
bition), which is output by the Relationship Ex-
tractor, into the relationship subj(prevention, inhi-
bition), which is an abstract conceptual relation-
ship suitable for semantic inference by PLN.  It 
performs this mapping because it has knowledge 
that “prevention” inherits (Inh) from the semantic 
category transitive_event, which lets it guess 
what the appropriate sense of “by” might be. 
Finally, the last stage in the BioLiterate pipeline 
is probabilistic inference, which is carried out by 
the Probabilistic Logic Networks
4
 (PLN) system 
(Goertzel et al, in preparation) implemented within 
the Novamente AI Engine integrated AI architec-
ture (Goertzel and Pennachin, 2005; Looks et al, 
2004).   PLN is a comprehensive uncertain infer-
ence framework that combines probabilistic and 
heuristic truth value estimation formulas within a 
knowledge representation framework capable of 
expressing general logical information, and pos-
sesses flexible inference control heuristics includ-
ing forward-chaining, backward-chaining and rein-
forcement-learning-guided approaches.   
Among the notable aspects of PLN is its use of 
two-valued truth values: each PLN statement is 
tagged with a truth value containing at least two 
components, one a probability estimate and the 
other a “weight of evidence” indicating the amount 
of evidence that the probability estimate is based 
on.  PLN contains a number of different inference 
rules, each of which maps a premise-set of a cer-
tain logical form into a conclusion of a certain 
logical form, using an associated truth-value for-
mula to map the truth values of the premises into 
the truth value of the conclusion. 
The PLN component receives the logical rela-
tionships output by the semantic mapper, and per-
forms reasoning operations on them, with the aim 
at arriving at new conclusions implicit in the set of 
relationships fed to it.  Some of these conclusions 
                                                           
4
 Previously named Probabilistic Term Logic
may be implicit in a single text fed into the system; 
others may emerge from the combination of multi-
ple texts. 
In some cases the derivation of useful conclu-
sions from the semantic relationships fed to PLN 
requires “background knowledge” relationships not 
contained in the input texts.  Some of these back-
ground knowledge relationships represent specific 
biological or medical knowledge, and others repre-
sent generic “commonsense knowledge.”  The 
more background knowledge is fed into PLN, the 
broader the scope of inferences it can draw.   
One of the major unknowns regarding the cur-
rent approach is how much background knowledge 
will need to be supplied to the system in order to 
enable truly impressive performance across the full 
range of biomedical research abstracts.  There are 
multiple approaches to getting this knowledge into 
the system, including hand-coding (the approach 
we have taken in our BioLiterate work so far) and 
automated extraction of relationships from relevant 
texts beyond research abstracts, such as databases, 
ontologies and textbooks.  While this is an ex-
tremely challenging problem, we feel that due to 
the relatively delimited nature of the domain, the 
knowledge engineering issues faced here are far 
less severe than those confronting projects such as 
Cyc (Lenat, 1986; Guha, 1990; Guha, 1994) and 
SUMO (Niles, 2001) which seek to encode com-
monsense knowledge in a broader, non-domain-
specific way.  
3 A Practical Example 
We have not yet conducted a rigorous statistical 
evaluation of the performance of the BioLiterate 
system.  This is part of our research plan, but will 
involve considerable effort, due to the lack of any 
existing evaluation corpus for the tasks that Bio-
Literate performs.  For the time being, we have 
explored BioLiterate’s performance anecdotally 
via observing its behavior on various example “in-
ference problems” implicit in groups of biomedical 
abstracts.  This section presents one such example 
in moderate detail (full detail being infeasible due 
to space limitations). 
Table 2 shows two sentences drawn from differ-
ent PubMed abstracts, and then shows the conclu-
sions that BioLiterate draws from the combination 
of these two sentences.  The table shows the con-
clusions in natural language format, but the system 
107
actually outputs conclusions in logical relationship 
form as detailed below. 
 
Premise 1 Importantly, bone loss was almost 
completely prevented by p38 MAPK 
inhibition.  (PID 16447221) 
Premise 2 Thus, our results identify DLC as a 
novel inhibitor of the p38 pathway and 
provide a molecular mechanism by 
which cAMP suppresses p38 activa-
tion and promotes apoptosis. (PID 
16449637) 
(Uncertain) 
Conclusions 
DLC prevents bone loss. 
cAMP prevents bone loss. 
Table 2.  An example conclusion drawn by BioLiterate 
via combining relationships extracted from sentences 
contained in different PubMed abstracts.  The PID 
shown by each premise sentence is the PubMed ID of 
the abstract from which it was drawn. 
 
Tables 3-4 explore this example in more detail.  
Table 3 shows the relationship extractor output, 
and then the semantic mapper output, for the two 
premise sentences.   
 
Premise 1 
Rel Ex. 
Output 
_subj-n(bone, loss)  
_obj(prevention, loss)  
_subj-r(almost, completely)  
_subj-r(completely, prevention)  
by(prevention, inhibition)  
_subj-n(p38 MAPK, inhibition) 
 
Premise 2 
Sem Map 
Output 
subj (prevention, inhibition) 
obj (prevention, loss) 
obj (inhibition, p38_MAPK) 
obj (loss, bone) 
Premise 1 
Rel Ex 
Output 
_subj(identify, results)  
as(identify, inhibitor)  
_obj(identify, DLC)  
_subj-a(novel, inhibitor)  
of(inhibitor, pathway)  
_subj-n(p38, pathway) 
Premise 2 
Sem Map 
Output 
subj (inhibition, DLC) 
obj (inhibition, pathway) 
inh(pathway, p38) 
Table 3.  Intermediary processing stages for the two 
premise sentences in the example in Table 2.   
 
Table 4 shows a detailed “inference trail” consti-
tuting part of the reasoning done by PLN to draw 
the inference “DLC prevents bone loss” from these 
extracted semantic relationships, invoking back-
ground knowledge from its knowledge base as ap-
propriate.  
The notation used in Table 4 is so that, for in-
stance, Inh inhib  inhib  is synonymous with 
inh(inhib , inhib ) and denotes an Inheri-
tance relationship between the terms inhibition  
and inhibition  (the textual shorthands used in the 
table are described in the caption).  The logical 
relationships used are Inheritance, Implication, 
AND (conjunction) and Evaluation.  Evaluation is 
the relation between a predicate and its arguments; 
e.g. Eval subj(inhib , DLC) means that the 
subj predicate holds when applied to the list (in-
hib , DLC).  These particular logical relation-
ships are reviewed in more depth in (Goertzel and 
Pennachin, 2005; Looks et al, 2004).  Finally, in-
dent notation is used to denote argument structure, 
so that e.g.  
1 2
1 2
1
2
2
2
 
R 
 A 
 B 
 
is synonymous with R(A,B). 
PLN is an uncertain inference system, which 
means that each of the terms and relationships used 
as premises, conclusions or intermediaries in PLN 
inference come along with uncertain truth values.  
In this case the truth value of the conclusion at the 
end of Table 4 comes out to <.8,.07>, which indi-
cates that the system guesses the conclusion is true 
with probability .8, and that its confidence that this 
probability assessment is roughly correct is .07.  
Confidence values are scaled between 0 and 1: .07 
is a relatively low confidence, which is appropriate 
given the speculative nature of the inference.  Note 
that this is far higher than the confidence that 
would be attached to a randomly generated rela-
tionship, however. 
The only deep piece of background knowledge 
utilized by PLN in the course of this inference is 
the knowledge that: 
 
Implication 
AND 
  Inh X  causal_event 
1
  Inh X
2
 causal_event 
subj(X
1
, X
3
) 
subj(X
2
, X
1
) 
subj(X
2
,X
3
) 
 
which encodes the transitivity of causation in terms 
of the subj relationship.  The other knowledge 
108
used consisted of simple facts such as the inheri-
tance of inhibition and prevention from the cate-
gory causal_event. 
 
  
Premises Rule 
Conclusion 
Inh inhib
1
, inhib 
Inh inhib
2
, inhib 
 
Abduction 
Inh inhib
1
, inhib
2 
<.19, .99> 
Eval subj (prev
1
, inhib
1
)  
Inh inhib
1
, inhib
2
Similarity 
Substitution 
Eval subj (prev
1
   inhib
2
)  <1, 
.07> 
Inh inhib
2
, inhib  
Inh inhib, causal_event 
 
Deduction 
Inh inhib
2
, causal_event <1,1> 
Inh inhib
2
, causal_event 
Inh prev
1
, causal_event 
Eval subj (prev
1
, inhib
2
)  
Eval subj (inhib
2
, DLC) 
 
 
 
AND 
AND <1, .07> 
    Inh inhib
2
, causal_event 
    Inh prev
1
, causal_event 
    Eval subj (prev
1
, inhib
2
)  
    Eval subj (inhib
2
, DLC) 
 
 
ForAll (X
0
, X
1
, X
2
)  
    Imp  
        AND 
            Inh X
0
, causal_event 
            Inh X
1
, causal_event 
            Eval subj (X
1
, X
0
)  
            Eval subj (X
0
, X
2
)          
        Eval subj (X
1
,  X
2
)  
 
AND 
    Inh inhib
2
, causal_event 
    Inh prev
1
, causal_event 
    Eval subj (prev
1
, inhib
2
)  
    Eval subj (inhib
2
, DLC)     
 
 
 
 
 
 
 
 
 
 
 
 
 
Unification 
Eval subj (prev
1
, inhib
2
)  <1,.07> 
Imp  
    AND 
        Inh inhib
2
, causal_event 
        Inh prev
1
, causal_event 
        Eval subj (prev
1
, inhib
2
)  
        Eval subj (inhib
2
, DLC)         
    Eval subj (prev
1
, DLC) 
 
 
 
Implication 
Breakdown 
(Modus 
Ponens)  
Eval subj (prev
1
, DLC)  <.8, .07> 
 
Table 4.   Part of the PLN inference trail underlying 
Example 1.  This shows the series of inferences leading 
up to the conclusion that the prevention act prev1 is 
carried out by the subject DLC.  A shorthand notation is 
used here: Eval = Evaluation, Imp = Implication, Inh = 
Inheritance, inhib = inhibition, prev = prevention.  For 
instance, prev
1
 and prev
2
 denote terms that are particular 
instances of the general concept of prevention. Relation-
ships used in premises along the trail, but not produced 
as conclusions along the trail, were introduced into the 
trail via the system looking in its knowledge base to 
obtain the previously computed truth value of a relation-
ship, which was found via prior knowledge or a prior 
inference trail. 
4 Discussion 
We have described a prototype bioNLP system, 
BioLiterate, aimed at demonstrating the viability of 
using probabilistic inference to draw conclusions 
based on logical relationships extracted from mul-
tiple biomedical research abstracts using NLP 
technology.  The preliminary results we have ob-
tained via applying BioLiterate in the domain of 
the genetics of oncology suggest that the approach 
is potentially viable for the extraction of hypotheti-
cal interactions between genes, proteins and ma-
lignancies from sets of sentences spanning multiple 
abstracts.  One of our foci in future research will 
be the rigorous validation of the performance of 
the BioLiterate system in this domain, via con-
struction of an appropriate evaluation corpus.  
In our work with BioLiterate so far, we have 
identified a number of examples where PLN is able 
to draw biological conclusions by combining sim-
ple semantic relationships extracted from different 
biological research abstracts.  Above we reviewed 
one of these examples.  This sort of application is 
particularly interesting because it involves soft-
ware potentially creating relationships that may not 
have been explicitly known by any human, because 
they existed only implicitly in the connections be-
tween many different human-written documents.  
In this sense, the BioLiterate approach blurs the 
boundary between NLP information extraction and 
automated scientific discovery. 
Finally, by experimenting with the BioLiterate 
prototype we have come to some empirical conclu-
sions regarding the difficulty of several parts of the 
pipeline.  First, entity extraction remains a chal-
lenge, but not a prohibitively difficult one.  Our 
system definitely missed some important relation-
ships because of imperfect entity extraction but 
this was not the most problematic component. 
Sentence parsing was a more serious issue for 
BioLiterate performance. The link parser in its 
pure form had very severe shortcomings, but we 
were able to introduce enough small modifications 
to obtain adequate performance. Substituting un-
109
common and multi-word entity names with simple 
noun identifiers (a suggestion we drew from Pyy-
salo, 2004) reduced the error rate significantly, via 
bypassing problems related to wrong guessing of 
unknown words, improper handling of parentheses, 
and excessive possible-parse production. Other 
improvements we may incorporate in future in-
clude augmenting the parser’s dictionary to include 
biomedical terms (Slozovits, 2003), pre-processing 
so as to split long and complex sentences into 
shorter, simpler ones (Ding et al, 2003), modifying 
the grammar to handle with unknown constructs, 
and changing the link parser’s ranking system (Py-
ysalo, 2004). 
The inferences involved in our BioLiterate work 
so far have been relatively straightforward for PLN 
once the premises have been created.  More com-
plex inferences may certainly be drawn in the bio-
medical domain, but the weak link inference-wise 
seems to be the provision of inference with the ap-
propriate premises, rather than the inference proc-
ess itself.   
The most challenging aspects of the work in-
volved semantic mapping and the supplying of 
relevant background knowledge.  The creation of 
appropriate semantic mapping rules can be subtle 
because these rules sometimes rely on the semantic 
categories of the words involved in the relation-
ships they transform.  The execution of even com-
monsensically simple biomedical inferences often 
requires the combination of abstract and concrete 
background knowledge.  These are areas we will 
focus on in our future work, as achieving a scalable 
approach will be critical in transforming the cur-
rent BioLiterate prototype into a production-
quality system capable of assisting biomedical re-
searchers to find appropriate information, and of 
drawing original and interesting conclusions by 
combining pieces of information scattered across 
the research literature. 
Acknowledgements 
This research was partially supported by a contract 
with the NIH Clinical Center in September-
November 2005, arranged by Jim DeLeo. 

References  
Chan-Goo Kang and Jong C. Park. 2005. Generation of 
Coherent Gene Summary with Concept-Linking Sen-
tences. Proceedings of the International Symposium 
on Languages in Biology and Medicine (LBM), 
pages 41-45, Daejeon, Korea, November, 2005. 
Claire Nédellec. 2005. Learning Language in Logic - 
Genic Interaction Extraction Challenge. Proceedings 
of The 22nd International Conference on Machine 
Learning, Bonn, Germany.  
Cliff Goddard. 2002. The On-going Development of the 
NSM Research Program. Ch 5 (pp. 301-321) of 
Meaning and Universal Grammar - Theory and Em-
pirical Findings. Volume II. Amsterdam: John Ben-
jamins. 
Davulcu, H et Al. 2005. IntEx?: A Syntactic Role 
Driven Protein-Protein Interaction Extractor for Bio-
Medical Text. Proceedings of the ACL-ISMB Work-
shop on Linking Biological Literature, Ontologies 
and Databases: Mining Biological Semantics. De-
troit.  
Donaldson, Ian,  Joel Martin, Berry de Bruijn, Cheryl 
Wolting et al. 2003. PreBIND and Textomy - mining 
the biomedical literature for protein-protein interac-
tions using a support vector machine. BMC Bioin-
formatics, 4:11,  
Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky. 
2001. A. GENIES: a natural-language processing 
system for the extraction of molecular pathways from 
journal articles. Bioinformatics Jun;17 Suppl 1:S74-
82. 
Goertzel, Ben and Cassio Pennachin. 2005. Artificial 
General Intelligence.  Springer-Verlag.  
Goertzel, Ben, Matt Ikle’, Izabela Goertzel and Ari Hel-
jakka.  2006.  Probabilistic Logic Networks.  In 
preparation. 
Götz, T and Suhre, O. 2004. Design and implementation 
of the UIMA Common Analysis System. IBM Systems 
Journal. V 43, number 3. pages 476-489 . 
Guha, R. V., & Lenat, D. B. 1994. Enabling agents to 
work together. Communications of the ACM, 37(7), 
127-142. 
Guha, R.V. and Lenat,D.B. 1990. Cyc: A Midterm Re-
port. AI Magazine 11(3):32-59.
Hakenberg, . et al. 2205. LLL'05 Challenge: Genic In-
teraction Extraction -- Identification of Language 
Patterns Based on Alignment and Finite State Auto-
mata. Proceedings of The 22nd International Confer-
ence on Machine Learning, Bonn, Germany. 2005.  
Hoffmann, R., Valencia, A. 2005. Implementing the 
iHOP concept for navigation of biomedical litera-
ture. Bioinformatics 21(suppl. 2), ii252-ii258 (2005). 
Ian Niles and Adam Pease. 2001. Towards a Standard 
Upper Ontology.  In Proceedings of the 2nd Interna-
tional Conference on Formal Ontology in Informa-
tion Systems (FOIS-2001), Ogunquit, Maine, Octo-
ber 2001 
Jensen, L.J., Saric, J and Bork, P. 2006. Literature Min-
ing for the biologist: from information retrieval to 
biological discovery. Nature Reviews. Vol 7. pages 
119-129. Natura Publishing Group. 2006. 
Jing Ding. 2003. Extracting biomedical interactions 
with from medline using a link grammar parser. Pro-
ceedings of 15th IEEE international Conference on 
Tools With Artificial Intelligence.  
Kim, Jim-Dong et al. 2004. Introduction to the Bio-NLP 
Entity Task at JNLPBA 2004. In Proceedings of 
JNLPBA 2004. 
Lenat, D., Prakash, M., & Shepard, M. 1986. CYC: Us-
ing common sense knowledge to overcome brittleness 
and knowledge acquisition bottlenecks. AI Magazine, 
6(4), 65-85 
Lerman, K , McDonal, R., Jin, Y. and Pancoast, E. Uni-
versity of Pennsylvania BioTagger. 2006. 
http://www.seas.upenn.edu/~ryantm/software/BioTag
ger/  
Looks, Moshe, Ben Goertzel and Cassio Pennachin. 
2004.  Novamente: An Integrative Approach to Arti-
ficial General Intelligence.  AAAI Symposium on 
Achieving Human-Level Intelligence Through Inte-
grated Systems and Research, Washington DC, Oc-
tober 2004 
Mandel, Mark. 2006. Mining the Bibliome.  February, 
2006 http://bioie.ldc.upenn.edu
Mark A. Greenwood, Mark Stevenson, Yikun Guo, 
Henk Harkema, and Angus Roberts.  2005. Auto-
matically Acquiring a Linguistically Motivated Genic 
Interaction Extraction System. In Proceedings of the 
4th Learning Language in Logic Workshop (LLL05), 
Bonn, Germany. 
McDonald, F. Pereira, S. Kulick, S. Winters, Y. Jin and 
P. White. 2005.  Simple Algorithms for Complex Re-
lation Extraction with Applications to Biomedical IE. 
R. 43rd Annual Meeting of the Association for Com-
putational Linguistics, 2005. 
Müller, H. M., Kenny, E. E. and Sternberg, P. W. 2004. 
Textpresso: An Ontology-Based Information Re-
trieval and Extraction System for Biological Litera-
ture. PLoS Biol 2(11): e309 
Pyysalo, S. et al. 2004. Analisys of link Grammar on 
Biomedical Dependency Corpus Targeted at Protein-
Protein Interactions. In Proceedings of JNLPBA 
2004.  
Riedel, et al. 2005. Genic Interaction Extraction with 
Semantic and Syntactic Chains. Proceedings of The 
22nd International Conference on Machine Learning, 
Bonn, Germany.  
Ryan McDonald and Fernando Pereira. 2005. Identify-
ing gene and protein mentions in text using condi-
tional random fields. BMC Bioinformatics 2005, 
6(Suppl 1):S6  
Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra 
P, Morris M, Yu H, Duboue PA, Weng W, Wilbur 
WJ, Hatzivassiloglou V, Friedman C. 2004. Gene-
Ways: a system for extracting, analyzing, visualizing, 
and integrating molecular pathway data. Journal of 
Biomedical Informatics 37(1):43-53.  
Sleator, Daniel and Dave Temperley. 1993. Parsing 
English with a Link Grammar. Third International 
Workshop on Parsing Technologies,  Tilburg, The 
Netherlands.  
Smalheiser, N. L and Swanson D. R. 1996. Linking es-
trogen to Alzheimer's disease: an informatics ap-
proach. Neurology 47(3):809-10.  
Smalheiser, N. L and Swanson, D. R. 1998. Using AR-
ROWSMITH: a computer-assisted approach to for-
mulating and assessing scientific hypotheses. Com-
put Methods Programs Biomed. 57(3):149-53.  
Syed Ahmed et al. 2005. IntEx: A Syntactic Role Driven 
Protein-Protein Interaction Extractor for Bio-
Medical Text.  Proc. of BioLink '2005, Detroit, 
Michigan, June 24, 2005  
Szolovits, Peter. 2003. Adding a medical lexicon to an 
English parser. Proceedings of 2003 AMIA Annual 
Symposium. Bethesda. MD.  
Tanabe, L. U. Scherf, L. H. Smith, J. K. Lee, L. Hunter 
and J. N. Weinstein. 1999. MedMiner: an Internet 
Text-Mining Tool for Biomedical Information, with 
Application to Gene Expression Profiling.  BioTech-
niques 27:1210-1217. 
Wierzbicka, Anna. 1996. Semantics, Primes and Uni-
versals. Oxford University Press. 
