Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL 06, pages 9–16,
New York City, June 2006. c©2006 Association for Computational Linguistics
Ontology-Based Natural Language Query Processing for the Biological 
Domain 
 
Jisheng Liang, Thien Nguyen, Krzysztof Koperski, Giovanni Marchisio 
Insightful Corporation 
1700 Westlake Ave N., Suite 500, Seattle, WA, USA 
{jliang,thien,krisk,giovanni}@insightful.com 
 
 
Abstract 
This paper describes a natural language 
query engine that enables users to search 
for entities, relationships, and events that 
are extracted from biological literature. 
The query interpretation is guided by a 
domain ontology, which provides a map-
ping between linguistic structures and 
domain conceptual relations. We focus on 
the usability of the natural language inter-
face to users who are used to keyword-
based information retrieval. Preliminary 
evaluation of our approach using the 
GENIA corpus and ontology shows prom-
ising results. 
1 Introduction 
New scientific research methods have greatly in-
creased the volume of data available in the biologi-
cal domain. A growing challenge for researchers 
and health care professionals is how to access this 
ever-increasing quantity of information [Hersh 
2003]. The general public has even more trouble 
following current and potential applications. Part 
of the difficulty lies in the high degree of speciali-
zation of most resources. There is thus an urgent 
need for better access to current data and the vari-
ous domains of expertise. Key considerations for 
improving information access include: 1) accessi-
bility to different types of users; 2) high precision; 
3) ease of use; 4) transparent retrieval across het-
erogeneous data sources; and 5) accommodation of 
rapid language change in the domain.  
 
Natural language searching refers to approaches 
that enable users to express queries in explicit 
phrases, sentences, or questions. Current informa-
tion retrieval engines typically return too many 
documents that a user has to go through. Natural 
language query allows users to express their in-
formation need in a more precise way and retrieve 
specific results instead of ranked documents. It 
also benefits users who are not familiar with do-
main terminology.   
 
With the increasing availability of textual informa-
tion related to biology, including MEDLINE ab-
stracts and full-text journal articles, the field of 
biomedical text mining is rapidly growing. The 
application of Natural Language Processing (NLP) 
techniques in the biological domain has been fo-
cused on tagging entities, such as genes and pro-
teins, and on detecting relations among those 
entities. The main goal of applying these tech-
niques is database curation. There has been a lack 
of effort or success on improving search engine 
performance using NLP and text mining results. In 
this effort, we explore the feasibility of bridging 
the gap between text mining and search by  
• Indexing entities and relationships ex-
tracted from text, 
• Developing search operators on entities 
and relationships, and 
• Transforming natural language queries to 
the entity-relationship search operators.  
 
The first two steps are performed using our exist-
ing text analysis and search platform, called InFact 
[Liang 2005; Marchisio 2006]. This paper con-
cerns mainly the step of NL query interpretation 
and translation. The processes described above are 
all guided by a domain ontology, which provides a 
conceptual mapping between linguistic structures 
and domain concepts/relations. A major drawback 
to existing NL query interfaces is that their linguis-
tic and conceptual coverage is not clear to the user 
9
[Androutsopoulos 1995]. Our approach addresses 
this problem by pointing out which concepts or 
syntactic relations are not mapped when we fail to 
find a consistent interpretation.  
Figure 1 shows the query processing and 
retrieval process.
 
There has been skepticism about the usefulness of 
natural language queries for searching on the web 
or in the enterprise. Users usually prefer to enter 
the minimum number of words instead of lengthy 
grammatically-correct questions. We have devel-
oped a prototype system to deal with queries such 
as “With what genes does AP-1 interact?” The 
queries do not have to be standard grammatical 
questions, but rather have forms such as: “proteins 
regulated by IL-2” or “IL-2 inhibitors”. We apply 
our system to a corpus of molecular biology litera-
ture, the GENIA corpus. Preliminary experimental 
results and evaluation are reported. 
2 Overview of Our Approach 
Molecular biology concerns interaction events be-
tween proteins, drugs, and other molecules. These 
events include transcription, translation, dissocia-
tion, etc. In addition to basic events which focus on 
interactions between molecules, users are also in-
terested in relationships between basic events, e.g. 
the causality between two such events [Hirschman 
2002]. In order to produce a useful NL query tool, 
we must be able to correctly interpret and answer 
typical queries in the domain, e.g.:  
• What genes does transcription factor X 
regulate?  
• With what genes does gene G physically 
interact?   
• What proteins interact with drug D?  
• What proteins affect the interaction of an-
other protein with drug D? 
 
Figure 1 shows the process diagram of our system. 
The query interpretation process consists of two 
major steps: 1) Syntactic analysis – parsing and 
decomposition of the input query; and 2) Semantic 
analysis – mapping of syntactic structures to an 
intermediate conceptual representation. The analy-
sis uses an ontology to extract domain-specific en-
tities/relations and to resolve linguistic ambiguity 
and variations. Then, the extracted semantic ex-
pression is transformed into an entity-relationship 
query language, which retrieves results from pre-
indexed biological literature databases. 
 
Natural Language 
Query 
 
 
 
Parsing &  
Decomposition
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2.1 Incorporating Domain Ontology 
Domain ontologies explicitly specify the meaning 
of and relation between the fundamental concepts 
in an application domain. A concept represents a 
set or class of entities within a domain. Relations 
describe the interactions between concepts or a 
concept's properties. Relations also fall into two 
broad categories: taxonomies that organize con-
cepts into “is-a” and “is-a-member-of” hierarchy, 
and associative relationships [Stevens 2000]. The 
associative relationships represent, for example, 
the functions and processes a concept has or is in-
volved in. A domain ontology also specifies how 
knowledge is related to linguistic structures such as 
grammars and lexicons. Therefore, it can be used 
by NLP to improve expressiveness and accuracy, 
and to resolve the ambiguity of NL queries.  
 
There are two major steps for incorporating a do-
main ontology: 1) building/augmenting a lexicon 
for entity tagging, including lexical patterns that 
specify how to recognize the concept in text; and 
2) specifying syntactic structure patterns for ex-
tracting semantic relationships among concepts.  
The existing ontologies (e.g. UMLS, Gene Ontol-
ogy) are created mainly for the purpose of database 
Entity-Relationship 
Markup & Indexing
Semantic Analysis
Syntactic Structure 
 
Domain 
Ontology
Semantic Expression 
Translation 
Entity-Relationship 
Query 
Text 
Corpus 
10
annotation and consolidation. From those ontolo-
gies, we could extract concepts and taxonomic re-
lations, e.g., is-a. However there is also a need for 
ontologies that specify relevant associative rela-
tions between concepts, e.g. “Protein acetylate Pro-
tein.” In our experiment we investigate the 
problem of augmenting an existing ontology (i.e. 
GENIA) with associative relations and other lin-
guistic information required to guide the query in-
terpretation process.  
2.2 Query Parsing and Normalization 
Our NL parser performs the steps of tokenization, 
part-of-speech tagging, morphological processing, 
lexical analysis, and identification of phrase and 
grammatical relations such as subjects and objects. 
The lexical analysis is based on a customizable 
lexicon and set of lexical patterns, providing the 
abilities to add words or phrases as dictionary 
terms, to assign categories (e.g. entity types), and 
to associate synonyms and related terms with dic-
tionary items. The output of our parser is a de-
pendency tree, represented by a set of dependency 
relationships of the form (head, relation, modifier).   
 
In the next step, we perform syntactic decomposi-
tion to collapse the dependency tree into subject-
verb-object (SVO) expressions. The SVO triples 
can express most types of syntactic relations be-
tween various entities within a sentence. Another 
advantage of this triple expression is that it be-
comes easier to write explicit transformational 
rules that encode specific linguistic variations.  
 
 
Figure 2 shows the subject-action-object triplet.
 
 
 
 
 
 
 
Verb modifiers in the syntactic structure may in-
clude prepositional attachment and adverbials. The 
modifiers add context to the event of the verb, in-
cluding time, location, negation, etc. Subject/object 
modifiers include appositive, nominative, genitive, 
prepositional, descriptive (adjective-noun modifi-
cation), etc. All these modifiers can be either con-
sidered as descriptors (attributes) or reformulated 
as triple expressions by assigning a type to the pair.  
Linguistic normalization is a process by which lin-
guistic variants that contain the same semantic 
content are mapped onto the same representational 
structure. It operates at the morphological, lexical 
and syntactic levels. Syntactic normalization in-
volves transformational rules that recognize the 
equivalence of different structures, e.g.: 
• Verb Phrase Normalization – elimination 
of tense, modality and voice. 
• Verbalization of noun phrases – e.g. Inhi-
bition of X by Y  � Y inhibit X.  
 
For example, queries such as: 
Proteins activated by IL-2 
      What proteins are activated by IL-2? 
      What proteins does IL-2 activate? 
      Find proteins that are activated by IL-2 
are all normalized into the relationship: 
IL-2 > activate > Protein 
 
As part of the syntactic analysis, we also need to 
catch certain question-specific patterns or phrases 
based on their part-of-speech tags and grammatical 
roles, e.g. determiners like “which” or “what”, and 
verbs like “find” or “list”. 
2.3 Semantic Analysis 
The semantic analysis typically involves two steps: 
1) Identifying the semantic type of the entity 
sought by the question; and 2) Determining addi-
tional constraints by identifying relations that 
ought to hold between a candidate answer entity 
and other entities or events mentioned in the query 
[Hirschman 2001]. The semantic analysis attempts 
to map normalized syntactic structures to semantic 
entities/relations defined in the ontology. When the 
system is not able to understand the question, the 
cause of failure will be explained to the user, e.g. 
unknown word or syntax, no relevant concepts in 
the ontology, etc. The output of semantic analysis 
is a set of relationship triplets, which can be 
grouped into four categories: 
Subject Action Object
 
Events, including interactions between entities and 
inter-event relations (nested events), e.g. 
    Inhibition(“il-2”, “erbb2”) 
    Inhibition(protein, Activation(DEX, IkappaB)) 
 
Event Attributes, including attributes of an inter-
action event, e.g. 
Subject 
Modifier 
Action 
Modifier 
Object 
Modifier
11
    Location(Inhibition(il-2, erbb2), “blood cell”) 
 
Entity Attributes, including attributes of a given 
entity, e.g. 
     Has-Location(“erbb2”, “human”) 
 
Entity Types, including taxonomic paths of a 
given entity, e.g. 
    Is-A(“erbb2”, “Protein”) 
 
A natural language query will be decomposed into 
a list of inter-linked triplets. A user’s specific in-
formation request is noted as “UNKNOWN.” 
 
Starting with an ontology, we determine the map-
ping from syntactic structures to semantic rela-
tions. Given our example “IL-2 > activate > 
Protein”, we recognize “IL-2” as an entity, map the 
verb “activate” to a semantic relation “Activation,” 
and detect the term “protein” as a designator of the 
semantic type “Protein.” Therefore, we could eas-
ily transform the query to the following triplets: 
• Activation(IL-2, UNKNOWN) 
• Is-A(UNKNOWN, Protein)  
 
Given a syntactic triplet of subject/verb/object or 
head/relation/modifier, the ontology-driven seman-
tic analysis performs the following steps: 
1. Assign possible semantic types to the pair 
of terms,  
2. Determine all possible semantic links be-
tween each pair of assigned semantic types 
defined in the ontology,  
3. Given the syntactic relation (i.e. verb or 
modifier-relation) between the two con-
cepts, infer and validate plausible inter-
concept semantic relationships from the set 
determined in Step 2, 
4. Resolve linguistic ambiguity by rejecting 
inconsistent relations or semantic types. 
 
It is simpler and more robust to identify the query 
pattern using the extracted syntactic structure, in 
which linguistic variations have been normalized 
into a canonical form, rather than the original ques-
tion or its full parse tree. 
2.4 Entity-Relationship Indexing and 
Search 
In this section, we describe the annotation, index-
ing and search of text data. In the off-line indexing 
mode, we annotate the text with ontological con-
cepts and relationships. We perform full linguistic 
analysis on each document, which involves split-
ting of text into sentences, sentence parsing, and 
the same syntactic and semantic analysis as de-
scribed in previous sections on query processing. 
This step recognizes names of proteins, drugs, and 
other biological entities mentioned in the texts. 
Then we apply a document-level discourse analysis 
procedure to resolve entity-level coreference, such 
as acronyms/aliases and pronoun anaphora. Sen-
tence-level syntactic structures (subject-verb-
object triples) and semantic markups are stored in a 
database and indexed for efficient retrieval. 
 
In the on-line search mode, we provide a set of 
entity-relationship (ER) search operators that allow 
users to search on the indexed annotations. Unlike 
keyword search engines, we employ a highly ex-
pressive query language that combines the power 
of grammatical roles with the flexibility of Boo-
lean operators, and allows users to search for ac-
tions, entities, relationships, and events. We 
represent the basic relationship between two enti-
ties with an expression of the kind: 
Subject Entity > Action > Object Entity 
We can optionally constrain this expression by 
specifying modifiers or using Boolean logic. The 
arrows in the query refer to the directionality of the 
action. For example,  
Entity 1 <> Action <> Entity 2 
will retrieve all relationships involving Entity 1 
and Entity 2, regardless of their roles as subject or 
object of the action. An asterisk (*) can be used to 
denote unknown or unspecified sources or targets, 
e.g. “Il-2 > inhibit > *”. 
 
In the ER query language we can represent and 
organize entity types using taxonomy paths, e.g.: 
    [substance/compound/amino_acid/protein]  
    [source/natural/cell_type]  
The taxonomic paths can encode the “is-a” relation 
(as in the above examples), or any other relations 
defined in a particular ontology (e.g. the “part-of” 
relation). When querying, we can use a taxonomy 
path to specify an entity type, e.g. [Pro-
tein/Molecule], [Source], and the entity type will 
automatically include all subpaths in the taxonomic 
12
hierarchy. The complete list of ER query features 
that we currently support is given in Table 1.  
 
ER Query Features Descriptions and Examples 
Relationships be-
tween two entities or 
entity types 
The query “il-2 <> * <> Ap1” 
will retrieve all relationships 
between the two entities.  
Events involving 
one or more entities 
or types 
The query “il-2 > regulate > 
[Protein]” will return all in-
stances of il-2 regulating a 
protein.  
Events restricted to a 
certain action type - 
categories of actions 
that can be used to 
filter or expand 
search 
The query “[Protein] > [Inhi-
bition] > [Protein]” will re-
trieve all events involving two 
proteins that are in the nature 
of inhibition. 
 
Boolean Operators 
- AND, OR, NOT 
Example: Il-2 OR “interleukin 
2” > inhibit or suppress >* 
Phrases such as “interleukin 
2” can be included in quotes. 
Prepositional Con-
straints 
- Filter results by 
information found in 
a prepositional 
modifier. 
Query Il-2 > activate > [pro-
tein]^[cell_type] 
will only return results men-
tioning a cell type location 
where the activation occurs.  
 
Local context con-
straints - Certain 
keyword(s) must 
appear near the rela-
tionship (within one 
sentence). 
Example: LPS > induce > NF-
kappaB CONTEXT 
CONTAINS “human T cell” 
 
Document keyword 
constraints - Docu-
ments must contain 
certain keyword(s) 
Example: Alpha-lipoic acid > 
inhibit > activation DOC 
CONTAINS “AIDS” OR 
“HIV” 
Document metadata 
constraints 
Restrict results to documents 
that contain the specified 
metadata values. 
Nested Search Allow users to search the re-
sults of a given search. 
Negation Filtering Allow users to filter out ne-
gated results that are detected 
during indexing. 
Table 1 lists various types of ER queries 
2.5 Translation to ER Query 
We extract answers through entity-relational 
matching between the NL query and syntac-
tic/semantic annotations extracted from sentences. 
Given the query’s semantic expression as de-
scribed in Section 2.3, we translate it to one or 
more entity-relationship search operators. The dif-
ferent types of semantic triplets (i.e. Event, Attrib-
ute, and Type) are treated differently when being 
converted to ER queries.  
• The Event relations can be converted di-
rectly to the subject-action-object queries. 
• The inter-event relations are represented as 
local context constraints. 
• The Event Attributes are translated to 
prepositional constraints.  
• The Entity Attribute relations could be ex-
tracted either from same sentence or from 
somewhere else within document context, 
using the nested search feature. 
• The Entity Type relations are specified in 
the ontology taxonomy. 
 
For our example, “proteins activated by il-2”, we 
translate it into an ER query: “il-2 > [activation] > 
[protein]”. Figure 3 shows the list of retrieved sub-
ject-verb-object triples that match the query, where 
each triple is linked to a sentence in the corpus. 
3 Experiment Results 
We tested our approach on the GENIA corpus and 
ontology. The evaluation presented in this section 
focuses on the ability of the system to translate NL 
queries into their normalized representation, and 
the corresponding ER queries. 
3.1 Test Data 
The GENIA corpus contains 2000 annotated 
MEDLINE abstracts [Ohta 2002]. The main reason 
we chose this corpus is that we could extract the 
pre-annotated biological entities to populate a do-
main lexicon, which is used by the NL parser. 
Therefore, we were able to ensure that the system 
had complete terminology coverage of the corpus. 
During indexing, we used the raw text data as input 
by stripping out the annotation tags.  
 
The GENIA ontology has a complete taxonomy of 
entities in molecular biology. It is divided into sub-
stance and source sub-hierarchies. The substances 
include sub-paths such as nucleic_acid/DNA and 
amino_acid/protein. Sources are biological loca-
tions where substances are found and their reac-
tions take place. They are also hierarchically sub-
classified into organisms, body parts, tissues, cells 
13
or cell types, etc. Our adoption of the GENIA on-
tology as a conceptual model for guiding query 
interpretation is described as follows. 
           
Entities - For gene and protein names, we added 
synonyms and variations extracted from the Entrez 
Gene database (previously LocusLink). 
 
Interactions – The GENIA ontology does not con-
tain associative relations. By consulting a domain 
expert, we identified a set of relations that are of 
particular interest in this domain. Some examples 
of relevant relations are: activate, bind, interact, 
regulate. For each type of interaction, we created a 
list of corresponding action verbs. 
 
Entity Attributes - We identified two types of 
entity attributes: 
1. Location, e.g. body_part, cell_type, etc. 
identified by path [genia/source] 
 
 
Figure 3 shows our natural language query interface. The retrieved subject-verb-object relationships 
are displayed in a tabular format. The lower screenshot shows the document display page when user 
clicks on the last result link <interleukin 2, activate, NF-kappa B>. The sentence that contains the 
result relationship is highlighted. 
 
2. Subtype of proteins/genes, e.g. enzymes, 
transcription factors, etc., identified by 
types like protein_family_or_group, 
DNA_family_or_group 
 
Event Attributes - Locations were the only event 
attribute we supported in this experiment.  
 
Designators - We added a mapping between each 
semantic type and its natural language names. For 
example, when a term such as "gene" or "nucleic 
acid" appears in a query, we map it to the taxo-
nomic path: [Substance/compound/nucleic_acid] 
3.2 Evaluation 
14
To demonstrate our ability to interpret and answer 
NL queries correctly, we selected a set of 50 natu-
ral language questions in the molecular biology 
domain. The queries were collected by consulting a 
domain expert, with restrictions such as: 
1. Focusing on queries concerning entities 
and interaction events between entities. 
2. Limiting to taxonomic paths defined 
within the GENIA ontology, which does 
not contain important entities such as 
drugs and diseases. 
 
For each target question, we first manually created 
the ground-truth entity-relationship model. Then, 
we performed automatic question interpretation 
and answer retrieval using the developed software 
prototype. The extracted semantic expressions 
were verified and validated by comparison against 
the ground-truth. Our system was able to correctly 
interpret all the 50 queries and retrieve answers 
from the GENIA corpus. In the rest of this section, 
we describe a number of representative queries. 
 
Query on events:
    With what genes does ap-1 physically interact?  
Relations: 
    Interaction(“ap-1”, UNKOWN) 
    IS-A(UNKNOWN, “Gene”) 
ER Query:
    ap-1 <>[Interaction] <> [nucleic_acid] 
 
Queries on association: 
    erbb2 and il-2 
    what is the relation between erbb2 and il-2? 
Relations:  
    Association(“erbb2”, “il-2”) 
ER Query:  
    Erbb2 <>*<>il-2 
 
Query of noun phrases:  
    Inhibitor of erbb2 
Relation:  
    Inhibition(UNKNOWN, “erbb2”)  
ER Query:  
    [substance] > [Inhibition] > erbb2 
 
Query on event location: 
    In what cell types is il-2 activated? 
Relations: 
    Activation (*, “Il-2”) 
    Location (Activation(), [cell_type]) 
ER Query: 
    * > [Activation] > il-2 ^ [cell_type] 
 
Entity Attribute Constraints 
An entity’s properties are often mentioned in a 
separate place within the document. We translate 
these types of queries into DOC_LEVEL_AND of 
multiple ER queries. This AND operator is cur-
rently implemented using the feature of nested 
search. For example, given query:  
    What enzymes does HIV-1 Tat suppress? 
we recognize the word "enzyme" is associated with 
the path: [protein/protein_family_or_group], and 
we consider it as an attribute constraint. 
 
Relations: 
    Inhibition (“hiv-1 tat”, UNKNOWN)  
    IS-A(UNKNOWN, “Protein”) 
    HAS-ATTRIBUTE (UNKNOWN, “enzyme”)    
ER query: 
    ( hiv-1 tat > [Inhibition]> [protein] )  
    DOC_LEVEL_AND 
    ( [protein] > be > enzyme ) 
 
One of the answer sentences is displayed below:  
“Thus, our experiments demonstrate that the C-
terminal region of HIV-1 Tat is required to sup-
press Mn-SOD expression” 
while Mn-SOD is indicated as an enzyme in a dif-
ferent sentence: 
“… Mn-dependent superoxide dismutase (Mn-  
SOD), a mitochondrial enzyme … ” 
 
Inter-Event Relations 
The inter-event relations or nested event queries 
(CLAUSE_LEVEL_AND) are currently imple-
mented using the ER query’s local context con-
straints, i.e. one event must appear within the local 
context of the other.  
 
Query on inter-event relations:  
What protein inhibits the induction of Ikappa-
Balpha by DEX? 
Relations: 
    Inhibition ([protein], Activation()) 
    Activation (“DEX”, “IkappaBalpha”) 
ER Query: 
    ( [protein] > [Inhibition] > * )    
    CLAUSE_LEVEL_AND  
    ( DEX > [Activation] > IkappaBalpha ) 
 
15
One of the answer sentences is: 
“In both cell types, the cytokine that inhibits the 
induction of IkappaBapha by DEX, also rescues 
these cells from DEX-induced apoptosis.” 
4 Discussions 
We demonstrated the feasibility of our approach 
using the relatively small GENIA corpus and on-
tology. A key concern with knowledge or semantic 
based methods is the scalability of the methods to 
larger set of data and queries. As future work, we 
plan to systematically measure the effectiveness of 
the approach based on large-scale experiments in 
an information retrieval setting, as we increase the 
knowledge and linguistic coverage of our system. 
 
We are able to address the large data size issue by 
using InFact as an ingestion and deployment plat-
form. With a distributed architecture, InFact is ca-
pable of ingesting large data sets (i.e. millions of 
MEDLINE abstracts) and hosting web-based 
search services with a large number of users.  We 
will investigate the scalability to larger knowledge 
coverage by adopting a more comprehensive on-
tology (i.e. UMLS [Bodenreider 2004]). In addi-
tion to genes and proteins, we will include other 
entity types such as drugs, chemical compounds, 
diseases and phenotypes, molecular functions, and 
biological processes, etc. A main challenge will be 
increasing the linguistic coverage of our system in 
an automatic or semi-automatic way. 
 
Another challenge is to encourage keyword search 
users to use the new NL query format and the 
semi-structured ER query form.  We are investigat-
ing a number of usability enhancements, where the 
majority of them have been implemented and are 
being tested.  
 
For each entity detected within a query, we provide 
a hyperlink that takes the user to an ontology 
lookup page. For example, if the user enters "pro-
tein il-2", we let the user know that we recognize 
"protein" as a taxonomic path and "il-2" as an en-
tity according to the ontology. If a relationship 
triplet has any unspecified component, we provide 
recommendations (or tips) that are hyperlinks to 
executable ER queries. This allows users who are 
not familiar with the underlying ontology to navi-
gate through most plausible results. When the user 
enters a single entity of a particular type, we dis-
play a list of relations the entity type is likely to be 
involved in, and a list of other entity types that are 
usually associated to the given type. Similarly, we 
define a list of relations between each pair of entity 
types according to the ontology. The relations are 
ranked according to popularity. When the user en-
ters a query that involves two entities, we present 
the list of relevant relations to the user. 
Acknowledgements: This research was sup-
ported in part by grant number 1 R43 LM008464-
01 from the NIH. The authors thank Dr. David 
Haynor for his advice on this work; the anonymous 
reviewers for their helpful comments; and Yvonne 
Lam for helping with the manuscript. 

References  
Androutsopoulos I, Ritchie GD and Thanisch P. “Natu-
ral Language Interfaces to Databases – An Introduc-
tion”, Journal of Natural Language Engineering, Vol 
1, pp. 29-81, 1995. 
Bodenreider O. The Unified Medical Language System 
(UMLS): Integrating Biomedical Terminology. Nu-
cleic Acids Research, 2004. 
Hersh W and Bhupatiraju RT. “TREC Genomics Track 
Overview”, In Proc. TREC, 2003, pp. 14-23. 
Hirschman L and Gaizauskas R. Natural Language 
Question Answering: The View from Here. Natural 
Language Engineering, 2001. 
Hirschman L, Park JC, Tsujii J, Wong L and Wu CH. 
Accomplishments and Challenges in Literature Data 
Mining for Biology. Bioinformatics Review, Vol. 18, 
No. 12, 2002, pp. 1553-1561. 
Liang J, Koperski K, Nguyen T, and Marchisio G. Ex-
tracting Statistical Data Frames from Text. ACM 
SIGKDD Explorations, Volume 7, Issue 1, pp. 67 – 
75, June 2005. 
Marchisio G, Dhillon D, Liang J, Tusk C, Koperski K, 
Nguyen T, White D, and Pochman L. A Case Study 
in Natural Language Based Web Search. To appear 
in Text Mining and Natural Language Processing. A 
Kao and SR Poteet (Editors). Springer 2006. 
Ohta T, Tateisi Y, Mima H, and Tsujii J. GENIA Cor-
pus: an Annotated Research Abstract Corpus in Mo-
lecular Biology Domain. In Proc. HLT 2002. 
Stevens R, Goble CA, and Bechhofer S. Ontology-based 
Knowledge Representation for Bioinformatics. Brief-
ings in Bioinformatics, November 2000. 
