TRANSPORTABLE NATURAL-LANGUAGE INTERFACES TO DATABASES 
by 
Gary G. Hendrlx and William H. Lewis 
SRI International 
333 Ravenewood Avenue 
Menlo Park, California 94025 
I INTRODUCTION 
Over the last few years a number of 
application systems have been constructed that 
allow users to access databases by posing questions 
in natural languages, such as English. When used 
in the restricted domains for which they have been 
especially designed, these systems have achieved 
reasonably high levels of performance. Such 
systems as LADDER \[2\], PLANES \[10\], ROBOT \[1\], 
and REL \[9\] require the encoding of knowledge 
about the domain of application in such constructs 
as database schemata, lexlcons, pragnmtic grammars, 
and the llke. The creation of these data 
structures typically requires considerable effort 
on the part of a computer professional who has had 
special training in computational linguistics and 
the use of databases. Thus, the utility of these 
systems is severely limited by the high cost 
involved in developing an interface to any 
particular database. 
This paper describes initial work on a 
methodology for creating natural-language 
processing capabilities for new domains without the 
need for intervention by specially trained experts. 
Our approach is to acquire logical schemata and 
lexical information through simple interactive 
dialogues with someone who is familiar with the 
form and content of the database, but unfamiliar 
with the technology of natural-language interfaces. 
To test our approach in an actual computer 
environment, we have developed a prototype system 
called TED (Transportable English Datamanager). As 
a result of our experience with TED. the NL group 
at SRI is now undertaking the develop=ant of a ~ch 
more ambitious system based on the sane philosophy 
\[4\]. 
II RESEARCH PROBLEMS 
Given the demonstrated feasibility of 
language-access systems, such as LADDER, major 
research issues to be dealt with in achieving 
transportable database interfaces include the 
following: 
* Information used by transportable systems 
must be cleanly divided into database- 
independent and database-dependent 
portions. 
* Knowledge representations must be 
established for the database-dependent part 
in such a way that their form is fixed and 
applicable to all databases and their 
content readily acquirable. 
* Mechanisms must be developed to enable the 
system to acquire information about a 
particular applicationfrom nonlinguists. 
III THE TED PROTOTYPE 
We have developed our prototype system (TED) 
to explore one possible approach to chase problems. 
In essence, TED is a LADDER-like natural-language 
processing system for accessing databases, combined 
with an "automated interface expert" that 
interviews users to learn the language and logical 
structure associated with a particular database and 
that automatically tailors the system for use with 
the particular application. TED allows users to 
create, populate, and edit ~heir own new local 
databases, to describe existing local databases, or 
even to describe and subsequently access 
heterogeneous (as in \[5\]) distributed databases. 
Most of TED is based on and built from 
components of LADDER. In particular, TED uses the 
LIFER parser and its associated support packages 
\[3\], the SODA data access planner \[5\], and the 
FAM file access manager \[6\]. All of these support 
packages are independent of the particular database 
used. In LADDER, the data structures used by these 
components ~re hand-generated for s particular 
database by computer scientists. In TED, however, 
they are created by TED's automated interface 
expert. 
Like LADDER, TED uses a pragmatic granmar; but 
TED's pragmatic gramemr does not make any 
asstmptlons about the particular database being 
accessed. It assumes only that interactions with 
the system will concern data access or update, and 
that information regarding the particular database 
will be encoded in data structures of a prescribed 
form, which are created by the automated interface 
expert. 
The executive level of TED accepts three kinds 
of input: questions stated in English about the 
data in files that have been previously described 
to the system; questions posed in the SODA query 
language; single-~ord commands that ~nltlaCe 
dialogues with the automated interface expert. 
zv THE *.Ta~A~ I~r~FAC~ )X~RT 
A. Philosoph 7 
TED's mechanism for acquiring inforaatlon 
about a particular database application Is to 
conduct interviews wlth users. For such Intervlews 
to be successful, 
The work reported herein was supported by the Advanced Research Projects Agency of the Department of Defense 
under contracts N00039-79-C-0118 and NOOO39-80-C-O6A5 wlth the Naval Electronic Systems Command. The views and 
conclusions contained in this document are those of the authors and should not be interpreted as representative 
of the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency of the 
U.S. Government. 
159 
* There must be a range of readily understood 
questions that elicit all the information 
needed about a new database. 
* The questions must be both brief and easy 
to understand. 
* The system must appear coherent, ellciting 
required information in an order 
comfortable to the user. 
* The system must provide substantial 
assistance, when needed, to enable a user 
to understand the kinds of responses that 
are expected. 
All these points cannot be covered herein, but the 
sample transcript shown at the end of this papert 
in conjunction with the following discussion, 
suggests the manner of our approach. 
B. Strategy 
A key strateSy of TED is to first acquire 
information about the structure of files. Because 
the semantics of files is relatively well 
understoodt the system thereby lays the foundation 
for subsequently acquiring information about the 
linguistic constructions likely to be used in 
questions about the data contained in the file. 
One of the single-word co----nds accepted by 
the TED executive system is the command NEW, which 
initiates a dialogue prompting the user to supply 
information about the structure of a new data file. 
The NEW dialogue allows the user to think of the 
file as a table of information and asks relatively 
simple questions about each of the fields (columns) 
in the file (table). 
For example, TED asks for the heading names of 
the columns, for possible synonyms for the heading 
names, and for information about the types of 
values (numeric, Boolean, or symbolic) that each 
column can contain. The heading names generally 
act like relational nouns, while the information 
about the type of values in each column provides a 
clue to the column's semantics. The heading name 
of a symbolic column tends to he the generic name 
for the class of objects referred to by the values 
of that column. Heading names for Boolean columns 
tend co be the names of properties that database 
objects can possess. T.f a column contains numbers, 
thls suggests that there may be some scale wlth 
associated adjectives of degree. To allow the 
system to answer questions requiring the 
integration of information from multiple files, the 
user is also asked about the interconnections 
between the file currently being defined and other 
files described previously. 
C. Examples from a Transcript 
In the sample transcript at the end of this 
paper, the user initiates a NEW dialogue at Point 
A. The automated interface expert then takes the 
initiative in the conversation, asking first for 
the name of the new file, then for the names of the 
file's fields. The file name wlll be used to 
dlstlngulsh the new file from others during the 
acquisition process. The field names are entered 
into the lexicon as the names of attributes and are 
put on an agenda so that further questions about 
the fields may be asked subsequently of the user. 
At this point, TED still does not know what 
type of objects the data in the new file concern. 
Thus, as its next task, TED asks for words that 
might be used as generic names for the subjects of 
the file. Then, at Point E, TED acquires 
Information about how to identify one of these 
subjects co the user and, at Point F, determines 
what kinds of pronouns might be used to refer to 
one of the subjects. (As regards ships, TED is 
fooled, because ships may be referred to by "she.") 
TED is progra-,~ed wlch the knowledge that the 
identifier of an object must be some kind of name, 
rather than a numeric quantity or Boolean value. 
Thus, TED can assume a priori that the NAME field 
given in Interaction E is symbolic in nature. At 
Point G, TED acquires possible synonyms for NAME. 
TED then cycles through all the other fields, 
acquiring information about their individual 
semantics. At Point H, TED asks about the CLASS 
field, but the user doesn't understand the 
question. By typing a question eu'rk, the user 
causes TED to give a more detailed explanation of 
what it needs. Every question TED asks has at 
least two levels of explanation that a user may 
call upon for clarification. For example, the user 
again has trouble at J, whereupon he receives an 
extended explanation with an example. See T also. 
Depending upon whether a field is symbolic, 
arithnetic or Boolean, TED makes different forms of 
entries in its lexicon and seeks to acquire 
different types of information about the field. 
For example, as at Points J, K and ¥, TED asks 
whether symbolic field values can be used as 
modifiers (usually in noun-~oun combinations). For 
arithmetic fields, TED looks for adjectives 
associated with scales, as is illustrated by the 
sequence 0PQR. Once TED has a word such as OLD, it 
assumes MORE OLD, OLDER and OLDEST may also be 
used. (GOOD-BETTER-BEST requires special 
intervention. ) 
Note the aggressive use of previously acquired 
information in formulating new questions to the 
user (as in the use of AGE, and SHIP at Point P). 
We have found that this aids considerably in 
keeping the user focused on the current items of 
interest co the system and helps to keep 
interactions brief. 
Once TED has acquired local information about 
a new file, it seeks to relate it to all known 
files, including the new file itself. At Points Z 
through B+, TED discovers chat the *SHIP* file may 
be Joined with itself. That is, one of the 
attrlbutes of a ship is yet another ship (the 
escorted shlp)j which may itself be described in 
the same file. The need for this information is 
illustrated by the query the user poses at Point 
G+. 
TO better illustrate linkages between files, 
the transcript includes the acquisition of a second 
file about ship classes, beginnlng at Point J+. 
Much of thls dialogue is omitted but, aC L÷s TED 
learns there is a link between the *SHIP* and 
*CLASS* files. At /4+ it learns the direction of 
160 
this link; at N+ and O+ it learns the fields upon 
which the Join must be made; at P+ it learns the 
attributes inherited through the llnk. This 
information Is used, for example, In answering the 
query at S+. TED converts the user's question 
"What Is the speed of the hoel?" into '~hat is the 
speed of the class whose CN~ is equal to the 
CLASS of the hoel?." 
Of course, the whole purpose of the NEW 
dialogues is to make it possible for users to ask 
questions of their databases in English. Examples 
of English inputs accepted by TED are shown at 
Points E+ through I+, and S+ and T+ In the 
transcript. Note the use of noun-noun 
combinations, superlatives and arithmetic. 
Although not illustrated, TED also supports all the 
available LADDER facilities of ellipsis, spelling 
correction, run-time gram,~r extension end 
introspection. 
V THE PRACHATIC GRAMMAR 
The pragmatic grammar used by TED includes 
special syntactic/semantic categories that are 
acquired by the NEW dialogues. In our actual 
implementation, these have rather awkward names, 
but they correspond approx/macely to the following: 
* <GENERIC> is the category for the generic 
names of the objects in files. Lexlcal 
properties for this category include the 
name of the relevant file(s) and the names 
of the fields that can be used Co identify 
one of the objects to the user. See 
transcript Points D and E. 
* <ID.VALUE> is the category for the 
identifiers of subjects of individual 
records (i.e., key-field values). For 
example, for the *SHIP* file, it contains 
the values of the NAME field. See 
transcript Point E. 
* <MOD.VALUE> is the category for the values 
of database fields that can serve as 
modifiers. See Points J and K. 
* <NUM.ATTP.>, <SYM.ATTR>, and <BOOL.ATTP.> are 
n,--eric, symbolic and Boolean attributes, 
respectively. They include the names of 
all database fields and their synonyms. 
* <+NUM.ADJ> is the category for adjectives 
(e.g. OLD) associated with numeric fields. 
Lexlcal properties include the name of the 
associated field and flies, as veil as 
information regarding whether the adjective 
is associated with greater (as In OLD) or 
lesser (as in YOUNG) values in the field. 
See Points P, Q and R. 
* <COMP.ADJ> and <SUPERLATIVE> are derived 
fro= <+NUM.ADJ>. 
Shown below are some illustrative pragmatic 
production rules for nonlexlcal categories. As in 
the foregoing examples, these are not exactly the 
rules used by TED, but they do convey the unCure of 
the approach. 
<S> -> <PRESENT> THE <ATTP.> OF <ITEM> 
what is the age of the reeves 
HOW <+NUM.ADJ> <BE> <ITEM> 
how old is the youngest ship 
<WHDET> <ITEM> <HAVE> <FEATURE> 
what leahy ships have a doctor 
<WHDET> <ITEM> <BE> <COMPLEMENT> 
which ships are older then reeves 
<PRESENT> -> WHAT <BE> 
PRINT 
<ATrR> -> <NUM.ATTR> 
<SYM.ATTR> 
<BOOL.ATTK> 
<ITEM> -> <GENERIC> 
ships 
<ID.VALUE> 
reeves 
THE <ITEM> 
the oldest shlp 
<MOD.VALUE> <ITEM> 
leahy ships 
<SUPERLATIVE> <ITEM> 
fastest ship with • doctor 
<ITEM> <WITH> <FEATURE> 
ship with a speed greater than 12 
<FEATURE> -> <BOOL.ATTR> 
doctor / poisonous 
<NUN.ATTE> <NUM.COMP> <NUMBER> 
age of 15 
<NUM.ATTR.> <NUM.COMP> <ITEM> 
age greater than reeves 
<NUM.COMP> -> <COMP.ADJ> THAN 
OF 
(GREATER> THAN 
<COMPLEMENT> -> <COMP.A/kJ> THAN <ITEM> 
<COMP.ADJ> THAN <NUMBER> 
These pragmatic Era-mar rules are very much 
like the ones used in LADDER \[2\], but they differ 
from those of LADDER in two critical ways. 
(1) They capture the pragmatics of accessing 
databases without forcibly £ncludin8 
information about the praSmatics of any 
one particular set of data. 
(2) They use s~tsct4~/semantic categories 
that support the processes of accessln8 
databases, but that are domsin- 
independent and easily acquirable. 
It is worth noting that, even when a psrClcular 
application requires the introduction of Special- 
purpose rules, the basic pragmatlc grmamar used by 
TED provides a starting point from whlch domain- 
specific features can be added. 
VI DIRECTIONS FOR FURTHER WORK 
The TED system represents a first step toward 
truly portable natural-language interfaces to 
database systems. TED is only a prototype, 
however, and --,ch additional work will be required 
161 
to provide adequate syntactic and conceptual 
coverage, as well as to increase the ease with 
which systems may be adapted to new databases. 
A severe limitation of the current TED system 
is its restricted range of syntactic coverage. For 
example, TED deals only with the verbs BE and HAVE, 
and does not know about units (e.g., the Waddel's 
age is 15.5, not 15.5 YEARS). To remove this 
limitation, the SRI NL group is currently adapting 
Jane Robinson's extensive DIAGRAM grammar {7\] for 
use in a successor Co TED. In preparation for the 
latter, we are experimenting with verb acquisition 
dialogues such as the following: 
> VERB 
Please conjugate the verb 
(e.g. fly flew flown) > EARN EARNED EARNED 
EARN is: 
1 intransitive (John dines) 
2 transitive (John eats dinner) 
3 dicransitive (John cooks Mary dinner) 
(Choose the most general pattern) > 2 
who or what is EARNED? > A SALARY 
who or what EARNS A SALARY? > AN EMPLOYEE 
can A SALARY be EARNED by AN EMPLOYEE? > YES 
can A SALARY EARN? > NO 
can AN ~dPLOYEE EARN? > NO 
Ok:, an EMPLOYEE can EARN a SALARY 
What database field identifies an EMPLOYEE? > NAME 
What database field identifies a SALARY? > SALARY 
extensive conceptual and symtacclc coverage 
continues to pose a challenge to research, a 
polished version of the TED prototype, even with 
its limited coverage, would appear to have high 
potential as a useful tool for data access. 
KEFER£NCES 
1. L.R. Harris, "User Oriented Data Base Query 
with the ROBOT Natural Language Query System," 
Proc. Third International Conference o.~n Vet \[ 
Large Data Bases; Tokyo (October 1977). 
2. G.G. Hendrix, E. D. Secerdoti, D. Sagalowicz, 
and J. Slocum, "Developing a Natural Language 
Interface to Complex Data," ACH Transactions 
on Database Systems , Vol. 3,--~. 2 (June 
1978). 
3. G.G. Hendrix, "Human Engineering for Applied 
Natural Language Processing," Proc. 5th 
International Joint Conference on Artificial 
4. 
5. 
The greatest challenge to extending systems 
like TED is to increase their conceptual coverage. 
As pointed out by Tennant \[8\], umers who are 
accorded natural-language access co a database 6. 
expect not only to retrieve information directly 
stored there, but also co compute "reasonable" 
derivative information. For example, if a database 
has the location of two ships, users will expect 
the system to be able to provide the distance 
between them--an item of information not directly 7. 
recorded in the database, but easily computed from 
the existing data. In general, any system that is 
tO be widely accepted by users must not only 
provide access to primary information, but uast 
also enhance the latter with procedures that 8. 
calculate secondary attributes from the data 
actually stored. Data enhancement procedures are 
currently provided by LADDER and a few other hand- 
built systems, but work is needed now to devise 
means for allowing system users to specify their 
own database enhancement functions and to couple 9. 
these wlth the natural-language component. 
A second issue associated with conceptual 
coverage is the ability to access information 
extrinsic to the database per se, such as where the 
data are stored and how the fields are defined, as 10. 
well as information about the status of the query 
system itself. 
In summary, systems such as LADDER are of 
limited utility unless they can be transported to 
new databases by people with no significant formal 
training in computer science. Although the 
development of user-specifiable systems with 
Intelligence, Cambridge, Massachusetts (August 
1977). 
G. G. Nendrix, D. Sagalowlcz and E. D. 
Sacerdoti, "Research on Transportable English- 
Access Hedia to Distributed and Local Data 
Bases," Proposal ECU 79-I03, Artificial 
Intelligence Center, SRI International, Menlo 
Park, California (November 1979). 
R. C. Moore, "Kandling Complex Queries in a 
Distributed Data Ease," Technical Note 170, 
Artificial Intelligence Center, SRI 
International Menlo Park, California (October 
1979). 
P. Morris and V. Sagalowicz, '~lanaging Network 
Access to a Distributed Data Base," Proc. 
Second Serkele~ Workshop on Distributed Data 
Hana6e~enc and Computer Networks, gerkeley, 
California ~y~ 
J. J. Robinson, "DIAGRAH: A Gra~aar for 
Dialogues," Technical Note 205, Artificial 
Intelligence Center, SRI Intsrnatlonal 
Menlo Park, California (February 1980). 
H. Tennant, '~xperience with the Evaluation of 
Natural Language Question Answerers," Proc% 
Sixth International Joint Conference on 
Artificial Intelligence, Tokyo, Japan (August 
1979)o 
F. g. Thompson and B. H. Thompson, "Practical 
Natural Language Processing: The REL System as 
Prototype," pp. 109-168, M. Rublnoff and M. C. 
¥ovlts, ads., Advances In.Computers 13 
(Academic Press, New ¥o~, 1975). 
D. Waltz, "Natural Language Access to a Large 
Data Base: An Engineering Approach," Proc. 4th. 
International Joint Conference on Artificial 
Intelligence, Tbilisi, USSR, pp. 868-872 
(September 1975). 
162 
e-° 
*,.4 
m 
~^ 
z 
" ® ~ ~ ~ 
w-~ ¢: • m *" o 
. ~ .~ ,~ ..~ 
.,-*V 
, .~ ~~';~ ~ ~.~ ,~'~ ~ ~.~ ~ ~ ~ ~. ~ ~ 
.----_----__ ------------ ~ ~,~A ~ ~,~^ 
z 
t~ 
Z "~ ~.~ ~,~1 I~ ~ TM 
: ~ ~ ~ ~^ :~ o 
s., ~ w 
v~d 
...~ ~ ~ 
163 
mU = 
=~ <.= = 
F- :3 m: 
= ~0~ 
,-, ~ 
^L 
u~a - 
= ~" 
< 
< 
=~ ~ 
• J ~. 
A ° =~ 
aN °~ u~ 
0 0 C "-" 
o 
= 
: ~ ~ 
=: ,m 
o" 
" ! 
" ~ = ~ ~, 
÷ + 
=~ ~ _= 
Z='~. =o 
164 
"~w ZZ 
~ • 0 
41 ~ ~p a :=~ 
o- 
F-, 
" 8 I ~SX ~ 
~ ~ g~ -.. 
., m,~ ~ 
~,,-I IU 
u,~ .,c 
m 
k ~=. k.. 
m 
4~ 
= 
~o ~ 
2 
Z X: 
4c 
,.I 
Z 
CM ~ E~ 
~J • ° . 
~4t 
,-44~ 
G Ic 
L: 
~4t 
t~ *a .,=4,-4 
0 0~*~ 0 
..~.5~ ~ 
Z=~ g .- 
~ 4¢ 41 4c 4c 4t 41 4e 41 4c 4~ 4t aL 41 ~ ~ ~ u~ ® .o=a,,,~ .~5 "Z o 
÷ ÷ +, ~ ÷ ÷ 
165 

