Problems ¥ith Domain-Independent Natural Language Database Access Systems 
Steven P. Shvartz 
Cognitive Systems Inc. 
234 Church Street 
New Haven, Ca. 06510 
Zn the past decade, a number of natural lang- 
uage database access systems have been constructed 
(e.g. Hendrix 1976; Waltz et el. 1976; Sac- 
erdoti 1978; Harris 1979; Lehner~ and Shwartz 
1982; Shvartz 1982). The level of performance 
achieved by natural language database access sys- 
tems varies considerably, with the sore robust 
systems operating vithtn a narrow domain (i.e., 
content area) and relying heavily on domain-speci- 
fic knowledge to guide the language understanding 
process. Transporting a system constructed for one 
domain into a new domain is extremely resource-in- 
tensive because a new set of domain-specific know- 
ledge must be encoded. 
In order to reduce the cost of transportation, 
a great deal of current research has focussed on 
building natural language access systems that are 
domain-independent. More specifically, these sys- 
tems attempt to use syntactic knowledge in con- 
~unction with knowledge about the structure of the 
database as a substitute for conceptual knowledge 
regarding the database content area. In this paper 
I examine the issue of whether or not it is possi- 
ble to build a natural language database access 
systee that achieves an acceptable level of per- 
formance without including domain-specific concep- 
tual knowledge. 
6 gerforn=nca ~i~g~ion for oa~u£al language atoms= 
=X=~em=, 
The principle motivation for building natural 
language systems for database access is ~o free the 
user from the need for data processing instruction. 
A natural language front end is a step above the 
"English-like = query systems that presently domi- 
nate the commercial database retrieval field. 
English-like query systems allow the user to phrase 
requests as English sentences, but permit only a 
restricted subset of English and impose a rigid 
syntax on user requests. These English-like query 
systems are easy to learn, but a training period is 
still required for the user to learn to phrase re- 
quests that conform to ~hc~ restrictions. Howe- 
ver, the training period is often very brief, and 
natura~ language systems can be considered superior 
only if no computer-related training or knowledge 
is required of the user. 
This criterion can only be met if no restric- 
tions are placed on user queries. A user who has 
previously relied on a programmer-technician to 
code formal queries for information retrieval 
should be permitted to phrase inform%ion retrieval 
requests t~ the program in exactly the same way as 
to the technician. That is, whatever the techni- 
cian would understand, the program should 
understand. For example, a natural language front 
end to a stock market database should understand 
that 
(1) Did IBM go up yesterday? 
refers to PRZCE and not VOLUME. However, the sys- 
tem need not understand requests that a program- 
mer-technician would be unable to process, e.g. 
(2) Is GENCO a likely takeover target? 
That is, the programmer-technlcisn uorking for an 
investment firm would not be expected to know how 
t<) process requests that require "expert" knowledge 
and neither should | natural language front end, 
If, however, = natural language system cannot a- 
chieve the level of performance of a program- 
ear-technician it will seem stupid because it does 
not meet = user's expectations for an English un- 
derstanding system, 
The mprograemer-technician criterion m cannot 
possibly be met by = domain-independent natural 
language access system because language understan- 
ding requires domain-specific world knowledge. On 
a theoretical level, the need for a knowledge base 
in a natural language processing system has been 
well-documented (e.g. Schank A Abelson 1977; 
Lehnert 1978; Dyer 1982). It will be argued 
below that in an applied context, a system that 
does not have a conceptual knowledge base can pro- 
duce at best only a shallow level of understanding 
and one that does not meet the criterion specifled 
above. Further, the domain-independent approach 
creates a host of problems that are simply non-ex- 
istent in knowledge-based s~stems. 
E~oble== far dolai0:i0dg~a0dan~ =~=~®=~ infer- 
ence. ambiguity, sod aoagbora, 
Inferential processing is an integral part of 
natural language understanding. Consider the fol- 
lowing requests from PEARL (Lehnert and Shvartz 
1982; Shwartz 1982) when it operates in the domain 
of geological map generation: 
60 
(3) Show ss ell oil veils from 1970 to 1980. 
(4) Show Is all oil veils fro! 8000 ~ 7000. 
(5) Show se all oil wells 1 t~a 2000. 
(6) Show ee all oil wells 40 to 41, 80 to 81. 
A programmer-technician In the petrochemical in- 
dustry would infer that (3) refers to drilling 
dates, (4) refers ~o veil depth, (5) refers ~o the 
sap scale, end (6) refers to latitude/longitude 
specifications. 
Correct processing of these requsst~ requires in- 
ferential processing that is based on knowledge of 
the petrochemical industry. That is, these con- 
ventions =re not in everyone's general working 
knowledge of the English language. Yet they are 
standard usage for people who communicate with each 
other about drilling data, and any systss that 
claims t~o provide a natural language interface t~ l 
data base of drilling data must have the knowledge 
to correctly process requests such as these. 
Without such inferential processing, the user is 
required to spell out everything in detail, some- 
thing that is sispty not necessary in normal Eng- 
lish discourse. 
Another probles for any natural language un- 
derstanding systes is the processing of ambiguous 
words. In some cases disambiguation can be per- 
formed syntactically. In other cases, the struc- 
ture of the database can provide the information 
necessary for word sense disambiguation (more on 
this below). However, in many cases disasbiguation 
can only be performed if domain-specific, world 
knowledge is available. For example, consider the 
processing of the word "sales = in (7), (8) and (9). 
(7) What is the average mark up for sales of stereo 
equipment? 
(8) What is the average mark down for sales of 
stereo equipment? 
(9) What is the average mark up during sales of 
stereo equipment? 
(10) What is the average mark down durlng sales of 
stereo equipment? 
These four requests, which are so nelrly identical 
both lexically and syntactically, have very dis- 
tinct meanings that derive from the fact that the 
correct sense of 'sliest in (7) ls quits different 
from the sense of "sales = intended in (8), (9), end 
(10). Nest people have little difficulty deter- 
mining which sense of =sales = is intended in these 
sentences, and neither would a knowledge-based un- 
derstander. The key to the disambiguation process 
involves world knowledge regarding retail sales. 
Problems of anaphora pose similar problems. 
For example, suppose the following requests were 
submitted to a personnel data base: 
(11) List all salesmen with retirement plans along 
with their salaries. 
(12) List all offices with women managers along 
with their salaries. 
While these requests are syntactically identical, 
the referents for "their" in (11) end (12) occupy 
different syntactic positions. As human informa- 
tion processors, ve have no trouble understanding 
61 
that salarie~ are associated with people, so 
retirement pllns and offices are never considered 
as possible referents. Again, domain-specific 
world knouledge is helpful in understanding these 
requests. 
~Ug~u~al knQwlldgm i= m =uh=~i~u~m fo~ GQO¢ID~ual 
knowlsdgg, 
One of inner|aliens to eaerge from the con- 
struction of domain-independent systems is t clever 
mechanism that extracts dosain-speclflc knowledge 
free the structure of the data base. For example, 
the resolution of the pronoun 'their = in both (11) 
and (12) above could be accomplished by using only 
structural (rather than conceptual) knowledge of 
the domain. For example, suppose the payroll 
database for (11) were structured such that SALARY 
and RETIRENENT-PLANS were fields within a SALESMAN 
file. It would then be possible to infer that 
ltheir= refers to =salesmen = in (11) by noting that 
SALARY is a field in the SALESMEN file, but that 
SALARY is not an entry in I RETIREMENT-PLANS file. 
Unfortunately, this approach has lilited u- 
tility because it relies on a fortuitous de,abase 
structure. Consider what would happen if the data 
base had a top-level ERPLOYEES file (rather than 
individual files for each type of employee) with 
fields for JOB-TYPE, SALARY, COMMISSIONS, and RE- 
TZRENENT-PLANS, With this database organization, 
it would not he possible to detersine that 
(13) List all salesmen who have secrebaries along 
with their comsissions. 
ltheir= refers ~o meal=amen" and not "secretaries = 
in (13) on the basis of the structure of the data- 
bass. To the naive user, however, the seining of 
this sentence is perfectly clear. A person who 
couldn't determine the referent of "their = in (13) 
would not be perceived as having an adequate cos- 
sand of the English language and the same would be 
true for a computer system that did not understand 
the request. 
~i~fall= a==g~il~Id wi~b ~bm dQ®zin:indag~ndln~ i~- 
In a knowledge-based systes such as PEARL, = 
natural language request is parsed into a concep- 
tual representation of the meaning of the request. 
The retrieval routine is then generated free this 
concepbual representation. As a result, the parser 
is independent of the logical structure of the 
database. That is, the same parser can be used for 
databases with different logical structures, but 
the same information content. Further, the same 
parser can be used whether the required information 
is located in = single file or in lultiple files. 
In a domaln-independent systes, the parser is 
entirely dependent on the structure of the database 
for domain-specific knowledge. As a result, one 
must restructure the parser for databases with i- 
dentical content but different logical structure. 
Sisilarly, the output of the parser lust be very 
dlfferent vhen the required information Is con- 
tained in mulSiple files rather than a single file. 
Because of their lack of conceptual knowledge 
regarding the database, domain-independent systems 
rely heavily on key words or phrases to indicate 
which database field iS being referred to. For 
example, 
(14) Vhat is Bill Smith's ~ob &male? 
High& be easily processed by simply retrieving the 
con&ants of a JOB-TITLE field. Different vlys of 
referring ~o job title can also be handled as syn- 
onyms. However, dosiin°independent systems get 
into deep trouble vhen the database field that 
needs to be accessed is not directly indicated by 
key words or phrases in the input request. For 
example, 
(15) Is John Jones the child of an alumnus? 
is easily processed if there exists a 
CHILD-OF-AN-ALUMNUS field, but the query 
(16) Is one of John Jones' paren&s an alumnus? 
contains no key word or phrase to indicate that the 
CHILD-OF-AN-ALURNUS field should be accessed, In a 
knowledge-based system, the retrieval routine is 
generated from a conceptual representation of the 
meaning of the user query and therefore key words 
or phrases arm not required. A related problem 
occurs with queries involving a~reption or quan- 
tity. For example, 
(17) How many employees are in the sales depart- 
ment? 
light require retrieving the value of a particular 
field (e.g. NUHBER-OF-EHPLOYEES), or it sight re- 
quire totalling the number of records in the EH- 
PLOYEE file that have the correct DEPARTNENT field 
value, or, if the departments are broken down into 
offices, it light require totalling the NUN- 
BER-OF-ENPLOYEES field for each office. In m do- 
main-independent system, the correct parse depends 
upon the structure of the database and is therefore 
difficult to handle in a general way. In a know- 
ledge-based system such as PEARL, the different 
database structures would simply require altering 
the mapping between the conceptual representaSion 
of the parse and the retrieval query. 
Finally, this reliance on database structure 
can lead to wrong answers. A classic example is 
Harris' (1979) 'snowmobile problem =. Yhen Harris' 
ROBOT system interfaces with a file containing in- 
formation about homeowner's insurance, the word 
'snowmobile" is defined as any number • 0 in the 
'snowmobile field" of an insurance policy record. 
This means that as far as ROBOT is concerned, the 
question 'How many snowmobiles are there? = is no 
different from "How many policies have snowmobile 
coverage?" However, the correct answers to the two 
questions will often be very different. If the 
first question is asked and the second question is 
answered, the result is an incorrect answer. If 
the first question cannot be answered due to the 
structure of the database, the system should inform 
the user the5 this is the case. 
~oogluaioo=. 
I have argued above that conceptually-based 
domain-specific knowledge is absolutely essential 
for n|turll language database access systems. 
Systems that rely on dltabase structure for this 
domain-specific knowledge viii not achieve an ac- 
ceptable level of performance -- i.e. operate at 
the level of understanding of a programmer-techni- 
cian. 
Because of the requirement for delian-specific 
knowledge, conceptually-based systems are restric- 
ted t~o limited domains and are not readily portable 
~o new content areas. However, eliminating the 
domain-speciflc conceptual knowledge is throwing 
&he baby out with the ba&h water. The conceptual- 
ly-based domain-specific knowledge is the key to 
robust understanding. 
The approach of the PEARL project with regard 
t~ the &ransportability problem is t~ try and I- 
dentify areas of discourse that are common t~ most 
domains and to build robust modules for natural 
language analysis within these domains. Examples 
of such domains are temporal reference, loci&ion 
reference, and report generation. These modules 
are knowledge-based and can be used by a wide va- 
riety of domains to help extract ~hm conceptual 
content of a requss5. 
REFERENCES 
Dyer, N. (1982). ~n:~9~h Und~£~aodiag~ ~ Cos- 
pu~nt HQdnl of In~ng£a~nd 8to,oaring fg£ Na~i- 
~\[X§ Cg~D£ObgU~igO. Yale University, Computer 
Science Dept., Research Report #219. 
Harris, t. R. (1979). Experience with ROBOT in 12 
commercial natural language data base query ap- 
plications, g£~oeding= Of ~b| O~b \[o~ncna~ioo- 
al Joins Cgnfntnnco on &£~ificial \[n~olllgonco. 
Hendrix, G. G. (1976). LIFER: A natural language 
interface facility. SRZ Tech. Note 135. Dec. 
1976. 
Lehnert, W. (1978). Ibo 8~o~o~ of Ggo~ioo 8O- 
sHO£iOg. Lawrence Erlbaum Associates, Hills- 
dale, New Jersey. 
Lehnert, ¥. and Shwartz, S. (1982). Nabural 
Language Data Base Access with Pearl. EzoCmod- 
logs of ~be Hin~b Io~ntna~ional Conference on 
Comp~aSioQal Linguistic=, Prague, Czechoslo- 
vakia. 
5acerdoti, E. D. (1978). A LADOER user's guide. 
Technical Note 163. SRI Project 6891, 
Schank, R. C. and kbelson, R. (1977). ~£ig~. 
Elm0=, G~IIs add U0da£s~anding, Lawrence Erl- 
baum Associates, Hillsdale Ne~ Jersey, 1977. 
Shwartz, S. (1982). PEARL: 'k Natural Language 
Analysis System for Information Retrieval (sub- 
mitted to AAAI-82/applications division). 
Waltz, D. L., Finin. T., Green, F., Conrad, F., 
Goodman, B., Hadden, G. (1976). The planes 
system: natural language access to a lar~e data 
base. Coordinated Science Lab., Univ, of Il- 
linois, Urbane, Tech. Report T-34, (July 1976). 
62 
