EVALUATION OF NATURAL LANGUAGE INTERFACES TO DATA BASE SYSTEMS 
Bozena Henisz Thompson 
California Institute of Technology 
INTEODUCT~ON 
Is evaluation, like beauty, in the eye of the beholder? 
The answer is far from simple because it depends on who 
is considered to be the proper beholder. Evaluacors may 
range from casual users to society as a whole, with sys- 
tem builders, sophisticated users, linguists, grant pro- 
viders, system buyers, and others in between. The 
members of thls panel are system builders and linguists 
-- or rather the t~ao fused into one -- but, I believe, 
interested in all or almost all actual or potential 
bodies of evaluators. One of our colleagues expressed a 
forceful opinion while being a member of a similar panel 
at last year's ACL conference: "Those of us on this 
panel and other researchers in the field simply don't 
have the right to determine whether a system is practi- 
cal. Only the users of such a system can make Chat 
determination. Only a user can decide whether the hi. 
\[natural language\] capability constitutes sufficient 
added value to be deemed practical Only a user can 
decide if the system's frequency of inappropriate 
response is sufficiently low to be deemed practical. 
Only a user can decide whether the overall NL interac- 
tion, taken in toto, offers enough benefits over alter- 
native formal interactions to be deemed practical" Ill. 
It is hard for me co disagree, since I argued as force- 
fully on the basis of my study of users* evaluation of 
machine translation \[2\] -- a study which was prompted by 
the evaluations of the quality of machine translation as 
viewed by linguists and users, ranging from 35Z accept- 
able for the former to 90Z for the latter. Whet the 
study also showed was chat the practicality of the out- 
put could indeed only be judged by the users, since even 
incomplete and stylistically very inelegant translations 
were found quite useful in practice because they, on the 
one hand, provided, however crudely, the information 
sought by the users, and, on the other hand, the users 
themselves brought knowledge chat made the texts far 
more understandable and useful then might appear co a 
nonspecialist linguist. But this endorsement on mY pert 
of the user a~ the ultimate judge in evaluations does 
not preclude my fully subscribing co Norm Sondheimer's 
\[3\] introductory co~ents co this panel stating that to 
"make progress as a field, we need to be able Co evalu- 
ate." We are now less likely co confuse the issue of the 
evaluation by people like ourselves and the judgment of 
the users, less likely to be surprised at the discrepan- 
cies, and less likely to be surprised at the users" 
acceptance of the limitations of our NL interfaces. 
Also, we are far more aware of the fact chac evaluations 
of '~orth" or "quality" have Co be conducted in the con- 
texts of the actual, perceived needs. Zn extensive stu- 
dies on evaluation of innovations, Mosteller \[4\], the 
recently retired president of AAAS, found that "success- 
ful innovators better understand user needs; \[and\] pay 
more attention to marketing .... " The same source, 
however, leads me co the notorious difficulties of 
evaluation given the vide range of evaluaCors and their 
purposes. We are all undoubtedly convinced of the value 
of NLI for the society as a whole, but the evaluation of 
experiments with these interfaces is another matter. 
Mosceller was faced with social, sociomedical, and medi- 
cal fields. Let me recount some of the studies he and 
his team made for reasons which will soon become obvi- 
ous. His teem scored a given program on a scale from 
plus ~wo Co minus ~wo with zero meaning there was essen- 
tially uo gain. Accordingly, a study of delinquent 
girls that identified th ~- buc failed to prevent them 
from delinquency received a zero. Likewise, a zero was 
assigned Co a probation experiment for conviction 
for public drunkenness in which three methods were 
used: (I) no treatment, (2) an alcoholic clinic, and 
(3) Alcoholics Anonymous. Since the "no treatment" 
group performed somewhat better, short-term referrals 
were considered of no value. A minus one was given to a 
study whose results were opposite co those hoped for: a 
major insurance cOmpany increased outpatient benefits in 
the hope of decreasing hospital costs, but the outpa- 
tient group's hospital stays increased. Finally, a dou- 
ble plus was swarded to an experiment involving the Salk 
vaccine, which was, predictably, very successful. Now 
this kind of evaluation may be justified when the needs 
of the society are at stake. I have gone into these 
details, however, for the purpose of expressing the 
opinion, in which I know I'm not alone, that nelative 
results are as important as positive ones, that evalua- 
tion in our case is almost equivalent to the amount of 
information obtained in an experiment. An experiment 
whose results would be totally predictable would be 
almost useless, but one with results different frOm 
those hoped for might be embarrassing but very valuable. 
Another c~ent prompted by those evaluations is chat 
the application of any rigid, fine scale is totally 
inappropriate in the case of NLI evaluations. 
NLI EVALUATIONS 
A. METHODOLOGY AND SOME RESULTS 
It had been widely taken for granted some time ago Chat 
l~LI is as good as is its gr-~-r, and a grammar is as 
good as it is extensive. The specific needs of users, 
the requirements of special tasks and the like cook a 
back seat. The nature of ht--an discourse was yet to be 
explored. Happily, we have been in a different situa- 
tion for some time. When the REL \[5, 5, 7\] system was 
getting into • reasonably sturdy shape with respect to 
speed and buss, I started planning experiments to test 
it. There yes important literature about discourse, 
especially in sociology, such as the work of Schegloff. 
It was thus clear that successful NLI experiments had Co 
be based on knowledge of hi, an discourse. St was also 
clear chat that was the way Co make the interface more 
natural. This ass~ption has already been fruitful: 
the NL interface in POL \[9\], a successor Co REL, has 
already been extensively improved as a result of the 
EEL-related experiments. 
Experiments were made in three modes: in addition to 
face-to-face and human-to-co~puter, cerainal-co-terminal 
communication was examined, since at present chat is the 
only practical mode of accessing the computer. Through 
early 1980, Over 80 subjects, 80,000 words, and over 50 
hours were analyzed in great detail. In the fall of 
1980, another 13 subjects were tested in the computa- 
tional mode only, adding approximately 20 hours. From 
the start, the experiments were encouraging, although 
limited to ~wo modes: F-F and T-T. Interactions not 
only showed a great deal of structure but extensive 
similarities in both modes, the most important being the 
constancy of the nt=aber of words in sentences (about 
70Z); the length of sentences (about 7 words); the 
existence of fragments (70Z of messages in F-F and 50Z 
in T-T containing them); and phatics (10Z of total for 
F-F and 5Z for T-T). Thus similarities between the 
=odes were a candidate for consideration in experiments 
in the computational mode, the T-T mode being seemingly 
quite far removed from natural F-F. The sentence having 
historically been the unit of analysis (and since phat- 
its were considered of lesser Lmportance from the compu- 
tational vi~, although of great interest in general), 
m 7 attention turned Co fragments. REL allowed for three 
non-sentence type structures: "NP?" (including number 
parsed into NP); "all/none or uomber" answers; and 
39 
definitions introducible by the user which make ic pos- 
sible to include individual knowledge and terminology. 
The analysis of F-F and T-T protocols, however, showed 
the existence of other fragment categories, finally 
analyzed ~nco a dozen categories (see \[8\]). Since they 
constitute a considerable amount of F-F conversations 
and even T-T protocols, they clearly had co be watched 
for in computational experiments. 
The experiments for actually observin~ user-system 
interaction were conducted in the winter Cem of 1979/80 
and produced 21 protocols, the analysis of which was 
compared with results of eight F-F and fou~ T-T experi- 
ments. Another 13 computational experiments done in the 
fall coufimed the results of the earlier ones. The 
Cask in all three =odes was a real one: loading cargo 
onto a ship, the data coming from the actual envirooment 
of loading U.S. navy ships by a group in San Diego, Cal- 
ifornia. In the F-F and T-T experiments, ~n,~o persons 
were involved -- one given cargo item~ Co be loaded, the 
other infot~nation about decks (details in \[8\]). In the 
computational mode (H-C) the ship data was in ~he com- 
puter and the list of cargo to be loaded was handed Co 
the subjects, all with Caltech background. Details 
being available elsewhere andspace limited here, only 
some major results are given here. Table 1 shows the 
comparison of the three modes. 
TABLE 1 
~-__~ T-__/~ .-c 
Sentence length 6.8 6.I 7.8 
Message length 9.5 10.3 7.0 
Frequent length 2.7 2.8 2.8 
Z words in sentences 68.8 72.8 89.3 
Z words in fragments 17.2 21.1 10.7 
Toca~ AvR. ~ota~ Avt. ToCa~ Ave, 
Messages 5574 697 310 78 1093 52 
Parsed & nonparsed 1615 77 
Sentences 5302 663 385 77 882 42 
Fragments 3253 402 230 58 211 10 
Phatics (including 
connectors & tags) 48A2 605 148 37 46 2 
Total ~ota\[ Total 
Words in messages 49800 3285 8525 
Words in sentences 34266 2393 6880 
Words in fra~encs 8584 694 823 
As can be seen, several statistics show siailaritias: 
sentence length, message length, fragment length, per- 
centage of words in sentences and fragments. The close- 
ness of the average of messages in T-T and parsed and 
uonparsed inputs in H--C is striking. 
Table 2 (the meaning of abbreviations is given below the 
cable) deals with fragments. Zt is mostly self- 
explanatory, as is the absence of dsfiniclons from ¥-F 
and T-T (although some abbreviations used there fall in 
this category) and the absence of some other categories 
from T-T and K-C. At lease ~wo comaents, however, are 
necessary. The surprisingly low use of terse questions 
£n H-C may be accounted for by the tendency toward a 
formal style in compuCacionnl interaction. The defini- 
tions used were often of quite complex character, 
although far fever than could be hoped for due 
apparently to lack of familiarity with this capability. 
The complex character of definitions undoubtedly had 
some effect on the length of sentences in the H-C mode. 
d 
TABLE 2 
F-F T-T H-C 
Tota~ ~l TOCa ~ ; TOCa t 
g 532 £6.4 10 4.3 
ADD 425 13. I 41 17.8 
CORE 56 1 • 7 
COMP 95 2.9 2 .9 
SELF I14 3.5 
T~ 571 17.6 67 29.1 
TQ 4li 12.5 31 13.4 
TI 297 9 . 1 48 20 . 9 
FS 413 12.7 23 I0.0 
TEUN 339 I0.4 9 3.9 
DrY 
p 4~2 148 
C 1935 34 
T 31 
91 37 o8 
67 27.8 
, 30 12,4 
53 22.0 
Abbreviations 
E (Echo): An ezacc or partial repetition of usually 
the other speaker's string. Often an NP, but it 
may be an elliptical structure of various forms. 
ADD (Added ~nformatiou): An elliptical structure, 
often NP, used to clarif 7 or complete a previous 
utterance, often ode" s own, e.g., "IC doesn" ~: say 
anything here about weight, or breaking chins, 
down. Except for orushablee.", "It's smaller. 
36"x20"x17"." Spelling out words was Lncluded 
here. 
CORE (Correction): This may be done by either speaker. 
Tf done by the smm speaker it is related Co false 
start, but semantic considerations suggest a 
correction, e.g., "Those are 30, ,,h, 48 length by 
40 width by 14 height." 
COMP (ComoleCion): Completion of the other speaker's 
utterance, distinguished from interruption by the 
cooperative nature of the utterance, e.g., "As T've 
got a lot of...Z've toe B: two pages. A: Yeah." 
SZLY.(Ta~kin S co 0ueself~: Muttsrings, even to the 
point of undecipherabiliCy, noc intended for the 
other person. 
TR (Terse reply): An elliptical reply, often NP, 
e.go, "No.", "Probably meters.", "50 and 7.62." 
TQ (Terse OuesCion) : An elliptical question, often 
NP, e.g., '~hy?", "How about pyrotechnics?", '~hich 
ones?" 
TI (Terse Information): A rather elusive category, 
neither question, reply nor co--and, an elliptical 
statement but one often requiring an action. 
F8 (False Sta~c): These are also abandoned utter- 
ances, but i~edistely followed by usually syntac- 
tically and semantically related ones, e.g., "They 
may, they may be identical classes.", '~ell, the 
height, the next largest height I've got is 34." 
TRUN (Truncated.): An incomplete utterance, voluntarily 
abandoned. 
DEF (Definition): E.g., '~0efine: ED: each deck of the 
Almeo." 
P (Phatics): The largest subgroup of fragments whose 
nets is borrowed from Malinoweki °s tern "phacic 
colmtmion" with which he referred to chose vocal 
utterances chat serve to establish social relations 
racher than the direct purpose of communication. 
This term has been broadened to include all frag- 
ments which help keep the channel of communication 
open, such as '~ell", '~aic", and even '~ou Cur- 
kay". Two subcategories of phacics are: 
C (Dialogue Connectors) : Words such as "Then", 
"And", "Because" (at the beginning of a message or 
utterance). 
T (Tan Ouescions): E.g., "They're all under 60, 
seen" t they?" 
40 
B. SYST~4 PERFORMANCE, sYNTAX USED, SPECIAL STRATEGIES, 
AND ERROR ANALYSIS 
System performance can obviously be evaluated in a 
number of ways, but without good response time meaning- 
ful experiments are impossible. When much data is 
involved in processing a delay of a few minutes can 
probably be tolerated, but the vast majority of requests 
should be responded to within seconds. The latter was 
the case in my experiments. Fairly complex messages of 
about 12 words were responded to in about l0 seconds. 
The system clearly has to be reasonably free of bugs -- 
in my case, 12 bugs were hit in the total of 1615 parsed 
and nonparsed messages. The adequate extent of natural 
language syntax is impossible to determine. Table 3 
shows the syntax used by my subjects. 
sentences; or possibly just "baby talk" due to the 
suspicion of the computer's limitations. 
An interesting fact to note is that similar results with 
respect to syntax were obtained in the exper~nents with 
USL, the "sister system" of REL developed by IBM Heidel- 
berg \[10\] -- with German used as gLl in two studies of 
high school students: predominance of wh-questions (317 
in total of 451); not many relative clauses (66); com- 
mands (35); conjunctions (26); quantifiers (15); defini- 
tions (ii); comparisons (2); yes/no questions (i). 
An evaluation which would not include an analysis of 
unparsed input would at best be of limited value. It 
was shown in Table i that i093 out of 1515 or about ~o 
thirds were parsed in my experiments. 
TABLE 3 
SENTENCE TYPES 
Tot~l 
882 
651 
All sentences 
Simple sentences, e.g., "List the decks 
of the Alamo." 73.8 
Sentences with pronouns, e.g., '~/hat is 
its length?", "what is in its pyro- 
technic looker?" 30 3.A 
Sentences with quantifier(s), e.g., 
"List the class of each cargo." 71 8.0 
Sentences with conjunctions, e.g. "What 
is the maxim,-- stow height and bale 
cube of the pyrotechnic locker of the 
AL?" 88 I0.0 
Sentences with quantifier and conjunc- 
tion(s), e.g., "List hatch width and 
hatch length of each deck of the Alamo." 13 2.6 
Sentences with relative clause, e.g., 
"List the ships that have water." 6 .7 
Sentences with relative clause (or 
related construction) and cemparator, 
e.g., "List the ships with a beam less 
than lO00." 6 .7 
Sentences with quantifier and relative 
clause, e.g., "List height of each 
content whose class is class IV." 2 .23 
Sentences with quantifier, conjunction 
and relative clause, e.g., "List length, 
width and height of each content whose 
class is a--nunicion." 2 .23 
Sentences with quantifiers and comparator, 
e.g., '~Iow many ships have a beam greater 
than 10007'* 3 .34 
Wh-questions 75.0 
Yes/no questions 1.0 
Con=sands 19.0 
Statements (data addition) 5.0 
Considering the wide range of R k'r- syntax \[7\], the pau- 
city of complex sentences is surprising. The use of 
definitions which often involved complex constructions 
(relative clauses, conjunctions, even quantifiers) had a 
definite influence. So did, undoubtedly, the task 
situation causing optimization of work methods. The 
influence of the specific nature of the task would 
require additional studies, but the special device pro- 
vided by the system (a loading prompt sequence -- which 
was not analyzed) was employed by every subject. Dew- 
ices such as these obviously are a great aid in accom- 
plishin 8 tasks. They should be tested extensively to 
determine how they can augment the uaturalness of NLIs. 
Other reasons for the relatively simple syntax used were 
special strategies: paraphrasing into simpler syntax 
even though a sentence did not parse for other reasons; 
"SUCCesS strategy" resulting in repetitious simple 
TABLE 4 
Total % 
Vocabulary 161 36.1 
Punctuation 72 16.1 
Syntax 62 13.9 
Spelling 61 13.6 
Transmission 32 7.2 
Definition format 30 6.7 
Lack of response 16 3.6 
Bus 12 2.7 
Table 4 st~_erizes the categories of errors. The 
predominance of vocabulary is not surprising, but rela- 
tively few syntactic errors are. In part this may be 
due to the method of scoring in which errors were 
counted only once, so if a sentence contained an unknown 
vocabulary item (e.g. "On what decks of the Alamo 
cargo be stored?") but would have failed on syatactic 
grounds as well, it would fall in the vocabulary 
category. A comparison can be made here with Damerau's 
study Ill\] of the use of the ll~A system by the city 
plannin S department in White Plains, at least with 
regard to the total of queries to those completed: 788 
to 513. So, again, roughly t~ao thirds were parsed. In 
other categories "parsin S failure" is 147, "lookup 
failures" 119, "nothing in data base" 61, "program 
error" 39, but this only points to the general difficul- 
ties of comparisons of system performance. 
SOME CONCLUSIONS 
Norm Sondheimer suggested some questions we might try to 
answer. What has been learned about user needs? What 
most important linguistic phenomena to allOW for? What 
other kinds of interactions? Error analysis points in 
the obvious directions of user needs, and so do the 
types of sentences employed. While it is justified to 
quit the search for an almost perfect grnmm,r, it would 
be a mistake to constrain it to the constructions used. 
Improved naturalness can be achieved with diagnostics, 
definitions, and devices geared to specific tasks such 
as special prompting sequences. Some tasks clearly 
require math in the NLI. How good are systems? An 
objective measurement is probably impossible, but the 
percentage of requests processed might give some idea. 
In the case of a task situation such as loading cargo 
items, the percentage of task completion may signal both 
system performance and user satisfaction. System 
response times are a very important measure. The ques- 
tionnaire method can and has been used (in the case of 
MT and USL), but as yet there is too little experience 
to measure user satisfaction. Users seem very good at 
adapting to systems. They paraphrase, use success stra- 
tegy, simplify syntax, use special devices -- what they 
really do is maximize their performance with respect Co 
a given task. 
41 
What have we learned about running evaluations7 It is 
important Co know what to look for, therefore the need 
for good knowledge of human to hmnan discourse. Good 
system response times are a sine qua non. Controlled 
experiments have the advantage of being replicable, a 
crucial factor in arriving ac evaluation criteria. 
Determining user bias and experience nay be important, 
but even more so £s user training. Controlled experi- 
ments can show what methods are ~ost effective (e.g. a 
manual or study of proCocols~). Study of user commence 
-- phacic material -- gives some measure of user 
(dis)satisfaction (I have seen '"/ou lie," buc I have yeC 
to see "Good boy, youZ"). Clearly, the best indication 
of user satisfaction is whether he or she uses the sys- 
tem again. Extensive IonS-term studies are needed for 
that. 
What should the future look like? Task oriented situa- 
tions seem to be a promising envirooment for ~LZ. The 
standards of NL systems performance will be set by the 
users. Future evaluations? As Antoine de Sainc-Zxup&r7 
wrote, "As for the Future, your task is not to foresee, 
but to enable it." 
REFERENCES 
i. Harris, Larry E. "Prospects of Practical Natural 
Language Systems." Proceedings of the 18th Annual 
Meetin~ of the Association for Computationa~ 
Linguistics, June 1980, p. 129. 
Z.. Henisz-DosterC, B.; Macdonald, R. E.; and Zarech- 
rusk, M. Machine Translation. The Hague: Mouton, 
1979. 
3. Sondheimer, N. K. "Evaluation of Natural Language 
Interfaces to Data Base Systems." Proceedings o( 
the 19th Annual Meecin~ of the Association for Com- 
putational Linguistics, June 1981. 
4. Mosteller, F. "~nnovation and Evaluation." Science 
(February 27, 1981):881-886. 
5. Thompson, F. B. and Thompson, Boaena H. "?tactical 
Natural Language Processing: The EEL System as 
Prototype." In Advances in Computers, ed. M. Rubi- 
noff and M. C. Yovits. Yol. 13. New York: 
Academic Press, 1975. 
6. Thompson, BozenaH. and Thompson, F. B. "Rapidly 
Extendable Natural Language." Proceedings of the 
1978 Nationa~ Conference of the ACM, pp. 173-182. 
7. Thompson, Bozena H. REL English for the User. 
Pasadena: California Institute of Technology, 1978. 
8. Thompson, Bozena H. "Linguistic Analysis of 
Natural Language Co--,unication rich Computers." 
COLING 80: Proceedings of the gCh Internationa~ 
Conference on Computariona~ Linguistics, Tokyo, 
October 1980, pp. 190-201. 
9. Thompson, Bozeua H. and Thompson, F.B. "Shifting 
to a Higher Gear in a Hatural Language System." 
Proceedinzs of the Nat~ona~ Computer Conference, 
May 1981. 
10. Lehmann, Hubert; OCt, Nikolaue; Zoeppri~z, Mag- 
dalene. '~ser Experiments with Natural Language 
for DaCe Base Access." COLING 78: ProceedinRs of 
ch~ 7oh International Conference on Computational 
Linguistics. Bergen, August 1978. 
Ii. Oamtrau, Fred J. The Transformational ~uestion 
Answ~rin~ ~T~A~ System: Operational Statistics - 
1978. EC 7739. Yorktown Heights: IBM T. J. Watson 
research Center, June 1979. 
42 
