SELECTIVE PLANNING OF INTERFACE EVALUATION~ 
William C. Mann 
USC Information Sciences Institute 
1 The Scope of Evaluations 
The basic ides behind evaluation is 8 simple one: An object is 
produced and then subjected to trials of its I~trformance. Observing the 
trials revesJs things about the character of the object, and reasoning 
about those observations leads tO stJ=tements about the "value" of the 
object, a collection of such statements bein.3 &n "evaluation." An 
evaluation thus dlffe~ from a description, a critique or an estimate. 
For our purl:)oses here, the object is a database system with a natural 
language interface for users. Ideally. the trials are an instrumented 
variant of normal uSage. The character of the users, their tasks, the 
data, and so forth are reDreeentative of the intended use of the system. 
In thinking about evaluations we need to be clear about the intended 
scope. Is it the whole system that is to be evaluated, or just the natural 
language interface portion, or pos~bly both? The decision is crucial for 
planning the evaluation and understanding the results. As we will see. 
choice of the whole system as the scope of evaluation leads t O ver~ 
different designs than the choice of the interface module. It is unlikely 
that an evaluation which is supposed to cover both scopes will cover 
both well. 
2 Different Plans for Different Consumers 
We can't expect a single form or method of evaluation to be suitable for 
all uses. In planning to evaluate (or not to evaluate) it heil~ a great deal 
to identify the potential usor of the evaluation. 
There are some obvious prlncipis¢ 
1. If we can't identify the consumer of the evaluation, don't 
evaluate. 
2. If something other than sn evaluation meets the 
consumer's needs better, plan tO use it instearl. 
Who are the potential consumers? Clearly they ate not the same as the 
sDonsors, who have often lost interest by the time an evaluation is 
timely. Instead, they are: 
1. Organizations that Might Use the System ..- These 
consumers need a good overview of what the system can 
do. Their evaluation must be hotistic, not an evaluation of a 
module or of particular techniqueS, They need informal 
information, and possibly a formal system evaluation as 
well. 
However, they may do beet with no evaluation at all. 
Communication theorists point out that there has never 
been s comprehensive effectivenees study of the 
telephone. Telephone service is sold without such 
evaluations. 
2. Public Observers of the Art ..." ScienOata and the 
general public alike have shown a great intermit in AI, and a 
legitimate concern over its social effects- The interest is 
especially great in natural language precepting. However, 
neatly all of them are like obsorvem of the recent space 
shuttle: They can understand liftoff, landing and some of 
the discus=dons of the heat of re(retry, but the critical details 
are completely out of reach. Rather than carefully 
controlled evaluations, the public needs competent and 
honest interpretations of the action. 
3. The Implementers' Egos --. Human self-acceptance and 
enjoyment of life are worthwhile goals, even for system 
designers and iml=lementers, We aJl have e~o needs. The 
trouble with using evaluations to meet them is that they can 
give only too little, too late. Praise and encouragement 
aJong the way would be not only more timely, but more 
efficient. Implementers who plan an evaluation as their 
vindication or grand demonstration will almost surely be 
frustrated. The evaluation can serve them no better than 
receiving an academic degree serves a student. If the 
process of getting it hasn't been enjoyable, the final 
certification won't helD. 
4. The Cultural Imperative ... There may be no potential 
consumers of the evaluation at all, but the scientific 
subculture may require one anyway. We seem to have 
asCenDed this one far more successfully than some fields of 
psychology, but we should Still avoid evaluations performed 
out of social habit. Otherwise We will have something like a 
school graduation, a big. eiaJoorete, exbenalve NO,OP. 
5. The Fixers -°- These I:~ople, almost inevitably some of 
the implementers, are interested in tuning up the system to 
meet the needs of real usem. They must move from the 
implementation environment, driven by expectation and 
intuition, to a more taoistic world in which those 
expectations are at least vulnerable. 
Such Customers cannot be served by the sort of broad 
holistic performance test the" may serve the public or the 
organization that is about to acquire the system. Instead, 
they need detailed, specific exercises of the sort that will 
support a causal model of how the system really functions. 
The best sort of evaluation will function as a tutor, providing 
lots of ¢oecifi¢, well distributed, detailed information. 
6. The Research and Developmeht Community ... 
These are the AI and system development Deople from 
outside of the project. They are like the engineers for Ford 
who test Dstsuns on the track. Like the implementerso they 
need dch detail to support causal models. Simple, ho(iStic 
evaluations are entirely inadequate. 
7. The Inspector --- There is another model of how 
evaluations function. Its premises differ grossly from those 
u~d adore. In this model, the results of the evaluation, 
whatever they are, can be discarded because they have 
nothing tO do with the real effects. The effects come from 
the threat of an evaluation, and they are like the threat 
of a military inspection. All of the valuable effects are 
complete before the ins~oection takes piece. 
Of course, in s mature and stable culture, the insl:~cted 
learns to know what to expect, and the parties cart 
develop the game to a high state of irrelevance. Perhaps in 
AI the ins~Cter could still do some good. 
33 
t" 
Both the imptemantere and the researchers need a special kind of test. 
and for the same reeson: to support deaign, l The value of 
evaluations for them is in its influence on future design activity. 
There are two interesting psttems in the observations above. The first 
is on the differing needs of "insiders" and "outsiders." 
• The "outsiders" (public observers, potential 
organi;r.ations) need evaluations of the entire system, in 
relatively simple terms, well supplemented by informal 
interpretation and demonstration. 
• The "insiders," researcher~ in the same field, fixers and 
implementera, need complex, detailed evaluations that lead 
to many separate insights about the system at hand• They 
are much more ready to cope with such complexity, and the 
value of their evaluation de~enas on having it. 
These neede are so different, and their characteristics so contradictor./. 
that we should expect that to serve both neeOs would require bNO 
different evaluations. 
The second pattsm concerns relative benefits• The benefits of 
evaluations for "insiders" are immediate, tangible and hard to obtain in 
any other way. They are potentially of great value, especially in 
directing design. 
In contrast, the benefits of evaluations to "outsiders" are tenuous and 
arguable. The option of performing an evaluation is often dominated by 
better methods and the option of not evaluating is sometimes attractive. 
The significance of this contrast is this: 
SYSTEM EVALUATION BENEFITS PRINCIPALLY 
THOSE WHO ARE WITHIN THE SYSTEM DEVELOPMENT 
FIELD: iMPt.EMENTERS, RESEARCHERS, SYSTEM 
DESIGNERS AND OTHER MEMBERS OF THE 
TECHNICAL COMMUNITY. 2 
It seems oiovious that evaluationa should therefore be planned 
Dnncipally for this community• 
As a result, the outcomes of evalustione tend to be ex~'emely 
conditional. The most defensible con¢luaione are the most conditional- 
• they say "This is what happena with these u~4, these questions, this 
much system load..." Since those conditions will never cooccur again, 
such results are rather useless. 
The key to doing better is in creating results which can be generalizsd. 
Evaluation plans are in tension between the possibility of creating highly 
credible but insignificant results on one hand and the I=osalbiUty of 
creating broad, general results without a credible amount of Support on 
the other. 
f know no general solution to the problem of making evaluation results 
ganeraliza/Die and significant. We can observe what others have done, 
even in this book, and proceed in a case by case manner. Focusing our 
attention on results for design will halb. 
Design proceeds from causal models of its subieot matter. Evaluation 
results should therefora be interpreted in cesual mode. There is a 
tendency, particularly when statistical results are involved, to avoid 
causal interpretations. This comes in ~ from the view that it is part of 
the nature of statistical models to not supbort causal intor~retetions. 
Avoiding causal interpretation is formally defensible, but entirely 
inappropriate. If the evaluation is to have effects and value, causal 
interpretationa will be made• They are inevitable in the normal course of 
successful activity. They must be made, and so these interpret,=tions 
should be made by those best qualified to do so. 
Who should make me first causal interpretation of an e~tmtion? Not 
the consumers of the evaluation, but the evaluetors themselves. They 
are in the best position tO do so, and the act of stating the interDrets~on 
ia a kind of che~ on its plal~libility. 
By identifying the consumer, focumn 0 on consequences for dui~n, and 
providing causal interpretabons of r~its, we can crest,, v,,,usiole 
evaluations. 
3 The Kay Problem: Generalization 
We have already noticed that evaluations can become very complex, 
with both good and bad effects. The complexity comes from the tssk: 
Useful systems are complex, the knowledge they contain is complex, 
users are complex and natural language is complex. Beyond all thaL 
planning • test from which reliable conclusions can be drawn is itself a 
comptex matter. 
l~n the face of so much complexity, it is hoDelees to try to soan the full 
range of the phenomena of interest. One must sample in a many. 
dimensional sO=ace, hoping to focus attention where conclusions are 
both ac, cesalble anG ,significant. 
II~mgn hire. -,, m mo~ ~ ¢ons~m almost entirety of recleB~n. 
2Th,q is no( to say that ~e anl not le~timate, important neecls anmng 
"ou~ecl'. Son~mn@ musZ select lmon O commmcmlly offered am~cs¢ CXOCum new 
¢o~or sy.Jcems and so form. U~or~k'un4mtecy. me imvaiim ~mation lec~mgy dole 
nm e~m mmoteht sa~-oach • meth~l~ogy lot msetm 0 such ~ For ezamQle, 
is nothing com~IrlOCe to c43m1~r i0ef~cnmlrkin 9 methods for intm~cl~wl natuttl 
languag(l im~/lu:R. It is not thM "ou1~m~der~" don't hlve imoortant needs: rlm~r, vm anl 
~any ~Wi~e= to n~m m41~ nml~l. 
34 
