A Formal Basis for Performance Evaluation 
of Natural Language Understanding Systems 
Giovanni Guida 1 and Giancarlo Mauri 2 
Istituto di Matematica, Informatica e Sistemistica 
Universit~ di Udine 
Udine, Italy 
The task of evaluating the performance of a natural language understanding system, 
despite its largely recognized relevance, is still poorly defined. It mostly relies on intuitive 
reasoning and lacks a sound theoretical foundation. This paper sets a formal and quantita- 
tive proposal for this task. In particular, a measure of performance that allows the basic 
input-output characteristics of a system to be evaluated is introduced first at an abstract 
level. The definition of concrete measures is then obtained by assigning actual values to 
the functional parameters of the abstract definition; some particular cases are shown and 
discussed in detail. Finally, the task of measuring performance in practice is considered, 
and a model for experimental performance evaluation is presented. Comparison with 
related works is also briefly discussed; open problems and promising directions for future 
research are outlined. A limited case study experimentation with the model proposed is 
presented in the appendix. 
1. Introduction 
Research on natural language processing has recently 
been featured by the design and implementation of a 
number of experimental systems. Recent survey re- 
ports (Waltz 1977, Kaplan 1982) mention more than 
one hundred items among the most successful and 
relevant systems in the classical application fields of 
data base inquiry, machine translation, question an- 
swering, and man-machine interfacing. 
This trend is not surprising in the context of re- 
search whose specific aim is that of providing automat- 
ed tools for the understanding or translating of natural 
languages; but it is also evident even in natural lan- 
guage research with a more theoretical flavour. The 
successful construction of a good performing system is 
1 Address: 
Prof. Giovanni Guida 
Dipartimento di Elettronica 
Politecnico di Milano 
P.zza Leonardo da Vinci, 32 
i-20133 MILANO, Italy 
Also with Milan Polytechnic Artificial Intelligence Project, Milano, 
Italy. 
2 Also with lstituto di Cibernetica, Universita di Milano, Mila- 
no, Italy. 
in fact often considered as the most evident proof of 
the validity of a theory, and, therefore, designing run- 
ning systems is routine, and even sometimes the spe- 
cific goal of several researchers. 
The task of evaluating the performance of a given 
system and that of comparing the behaviour of differ- 
ent systems appears, therefore, to be a fundamental 
issue. Despite its large recognized relevance (Woods 
1977, Tennant 1980), measuring the performance of a 
system for natural language processing is still poorly 
defined. It mostly relies on intuitive reasoning and 
lacks a sound theoretical foundation. As Tennant 
clearly points out (1980), there is a nearly complete 
absence of meaningful evaluation in current natural 
language processing research. This leaves several cru- 
cial questions unanswered: 
• What is the relevance and value of obtained results? 
• How general are the proposed solutions? 
• How do they compare with other proposals? 
• What problems are still open? 
• What directions have to be followed? 
• What issues are to be faced in the progress of the 
research? 
Copyright 1984 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted 
provided that the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the 
first page. To copy otherwise, or to republish, requires a fee and/or specific permission. 
0362-613X/84/030015--16503.00 
Computational Linguistics, Volume 10, Number 1, January-March 1984 15 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
The lack of evaluation constitutes a serious obsta- 
cle to the development of a sound technology in natu- 
ral language processing. 
The purpose of this paper is to provide a formal 
and quantitative model for the performance evaluation 
task. In particular, we give a formal definition of 
"understanding power", and we propose some techni- 
ques for measuring this feature in practice. Our pro- 
posal is based on several assumptions we discuss be- 
low. 
First, we assume as object of our attention only 
that module of a natural language system that is devot- 
ed to understanding natural language, that is, to map- 
ping input expressions into formal internal representa- 
tions. This can clearly include several kinds of proc- 
essing activities, such as linguistic analysis, reasoning, 
inferencing, etc.; but must have as ultimate goal the 
construction of a correct internal representation, not 
the production of any type of service to the end user 
of the natural language system. Thus, for example, a 
question answering system (Tennant 1979) does not 
belong to the class of natural language understanding 
systems that concern us; instead, it is the natural lan- 
guage interface it contains that meets exactly our re- 
quirements. 
Second, we assume the following naive notion of 
performance: the extent to which a system is able to 
correctly understand natural language expressions in a 
given application domain. The resources needed by 
the system to accomplish its task are irrelevant in this 
case. In other words, we want to capture and measure 
the "power" of the system, in terms of how much and 
how well it is capable of understanding, not its 
"efficiency", that is, how much does it cost (for exam- 
ple, in terms of time and memory requirements) to 
understand what it is capable of understanding. 
Third, we want to define a measure of performance 
that allows the evaluation of the input-output charac- 
teristics of a particular system in a given domain. This 
kind of measure is clearly inappropriate to reveal and 
test features, such as the power of a model as opposed 
to that of a particular implementation of it, the appli- 
cability of the model to other domains, its extensibili- 
ty, etc., which are more closely related to the internal 
structure and mode of operation of a system, rather 
than to its input-output behaviour. The goal of evalu- 
ating such more general properties, worked on by Ten- 
nant (1980) through the method of abstract analysis 
(mainly based on taxonomies of conceptual, linguistic, 
and implementational issues), is not considered in this 
work. 
This paper is organized in the following way. In 
section 2 we discuss in an intuitive, yet precise, way 
the basic concepts involved in the performance evalua- 
tion problem, in order to have a sufficiently clear 
specification of what we want to formalize. Then, in 
section 3, we give an abstract definition of the formal 
model, and in section 4 we discuss some actual cases 
of particular interest. Section 5 presents some techni- 
ques that could be used to measure in practice the 
performance of a natural language understanding sys- 
tem. In section 6 we discuss some concluding re- 
marks, and present open problems and promising top- 
ics for future research. A limited case study experi- 
mentation with the model proposed is presented in the 
appendix. 
2. Basic Definitions and Statement of the 
Problem 
Let us introduce some background definitions needed 
to clearly state the problem of performance evaluation, 
as discussed in this work. The model of natural lan- 
guage understanding we are going to define is so con- 
ceived as to include only those very few features that 
are relevant for the purpose of performance evaluation 
and is strictly tailored to this particular goal. 
Let an expression of a natural language be any fi- 
nite sequence of legal words and punctuation marks 
from the given language. Let A be the set of all ex- 
pressions of a natural language. 
Note that the above definition is very loose and 
does not take into account the structure of the expres- 
sions. So an expression can be a sentence, a dialogue, 
a meaningless sequence of words, the whole content of 
a book, or just a single word. Introducing a more 
definite notion of expression is not necessary at this 
point for our purpose of stating the problem of per- 
formance evaluation. 
Although the above definition includes expressions 
of arbitrarily (finite) length, so that A contains infi- 
nitely many expressions, in a more pragmatic approach 
the length of existing expressions of a natural language 
at a given moment of its history has an upper bound. 
Therefore, it makes sense to restrict our attention to a 
finite subset E of A, containing all expressions of 
length less than or equal to an appropriately fixed 
integer n. 
Let L be the set of all meaningful expressions of a 
natural language, that is, of all expressions to which 
humans attach a meaning. Note that L is defined on a 
purely semantic basis, so that expressions of L do not 
have to be syntactically correct with respect to any 
fixed syntax, and that, generally, more than one mean- 
16 Computational Linguistics, Volume 10, Number 1, January-March 1984 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
ing may be attached to the same expression, that is, 
expressions are not required to be univocal. 
Let S be the set of all possible meanings that can 
be attached to expressions of E. 
We do not face here the problems of what S actual- 
ly contains or of how S could be represented explicitly 
(which mostly pertain to cognitive psychology); let us 
assume S merely as that basic datum, shared by all 
humans speaking a given language, which allows effec- 
tive interpersonal communication. 
We call the semantics of a natural language the 
total function f: E~2 s (into 2S), which associates to 
each expression of E the set of all its possible mean- 
ings. 
Clearly the function f can be computed by any 
person who can understand perfectly the natural lan- 
guage to which the expressions of E belong (theoreti- 
cal problems concerning subjective interpretation and 
disagreement between different people are not consid- 
ered here). 
Moreover, f(e) = ~ denotes that no meaning is 
associated to the expression e, and hence eeL 
iff f(e) ~. 
Each expression eeE such that If(e) l_<l is called 
an univocal expression. 
Let now D be a nonempty subset of S that contains 
meanings all related to a unique subject ("what we are 
speaking of", "the topic of the discourse", "the con- 
ceptual competence of a natural language understand- 
ing system"); we call D a domain. 
Let fD be the restriction of f to D defined as: 
fo(e) = f(e)fl D, for any eeE. 
Let L D = E--fDI(~) be the restriction of L to D. 
It is obvious that LD_qL_qE. 
Let us now try to formalize the concept of natural 
language understanding system. 
The main problem is that of giving a formal repre- 
sentation to the informally defined domain D. To this 
purpose, we take a finite set of symbols B, called 
alphabet, and then we construct a set R of sequences 
of arbitrary finite length over B (that is, R-B*), in 
such a way that to every element deD an element of 
R, r = laD(d), is associated by a bi-univocal function 
h o. The sequence r = hD(d) is called the 
representation of d, while the set R is called a represen- 
tation language for D. 
Obviously, the map la Dl is a total function 
hD~:R-~D, which associates to every sequence of R its 
informal meaning in D. Both h D and laD 1 are known 
to man, in the sense that he is able to compute them. 
We are now able to formalize the naive notion of 
natural language understanding system in the following 
way. 
Let D-S be a domain and R a representation lan- 
guage for D. A natural language understanding system 
UR/D in R on D is an algorithm that computes a total 
u R function gR/D:E---2 U {_L} (into 2Ru {±}), where ± is 
U called the undefined symbol, gR/D(e) = ± denotes that 
U is unable to assign a meaning to the expression e, 
that is, that it fails in computing gR/D(e) (not that e 
has no meaning in the domain D!). 
Note that in the above definition we have assumed 
that a system UR/D should accept as input not only 
expression of L D but, generally, all expressions of E. 
The reason for this choice is that a basic feature of 
natural language understanding is also to recognize 
that some expressions are meaningless (they belong to 
E-L) or are in no way related to a given domain D 
(they are in L-LD). Clearly, this feature is often less 
important than the capability of correctly understand- 
ing expressions of LD, but this can be appropriately 
taken into account when defining a measure of per- 
formance. 
Measuring the performance of a natural language 
understanding system UR/D may now be defined as 
evaluating how well UR/D is capable of explicitly rep- 
resenting in R the meaning of expressions of E. 
To define such a notion in quantitative terms we 
can first extend the bi-univocal function hD:D--,-R to 
the function (bi-univocal if ± is not considered) 
hD:2D~ 2Ru {±}, 
defined by: 
hD(X) = U {hD(d)}, dex 
for xe2 D. 
Figure 1 illustrates the definitions of the functions u 
f, fD' hD' hD' and gR/D presented above. 
Considering now the three functions fD, hD, and u 
gR/D defined above, if we denote hDOfD-----gD' the 
performance of UR/D can then be expressed as the u 
degree of precision to which gR/D approaches gD over 
E. 
This task raises, however, some difficult problems. 
Two basic questions are: u 
(i) how to define the "difference" between gR/D and 
gD over E in such a way to match the intuitive 
notion of performance; 
(ii) how to measure such a "difference" in practice, 
that is, through an effective experimental proce- 
dure. 
Computational Linguistics, Volume 10, Number 1, January-March 1984 17 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
2 s / 2 D 
fO s f O 
E L 
LD 
~r 
gU 
R/D 
2 ~ u f±l 
U Figure 1. Relationships between the functions f, fD, hD, hD' and gR/D" 
18 Computational Linguistics, Volume 10, Number 1, January-March 1984 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
Both of these problems are discussed in the follow- 
ing sections (the former in sections 3 and 4, and the 
latter in section 5). 
3. A Theoretical Framework 
Before tackling the core topic of this section in a for- 
mal way, let us examine from an intuitive point of 
view the basic requirements for a measure of perform- 
ance ~r to be reasonably acceptable. The primary goal 
is that it should allow consistent comparison among 
different systems, in the sense that if ~r(U 1) = qr(U 2) 
the behaviour of the two systems U 1 and U 2 should be 
sufficiently similar, and that if ~r(Ul)>~r(U2), U 1 
should perform better than U 2. 
Furthermore, this comparison should be as fine and 
precise as possible, in such a way to capture all the 
essential features of the behaviour of a system U in a 
given domain. 
Finally, comparison might be between two different 
systems, between two versions of the same system, 
between a system and a given set of issues, or between 
a system and an independent scale (Tennant 1980). 
To capture the intuitive notion of performance 
according to the above requirements, at least two 
points of view seem worth considering. First, a meas- 
ure of performance should give a numerical value for u 
the "distance" between the two functions gR/D and 
gD' that is, the measure should allow us to formalize u 
how near gR/D(e) approaches ~D(e) for any eeE, or, 
more explicitly, how well each expression e£E is un- 
derstood by the system U. Second, it should weight 
this notion of "distance" in such a way as to take into 
account the fact that, generally, it is not equally im- 
portant to understand well any expression in E; for 
example, it could be reasonable to suppose that correct 
understanding of expressions in L D is far more rele- 
vant than in E--LD, or that correct understanding is 
more important for frequently used expressions than 
for unusual and rare ones. 
According to the above remarks, an appropriate 
notion of performance qr will depend on two basic pa- 
rameters: 
(i) the shifting # 
u between gR/D(e) and ~D(e) for any eEE 
(ii) the importance p 
for any expression eeE to be correctly under- 
stood. 
Different choices of /z and p clearly provide differ- 
ent notions of performance, ~r\[/~,0\], that fit different 
needs for capturing particular classes of features in a 
natural language understanding system. 
Let us now go further in defining an appropriate 
formal framework embedding the above ideas. In the u 
following, we shall omit in fD, gR/D, and gD the super- 
script U and the subscripts R/D and D, whenever this 
will not cause ambiguities. 
Let R be a representation language for a domain 
Dc-S a shifting function ~t on R is a function 
/z:(2 R U {.t.})x2R~\[0,1\], 
such that: 
- for each pair (r,r'), /z(r,r') = 0 iff r = r'; 
- there exists a pair (r,r t) such that/z(r,r w) = 1. 
From an intuitive point of view, /~(g(e),~(e)) repre- 
sents the "difference" between the (set of) meaning(s) 
of e computed by a natural language understanding 
system U, which is expressed by g(e), and its correct 
(set of) meaning(s) ~(e). Hence, the value 
tz(g(e),g(e)) = 0 denotes perfect understanding of e, 
while t~(g(e),g(e)) = 1 denotes the worst case of mis- 
understanding of e. 
Given the set E of all expressions of a natural lan- 
guage of length less or equal than an appropriately 
fixed integer n, an importance function O on E is a 
function 
p:E-* \[0,1\]. 
Intuitively, p(e) represents the importance that the 
meaning of e is correctly understood by the system U. 
The value p(e) = 0 denotes that it is not at all impor- 
tant that e be understood correctly or incorrectly; 
values of p(e) greater than 0 denote the greater impor- 
tance for e to be understood correctly. Given a shift- 
ing function # on R and an importance function p on 
E, a performance measure cr for natural language un- 
derstanding systems OR/D is the function 
'/7"\[/x,p\] : {OR/D} ~ \[0,1 \], 
defined by: 
#(gR/D(e),gD(e)) o(e) 
¢r\[/~,p\](UR/D) = e~E p(e) 
eEE 
Clearly, ~r ranges from the value 0, in the case 
where all expressions of E are correctly understood, to 
the value 1, in the case where all expressions are com- 
pletely (that is, in the worst manner) misunderstood, 
Computational Linguistics, Volume 10, Number 1, January-March 1984 19 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
independently of the choice of 0 (of course, o-O is 
not allowed, being meaningless). 
~r\[/~,O\] provides a very synthetic representation of 
the performance of U that can be useful in several 
cases of evaluation and comparison. A richer and 
more informed picture of the performance of a system 
U fully coherent with the above definitions can be 
obtained in the following way, for the cases where the 
ranges of tt and 0 are finite. For given shifting tz and 
importance 0, let range(t~) = {61 ..... 8 n} and range(o) 
= {¢01,...,60m}. Then we pose: 
Ei, j = {e I tz(g(e),~(e)) = 8iandp(e) = a~j}, 
for iE{1 ..... n} and jE{1 ..... n}. 
Clearly, UEi, j = E and all El, j are pairwise disjoint. 
Therefore, {Ei,j} is a partitioning of E. 
Now let: 
I E~,j I 
Pi,j = I EI ' 
for iE{1 ..... n} and jE{1 ..... m}. (We remember that E 
has been assumed to be finite, and hence so is Ei,j_-qE ). 
The n xm matrix \[Pi,j\] is called the ~-o-profile of 
U and and provides a far more informed representa- 
tion of the performance of U than the value ~r\[/~,0\]. In 
fact, \[Pi,j\] allows one to discover and analyse the 
specific features of the system, going beyond the glob- 
al value ~r\[/~,O\]. 
The relation between \[Pi,j\] and ~r\[/~,O\] is straight- 
forward: 
n m IEI 
~r\[#'P\] = X Z Pi,j ° 8i ° wj • • 
i=l j=l Z 0(e) 
ecE 
Note that \[Pi,j\] depends on /z and p only through the 
partitioning {El,j} they induce on E, but it is inde- 
pendent of the actual values of 8 i and ~oj. 
Different choices of /~ and p clearly provide differ- 
ent measures of performance that can be compared, in 
general, only on a qualitative and intuitive basis. 
Therefore, evaluating the performance of a system U 
requires first the definition of t~ and 0, and then the 
computation of rr\[/z,p\]. Clearly, the most critical of 
these two steps is, from a conceptual point of view, 
the first as it completely determines the "goodness" of 
the measure and its actual matching with desired intui- 
tive requirements. The second is only difficult from 
the computational point of view since E is usually very 
large and, hence, it is not possible to evaluate the sum 
in the definition of ~r\[/~,p\] in a direct, exhaustive way. 
In the next section we discuss in detail the problem 
of appropriately defining t~ and p, while section 5 is 
devoted to the topic of actually computing ~r\[/~,p\]. 
4. Some Significant Choices of Shifting and 
Importance Parameters 
Having discussed in the previous section an abstract 
theory of performance evaluation, we now deal with 
some implementations of it that may be of practical 
interest. Clearly, an implementation is obtained by 
assigning actual functions as values for the 
(functional) parameters /~ and p in the definition of ~r. 
Different choices of/z and p will yield different models 
for performance evaluation and will allow one to ana- 
lyse different features of the systems to be evaluated. 
Since ~ and p are fully independent parameters, we 
shall deal with each separately. 
Let us begin with the shifting function t~; in order 
that only the effect of /~ be relevant to 7r, we shall 
suppose throughout the following discussion that 0 has 
the constant value 0(e)=l for any eEE. 
The simplest case is that where /z may assume only 
two (boolean) values 0 and 1, denoting a correct and a 
wrong understanding, respectively. Such a boolean 
shifting function is denoted by /~1 and formally de- 
fined by: 
{~ if r' =r" 
/ll(r"r") = if r'#r" 
for any pair (r',r')e(2Ru {±})x2 R. 
The intuitive meaning of /~1, when used to evalu- 
ate a natural language understanding system U, is 
straightforward: ~r\[tzl,1\](U ) = x denotes the percent- 
age of expressions of E that U is unable to understand 
correctly (clearly, 1-x is the percentage of expres- 
sions correctly understood by U). 
The above definition of ~t is very crude; in fact, 
systems with the same qr\[/ll,1 \] can show a very dif- 
ferent behaviour, and, furthermore, qr\[#l,1\](Ul) 2> 
~r\[tzl,1\](U 2) does not generally ensure that U 1 per- 
forms better than U 2. 
A slight improvement can be obtained by splitting 
the case r1#r '' into two subcases that cover, when 
evaluating U, the following situations: 
(i) U is unable to assign a meaning to an expression 
e (that is, it fails); hence, g(e) = r' = ± # r" = 
~(e) 
(ii) U assigns to an expression e a meaning that is not 
the correct one; hence, g(e) = r' # r" = ~(e), 
with g(e) # ±. 
20 Computational Linguistics, Volume 10, Number 1, January-March 1984 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
It seems quite reasonable that generally case (i) is 
less serious than case (ii), so that we can propose a 
new definition of shifting ~t2: 
l 0 if r' =r" /~2(r',r") = 8 if r' = ± 1 if r'#± and r'#r" 
where 3E(0,1). 
Clearly, the choice of 8 strongly affects the values 
of ~r\[/~,l\](U) and will depend on how much we want to 
distinguish between cases (i) and (ii) mentioned 
above. 
Going further to propose more fitting definitions of 
#, we may want to analyze in more detail the case 
r'~± and r'#r". Recalling that r' and r" are sets of 
strings in R, we can distinguish the following cases: 
(i) U assigns to an expression e the value 4~ (that is, 
no meaning), while it has a well-defined mean- 
ing; 
(ii) U assigns to an expression e a proper nonempty 
subset of its meanings; 
(iii) U assigns to an expression e all its correct 
meanings and, in addition, other incorrect ones; 
(iv) U assigns to an expression e a proper nonempty 
subset of its meanings and, in addition, other 
incorrect ones; 
(v) U assigns to an expression e a nonempty set of 
meanings that is fully different from the correct 
one. 
that covers Formally, we can define the shifting It 3 
all such situations by: 
/~3 (r',r") 
0 if r w -r" 
d 1 ifr'= ± 
8 2 ifr'=~andr"#~ 
8 3 if r'#~ and r'cr" 
= 8 4 if r' ~r" and r"#q, 
3 5 if r'Nr"#q~ and r'-r"#~ 
and r"-r' #q~ 
1 if r'#4~ and r'N r" = ~ 
where 6i£(0,1), for i = 1, 2, 3, 4, 5. 
It could be reasonably assumed 61 < 8 2 < 6 3 < 6 4 
< 6 5, since the situations to which they are attached 
are generally considered as denoting increasing degrees 
of misunderstanding (note that #3 deals in great detail 
with the case of ambiguous understanding, where at 
least one of r' or r" is not a singleton). 
Along the line of reasoning shown in the above 
definitions, several other improvements are possible. 
For example, we can further refine the above case (v), 
r'#4~ and r'N r" = ~, by taking into account the actu- 
al structure of the elements of r' and r". R being a 
well-defined formal language, we can first define an 
appropriate notion of "distance" /~ between elements 
of R, and then extend it to nonempty disjoint elements 
of 2 R. 
This kind of refinement is particularly significant 
when both r' and r" are singletons, that is, under- 
standing is not ambiguous, as is often the case. Also, 
it generally allows far more meaningful definitions of 
shifting, thus further approaching the intuitive notion 
of "distance" as "degree of understanding". 
Let us turn our attehtion now to the importance 
function 0. 
Also for this function, a first simple proposal can 
be a boolean definition: no importance at all is as- 
signed to expressions in E-L D and the same (not 
null) importance to every expression in L D. So we 
can define Pl as: 
pl(e) = /~ ife~LD 
if eeL D 
for each eeE. 
A refinement of p\] can be obtained by analyzing 
the case eeL D and taking into account the frequency 
of use of expressions in L o. This will give more im- 
portance to the correct understanding of more fre- 
quently used expressions and less importance to that 
of rare or unusual ones. From the human point of 
view, it is obvious that texts with a greater frequency 
are used, and hence understood, by a larger number of 
people. 
Therefore, it seems meaningful to consider a system 
that can understand quite well the relatively small 
number of the most common texts and fails on the 
most unusual ones, to be better than a system that 
understands a lot of very rare texts but often fails in 
understanding the most common ones. 
Formally, we can define the frequency of expres- 
sions of e as a map z : E--\[0,1\], with the constraint 
that E z(e) = 1. Then, we can define a new impor- 
eEE 
tance function P2 such that: 
z(e) if eeL o 
02(e) = 0 otherwise 
The frequency function z(e) can be effectively deter- 
mined by collecting, through an appropriate experi- 
mental activity, a meaningful bag of texts T, in which 
each eeE appears with a given integer multiplicity 
m(e), and then by computing 
Computational Linguistics, Volume 10, Number 1, January-March 1984 21 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
m(e) z(e) = -- 
E m(e) " 
eEE 
A totally different criterion that could be used to re- 
fine the definition of importance functions is structural 
complexity of the expressions of E (or of LD). 
A very crude notion of structural complexity is 
simply given by the length of an expression e. In this 
case, given a chain 0 = ~O<~l<...</~m_ 1 of m non- 
negative integers, we can partition E into m classes: 
E l = {elg0< lel <e l} 
Ez = {elgl< lel _<~2 } 
E m = {e ~m_l < I el \]. 
Then, a new importance function P3 is defined by: 
P3(e) = t0 i iff e~Ei, 
where ¢0iE\[0,1\], for i = 1 .... ,m. 
It is worth noting that the length of a text is not inde- 
pendent of its frequency of use; we feel that in several 
application domains (such as, for example, man- 
machine interaction) short texts are much more fre- 
quent than long ones and that texts exceeding a given 
length are not used at all. 
A more refined notion of structural complexity of 
an expression may be given by taking into account its 
syntactic structure, defined on the basis of an appro- 
priate set of characteristic features - see, for example, 
the classification proposed in Tennant (1980). E can 
be partitioned into different and disjoint classes E i, 
according to the set of syntactical features they match, 
and an importance function P4 can be defined as 
above: 
P4(e) = ,0 i iff eEEi, 
where ~0iE\[0,1\], for i = 1 ..... m. 
Let us note that, contrary to the above illustrated 
relation between the length of a text and its frequency, 
it seems reasonable to consider syntactical complexity 
as fully independent of frequency; in fact, quite com- 
plex syntactical features (such as ellipsis, anaphora, 
broken text, etc.) are frequently found in several ap- 
plication domains. 
Finally, a couple of other possible choices for as- 
signing the importance function p are worth mention- 
ing: one based on the notions of "information 
content" or "structural complexity" according to Kol- 
mogorov (1965, 1968), and the other based on the 
concept of "semantic complexity" of an expression, 
which could be formally defined, for example, in the 
represented domain R. However, some more theoreti- 
cal work on these notions is necessary before we can 
use them for our needs; hence we will not further de- 
velop these notions here. 
5. Measuring Performance in .Practice 
In the preceding sections, some theoretical tools for 
measuring the performance of a natural language un- 
derstanding system have been illustrated. At this 
point we have to put them to work: that is, we must 
discuss how the performance of a system can be actu- 
ally evaluated and how the comparison between two 
different systems can be carried out. 
We distinguish two steps in the process of perform- 
ance evaluation: 
(i) to assign the functions/~ and p; 
(ii) to compute ~r\[#,p\]. 
Let us examine in detail each of the two points. 
The choice 'of the shifting function /~ depends only 
on the degree to which we want to refine the notion of 
error in understanding and on the varying importance 
we want to assign to each type of error. Hence it is 
often only a matter of subjective feeling choosing ap- 
propriate values for /~ in order to analyse particular 
features of the system to be evaluated. Also, the defi- 
nition of # is strongly dependent on the representation 
language R for the domain D: the richer and more 
structured R is, the more refined and subtle are the 
possible definitions of/~. 
On the contrary, however, the choice of the impor- 
tance function p can generally be based on more ob- 
jective arguments, once an appropriate ranking among 
the desired understanding capabilities of the system to 
be evaluated has been defined. For example, in the 
case where the frequency of texts is taken into ac- 
count, an appropriate experimental activity can pro- 
vide reliable statistical estimations for the frequency 
z(e) of each expression eEE, thus allowing the effec- 
tive computation of p(e). (Problems connected with 
the choice of a meaningful sample to estimate z(e) - 
which could freely include millions of millions of ex- 
pressions - are not dealt with here, since they are 
more related to statistics than to computational lin- 
guistics.) 
Clearly, the choice of t~ and p fully determines the 
numerical value of ~r\[/~,p\] (or of the matrix \[Pi,j\]) in 
correspondence to a given system U. How a change in 
/~ or p can affect qr\[/~,p\] is generally impossible to pre- 
22 Computational Linguistics, Volume 10, Number 1, January-March 1984 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
dict, since this strongly depends on the particular fea- 
tures of U. Therefore, evaluating a system with differ- 
ent choices of p or p can indeed provide a clearer im- 
age of its performance. Although the comparison 
between different values of 7r obtained with different 
pairs (#,O) is often only a matter of intuitive reason- 
ing, an interesting particular case that can be conven- 
iently dealt with formally is briefly sketched below. 
A shifting function /~' is a refinement of a shifting 
function # (/~'_2/,) iff: 
- range(/,) = {61 ..... 6 n} with 61<62<...<6 n ; 
- range(/,') = {6' l .... ,6'n,}, with 6'1< 6'2<...< 
6t n, and n v >n ; 
- the partitioning {El} of E induced by/z t is a re- 
finement of the partitioning {El} of E induced by 
/~; 
- for each class E i = U E' t, , where 
teT 
T= {t I ..... ti } _c {1 ..... n'}, withtl< t2<...< ti: 
6i_1<6'tl<6tt2<...<6;ti = 6 i. 
In an analogous way we can define the refinement 
p, of an importance function p (pv___p). 
A pair (~tV,p v) is a refinement of a pair (/~,p) (we 
write (p',p') _3 (/~,p)) iff/~' _3 /z and p' _3 p. 
It is straightforward to prove that: 
For any system U and any two pairs (/z,p) and 
(p',p'), (/L',p') _3 (/~,p) implies ~r\[tt',p'\](U ) < 
~\[~,p\](U). 
For example, the shifting function /z 3 in section 4 
refines /~2, which in turn refines /Zl, that is, 
/~1 --- #3" For the importance function p, on the other 
hand, not one of the functions Ol, P2, P3, P4, in 
section 4 is a refinement of any other one. 
It is worth noting that, when defining appropriate 
pairs (it,p) to evaluate a system, there are basically 
two ways of reasoning for comparing different choices: 
the first one is to Start from a first basic proposal and 
to proceed through successive refinements until the 
desired degree of precision and detail is reached; the 
second one consists in proposing functions correspond- 
ing to several different points of view and then inte- 
grating them together in a well-balanced synthesis. 
Generally, the first approach is appropriate for the 
definition of /x, while the second one can be utilized 
for the choice of o. 
Let us turn now to the problem of computing 
~r\[/~,O\], once/~ and P have been assigned. 
Obviously, it is unrealistic to compute the exact 
value of qr by considering the behaviour of the system 
with respect to every expression ecE. Hence, a se- 
quence of test cases has to be considered (Gold 1967). 
Figure 2 shows a model for experimental perform- 
ance evaluation. A GENERATOR provides at each 
time instant i (i=1,2 .... ) an expression eiEE. Then, 
the system U to be evaluated computes the meaning 
g(ei), which is compared by/z with the correct mean- 
ing g(ei) supplied by an EVALUATOR (a man suppos- 
ed to be able to compute ~, that is, both f and ~). 
Finally, the value p(ei) is computed, and the current 
value of 
i 
E /~(g(ej),g(ej)) • p(ej) 
j=l 
,/7" i = i 
p(ei) 
j=l 
is determined. 
The major problem with the computation of ~r is 
the design of the GENERATOR, that is, the choice of 
the sample of E to be used for the evaluation of the 
system U. 
The mathematically simplest case is the one where 
a subset B _ E is randomly generated on the basis of a 
given probability distribution in E (for example, equi- 
probability); then, 
qr B = 
~\] ~(g(e),~(e)) • p(e) 
ecB 
p(e) 
eeB 
is a random variable such that E(qrB) = 7r for reason- 
able distributions, where E0rB) denotes the expecta- 
tion of qr B. The value of E(qrB) may be estimated by 
means of statistical techniques such as, for example, 
the maximum likelihood function. Here, we will not 
give a detailed account of such techniques. They can 
be easily found in classical works of statistics and sam- 
pling theory (Kobayashi 1978; Cox, Hinkley 1977; 
Mood, Graybill 1980), when needed. 
A different technique would be that of fixing a 
confidence interval, and then establishing the number 
n of tests to be generated in order to obtain the value 
of qr(/z,p) within the given confidence level, by means, 
for example, of X 2 techniques. 
In addition to these elementary statistical methods, 
more sophisticated sampling techniques can be used. 
This requires us first to choose a partitioning of E into 
meaningful classes, and then to define a sample strati- 
fied according to the considered partitioning. In this 
Computational Linguistics, Volume 10, Number 1, January-March 1984 23 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
GENERATOR I 
ei 
_1 system U 
-It0 be evatuated 
-~ EVALUATOR 
I g(ei) ~i)) 
i 
\] t\] (ei) j=Z1 IJ (g(eil'c\](eJ))'P(ei) 
i 
,_ j=l 
Figure 2. A model for experimental performance evaluation. 
TEl 
IL 
case, the GENERATOR might not work on a purely 
random basis. 
All the above-mentioned techniques are independ- 
ent of the choice of p, and do not take into account 
specific goals that could be assigned to performance 
evaluation (for example, syntactic capabilities, linguis- 
tic or conceptual competence, etc.). Such general 
purpose methods can sometimes provide a too much 
global and too less meaningful evaluation. Moreover, 
the sample to be used for the computation of ~r is gen- 
erally very large and hard to collect. 
Special purpose evaluation, centered on the analysis 
of some specific features of U, can often be more 
interesting and easier to implement. In this case, the 
specific goal of the measurement should be carefully 
taken into account in the definition of P, and both the 
goal and p should direct the choice of the appropriate 
sample of E to be used for the experimental computa- 
tion of ~r. More precisely, an experimental (special 
purpose) evaluation session could be organized as 
follows: 
1. precisely individuating the system U, the domain 
D, and the representation language R; 
2. defining the goals of the evaluations; 
3. deciding which samples of E to collect and how to 
collect them; 
4. defining/~; 
5. defining p (and how to compute it for the chosen 
samples); 
6. computing ~r (and/or \[Pi,j\])- 
Note that several tx and p could be generally consid- 
ered for a careful experimentation. Moreover, steps 3, 
4, and 5 might require, in critical cases, specific pre- 
experimentation and some refinement loops for appro- 
priate tuning. 
In the appendix, a limited case study experimenta- 
tion is briefly discussed. 
6. Discussion and Future Research Directions 
In this paper we have presented a model for perform- 
ance evaluation of natural language understanding 
systems. The main task of this model is that of pro- 
viding a basis for a quantitative measure of how well a 
system can understand natural language, thus allowing 
an objective and experimental comparison of the per- 
formance of different systems. 
Before discussing some open problems and illustrat- 
ing the main lines of future research, let us briefly 
discuss some further features of our approach by com- 
paring it to the classical work by Tennant (1979, 
1980) and by Finin, Goodman, and Tennant (1979). 
Tennant's proposal is based on the three main con- 
cepts of habitability, completeness, and abstract analy- 
sis. This last point is not considered here, as ex- 
plained in section 1 (see further in this section for its 
24 Computational Linguistics, Volume 10, Number 1, January-March 1984 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
possible relevance to future work); we therefore focus 
on the first two. From a naive point of view, habita- 
bility is used to test whether or not the system does 
what it was designed to do; completeness is introduced 
to test whether or not the system meets users' require- 
ments. More precisely, Tennant introduces the two 
notions of coverage and completeness to denote, respec- 
tively, the capabilities (both conceptual and linguistic) 
that the designer has put within a system, and 
(similarly to Woods, Kaplan, Nash-Webber 1972 
though differing from Woods 1977) the degree to 
which the capabilities expected by a set of users can 
actually be found in the system coverage. Further- 
more, habitability denotes (quite differently from Watt 
1968) the degree to which a system can actually ex- 
hibit the capabilities that it was designed to have. 
Our approach is based on a slightly different model 
and provides in some sense a refinement of the above 
concepts. 
We denote by the term competence the capabilities 
that a system is actually able to show, while by the 
term coverage we refer, according to Tennant, to the 
theoretical capabilities that a system should have as a 
consequence of its design specifications. 
More precisely, the conceptual coverage of a sys- 
tem UR/D is formalized in our model by the domain 
D, which represents, in fact, the range of concepts that 
are within the domain of discourse of a given applica- 
tion. 
The linguistic coverage clearly includes L D but, 
generally, is not limited to L D since understanding a 
language in a given domain also implies the capability 
of recognizing that some expressions are not meaning- 
ful in that domain. 
In general, for a given importance function p, we 
can assume that the linguistic coverage is defined by: 
LW D = {e l ecE and 0(e)>A}, 
where A(0<A<I) is a fixed bound. 
The linguistic competence can then be defined as: 
L' D = {e I eeL' D and g(e) = ~(e)}, 
and the conceptual competence as: 
D = U fD(e). 
eeLrD 
(without distinction between conceptual and linguistic 
aspects) approaches its coverage. This measure is 
quite similar to, and provides a refinement of, the 
concept of habitability, involving also to some extent 
the notion of completeness. In fact, both the choice 
of D as an adequate domain and the definition of o as 
a suitable importance function (and, therefore, of 
LWD) implicitly refer to a set of users and then to 
completeness. 
It is apparent that the proposal introduced in this 
paper demands further work, both theoretical and 
experimental, in order to have fully adequate tools for 
performance evaluation. 
First of all, some of the concepts presented here 
have to be further discussed and expanded. For exam- 
ple, in the definition of ~r, we have normalized it with 
respect to O by setting: 
E/t.p 
A different choice could be: 
E/ °p 
qT" -- m 
IEI 
where tt and p are given the same importance (in this 
case the value ~r=l would be reached only when all 
expressions of E are fully misunderstood, that is, 
t~=_l, and when it is important at the highest degree 
that each of them is correctly understood, that is, 
0-1). While we have preferred here the first defini- 
tion, arguments could be given in favour of the sec- 
ond. 
A second critical point is the definition of the 
/~-p-profile \[Pi,j\]" This could be further extended so 
as to provide a picture of several dimensions (features, 
for example: frequency, syntactic complexity, informa- 
tion content, etc.). Third, it is worthwhile considering 
and improving the notion of refinement: in fact, the 
present definition is not stable with respect to the 
choice of ~t and O. That is, it could be that, given two 
systems U and UI: 
~\[u,p\](u) < ~\[u,pl(U') 
and, for some refinement (tL',p') of (/~,p): 
~r\[~',o'\](U) > ~r\[tz',p'\](U'), 
The above concepts are summarized in Figure 3. 
Our definition of performance ~r\[/~,p\] tries to give a 
global idea of how well the competence of a system 
so that the refinement of the evaluation criteria may 
give an inversion of the first evaluation. A formal 
development of the three points mentioned above will 
Computational Linguistics, Volume 10, Number 1, January-March 1984 25 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
system 
design 
level 
system 
performance 
level 
D 
conceptual 
coverage 
conceptual 
competence 
capabilities expected by 
the users 
/ / 
I / COMPLETENESS 
I / 
% 
JW# 
L D \ 
linguistic 
HABITABILITY 
competence 
NATURAL LANGUAGE 
UNDERSTANDING 
SYSTEM 
U 
Figure 3. Coverage and competence of a natural language understanding system. 
26 Computational Linguistics, Volume 10, Number 1, January-March 1984 
Giovanni Guida and Giancarlo Mauri A Formal Basis for Performance Evaluation of NLUS 
be part of a future paper. 
For what concerns the main directions in the devel- 
opment of the current research activity, we mention: 
• experimentation with the model proposed in the 
evaluation of large systems; 
• development of appropriate sampling techniques for 
the experimental evaluation of 7r; 
• experimentation with several different choices of /~ 
and p; 
• design of techniques for special purpose evaluation 
(choice of the goal, definition of/~ and p, sampling, 
etc.); 
• analysis of the adequacy of the notion of/~-p-profile 
for representing all interesting details of the per- 
formance of a system. 
Beyond these issues we also point out two more 
ambitious and promising problems; they will be faced 
in future work. The approach to performance evalua- 
tion presented in this paper has two major limitations: 
first, it is only concerned with input-output behaviour 
and does not take into account the internal model on 
which a system is based; second, it does not deal with 
the efficiency of the natural language understanding 
process. As far as the former topic is concerned, it is 
clear that, except in the case where commercial appli- 
cations are considered, one is primarily interested in 
models rather than in particular implementations. It is 
far more significant that a model, a knowledge repre- 
sentation method, and a parsing algorithm have been 
designed to build natural language understanding sys- 
tems rather than that a specific system has been con- 
structed in a particular domain for a particular use. 
Tennant (1980) (see also Woods 1977) proposes a 
method, called abstract analysis, to organize in an in- 
formal but disciplined way the evaluation, through 
taxonomies of conceptual, linguistic, and implementa- 
tional issues, of the internal behaviour of a natural 
language system (including analysis of failure causes, 
domain dependent features, knowledge base complete- 
ness and closure, algorithm deficiencies, extensibility, 
etc.). A very demanding research issue that could 
substantially contribute to the development of the 
research on natural language processing is the defini- 
tion of more formal methods that, starting from the 
above proposal, allow a "deep" evaluation and com- 
parison of systems on the basis of their internal struc- 
ture and mode of operation, opposed to the "surface" 
measure of their input-output behaviour, as considered 
in the present paper. 
Concerning the latter topic, efficiency, two aspects 
seem worth considering: the experimental measure of 
the efficiency of a specific system in understanding 
natural language that could appropriately complete the 
concept of performance defined in the present work; 
and the theoretical evaluation of the complexity of the 
general model underlying the construction of a particu- 
lar system, which could possibly complete the notion 
of "deep" evaluation mentioned above. 
Acknowledgements 
We are grateful to the anonymous referees for their 
useful criticism and suggestions. 
We would also like to acknowledge the appreciated 
support provided by CSELT Laboratories (Torino, 
Italy) with the experimentation of the PARNAX sys- 
tem. 

References 
Comino, R.; Gemello, R.; Guida, G.; Rullent, C.; Sisto, L.; and 
Somalvico, M. 1983 Understanding Natural Language 
Through Parallel Processing of Syntactic and Semantic Knowl- 
edge: An Application to Data Base Query. In Proc. 8th Int. 
Joint Conference on Artificial Intelligence. Karlsruhe, West Ger- 
many: 663-667. 
Cox, D.R. and Hinkley, D.V. 1974 Theoretical Statistics. Chapman 
and Hall, London. 
Finin, T.; Goodman, B.; and Tennant, H. 1979 JETS: Achieving 
Completeness Through Coverage and Closure. In Proc. 6th Int. 
Joint Conference on Artificial Intelligence. Tokyo, Japan: 275- 
281. 
Gold, E.M. 1967 Language Identification in the Limit. Informa- 
tion and Control 10: 447-474. 
Kaplan, J. 1982 Special Section: Natural Language Processing. 
ACM SIGART Newsletter 79:27-109 and 80: 59-61. 
Kolmogorov, A.N. 1965 Three Approaches to the Concept of 
"'The Amount of Information". Probl. of Information 
Transmission 1 (1): 3-11. 
Mood, R.S. and Graybill, H.J. 1980 Introduction to Statistics. 
McGraw-Hill, Englewood Cliffs, New Jersey; 
Tennant, H. 1979 Experience with the Evaluation of Natural 
Language Question Answerers. in Proc. 6th Int. Joint Confer- 
ence on Artificial Intelligence. Tokyo, Japan: 874-876. 
Tennant, H. 1980 Evaluation of Natural Language Processes. 
Report T-103. Coordinated Science Laboratory, University of 
Illinois, Urbana, lllinois. 
Waltz, D. 1977 Natural Language Interfaces. ACM S1GART 
Newsletter 61 : 16-64. 
Watt, W.C. 1968 Habitability. American Documentation 338-351. 
Woods, W.A. 1977 A Personal View of Natural Language Under- 
standing. ACM SIGART Newsletter 61: 17-20. 
Woods, W.A.; Kaplan, R.M.; and Nash-Webber, B. 1972 The 
Lunar Sciences Natural Language Information System: Final 
Report. Report 2378. Bolt Beranek and Newman, Cambridge, 
Massachusetts. 
