Evaluation Metrics for Generation 
Srinivas Bangalore and Owen Rambow and Steve Whittaker 
AT&T Labs - Research 
180 Park Ave, PO Box 971 
Florham Park, NJ 07932-0971, USA 
{srini,r~mbow, stevew}@reSearch, art tom 
Abstract 
Certain generation applications may profit from the 
use of stochastic methods. In developing stochastic 
methods, it is crucial to be able to quickly assess 
the relative merits of different approaches or mod- 
els. In this paper, we present several types of in- 
trinsic (system internal) metrics which we have used 
for baseline quantitative assessment. This quanti- 
tative assessment should then be augmented to a 
fuller evaluation that examines qualitative aspects. 
To this end, we describe an experiment that tests 
correlation between the quantitative metrics and hu- 
man qualitative judgment. The experiment confirms 
that intrinsic metrics cannot replace human evalu- 
ation, but some correlate significantly with human 
judgments of quality and understandability and can 
be used for evaluation during development. 
1 Introduction 
For many applications in natural  genera- 
tion (NLG), the range of linguistic expressions that 
must be generated is quite restricted, and a gram- 
mar for a surface realization component can be fully 
specified by hand. Moreover, iLL inany cases it is 
very important not to deviate from very specific out- 
put in generation (e.g., maritime weather reports), 
in which case hand-crafted grammars give excellent 
control. In these cases, evaluations of the generator 
that rely on human judgments (Lester and Porter, 
I997) or on human annotation of the test corpora 
(Kukich, 1983) are quite sufficient .... 
However. in other NLG applications the variety of 
the output is much larger, and the demands on 
the quality of the output are solnewhat less strin- 
gent. A typical example is NLG in the context of 
(interlingua- or transfer-based) inachine translation. 
Another reason for relaxing the quality of the out- 
put may be that not enough time is available to de- 
velop a full gramnlar for a new target,  in 
NLG. ILL all these cases, stochastic methods provide 
an alternative to hand-crafted approaches to NLG. 
1 
To our knowledge, the first to use stochastic tech- 
niques in an NLG realization module were Langkilde 
and Knight (1998a) and (~998b) (see also (Langk- 
ilde, 2000)). As is the case for stochastic approaches 
in natural  understanding, the research and 
development itself requires an effective intrinsic met- 
ric in order to be able to evaluate progress. 
In this paper, we discuss several evaluation metrics 
that we are using during the development of FERGUS 
(Flexible Empiricist/Rationalist Generation Using 
Syntax). FERCUS, a realization module, follows 
Knight and Langkilde's seminal work in using an 
n-gram  model, but we augment it with a 
tree-based stochastic model and a lexicalized syntac- 
tic grammar. The metrics are useful to us as rela- 
tive quantitative assessments of different models we 
experiment with; however, we do not pretend that 
these metrics in themselves have any validity. In- 
stead, we follow work done in dialog systems (Walker 
et al., 1997) and attempt to find metrics which on 
tim one hand can be computed easily but on the 
other hand correlate with empirically verified human 
judgments in qualitative categories such as readabil- 
ity. 
The structure of the paper is as follows. In Section 2, 
we briefly describe the architecture of FEacUS, and 
some of the modules. In Section 3 we present four 
metrics and some results obtained with these met- 
rics. In Section 4 we discuss the for experimental 
validation of the metrics using human judgments, 
and present a new metric based on the results of 
these experiments. In Section 5 we discuss some of 
the 'many problematic issues related to the use Of 
metrics and our metrics in particular, and discuss 
on-going work. 
2 System Overview 
FERGUS is composed of three mQdules: .the Tree 
Chooser, tile Unraveler, and the Linear Precedence 
(LP) Chooser (Figure 1). Tile input to the system is 
a dependency tree as shown in Figure 2. t Note that 
the nodes are unordered and are labeled only with 
lexemes, not with any sort of syntactic annotations. 2 
The Tree Chooser uses a stochastic tree model to 
choose syntactic properties (expressed as trees in a 
Tree Adjoining Grammar) for the nodes in the in- 
put structure. This step can be seen as analogous to 
"supertagging" -(Bangalore-und doshh 1:999);. except 
that now supertags (i.e., names of trees which en- 
code the syntactic properties of a lexical head) must 
be found for words in a tree rather than for words 
in a linear sequence. The Tree Chooser makes the 
siinplifying assumptions that the choice of a tree for 
a node depends only on its daughter nodes, thus al- 
lowing for a top-down algorithm. The Tree Chooser 
draws on a tree model, which is a analysis in terms 
of syntactic dependency for 1,000,000 words of the 
Wall Street Journal (WSJ). 3 
The supertagged tree which is output from the Tree 
Chooser still does not fully determine the surface 
string, because there typically are different ways to 
attach a daughter node to her mother (for example, 
an adverb can be placed in different positions with 
respect to its verbal head). The Unraveler therefore 
uses the XTAG grammar of English (XTAG-Group, 
1999) to produce a lattice of all possible lineariza- 
tions that are compatible with the supertagged tree. 
Specifically, the daughter nodes are ordered with re- 
spect to the head at each level of the derivation tree. 
In cases where the XTAG grammar allows a daugh- 
ter node to be attached at more than one place in 
the mother supertag (as is the case in our exam- 
ple for was and for; generaUy, such underspecifica- 
tion occurs with adjuncts and with arguments if their 
syntactic role is not specified), a disjunction of all 
these positions is assigned to the daughter node. A 
bottom-up algorithm then constructs a lattice that 
encodes the strings represented by each level of the 
derivation tree. The lattice at the root of the deriva- 
tion tree is the result of the Unraveler. 
Finally. the LP Chooser chooses the most likely 
traversal of this lattice, given a linear  
1The sentence generated by this tree is a predicativenoun 
construction. The XTAG grammar analyzes these as being 
headed by the noun,rather-than by.the copula, and we fol- 
low the XTAG analysis. However, it would of course also be 
possible to use a graminar that allows for the copula-headed 
analysis. 
21n the system that we used in the experiments described 
in Section 3. all words (including function words) need to be 
present in the input representation, fully inflected. Further- 
more, there is no indication of syntactic role at all. This is of 
course unrealistic f~r applications see ,Section 5 for further 
renlarks. 
:3This wa~s constructed from the Penn Tree Bank using 
some heuristics, sirice the. l)enn Tree Bank does not contain 
full head-dependerit information; as a result of the tlse of 
heuristics, the Tree Model is tint fully correct. 
2 
I 
TAG Derivation Tree 
without Supertags 
i 
One single semi-specified~ 
TAG Deri~tion Trees 
Word Lattice 
i \[ cPc.oo=, \] 
l 
String 
Figure 1: Architecture of FERGUS 
estimate 
there was no cost for 
I 
phase 
the second 
Figure 2: Input to FERGUS 
model (n-gram). The lattice output from the Un- 
raveler encodes all possible word sequences permit- 
ted by the supertagged dependency structure. \Ve 
rank these word sequences in the order of their likeN- 
hood by composing the lattice with a finite-state ma- 
chine representing a trigram  model. This 
model has been constructed from the 1.000,0000 
words WSJ training corpus. We pick the best path 
through the lattice resulting from the composition 
using the Viterbi algorithm, and this top ranking 
word sequence is the output of the LP Chooser and 
the generator. 
When we tally the results we obtain the score shown 
in the first column of Table 1. 
Note that if there are insertions and deletions, the 
number of operations may be larger than the number 
of tokens involved for either one of the two strings. 
As a result, the simple string accuracy metric may 
3 Baseline-Qua_ntitmtive,Metrics ...,:-~.--..~,-:..,be.:..~eg~i~ee (t:hoagk:it, As, nevel:-greater._than 1, of 
We have used four different baseline quantitative 
metrics for evaluating our generator. The first two 
metrics are based entirely on the surface string. The 
next two metrics are based on a syntactic represen- 
tation of the sentence. 
3.1 String-Based Metrics 
We employ two metrics that measure the accuracy 
of a generated string. The first metric, simple ac- 
curacy, is the same string distance metric used for 
measuring speech recognition accuracy. This met- 
ric has also been used to measure accuracy of MT 
systems (Alshawi et al., 1998). It is based on string 
edit distance between the output of the generation 
system and the reference corpus string. Simple ac- 
curacy is the number of insertion (I), deletion (D) 
and substitutions (S) errors between the reference 
strings in the test corpus and the strings produced by 
the generation model. An alignment algorithm us- 
ing substitution, insertion and deletion of tokens as 
operations attempts to match the generated string 
with the reference string. Each of these operations 
is assigned a cost value such that a substitution op- 
eration is cheaper than the combined cost of a dele- 
tion and an insertion operation. The alignment al- 
gorithm attempts to find the set of operations that 
minimizes the cost of aligning the generated string 
to tile reference string. Tile metric is summarized 
in Equation (1). R is the number of tokens in the 
target string. 
course). 
The simple string accuracy metric penalizes a mis- 
placed token twice, as a deletion from its expected 
position and insertion at a different position. This is 
particularly worrisome in our case, since in our eval- 
uation scenario the generated sentence is a permuta- 
tion of the tokens in the reference string. We there- 
fore use a second metric, Generation String Ac- 
curacy, shown in Equation (3), which treats dele- 
tion of a token at one location in the string and the 
insertion of the same token at another location in 
the string as one single movement error (M). This 
is in addition to the remaining insertions (I') and 
deletions (D'). 
(3) Generation String Accuracy = 
( 1 -- M~-/~.P-~--~-~) 
In our example sentence (2), we see that the inser- 
tion and deletion of no can be collapsed into one 
move. However, the wrong positions of cost and of 
phase are not analyzed as two moves, since one takes 
the place of the other, and these two tokens still re- 
sult in one deletion, one substitution, and one inser- 
tion. 5 Thus, the generation string accuracy depe- 
nalizes simple moves, but still treats complex moves 
(involving more than one token) harshly. Overall, 
the scores for the two metrics introduced so far are 
shown in the first two columns of Table 1. 
3.2 Tree-Based Metrics 
(1) Simple String Accuracy = (1 I+*)+s I¢ ) \Vhile tile string-b~u~ed metrics are very easy to ap- 
ply, they have the disadvantage that they do not 
reflect the intuition that all token moves are not Consider tile fifth)wing example. The target sentence 
is on top, tile generated sentence below. Tile third equally "bad". Consider the subphrase estimate for 
line represents the operation needed to. transfor m .. phase the second of the sentence in (2). \Vhile this is 
one sentence into another: a period is used t.o indi- bad;it seems better:tiara art alternative such as es- 
cate that no operation is needed. 4 
(2) There was no cost estimate for tile 
There was estimate for l)hase tile 
d (1 i 
second phase 
second no cost 
i s 
• I Note that the metric is symmetric, 
timate phase for tile second. Tile difference between 
the two strings is that the first scrambled string, but 
not tile second, can be read off fl'om tile dependency 
tree for the sentence (as shown ill Figure 2) with- 
out violation of projectivity, i.e., without (roughly 
STiffs shows the importance of the alignment algorithm in 
the definition of Ihese two metrics: had it. not, aligned phase 
and cost as a substitution (but each with an empty position 
in the other~string-:instead),, then ~khe simple string accuracy 
would have 6 errors instead of 5, but the generation string 
accuracy would have 3 errors instead of ,1, 
speaking) creating discontinuous constituents. It 
has long been observed (though informally) that the 
dependency trees of a vast majority of sentences in 
the s of the world are projective (see e.g. 
(Mel'euk, 1988)), so that a violation of projectivity 
is presumably a more severe error than a word order 
variation that does not violate projectivity. 
We designed thetree-based'-acettrucymetrics in 
order to account for this effect. Instead of compar- 
ing two strings directly, we relate the two strings 
to a dependency tree of the reference string. For 
each treelet (i.e., non-leaf node with all of its daugh- 
ters) of the reference dependency tree, we construct 
strings of the head and its dependents in the order 
they appear in the reference string, and in the order 
they appear in the result string. We then calculate 
the number of substitutions, deletions, and inser- 
tions as for the simple string accuracy, and the num- 
ber of substitutions, moves, and remaining deletions 
and insertions as for the generation string metrics, 
for all treelets that form the dependency tree. We 
sum these scores, and then use the values obtained 
in the formulas given above for the two string-based 
metrics, yielding the Simple Tree Accuracy and 
Generation Tree Accuracy. The scores for our 
example sentence are shown in the last two columns 
of Table 1. 
3.3 Evaluation Results 
The simple accuracy, generation accuracy, simple 
tree accuracy and generation tree accuracy for the 
two experiments are tabulated in Table 2. The test 
corpus is a randomly chosen subset of 100 sentences 
from the Section 20 of WSJ. The dependency struc- 
tures for the test sentences were obtained automat- 
ically from converting the Penn TreeBank phrase 
structure trees, in the same way as was done to 
Create the training corpus. The average length of 
the test sentences is 16.7 words with a longest sen- 
tence being 24 words in length. As can be seen, the 
supertag-based model improves over the baseline LR 
model on all four baseline quantitative metrics. 
4 Qualitative Evaluation of the 
Quantitative Metrics 
4.1 The Experiments 
We have presented four metrics which we can com- 
pute automatically. In order to determine whether 
the metrics correlate with independent notions un- 
derstandability or quality, we have performed eval- 
uation experiments with human subjects. 
In the web-based experiment, we ask human sub- 
jects to read a short paragraph from the WSJ. We 
present three or five variants of the last sentence of 
this paragraph on the same page, and ask the sub- 
ject to judge them along two dimensions: 
Here we summarize two experiments that we have 
performed that use different tree nmdels. (For a 
more detailed comparisons of different tree models, 
see (Bangalore and Rainbow, 2000).) 
o For the baseline experiment, we impose a ran- 
dom tree structure for each sentence of the cor- 
pus and build a Tree Model whose parameters 
consist of whether a lexeme ld precedes or fol- 
lows her mother lexeme \[ .... We call this the 
Baseline Left-Right (LR) Model. This model 
generates There was estimate for phase the sec- 
ond no cost. for our example input. 
o In the second experiment we use the-system 
as described in Section 2. We employ the 
supertag-based tree model whose parameters 
consist of whether a lexeme ld with supertag 
sd is a dependent of lexeme 1,,, with supertag 
s,,,. Furthermore we use the information pro- 
vided by the XTAG grammar to order the de- 
pendents. This model generates There was no 
cost estimate for" the second phase . for our ex- 
ample input, .which is indeed.the sentence found 
in the WS.I. 
o Understandability: How easy is this sentence 
to understand? Options range from "Extremely 
easy" (= 7) to "Just barely possible" (=4) to 
"Impossible" (=1). (Intermediate numeric val- 
ues can also be chosen but have no description 
associated with them.) 
o Quality: How well-written is this sentence? 
Options range from "Extremely well-written'" 
(= 7) to "Pretty bad" (=4) to "Horrible (=1). 
(Again. intermediate numeric values can also t)e 
chosen, but have no description associated with 
them.) 
The 3-5 variants of each of 6 base sentences are con- 
strutted by us (most of the variants lraxre not actu- 
ally been generated by FERGUS) to sample multiple 
values of each intrinsic metric as well as to contrast 
differences between the intrinsic measures. Thus for 
one sentence "tumble", two of the five variants have 
approximately identical values for each of the met- 
rics but with the absolute values being high (0.9) 
and medium (0.7) respectively. For two other sen- 
\[,('II('(}S ~ve have contrasting intrinsic values for tree 
trod string based measures. For .the final sentence 
we have contrasts between the string measures with 
Metric Simple Generation Simple Generation 
String Accuracy String Accuracy Tree Accuracy Tree Accuracy 
Total number of tokens 9 9 9 9 
Unchanged 
Substitutions 
Insertions 
Deletions 
Moves 
6 
1 
2 
2 
0 
6 
0 
3 
3 
O.. 
6 
0 
0 
0 
.3 
Total number of problems 5 4 " 6 3 
Score 0.44 0.56 0.33 0.67 
Table 1: Scores for the sample sentence according to the four metrics 
Tree Model Simple Generation Simple Generation 
String Accuracy String Accuracy Tree Accuracy Tree Accuracy i 
Baseline LR Model 0.41 0.56 0.41 0.63 
..... i Supertag-based Model 0.58 0.72 0.65 0.76 I 
Table 2: Performance results 
tree measures being approximately equal. Ten sub- 
jects who were researchers from AT&T carried out 
the experiment. Each subject made a total of 24 
judgments. 
Given the variance between subjects we first nor- 
malized the data. We subtracted the mean score 
for each subject from each observed score and then 
divided this by standard deviation of the scores for 
that subject. As expected our data showed strong 
correlations between normalized understanding and 
quality judgments for each sentence variant (r(22) = 
0.94, p < 0.0001). 
Our main hypothesis is that the two tree-based met- 
rics correlate better with both understandability and 
quality than the string-based metrics. This was con- 
firmed. Correlations of the two string metrics with 
normalized understanding for each sentence variant 
were not significant (r(22) = 0.08 and rl.2.21 = 0.23, for 
simple accuracy and generation accuracy: for both 
p > 0.05). In contrast both of the tree metrics were 
significant (r(2.2) = 0.51 and r(22) = 0.48: for tree 
accuracy and generation tree accuracy, for both p 
< 0.05). Similar results were achieved--for thegor- 
realized quality metric: (r(.2.21 = 0.16 and r(221 = 
0,33: for simple accuracy and generation accuracy, 
for both p > 0.05), (r(ee) = 0.45 and r(.2.2) = 0.42, 
for tree accuracy and generation tree accuracy, for 
both p < 0.05). 
A second aim of ()Lit" qualitative evaluation was to 
lest various models of the relationship between in- 
trinsic variables and qualitative user judgments. \Ve 
proposed a mmlber-of'models:in which various conL- 
from the two tree models 
binations of intrinsic metrics were used to predict 
user judgments of understanding and quality. .We 
conducted a series of linear regressions with nor- 
malized judgments of understanding and quality as 
the dependent measures and as independent mea- 
sures different combinations of one of our four met- 
rics with sentence length, and with the "problem" 
variables that we used to define the string metrics 
(S, I, D, M, I', D' - see Section 3 for definitions). 
One sentence variant was excluded from the data set, 
on the grounds that the severely "mangled" sentence 
happened to turn out well-formed and with nearly 
the same nleaning as the target sentence. The re- 
sults are shown in Table 3. 
We first tested models using one of our metrics as a 
single intrinsic factor to explain the dependent vari- 
able. We then added the "problem" variables. 6 and 
could boost tile explanatory power while maintain- 
ing significance. In Table 3, we show only some con> 
binations, which show that tile best results were ob- 
tained by combining the simple tree accuracy with 
the number of Substitutions (S) and the sentence 
length. As we can see, the number of substitutions 
..... has an.important effecVon explanatory.power,, while 
that of sentence length is much more modest (but 
more important for quality than for understanding). 
Furthermore, the number of substitutions has more 
explanatory power than the number of moves (and 
in fact. than any of the other "problem" variables). 
The two regressions for understanding and writing 
show very sinlilar results. Normalized understand- 
6None of tile "problem" variables have much explanatory 
power on their own (nor (lid they achieve significance). 
Model User Metric Explanatory Power Statistical Significance 
(R 2) (p value) 
Simple String Accuracy Understanding 0.02 0.571 
Simple String Accuracy Quality 0.00 0.953 
Generation String Accuracy 
Generation String Accuracy 
Simple Tree Accuracy ....... . - ~, 
Simple Tree Accuracy 
Generation Tree Accuracy 
Generation Tree Accuracy 
Simple Tree Accuracy + S 
Simple Tree Accuracy + S 
Simple Tree Accuracy + M 
Simple Tree Accuracy + M 
Simple Tree Accuracy + Length 
Simple Tree Accuracy + Length 
Simple Tree Accuracy + S + Length 
Simple Tree Accuracy + S + Length 
Understanding 
Quality 
::Unders~aatdiag 
Quality 
Understanding 
Quality 
0.02 
0.05 
:., .... 0.36 
0.34 
0.35 
0.35 
0.584 
0.327 
. . • .......... ".0;003.. - .: 
0.003 
0.003 
0.003 
Understanding 0.48 0.001 
Quality 0.47 0.002 
Understanding 0.38 0.008 
Quality 0.34 0.015 
Understanding 0.40 0.006 
Quality 0.42 0.006 
0.51 
0.53 
Understanding 
Quality 
0.003 
0.002 
Table 3: Testing different models of user judgments (S is number of substitutions, M number of moved 
elements) 
ing was best modeled as: 
Normalized understanding = 1.4728*sim- 
ple tree accuracy - 0.1015*substitutions- 
0.0228 * length - 0.2127. 
This model was significant: F(3,1.9 ) = 6.62, p < 0.005. 
Tile model is plotted in Figure 3. with the data point 
representing the removed outlier at the top of the 
diagram. 
This model is also intuitively plausible. The simple 
tree metric was designed to measure the quality of a 
sentence and it has a positive coefficient. A substitu- 
tion represents a case in the string metrics in which 
not only a word is in the wrong place, but the word 
that should have been in that place is somewhere 
else, Therefore, substitutions, more than moves or 
insertions or deletions, represent grave cases of word 
order anomalies. Thus, it is plausible to penalize 
them separately. (,Note that tile simple tree accuracy 
is bounded by 1, while the number of substitutions is 
l/ounded by the length of the sentence. In practice, 
in our sentences S ranges between 0 and 10 with 
a mean of 1,583.) Finally, it is also plausible that 
longer sentem:es are more difficult to understand, so 
that length has a (small) negative coefficient. 
We now turn to model for quality, 
Normalized quality = 1.2134*simple tree 
accuracy- 0.0839*substitutions - 0.0280 * 
length - 0.0689. 
This model was also significant: F(3A9) = 7.23, p < 
0.005. The model is plotted in Figure 4, with the 
data point representing the removed outlier at the 
top of the diagram. The quality model is plausible 
for the same reasons that the understanding model 
is. 
L2 
PP 
j,1" 
.i / 
. // 
/ 
// 
• .,j- 
-05 O0 05 
I a728"SLmo~eTteeMel~ - 0 I015"S - 0 0228"lerN~h 
Figure 3: Regression for Understanding 
6 
o 
Du~h~ 
-(0 -O5 0.0 05 I 0 
1 4728*S,mpleT~eeMetr ¢ - 0 I015"S - 0 0228"len~l~h 
Figure 4: Regression for Quality (Well-Formedness) 
4.2 Two New Metrics 
A further goal of these experiments was to obtain 
one or two metrics which can be automatically com- 
puted, and which have been shown to significantly 
correlate with relevant human judgments• We use as 
a starting point the two linear models for normalized 
understanding and quality given above, but we make 
two changes. First, we observe that while it is plau- 
sible to model human judgments by penalizing long 
sentences, this seems unmotivated in an accuracy 
metric: we do not want to give a perfectly generated 
longer sentence a lower score than a perfectly gener- 
ated shorter sentence. We therefore use models that 
just use the simple tree accuracy and the number 
of substitutions as independent variables• Second, 
we note that once we have done so, a perfect sen- 
tence gets a score of 0.8689 (for understandability) 
or 0.6639 (for quality). We therefore divide by this 
score to assure that a perfect sentence gets a score 
of 1. (As for the previously introduced metrics, the 
scores may be less than 0.) 
\Ve obtain the following new metrics: 
(4) Understandability • Accuracy = 
(1.3147*simple tree accuracy 0.1039*sub- 
stitutions - 0.4458) / 0.8689 - 
(5) Quality Accuracy = (1.0192*simple tree ac- 
curacy- 0.0869*substitutions - 0.3553) / 0.6639 
\\e reevahtated our system and the baseline model 
using the new metrics, in order to veri(v whether 
the nloro motivated metrics we have developed still 
show that FER(;I:S improves l)erforniance over the 
baseline. This is indeed the case: the resuhs are 
Slllnm.arized ill Tabh'-t. 
Tree Model Understandability Quality 
Accuracy Accuracy 
Baseline -0.08 -0.12 
Supertag-based 0.44 0.42 
. Table 4: Performance results from the .two tree mod- 
..... els:using the:new metrics ....... 
5 Discussion 
We have devised the baseline quantitative metrics 
presented in this paper for internal use during re- 
search and development, in order to evaluate dif- 
ferent versions of FERGUS. However, the question 
also arises whether they can be used to compare two 
completely different realization modules. In either 
case, there are two main issues facing the proposed 
corpus-based quantitative evaluation: does it gener- 
alize and is it fair? 
The problem in generalization is this: can we use 
this method to evaluate anything other than ver- 
sions of FERGUS which generate sentences from the 
WSJ? We claim that we can indeed use the quan- 
titative evaluation procedure to evaluate most real- 
ization modules generating sentences from any cor- 
pus of unannotated English text. The fact that the 
tree-based metrics require dependency parses of the 
corpus is not a major impediment. Using exist- 
ing syntactic parsers plus ad-hoc postprocessors as 
needed, one can create the input representations to 
the generator as well as the syntactic dependency 
trees needed for the tree-based metrics. The fact 
that the parsers introduce errors should not affect 
the way the scores are used, namely as relative scores 
(they have no real value absolutely). Which realiza- 
tion modules can be evaluated? First, it is clear 
that our approach can only evaluate single-sentence 
realization modules which may perform some sen- 
tence planning tasks, but cruciaUy not including sen- 
tence scoping/aggregation. Second, this approach 
:only works for generators whose input representa- 
tion is fairly "syntactic". For example, it may be 
difficult to evaluate in this manner a generator that 
-uses semanzic roles in-its inpntrepresent~ion, since 
we currently cannot map large corpora of syntac- 
tic parses onto such semantic representations, and 
therefore cannot create the input representation for 
the evaluation. 
The second question is that of fairness of the evalu- 
ation. FE\[,tGt.'S as described in this paper is of lim- 
ited use. since it only chooses word order (and, to a 
certain extent, syntactic structure). Other realiza- 
tion and sentence planning tin{ks-which are needed 
for most applications and which may profit from a 
stochastic model include lexical choice, introduction 
of function words and punctuation, and generation 
of morphology. (See (Langkilde and Knight, 1998a) 
for a relevant discussion. FERGUS currently can per- 
form punctuation and function word insertion, and 
morphology and lexical choice are under develop- 
ment.) The question arises whether our metrics will 
. fairly measure the:quality,~of,a, more comp!ete real~ .... 
ization module (with some sentence planning). Once 
the range of choices that the generation component 
makes expands, one quickly runs into the problem 
that, while the gold standard may be a good way of 
communicating the input structure, there are usu- 
ally other good ways of doing so as well (using other 
words, other syntactic constructions, and so on). 
Our metrics will penalize such variation. However, 
in using stochastic methods one is of course precisely 
interested in learning from a corpus, so that the fact 
that there may be other ways of expressing an input 
is less relevant: the whole point of the stochastic ap- 
proach is precisely to express the input in a manner 
that resembles as much as possible the realizations 
found in the corpus (given its genre, register, id- 
iosyncratic choices, and so on). Assuming the test 
corpus is representative of the training corpus, we 
can then use our metrics to measure deviance from 
the corpus, whether it be merely in word order or in 
terms of more complex tasks such as lexical choice 
as well. Thus, as long as the goal of the realizer 
is to enmlate as closely as possible a given corpus 
(rather than provide a maximal range of paraphras- 
tic capability), then our approach can be used for 
evaluation, r 
As in the case of machine translation, evaluation in 
generation is a complex issue. (For a discussion, see 
(Mellish and Dale, 1998).) Presumably, the qual- 
ity of most generation systems can only be assessed 
at a system level in a task-oriented setting (rather 
than by taking quantitative measures or by asking 
humans for quality assessments). Such evaluations 
are costly, and they cannot be the basis of work in 
stochastic generation, for which evaluation is a fre- 
quent step in research and development. An advan- 
tage of our approach is that our quantitative metrics 
allow us to evaluate without human intervention, au- 
tomatically and objectively (objectively with respect 
to the defined metric,-that is).- Independently, the 
use of the metrics has been validated using human 
subjects (as discussed in Section 4): once this has 
happened, the researcher can have increased confi- 
dence that choices nlade in research and develop- 
ment based on the quantitative metrics will in fact 
7We could also assume a set of acceptable paraphrases for 
each sentence in the test corpus. Our metrics are run on all 
paraphrases, and the best score chosen. However. for many 
applications it will not be emsy to construct such paraphrase 
sets, be it by hand or automatically. 
8 
correlate with relevant subjective qualitative mea- 
sures. 

References 
Hiyan Alshawi, Srinivas Bangalore, and Shona Dou- 
glas. 1998. Automatic acquisition of hierarchical 
~traalsduatian.:models :for ~machine. tr:anslation, tn 
Proceedings of the 36th Annual Meeting Association 
for Computational Linguistics, Montreal, Canada. 
Srinivas Bangalore and Aravind Joshi. 1999. Su- 
pertagging: An approach to almost parsing. Com- 
putational Linguistics, 25(2). 
Srinivas Bangalore and Owen Rambow. 2000. Ex- 
ploiting a probabilistic hierarchical model for gem 
eration. In Proceedings of the 18th International 
Conference on Computational Linguistics (COLING 
2000), Saarbriicken, Germany. 
Karen Kukich. 1983. Knowledge-Based Report Gen- 
eration: A Knowledge Engineering Approach to Nat- 
ural Language Report Generation. Ph.D. thesis, Uni- 
versity of Pittsuburgh. 
Irene Langkilde and Kevin Knight. 1998a. Gener- 
ation that exploits corpus-based statistical knowl- 
edge. In 36th Meeting of the Association for Com- 
putational Linguistics and 17th International Con- 
ference on Computational Linguistics (COLING- 
ACL'98), pages 704-710, Montreal, Canada. 
Irene Langkilde and Kevin Knight. 1998b. The 
practical value of n-grams in generation. In Proceed- 
ings of the Ninth International Natural Language 
Generation Workshop (INLG'98), Niagara-on-the- 
Lake, Ontario. 
Irene Langkilde. 2000. Forest-based statistical sen- 
tence generation. In 6th Applied Natural Language 
Processing Conference (ANLP'2000), pages 170- 
177, Seattle, WA. 
James C. Lester and Bruce W. Porter. 1997. De- 
veloping and empirically evaluating robust explana- 
tion generators: The KNIGHT experiments. Compu- 
tational Linguistics. 23(1):65-102. 
Igor A. Mel'~uk. 19S8. Dependency Syntax: Theory 
and Practice. State University of New ~%rk Press. 
New York. 
Chris Mellish and Robert Dale. 1998. Evahlation in 
the context of natural  generation. Corn= 
puter Speech and Language, 12:349-373. 
M. A. Walker, D. Litman, C. A. Kamm. and 
A. Abella. 1997. PARADISE: A general framework 
for evahlating spoken dialogue agents. In Proceed- 
ings of the 35th Annual Meeting of the Association 
of Computational Linguistics, A CL/EA CL 97. pages 
271-280. 
The XTAG-Group. 1999. A lexicalized Tree Adjoin- 
ing Gralnmar for English. Technical report, Insti- 
- tu;te for 1Research in Cognitive Science, University of 
Pennsylvania. 
