Proceedings of the Fourth International Natural Language Generation Conference, pages 130–132,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Building a semantically transparent corpus
for the generation of referring expressions
Kees van Deemter and Ielka van der Sluis and Albert Gatt
Department of Computing Science
University of Aberdeen
{kvdeemte,ivdsluis,agatt}@csd.abdn.ac.uk
Abstract
This paper discusses the construction of
a corpus for the evaluation of algorithms
that generate referring expressions. It is
argued that such an evaluation task re-
quires a semantically transparent corpus,
and controlled experiments are the best
way to create such a resource. We address
a number of issues that have arisen in an
ongoing evaluation study, among which is
the problem of judging the output of GRE
algorithms against a human gold standard.
1 Creating and using a corpus for GRE
A decade ago, Dale and Reiter (1995) published
a seminal paper in which they compared a num-
ber of GRE algorithms. These algorithms included
a Full Brevity (FB) algorithm which generates de-
scriptions of minimal length, a greedy algorithm
(GA), and an Incremental Algorithm (IA). The
authors argued that the latter was the best model
of human referential behaviour, and versions of
the IA have since come to represent the state
of the art in GRE. Dale and Reiter’s hypothe-
sis was motivated by psycholinguistic findings,
notably that speakers tend to initiate references
before they have completely scanned a domain.
However, this finding affords different algorithmic
interpretations. Similarly, the finding that basic-
level terms in referring expressions allow hearers
to form a psychological gestalt could be incorpo-
rated into practically any GRE algorithm.1
We decided to put Dale and Reiter’s hypothesis
to the test by an evaluation of the output of dif-
1A separate argument for IA involves tractability, but al-
though some alternatives (such as FB) are intractable, others
(such as GA) are only polynomial, and can therefore not eas-
ily be dismissed on purely computational grounds.
ferent GRE algorithms against human production.
However, it is notoriously difficult to obtain suit-
able corpora for a task that is as semantically in-
tensive as Content Determination (for GRE). Al-
though existing corpora are valuable resources,
NLG often requires information that is not avail-
able in text. Suppose, for example, that a corpus
contained articles about politics, how would the
output of a GRE algorithm be evaluated against the
corpus? It would be difficult to infer from an ar-
ticle exactly which representatives in the British
House of Commons are Liberal Democrats, or
Scottish. Combining multiple texts is hazardous,
since facts could alter across sources and time.
Moreover, the conditions under which such texts
were produced (e.g. fault-critical or not, as ex-
plained below) are hard to determine.
A recent GRE evaluation by Gupta and Stent
(2005) focused on dialogue corpora, using MAP-
TASK and COCONUT, both of which have an as-
sociated domain. Their results show that referent
identification in MAPTASK often requires no more
than a TYPE attribute, so that none of the algo-
rithms performed better than a baseline. In con-
trast to MAPTASK, COCONUT has a more elabo-
rate domain, but it is characterised by a collabora-
tive task, and references frequently go beyond the
identification criterion that is typically invoked in
GRE2. Mindful of the limitations of existing cor-
pora, and of the extent to which evaluation de-
pends on the corpus under study, we are using
controlled experiments to create a corpus whose
construction will ensure that existing algorithms
can be adequately differentiated on an identifica-
tion task.
2Jordan and Walker (2000) have demonstrated a signifi-
cantly better match to the human data when task-related con-
straints are taken into account.
130
2 Setup of the experiment
Like Dale and Reiter (1995), we focused on first-
mention descriptions. However, we decided to in-
clude simple ‘disjunctive’ references to sets (as
in ‘the red chair and the black table’), in addi-
tion to conjunctions of atomic properties, since
these can be handled by essentially the same al-
gorithms (van Deemter, 2002). For generality, we
lookedat twovery differentdomains. One ofthese
involved artificially constructed pictures of furni-
ture, where the available attributes and values are
relatively easy to determine. The other involved
real photographs of individuals, which provide a
richer range of options to subjects. To date, data
has been collected from 19 participants, and anal-
ysis is in progress.
Our first challenge was to make the experiment
naturalistic. Subjects were shown 38 randomised
trials, each depicting a set of objects, one or two
of which were the targets, surrounded by 6 dis-
tractors (Figure 1). In each case, a minimal distin-
guishing description of the targets was available.
Subjects were led to believe that they would be
describing the targets for an interlocutor. Once a
description was typed, the system removed from
the screen what it took to be the referents.
Figure 1: A stimulus example from the furniture domain.
Three groups performed the task in different
conditions, namely: 〈±FaultCritical〉, where
half the subjects in the 〈+FaultCritical〉 case
could use location (‘in the top left corner’). The
〈+FaultCritical〉 group was told: ‘Our program
will eventually be used in situations where it is
crucial that it understands descriptions accurately.
In these situations, there will often be no option to
correct mistakes. Therefore, (...) you will not get
the chance to revise (your description)’. By con-
trast, the 〈−FaultCritical〉 subjects were given
the opportunity to revise their description should
the system have got it wrong. Subjects in the
〈−Location〉 condition were told that their inter-
locutor could see exactly the same pictures as they
could, but these had been jumbled up; by con-
trast, 〈+Location〉 subjects were led to believe
that their addressee could see the pictures in ex-
actly the same position.
The second main challenge was to create tri-
als that would distinguish between all the algo-
rithms. For instance, if trials involved only one at-
tribute, say an object’s TYPE (e.g., chair or table),
they would not allow us to distinguish IA from
FB, as both would always generate the shortest de-
scription. Subtler issues arise with local brevity
(Reiter, 1990), an optimisation strategy which re-
quires sufficiently complex trials to make a differ-
ence.
3 How to analyse the data?
Our semantically transparent corpus can be
used for testing various hypotheses, for in-
stance about when an algorithm should
overspecify descriptions (e.g. more in
〈+FaultCritical,+Location〉 (Arts, 2004),
and/or when the target is a set). Here, we focus on
the issue raised in Section 1, namely, which of the
algorithms discussed in Dale and Reiter (1995)
matches human behaviour best.
The first problem is determining the relevant al-
gorithms. The IA comes in different flavours, be-
cause its output depends on the order in which
the different properties are attempted (commonly
called the preference order). It is possible to
consider all different IAs (trying every conceiv-
able preference order), but this would increase the
number of statistical hypotheses to be tested, im-
pacting the validity of the results and requiring a
Bonferroni correction. Instead, we are using a pre-
test to find the optimal version of IA, comparing
only that version to the other algorithms.
The second question is how to assess algorithm
performance. Since our production experiment
does not yield a single gold standard (GS), an al-
gorithm might match subjects better in one con-
dition (e.g. 〈+FaultCritical), or perform bet-
ter in one domain (e.g. furniture). Moreover, it
might match subjects poorly overall due to sam-
ple variation, while evincing a perfect match with
a single individual. Using both a by-subjects and a
by-items analysis will partially control for sample
131
dispersion.
How should we calculate the match between an
algorithm and a GS? Once again, there are two
facets to this problem. Since we are focusing on
Content Determination, each human description
could be viewed as associating, with the relevant
trial, a set of properties. Our approach will be to
annotate each human descriptionwith the set of at-
tributes it contains. However, the real data is often
messy. For example, when one subject called an
object ‘the non-coloured table’, and another called
it‘thegreydesk’, bothmaybeexpressingthesame
attributes (i.e. TYPE and COLOUR). Also, while it
is often assumed that the output of GRE is a def-
inite noun phrase, this is not always the case in
our corpus, which contains indefinite distinguish-
ing descriptions such as ‘a red chair, facing to
the right’, and telegraphic messages such as ‘red,
right-facing’.
The second aspect to the problem concerns the
actual human-algorithm comparison. Suppose the
GS equals the output of one subject, and we are
comparing two algorithms, x and y. Suppose our
subject produced ‘the two huge red sofas’, which
the GS associates with the set {sofa,red,large}.
Suppose our algorithms describe the target as:
Output from x : {sofa,red,top}
Outputfromy : {sofa,red,large,top}
Which of these algorithms matches the GS best?
Algorithm y adds a property (perhaps overspecify-
ing even more than the GS). Algorithm x has the
same length as the GS, but replaces one property
by another. Several reasonable ways of assess-
ing the differences can be devised, one of which is
Levenshtein distance (which suggests preferring y
over x, since the latter involves a deletion and an
addition) (Levenshtein, 1966). We also intend to
examine how often the GS over- or underspecifies
where the algorithm does not.
4 Conclusion
Corpora can be an invaluable resource for NLG
as long as the necessary contextual information
and the conditions under which the texts in a cor-
pus were produced are known. We believe that
controlled and balanced experiments are needed
for building semantically transparent resources,
whose construction we have discussed. As shown
in this paper, evaluation of algorithms against the
number of gold standards obtained with such a
corpus needs careful consideration.
Evaluation of GRE – and NLG systems more
generally – would benefit from more investiga-
tion of the differences between readers and pro-
ducers. In future work, we intend to follow up
with a reader-oriented experiment in which we test
the speed and/or accuracy with which the output
of different GRE algorithms is understood by sub-
jects. The dependent variables here will be non-
linguistic (perhaps involving subjects clicking on
pictures of presumed target referents). This illus-
trates a more general issue in this area, namely
that corpora should, in our view, only be a start-
ing point, with which data of different kinds can
be associated.
5 Acknowledgments
Thanks to Ehud Reiter, Richard Power
and Emiel Krahmer for useful comments.
This work is part of the TUNA project
(http://www.csd.abdn.ac.uk/
research/tuna/), funded by the EPSRC
in the UK (GR/S13330/01).

References
[Arts2004] A.Arts. 2004. Overspecification in Instruc-
tive Texts. Ph.D. thesis, Tilburg University.
[Dale and Reiter1995] R. Dale and E. Reiter. 1995.
Computational interpretations of the Gricean max-
ims in the generation of referring expressions. Cog-
nitive Science, 18:233–263.
[van Deemter2002] K. van Deemter. 2002. Generat-
ing referring expressions: Boolean extensions of the
incremental algorithm. Computational Linguistics,
28(1):37–52.
[Gupta and Stent2005] S. Gupta and A. J. Stent. 2005.
Automatic evaluation of referring expression gener-
ation using corpora. In Proceedings of the 1st Work-
shop on Using Corpora in NLG, Birmingham, UK.
[Jordan and Walker2000] P. Jordan and M. Walker.
2000. Learning attribute selections for non-
pronominal expressions. In Proceedings of the 38th
Annual Meeting of the Association for Computa-
tional Linguistics.
[Levenshtein1966] V. Levenshtein. 1966. Binary codes
capable of correcting deletions, insertions and rever-
sals. Soviet Physics Doklady, 10(8):707–710.
[Reiter1990] E. Reiter. 1990. The computational com-
plexity of avoiding conversational implicatures. In
Proceedings of the 28th ACL Meeting, pages 97–
104. MIT Press.
