Proceedings of the Fourth International Natural Language Generation Conference, pages 20–22,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Overgeneration and ranking for spoken dialogue systems
Sebastian Varges
Center for the Study of Language and Information
Stanford University
Stanford, CA 94305, USA
varges@stanford.edu
Abstract
We describe an implemented generator
for a spoken dialogue system that fol-
lows the ‘overgeneration and ranking’ ap-
proach. We find that overgeneration based
on bottom-up chart generation is well-
suited to a) model phenomena such as
alignment and variation in dialogue, and b)
address robustness issues in the face of im-
perfect generation input. We report evalu-
ation results of a first user study involving
20 subjects.
1 Introduction
Overgeneration and ranking approaches have be-
come increasingly popular in recent years (Langk-
ilde, 2002; Varges, 2002). However, most work on
generation for practical dialogue systems makes
use of generation components that work toward
a single output, often using simple templates. In
the following, we first describe our dialogue sys-
tem and then turn to the generator which is based
on the overgeneration and ranking paradigm. We
outline the results of a user study, followed by a
discussion section.
The dialogue system: Dialogue processing
starts with the output of a speech recognizer
(Nuance) which is analyzed by both a statistical
dependency parser and a topic classifier. Parse
trees and topic labels are matched by the ‘di-
alogue move scripts’ of the dialogue manager
(DM) (Mirkovic and Cavedon, 2005). The
dialogue system is fully implemented and has
been used in restaurant selection and MP3 player
tasks (Weng et al., 2004). There are 41 task-
independent, generic dialogue rules, 52 restaurant
selection rules and 89 MP3 player rules. Query
constraints are built by dialogue move scripts if
the parse tree matches input patterns specified
in the scripts. For example, a request “I want
to find an inexpensive Japanese restaurant that
takes reservations” results in constraints such as
restaurant:Cuisine = restaurant:japanese
and restaurant:PriceLevel = 0-10. If the
database query constructed from these constraints
returns no results, various constraint modification
strategies such as constraint relaxation or removal
can be employed. For example, ‘Japanese food’
can be relaxed to ‘Asian food’ since cuisine types
are hierarchically organized.
2 Overgeneration for spoken dialogue
Table 1 shows some example outputs of the sys-
tem. The wording of the realizations is informed
by a wizard-of-oz data collection. The task of
the generator is to produce these verbalizations
given dialogue strategy, constraints and further
discourse context, i.e. the input to the generator
is non-linguistic. We perform mild overgenera-
tion of candidate moves, followed by ranking. The
highest-ranked candidate is selected for output.
2.1 Chart generation
We follow a bottom-up chart generation approach
(Kay, 1996) for production systems similar to
(Varges, 2005). The rule-based core of the gen-
erator is a set of productions written in a produc-
tion system. Productions map individual database
constraints to phrases such as “open for lunch”,
“within 3 miles”, “a formal dress code”, and re-
cursively combine them into NPs. This includes
the use of coordination to produce “restaurants
with a 5-star rating and a formal dress code”,
for example. The NPs are integrated into sen-
tence templates, several of which can be combined
20
|result| mod example realization fexp
s1 0 no I’m sorry but I found no restaurants on Mayfield Road that serve Mediterranean food . 0
s2 small: no There are 2 cheap Thai restaurants in Lincoln in my database : Thai Mee Choke and 61
> 0,< t1 Noodle House .
s3 medium: no I found 9 restaurants with a two star rating and a formal dress code that are open 212
>= t1,< t2 for dinner and serve French food . Here are the first ones :
s4 large: no I found 258 restaurants on Page Mill Road, for example Maya Restaurant , 300
>= t2 Green Frog and Pho Hoa Restaurant . Would you like to try searching by cuisine ?
s5 large yes I found no restaurants that ... However, there are NUM restaurants that ... Would you like to ...? 16
s6 (any) yes/no I found 18 items . 2
Table 1: Some system responses (‘|result|’: size of database result set, ‘mod’: performed modifications).
Last column: frequency in user study (180 tasks, 596 constraint inputs to generator)
to form an output candidate turn. For example,
a constraint realizing template “I found no [NP-
original] but there are [NUM] [NP-optimized] in
my database” can be combined with a follow-up
sentence template such as “You could try to look
for [NP-constraint-suggestion]”. ‘NP-original’ re-
alizes constraints directly constructed from the
user utterance; ‘NP-optimized’ realizes potentially
modified constraints used to obtain the actual
query result. To avoid generating separate sets of
NPs independently for these two – often largely
overlapping – constraint sets, we assign unique in-
dices to the input constraints, overgenerate NPs
and check their indices.
The generator maintains state across dialogue
turns, allowing it to track its previous decisions
(see ‘variation’ below). Both input constraints and
chart edges are indexed by turn numbers to avoid
confusing edges of different turns.
We currently use 102 productions overall in the
restaurant and MP3 domains, 38 of them to gener-
ate NPs that realize 19 input constraints.
2.2 Ranking: alignment & variation
Alignment Alignment is a key to successful nat-
ural language dialogue (Brockmann et al., 2005).
We perform alignment of system utterances with
user utterances by computing an ngram-based
overlap score. For example, a user utterance “I
want to find a Chinese restaurant” is presented by
the bag-of-words {‘I’, ‘want’, ‘to’, ‘find’, ...} and
the bag-of-bigrams {‘I want’, ‘want to’, ‘to find’,
...}. We compute the overlap with candidate sys-
tem utterances represented in the same way and
combine the unigram and bigram match scores.
Words are lemmatized and proper nouns of exam-
ple items removed from the utterances.
Alignment allows us to prefer “restaurants that
serve Chinese food” over “Chinese restaurants”
if the user used a wording more similar to the
first. The Gricean Maxim of Brevity, applied to
NLG in (Dale and Reiter, 1995), suggests a prefer-
ence for the second, shorter realization. However,
if the user thought it necessary to use “serves”,
maybe to correct an earlier mislabeling by the
classifier/parse-matching patterns, then the system
should make it clear that it understood the user
correctly by using those same words. On the other
hand, a general preference for brevity is desirable
in spoken dialogue systems: users are generally
not willing to listen to lengthy synthesized speech.
Variation We use a variation score to ‘cycle’
over sentence-level paraphrases. Alternative can-
didates for realizing a certain input move are
given a unique alternation (‘alt’) number in in-
creasing order. For example, for the simple move
continuation query we may assign the follow-
ing alt values: “Do you want more?” (alt=1) and
“Do you want me to continue?” (alt=2). The sys-
tem cycles over these alternatives in turn. Once
we reach alt=2, it starts over from alt=1. The ac-
tual alt ‘score’ is inversely related to recency and
normalized to [0...1].
Score combination The final candidate score is
a linear combination of alignment and variation
scores:
scorefinal = λ1 · alignuni,bi +(1 − λ1) · variation(1)
alignuni,bi = λ2 · alignuni +(1 − λ2) · alignbi (2)
where λ1,λ2 ∈ {0...1}. A high value of λ1
places more emphasis on alignment, a low value
yields candidates that are more different from pre-
viously chosen ones. In our experience, align-
ment should be given a higher weight than vari-
ation, and, within alignment, bigrams should be
21
weighted higher than unigrams, i.e. λ1 > 0.5 and
λ2 < 0.5. Deriving weights empirically from cor-
pus data is an avenue for future research.
3 User study
Each of 20 subjects in a restaurant selection task
was given 9 scenario descriptions involving 3 con-
straints. We use a back-end database of 2500
restaurants containing the 13 attributes/constraints
for each restaurant.
On average, the generator produced 16 output
candidates for inputs of two constraints, 160 can-
didates for typical inputs of 3 constraints and 320
candidates for 4 constraints. For larger constraint
sets, we currently reduce the level of overgenera-
tion but in the future intend to interleave overgen-
eration with ranking similar to (Varges, 2002).
Task completion in the experiments was high:
the subjects met all target constraints in 170 out of
180 tasks, i.e. completion rate was 94.44%. To
the question “The responses of the system were
appropriate, helpful, and clear.” (on a scale where
1 = ‘strongly agree’, 5 = ‘strongly disagree’), the
subjects gave the following ratings: 1: 7, 2: 9, 3:
2, 4: 2 and 5: 0, i.e. the mean user rating is 1.95.
4 Discussion & Conclusions
Where NLG affects the dialogue system: Dis-
course entities introduced by NLG add items to the
system’s salience list as an equal partner to NLU.
Robustness: due to imperfect ASR and NLU,
we relax completeness requirements when doing
overgeneration, and reason about the generation
input by adding defaults for missing constraints,
checking ranges of attribute values etc. Moreover,
we use a template generator as a fall-back if NLG
fails to at least give some feedback to the user (s6
in table 1).
What-to-say vs how-to-say-it: the classic sep-
aration of NLG into separate modules also holds
in our dialogue system, albeit with some mod-
ifications: ‘content determination’ is ultimately
performed by the user and the constraint opti-
mizer. The presentation dialogue moves do micro-
planning, for example by deciding to present re-
trieved database items either as examples (s4 in
table 1) or as part of a larger answer list of items.
The chart generator performs realization.
In sum, flexible and expressive NLG is cru-
cial for the robustness of the entire speech-based
dialogue system by verbalizing what the system
understood and what actions it performed as a
consequence of this understanding. We find that
overgeneration and ranking techniques allow us to
model alignment and variation even in situations
where no corpus data is available by using the dis-
course history as a ‘corpus’.
Acknowledgments This work is supported by the
US government’s NIST Advanced Technology Program.
Collaborating partners are CSLI, Robert Bosch Corporation,
VW America, and SRI International. We thank the many
people involved in this project, in particular Fuliang Weng
and Heather Pon-Barry for developing the content optimiza-
tion module; Annie Lien, Badri Raghunathan, Brian Lathrop,
Fuliang Weng, Heather Pon-Barry, Jeff Russell, and Tobias
Scheideck for performing the evaluations and compiling the
results; Matthew Purver and Florin Ratiu for work on the
CSLI dialogue manager. The content optimizer, knowledge
manager, and the NLU module have been developed by the
Bosch Research and Technology Center.

References
Carsten Brockmann, Amy Isard, Jon Oberlander, and
Michael White. 2005. Modelling alignment for af-
fective dialogue. In Proc. of the UM’05 Workshop
on Adapting the Interaction Style to Affective Fac-
tors.
Robert Dale and Ehud Reiter. 1995. Computational
Interpretations of the Gricean Maxims in the Gener-
ation of Referring Expressions. Cognitive Science,
19:233–263.
Martin Kay. 1996. Chart Generation. In Proceedings
of ACL-96, pages 200–204.
Irene Langkilde. 2002. An Empirical Verification
of Coverage and Correctness for a General-Purpose
Sentence Generator. In Proc. of INLG-02.
Danilo Mirkovic and Lawrence Cavedon. 2005. Prac-
tical Plug-and-Play Dialogue Management. In Pro-
ceedings of the 6th Meeting of the Pacific Associa-
tion for Computational Linguistics (PACLING).
Sebastian Varges. 2002. Fluency and Completeness
in Instance-based Natural Language Generation. In
Proc. of COLING-02.
Sebastian Varges. 2005. Chart generation using pro-
duction systems (short paper). In Proc. of 10th Eu-
ropean Workshop On Natural Language Generation.
Fuliang Weng, L. Cavedon, B. Raghunathan,
D. Mirkovic, H. Cheng, H. Schmidt, H. Bratt,
R. Mishra, S. Peters, L. Zhao, S. Upson, E. Shriberg,
and C. Bergmann. 2004. Developing a conversa-
tional dialogue system for cognitively overloaded
users. In Proceedings of the International Congress
on Intelligent Transportation Systems (ICSLP).
