On the use of confidence for statistical decision in dialogue strategies
Christian Raymond1 Fr´ed´eric B´echet1 Renato de Mori1 G´eraldine Damnati2
1 LIA/CNRS, University of Avignon France 2 France Telecom R&D, Lannion, France
christian.raymond,frederic.bechet,renato.demori@lia.univ-avignon.fr
geraldine.damnati@rd.francetelecom.com
Abstract
This paper describes an interpretation and deci-
sion strategy that minimizes interpretation er-
rors and perform dialogue actions which may
not depend on the hypothesized concepts only,
but also on confidence of what has been rec-
ognized. The concepts introduced here are ap-
plied in a system which integrates language
and interpretation models into Stochastic Finite
State Transducers (SFST). Furthermore, acous-
tic, linguistic and semantic confidence mea-
sures on the hypothesized word sequences are
made available to the dialogue strategy. By
evaluating predicates related to these confi-
dence measures, a decision tree automatically
learn a decision strategy for rescoring a n-best
list of candidates representing a user’s utter-
ance. The different actions that can be then per-
formed are chosen according to the confidence
scores given by the tree.
1 Introduction
There is a wide consensus in the scientific community
that human-computer dialogue systems based on spoken
natural language make mistakes because the Automatic
Speech Recognition (ASR) component may not hypothe-
size some of the pronounced words and the various levels
of knowledge used for recognizing and reasoning about
conceptual entities are imprecise and incomplete. In spite
of these problems, it is possible to make useful applica-
tions with dialogue systems using spoken input if suit-
able interpretation and decision strategies are conceived
that minimize interpretation errors and perform dialogue
actions which may not depend on the hypothesized con-
cepts only, but also on confidence of what has been rec-
ognized.
This paper introduces some concepts developed for
telephone applications in the framework of stochastic
models for interpretation and dialogue strategies, a good
overview of which can be found in (Young, 2002).
The concepts introduced here are applied in a system
which integrates language and interpretation models into
Stochastic Finite State Transducers (SFST). Furthermore,
acoustic, linguistic and semantic confidence measures on
the hypothesized word sequences are made available to
the dialogue strategy. A new way of using them in the
dialogue decision process is proposed in this paper.
Most of the Spoken language Understanding Systems
(SLU) use semantic grammars with semantic tags as non-
terminals (He and Young, 2003) with rules for rewriting
them into strings of words.
The SFSTs of the system used for the experiments de-
scribed here, represent knowledge for the basic building
blocks of a frame-based semantic grammar. Each block
represents a property/value relation. Different SFSTs
may share words in the same sentence. Property/value
hypotheses are generated with an approach described in
(Raymond et al., 2003) and are combined into a sentence
interpretation hypothesis in which the same word may
contribute to more than one property/value pair. The di-
alogue strategy has to evaluate the probability that each
component of each pair has been correctly hypothesized
in order to decide to perform an action that minimizes the
risk of user dissatisfaction.
2 Overview of the decoding process
The starting point for decoding is a lattice of word
hypotheses generated with an n-gram language model
(LM). Decoding is a search process which detects com-
binations of specialized SFSTs and the n-gram LM. The
output of the decoding process consists of a n-best list
of conceptual interpretations  . An interpretation  is
a set of property/value pairs sj = (cj; vj) called con-
cepts. cj is the concept tag and vj is the concept value
of sj. Each concept tag cj is represented by a SFST and
can be related either to the dialogue application (phone
number, date, location expression, etc.) or to the dia-
logue management (confirmation, contestation, etc.). To
each string of words recognized by a given SFST cj is
associated a value vj representing a normalized value
for the concept detected. For example, to the word
phrase: on July the fourteenth, detected by a
SFST dedicated to process dates, is associated the value:
????/07/14.
The n-best list of interpretations output by the decod-
ing process is structured according to the different con-
cept tag strings that can be found in the word lattice. To
each concept tag string is attached another n-best list on
the concept values. This whole n-best is called a struc-
tured n-best. After presenting the statistical model used
in this study, we will describe the implementation of this
decoding process.
3 Statistical model
The contribution of a sequence of words W to a concep-
tual structure  is evaluated by the posterior probability
P( j Y ), where Y is the description of acoustic features.
Such a probability is computed as follows:
P( j Y ) =
P
W2SW P(Y j W)P( j W)
 P(W) 
P
W2SW P(Y j W)P(W)
 (1)
where P(Y j W) is provided by the acoustic models,
P(W) is computed with the LM. Exponents  and  are
respectively a semantic and a syntactic fudge factor. SW
corresponds to the set of word strings that can be found in
the word lattice. P( j W) is computed by considering
that thus:
P( j W) = P(s1 j W):
JY
j=2
P(sj j sj 11 W) (2)
P(sj j sj 11 W)  P(sj j W)
If the conceptual component sj is hypothesized with
a sentence pattern  j(W) recognized in W and  k(W)
triggers a pair sk and there is a training set with which
the probabilities P( k(W) j sk) 8k, can be estimated,
then the posterior probability can be obtained as follows:
P(sj j W) = P( j(W) j sj)P(sj)PK
k=1 P( k(W) j sk)P(sk)
(3)
where P(sk) is a unigram probability of conceptual
components.
4 Structured N-best list
N-best lists are generally produced by simply enumerat-
ing the n best paths in the word graphs produced by Au-
tomatic Speech Recognition (ASR) engines. The scores
used in such graphs are usually only a combination of
acoustic and language model scores, and no other linguis-
tic levels are involved. When an n-best word hypothesis
list is generated, the differences between the hypothesis
i and the hypothesis i+1 are often very small, made of
only one or a few words. This phenomenon is aggravated
when the ASR word graph contains a low confidence
area, due for example to an Out-Of-Vocabulary word, to
a noisy input or to a speech disfluency.
This is the main weakness of this approach in a Spoken
Dialogue context: not all words are important to the Di-
alogue Manager, and all the n-best word hypotheses that
differ only between each other because of some speech
disfluency effects can be considered as equals. That’s
why it is important to generate not only a n-best list of
word hypotheses but rather a n-best list of interpretations,
each of them corresponding to a different meaning from
the Dialogue Manager point of view.
We propose here a method for directly extracting such
a structured n-best from a word lattice output by an ASR
engine. This method relies on operations between Finite
State Machines and is implemented thanks to the AT&T
FSM toolkit (see (Mohri et al., 2002) for more details).
4.1 Word-to-Concept transducer
Each concept ck of the dialogue application is associated
with an FSM. These FMSs are called acceptors (Ak for
the concept ck). In order to process strings of words that
don’t belong to any concept, a filler model, called AF
is used. Because the same string of words can’t belong
to both a concept model and the background text, all the
paths contained in the acceptors Ak are removed from the
filler model AF in the following way:
AF =    
m[
k=1
Ak
where  is the word lexicon of the application and m
is the number of concepts used.
All these acceptors are now turned into transducers
that take words as input symbols and start or end con-
cept tags as output symbols. Indeed, all acceptors Ak be-
come transducers Tk where the first transition emits the
symbol <Ck> and the last transition the symbol </Ck>.
Similarly the filler model becomes the transducer Tbk
which emits the symbols <BCK> and </BCK>. Except
these start and end tags, no other symbols are emitted: all
words in the concept or background transducers emit an
empty symbol.
Finally all these transducers are linked together in a
single model called Tconcept as presented in figure 1.
FILLER
out=<BCK>
in=<start>
out=<>
in=<> in=<>out=</BCK>
in=<end>
out=<>
out=<>
in=<start>
out=<>in=w1
in=w2out=<>
in=<end>
out=<>
in=<>out=</C1>out=<C1>in=<>
out=<C2>
in=<>
out=</C2>
in=<>
in=<>
out=<Cn>
in=<>
out=</Cn>
in=<>
out=<>
in=<>
out=<>in=<>
out=<>
SFST C1
SFST Cn
SFST C2
Figure 1: Word-to-Concept Transducer
4.2 Processing the ASR word lattice
The ASR word lattice is coded by an FSM: an acceptor L
where each transition emits a word. The cost function for
a transition corresponds to the acoustic score of the word
emitted.
The first step in the word lattice processing consists
of rescoring each transition of L by means of a 3-gram
Language Model (LM) in order to obtain the probabili-
ties P(W) of equation 1. This is done by composing the
word lattice with a 3-gram LM also coded as an FSM (see
(Allauzen et al., 2003) for more details about statistical
LMs and FSMs).
The resulting FSM is then composed with the trans-
ducer TConcept in order to obtain the word-to-concept
transducer L0. A path in L0 corresponds to a word string
if only the input symbols of the transducer are considered
and its score is the one expressed by equation 1; simi-
larly by considering only the output symbols, a path in L0
corresponds to a concept tag string.
The structured n-best list is directly obtained from L0:
by extracting the n-best concept tag strings (output label
paths) we obtain an n-best list on the conceptual interpre-
tations. The score of each conceptual interpretation is the
sum of all the word strings (input label paths) in the word
lattice producing the same interpretation.
Finally, for every conceptual interpretations C kept at
the previous step, a local n-best list on the word strings is
calculated by selecting in L0 the best paths outputting the
string C.
The resulting structured n-best is illustrated by the fol-
lowing example. If we keep the 2 best conceptual in-
terpretations C1, C2 of a transducer L0 and, for each of
these, the 2 best word strings, we obtain:
1 : C1 = <c1_1,c1_2,..,c1_x>
1.1 : W1.1 = <v1.1_1,v1.1_2,..,v1.1_x>
1.2 : W1.2 = <v1.2_1,v1.2_2,..,v1.2_x>
2 : C2 = <c2_1,c2_2,..,c2_y>
2.1 : W2.1 = <v2.1_1,v2.1_2,..,v2.1_y>
2.2 : W2.2 = <v2.2_1,v2.2_2,..,v2.2_y>
where <ci_1,ci_2,..,ci_y> is the conceptual
interpretation at the rank i in the n-best list; Wi.j is the
word string ranked j of interpretation i; and vi.j_k
is the concept value of the kth concept ci_k of the jth
word string of interpretation i.
5 Use of correctness probabilities
In order to select a particular interpretation  (concep-
tual interpretation + concept values) from the structured
n-best list, we are now interested in computing the proba-
bility that  is correct, given a set of confidence measures
M: P( j M). The choice of the confidence measures
determines the quality of the decision strategy. Those
used in this study are briefly presented in the next sec-
tions.
5.1 Confidence measures
5.1.1 Acoustic confidence measure (AC)
This confidence measure relies on the comparison of
the acoustic likelihood provided by the speech recogni-
tion model for a given hypothesis to the one that would
be provided by a totally unconstrained phoneme loop
model. In order to be consistent with the general model,
the acoustic units are kept identical and the loop is over
context dependent phonemes. This confidence measure
is used at the utterance level and at the concept level (see
(Raymond et al., 2003) for more details).
5.1.2 Linguistic confidence measure (LC)
In order to assess the impact of the absence of ob-
served trigrams as a potential cause of recognition errors,
a Language Model consistency measure is introduced.
This measure is simply, for a given word string candi-
date, the ratio between the number of trigrams observed
in the training corpus of the Language Model vs. the total
number of trigrams in the same word string. Its computa-
tion is very fast and the confidence scores obtained from
it give interesting results as presented in (Est`eve et al.,
2003).
5.1.3 Semantic confidence measure (SC)
Several studies have shown that text classification tools
(like Support Vector Machines or Boosting algorithms)
can be an efficient way of labeling an utterance transcrip-
tion with a semantic label such as a call-type (Haffner et
al., 2003) in a Spoken Dialogue context. In our case, the
semantic labels attached to an utterance are the different
concepts handled by the Dialogue Manager. One classi-
fier is trained for each concept tag in the following way:
Each utterance of a training corpus is labeled with a
tag, manually checked, indicating if a given concept oc-
curs or not in the utterance. In order to let the classi-
fier model the context of occurrence of a concept rather
than its value we removed most of the concept headwords
from the list of criterion used by the classifier.
During the decision process, if the interpretation eval-
uated contains 2 concepts c1 and c2, then the classifiers
corresponding to c1 and c2 are used to give to the utter-
ance a confidence score of containing these two concepts.
The text classifier used in the experimental section
is a decision-tree classifier based on the Semantic-
Classification-Trees introduced for the ATIS task
by (Kuhn and Mori, 1995) and used for semantic disam-
biguation in (B´echet et al., 2000).
5.1.4 Rank confidence measure (R)
To the previous confidence measures we added the
rank of each candidate in its n-best. This rank contains
two numbers: the rank of the interpretation of the utter-
ance and the rank of the utterance among those having
the same interpretation.
5.2 Decision Tree based strategy
As the dependencies of these measures are difficult to es-
tablish, their values are transformed into symbols by vec-
tor quantization (VQ) and conjunctions of these symbols
expressing relevant statistical dependencies are obtained
by a decision tree which is trained with a development
set of examples. At the leaves probabilities P(Mj ) are
obtained when  represents any correct hypothesis, the
case in which only the properties have been correctly rec-
ognized or both properties and values have errors. With
these probabilities we are now able to estimate P( j M)
in the following way:
P( j M) = 11 + P(Mj 0)P( 0)
P(Mj )P( )
(4)
where  0 indicates that the interpretation in question is
incorrect and P(Mj 0) = 1  P(Mj ).
6 From hypotheses to actions
Once concepts have been hypothesized, a dialog system
has to decide what action to perform. Let A = aj be
the set of actions a system can perform. Some of them
can be requests for clarification or repetition. In partic-
ular, the system may request the repetition of the entire
utterance. Performing an action has a certain risk and the
decision about the action to perform has to be the one that
minimizes the risk of user dissatisfaction.
It is thus possible that some or all the hypothesized
components of a conceptual structure  do not corre-
spond to the user intention because the word sequence
W based on which the conceptual hypothesis has been
generated contains some errors. In particular, there are
requests for clarification or repetition which should be
performed right after the interpretation of an utterance in
order to reduce the stress of the user. It is important to
notice that actions consisting in requests for clarification
or repetition mostly depend on the probability that the in-
terpretation of an utterance is correct, rather than on the
utterance interpretation.
The decoding process described in section 2 provides
a number of hypotheses containing a variable number of
pairs sj = (cj; vj) based on the score expressed by equa-
tion 1.
P( j M) is then computed for these hypotheses. The
results can be used to decide to accept an interpretation
or to formulate a clarification question which may imply
more hypotheses.
For simplification purpose, we are going to consider
here only two actions: accepting the hypothesis with the
higher P( j M) or rejecting it. The risk associated to the
acceptation decision is called  fa and corresponds to the
cost of a false acceptation of an incorrect interpretation.
Similarly the risk associated to the rejection decision is
called  fr and corresponds to the cost of a false rejection
of a correct interpretation. In a spoken dialogue context,
 fa is supposed to be higher than  fr.
The choice of the action to perform is determined by
a threshold  on P( j M). This threshold is tuned on
a development corpus by minimizing the total risk R ex-
pressed as follows:
R =  fa  NfaN
total
+  fr  NfrN
total
(5)
Nfa and Nfr are the numbers of false acceptation and
false rejection decisions on the development corpus for a
given value of  . Ntotal is the total number of examples
available for tuning the strategy.
The final goal of the strategy is to make negligible Nfa
and the best set of confidence measures is the one that
minimizes Nfr. In fact, the cost of these cases is lower
because the corresponding action has to be a request for
repetition.
Instead of simply discarding an utterance if P( j M)
is below  , another strategy we are investigating consists
of estimating the probability that the conceptual interpre-
tation alone (without the concept values) is correct. This
probability can be estimated the same way as P( j M)
and can be used to choose a third kind of actions: accept-
ing the conceptual meaning of an utterance but asking for
clarifications about the values of the concepts.
A final decision about the strategy to be adopted should
be based on statistics on system performance to be col-
lected and updated after deploying the system on the tele-
phone network.
7 Experiments
7.1 Application domain
The application domain considered in this study is a
restaurant booking application developed at France Tele-
com R&D. At the moment, we only consider in our strat-
egy the most frequent concepts related to the application
domain: PLACE, PRICE and FOOD TYPE. They can be
described as follows:
 PLACE: an expression related to a restaurant loca-
tion (eg. a restaurant near Bastille);
 PRICE: the price range of a restaurant (eg. less than
a hundred euros);
 FOOD TYPE: the kind of food requested by the
caller (eg. an Indian restaurant).
These entities are expressed in the training corpus by
short sequences of words containing three kinds of to-
ken: head-words like Bastille, concept related words like
restaurant and modifier tokens like near.
A single value is associated to each concept entity
simply be adding together the head-words and some
modifier tokens. For example, the values associated to
the three contexts presented above are: Bastille ,
less+hundred+euros and indian.
In the results section a concept detected is considered a
success only if the tag exists in the reference corpus and if
both values are identical. It’s a binary decision process:
a concept can be considered as a false detection even if
the concept tag is correct and if the value is partially cor-
rect. The measure on the errors (insertion, substitution,
deletion) of these concept/value tokens is called in this
paper the Understanding Error Rate, by opposition to the
standard Word Error Rate measure where all words are
considered equals.
7.2 Experimental setup
Experiments were carried out on a dialogue corpus pro-
vided by France Telecom R&D. The task has a vocabu-
lary of 2200 words. The language model used is made
of 44K words. For this study we selected utterances cor-
responding to answers to a prompt asking for the kind
of restaurant the users were looking for. This corpus has
been cut in two: a development corpus containing 511
utterances and a test corpus containing 419 utterances.
This development corpus has been used to train the deci-
sion tree presented in section 5.2. The Word Error Rate
on the test corpus is 22.7%.
7.3 Evaluation of the rescoring strategy
Table 1 shows the results obtained with a rescoring strat-
egy that selects, from the structured n-best list, the hy-
pothesis with the highest P( j M). The baseline re-
sults are obtained with a standard maximum-likelihood
approach choosing the hypothesis maximizing the proba-
bility P( j Y ) of equation 1. No rejection is performed
in this experiment.
The size of the n-best lists was set to 12 items: the first
4 candidates of the first 3 interpretations in the structured
n-best list. The gain obtained after rescoring is very sig-
nificant and justify our 2-step approach that first extract
an n-best list of interpretations thanks to P( j Y ) and
then choose the one with the highest confidence accord-
ing to a large set of confidence measures M. This gain
can be compared to the one obtained on the Word Error
Rate measure: the WER drops from 21.6% to 20.7% af-
ter rescoring on the development corpus and from 22.7%
to 22.5% on the test corpus. It is clear here that the
WER measure is not an adequate measure in a Spoken
Dialogue context as a big reduction in the Understanding
Error Rate might have very little effect on the Word Error
Rate.
Corpus baseline rescoring UER reduction %
Devt. 15.0 12.4 17.3%
Test 17.7 14.5 18%
Table 1: Understanding Error Rate results with and with-
out rescoring on structured n-best lists (n=12) (no rejec-
tion)
7.4 Evaluation of the decision strategy
In this experiment we evaluate the decision strategy con-
sisting of accepting or rejecting an hypothesis  thanks to
a threshold on the probability P( j M). Figure 2 shows
the curve UER vs. utterance rejection on the development
and test corpora. As we can see very significant improve-
ments can be achieved with very little utterance rejection.
For example, at a 5% utterance rejection operating point,
the UER on the development corpus drops from 15.0% to
8.6% (42.6% relative improvement) and from 17.7% to
11.4% (35.6% relative improvement).
By using equation 5 for finding the operating point
minimizing the risk fonction (with a cost  fa = 1:5  
 fr) on the development corpus we obtain:
 on the development corpus: UER=6.5 utterance re-
jection=13.1
 on the test corpus: UER=9.6 utterance rejec-
tion=15.9
4
6
8
10
12
14
16
18
0 5 10 15 20
understanding error rate
utterance rejection (%)
devt
test
Figure 2: Understanding Error Rate vs. utterance rejec-
tion on the development and test corpora
8 Conclusion
This paper describes an interpretation and decision strat-
egy that minimizes interpretation errors and perform dia-
logue actions which may not depend on the hypothesized
concepts only, but also on confidence of what has been
recognized. The first step in the process consists of gen-
erating a structured n-best list of conceptual interpreta-
tions of an utterance. A set of confidence measures is
then used in order to rescore the n-best list thanks to a de-
cision tree approach. Significant gains in Understanding
Error Rate are achieved with this rescoring method (18%
relative improvement). The confidence score given by the
tree can also be used in a decision strategy about the ac-
tion to perform. By using this score, significant improve-
ments in UER can be achieved with very little utterance
rejection. For example, at a 5% utterance rejection op-
erating point, the UER on the development corpus drops
from 15.0% to 8.6% (42.6% relative improvement) and
from 17.7% to 11.4% (35.6% relative improvement). Fi-
nally the operating point for a deployed dialogue system
can be chosen by explicitly minimizing a risk function on
a development corpus.

References
Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003.
Generalized algorithms for constructing statistical lan-
guage models. In 41st Annual Meeting of the Associa-
tion for Computational Linguistics (ACL’03), Sapporo,
Japan.
Fr´ed´eric B´echet, Alexis Nasr, and Franck Genet. 2000.
Tagging unknown proper names using decision trees.
In 38th Annual Meeting of the Association for Compu-
tational Linguistics, Hong-Kong, China, pages 77–84.
Yannick Est`eve, Christian Raymond, Renato De Mori,
and David Janiszek. 2003. On the use of linguistic
consistency in systems for human-computer dialogs.
IEEE Transactions on Speech and Audio Processing,
(Accepted for publication, in press).
Patrick Haffner, Gokhan Tur, and Jerry Wright. 2003.
Optimizing SVMs for complex call classification. In
IEEE International Conference on Acoustics, Speech
and Signal Processing, ICASSP’03, Hong-Kong.
Y. He and S. Young. 2003. A data-driven spoken lan-
guage understanding system. In Automatic Speech
Recognition and Understanding workshop - ASRU’03,
St. Thomas, US-Virgin Islands.
R. Kuhn and R. De Mori. 1995. The application of se-
mantic classification trees to natural language under-
standing. IEEE Trans. on Pattern Analysis and Ma-
chine Intelligence, 17(449-460).
Mehryar Mohri, Fernando Pereira, and Michael Ri-
ley. 2002. Weighted finite-state transducers in
speech recognition. Computer, Speech and Language,
16(1):69–88.
Christian Raymond, Yannick Est`eve, Fr´ed´eric B´echet,
Renato De Mori, and G´eraldine Damnati. 2003. Belief
confirmation in spoken dialogue systems using confi-
dence measures. In Automatic Speech Recognition and
Understanding workshop - ASRU’03, St. Thomas, US-
Virgin Islands.
Steve Young. 2002. Talking to machines (statisti-
cally speaking). In International Conference on Spo-
ken Language Processing, ICSLP’02, pages 113–120,
Denver, CO.
