Mining Spoken Dialogue Corpora for System Evaluation and
Modeling
Frederic Bechet
LIA-CNRS
University of Avignon, France
frederic.bechet@
lia.univ-avignon.fr
Giuseppe Riccardi
AT&T Labs
Florham Park, NJ, USA
dsp3@
research.att.com
Dilek Hakkani-Tur
AT&T Labs
Florham Park, NJ, USA
dtur@
research.att.com
Abstract
We are interested in the problem of modeling
and evaluating spoken language systems in the
context of human-machine dialogs. Spoken di-
alog corpora allow for a multidimensional anal-
ysis of speech recognition and language under-
standing models of dialog systems. Therefore
language models can be directly trained based
either on the dialog history or its equivalence
class (or cluster). In this paper we propose an
algorithm to mine dialog traces which exhibit
similar patterns and are identi ed by the same
class. For this purpose we apply data clustering
methods to large human-machine spoken dia-
logue corpora. The resulting clusters can be
used for system evaluation and language mod-
eling. By clustering dialog traces we expect to
learn about the behavior of the system with re-
gards to not only the automation rate but the
nature of the interaction (e.g. easy vs di cult
dialogs). The equivalence classes can also be
used in order to automatically adapt the lan-
guage model, the understanding module and the
dialogue strategy to better  t the kind of in-
teraction detected. This paper investigates dif-
ferent ways for encoding dialogues into multi-
dimensional structures and di erent clustering
methods. Preliminary results are given for clus-
ter interpretation and dynamic model adapta-
tion using the clusters obtained.
1 Introduction
The deployment of large scale automatic spoken
dialog systems, like How May I Help You?SM
(HMIHY) (Gorin et al., 1997), makes avail-
able large corpora of real human-machine di-
alog interactions. Traditionally, this data is
used for supervised system evaluation. For in-
stance, in (Kamm et al., 1999) they propose a
static analysis aimed at measuring the perfor-
mance of a dialog system, especially in an at-
tempt to automatically estimate user satisfac-
tion. In (Hastie et al., 2002), a dynamic strat-
egy in the error handling process is proposed. In
all these studies, supervised learning techniques
are used in order to classify dialogs to predict
user satisfaction or dialog failures.
A novel approach to the exploitation of dialog
corpora is for speech recognition and language
understanding modeling. In fact, such corpora
allow for a multidimensional analysis of speech
and language models of dialog systems. Our
work di ers from previous studies in the algo-
rithmic approach and learning scenario.
First we use unsupervised speech mining
techniques. We apply data clustering methods
to large spoken dialog corpora. Two kinds of
clustering methods are used: a hierarchical one
based on decision trees and the optimization of
a statistical criterion; a partitional one based
on a k-means algorithm applied to vectors rep-
resenting the dialogs. We interpret the clusters
obtained and de ne a label for each of them.
Second we perform our analyses on large cor-
pora of real dialogs collected from deployed sys-
tems. These log  les contain a trace of the inter-
action between the users and a particular sys-
tem at a certain point in time. Our goal is to
highlight the structures behind these traces.
Lastly, we investigate several ways of encod-
ing a dialog into a multidimensional structure in
order to apply data clustering methods. Infor-
mation about the system channel and the user
channel are discussed and two ways of encoding
are proposed, one for hierarchical clustering and
the other for partitional clustering.
The clusters obtained can be used to learn
about the behavior of the system with regards
to the automation rate and the type of inter-
action (e.g. easy vs di cult dialog). More-
over, the clusters can be used on-the- y to auto-
matically adapt the language model, the under-
standing module and even the dialog strategy
to better  t the kind of interaction detected.
In this study, we present two levels of clus-
tering: clustering at the utterance level and the
 System: How may I help you?
 User: Hello?
 Call-type: Hello
 System: Hello, how may I help you?
 User: I have a question.
 Call-type: Ask(Info)
 System: OK, What is your question?
 User: I would like to know my account
balance.
 Call-type: Request(Account Balance)
 System: I can help you with that. What
is your account number?
 User: ...
Figure 1: An example dialog from our corpus.
dialog level. Clustering at the utterance is for
modeling the language people use in a speci c
dialog context; clustering at the dialog level al-
lows us to characterize the whole interaction be-
tween the users and a system. In the next sec-
tion we describe the corpora data structure. In
section 3 we describe the clustering algorithms.
In sections 4 and 5 we report on experiments
and results for utterance-based and dialog clus-
tering, respectively.
2 Dialog corpora
The corpora is collected from an automatic call
routing system where the aim is to transfer the
user to the right route in a large call center.
An example dialog from a customer care ap-
plication is given in Figure 1. After the greet-
ing prompt, the speaker’s response is recognized
using an automatic speech recognizer (ASR).
Then, the intent of the speaker is identi ed from
the recognized sequence, using a spoken lan-
guage understanding (SLU) component. This
step can be framed as a classi cation problem,
where the aim is to classify the intent of the user
into one of the prede ned call-types (Gorin et
al., 1997). Then, the user would be engaged in a
dialog via clari cation or con rmation prompts
until a  nal route is determined.
3 Hierarchical and Partitional
clustering
The goal of clustering is to reduce the amount
of data by categorizing or grouping similar data
items together. Clustering methods can be di-
vided into two basic types: hierarchical and par-
titional clustering. A lot of di erent algorithms
have been proposed for both types of clustering
methods in order to split a data set into clusters.
Hierarchical clustering proceeds successively
by either merging smaller clusters into larger
ones, or by splitting larger clusters. The result
of the algorithm is a tree of clusters. By cutting
the tree at a desired level a clustering of the
data items into disjoint groups is obtained. We
use in this study a decision-tree based clustering
method.
Partitional clustering, on the other hand, at-
tempts to directly decompose the data set into
a set of disjoint clusters. A criterion function is
used in order to estimate the distance between
the samples of the di erent clusters. By mini-
mizing this function between the samples of the
same clusters and maximizing it among the dif-
ferent clusters, the clustering algorithm itera-
tively  nds the best cluster distribution accord-
ing to the criteria used. We use in this study a
k-means algorithm applied to vectors encoding
the dialogs. The number of clusters is  xed in
advance.
4 Clustering at the utterance level
Performing clustering at the utterance level in a
dialog corpus aims to capture di erent kinds of
language that people would use in a speci c di-
alog context. This is a way of grouping together
turns of dialogs belonging to completely di er-
ent requests but sharing some common proper-
ties according to their dialog contexts.
The immediate application of such a study
can be the training of specialized Language
Models (LMs) that can be used in replacement
of a generic one once a speci c situation is de-
tected.
4.1 Decision-tree based clustering
In order to obtain utterance clusters from which
we can build LMs we use a decision tree method
based on an optimization criterion that has a
direct in uence on the recognition process: the
perplexity measure of the Language Model on
the manually transcribed training corpus. We
decide to use this criterion because even if there
is no evidence that a gain in perplexity results
in a Word Error Rate (WER) reduction, these
two quantities are generally related.
The clustering algorithm chosen is based on a
decision-tree approach inspired by the Semantic
Classi cation Tree method proposed in (Kuhn
and Mori, 1995) and already used for corpus
clustering in (Esteve et al., 2001) and (Bechet
et al., 2003). One originality of this kind of deci-
sion tree is the dynamic generation of questions
during the growing process of a tree.
4.2 Decision tree features
Each turn of the spoken dialog corpus used for
the clustering process is represented by a mul-
tidimensional structure. Two kinds of channel
can be considered in order to de ne the features:
the system channel (which contains all the infor-
mation managed by the system like the prompts
or the states of the dialog) and the user chan-
nel (which contain the utterances uttered by the
user with all their characteristics: length, vo-
cabulary, perplexity, semantic calltypes, etc.).
Because the clusters obtained are going to be
used dynamically by training speci c LMs on
each of them, we used mostly system channel
features in these experiments. On the HMIHY
corpus we used the following features:
 prompt text: this is the word string ut-
tered by the system before each user’s ut-
terance;
 prompt category: prompt category ac-
cording to the kind of information re-
quested (conf if the prompt asks for a con-
 rmation, numerical if the information re-
quested is a numerical value like a phone
number, other in all the other cases);
 dialog state: a label given by the Dialog
Manager characterizing the current state of
the dialog;
 dialog history: the string of dialog state
labels given by the Dialog Manager during
the previous turns of the same dialog
 utterance lengths: the utterance lengths
are estimated on the transcriptions (man-
ual or automatic) and represented by a set
of discrete symbols l0 for less than 5 words,
l1 between 5 and 10 words, l2 between 10
and 15 and l3 for more than 15 words);
The only feature that does not belong to the
system channel is the utterance lengths. This
feature is part of the user channel but it can
be estimated rather accurately from the word
graph produced during the speech recognition
process.
4.3 Results on the hierarchical
clustering
This experiment was made on the HMIHY cor-
pus. The training corpus used to grow the
clustering-tree comprises about 102k utterances
from live customer tra c. The test corpus was
made of 7k utterances.
After the clustering process we obtained the
6 clusters represented in table 1.
The size of each cluster is calculated accord-
ing to the number of words of all the utterances
belonging to it. This number is expressed as
a percentage of the total number of words of
the training corpus (column % words of table
1). One can notice that the cluster sizes are
not homogeneous. Indeed more than 70% of
the words of the training corpus are in the same
cluster. This result is not surprising: indeed the
open ended prompts like How may I help you ?
represent a very large chunk of all the possible
prompts and moreover most of the answers to
these prompts are quite long with more than 15
words. It is therefore very di cult to split a
chunk where all the utterances share the same
characteristics.
It is interesting to see the features considered
relevant in the tree splitting of the training cor-
pus. These 6 clusters contain the following type
of utterances:
 C1: answers to a prompt asking for a
phone number and containing between 10
and 15 words;
 C2: answers to the con rmation prompt
Are you phoning from your home phone ?
containing between 5 and 10 words;
 C3: answers to the same con rmation
prompt containing less than 5 words;
 C4: answers to other prompts and contain-
ing between 5 and 10 words;
 C5: answers to other prompts and contain-
ing between 10 and 15 words;
 C6: answers to other prompts and contain-
ing more than 15 words;
As we can see, 3 kinds of interaction are dis-
tinguished: request for a phone number, re-
quest for con rmation and other. These interac-
tions correspond to the di erent types of system
prompts de ned in section 4.2.
It is interesting to notice that  rst it is al-
ways a speci c prompt and not the prompt cat-
egory numeric, conf or other which is chosen by
the tree, and second that an utterance length is
systematically attached to each prompt. This
means that these prompts (which are very fre-
quent) have their own behaviors independently
from the other prompts part of the same prompt
category.
Utterance lengths are very strong and reli-
able indicators for characterizing an answer to a
given prompt. Cluster 1 contains mostly phone
numbers, cluster 2 contains con rmation an-
swers with explanation (mostly negative), and
cluster 3 contains con rmation answers without
explanation (mostly positive). We observe that
no dialog state label or dialog history label was
chosen as feature by the tree in the clustering
process. One possible explanation is the lim-
ited length of the dialogs in the HMIHY corpus
(4 turns on average). Therefore the dialog con-
text and history are negligible compared to the
system prompt alone.
Perplexity WER %
C % words 1-pass 2-pass 1-pass 2-pass
1 1.8 18.6 13.9 11.3 11.1
2 1.3 5.0 3.2 14.5 12.5
3 1.2 3.2 1.5 4.4 2.5
4 4.7 11 7.4 19.2 18
5 13.8 11.3 9.5 19.7 18.8
6 73.9 38.4 27.4 30.8 29.8
Table 1: Results for each cluster obtained with
the hierarchical clustering method, at the utter-
ance level, on the HMIHY corpus
By using these clusters for training speci c
LMs and by dynamically choosing a speci c
LM according to the dialog context for perform-
ing LM rescoring, we obtain the perplexity and
Word-Error-Rate (WER) results of table 1 on
the HMIHY test corpus. Signi cant perplex-
ity improvement can be seen for all the clus-
ters between the  rst and the second pass. On
the whole test corpus, the perplexity drops from
25.3 to 18.5, so a relative improvement of 26.8%.
On the speech recognition side, even if the de-
crease in WER is not as signi cant, we obtain
a gain for all of them.
4.4 K-means clustering
In order to split the utterances into clusters
more equally, we decided to use another cluster-
ing method based on a partitional k-means al-
gorithm. In these experiments we do not try to
explicitly optimize a speci c criterion, like the
perplexity, but we just want to group together
utterances sharing common properties and put
utterances that are very dissimilar into di er-
ent clusters. Perplexity measures obtained with
LMs trained on such clusters will tell us if this
method splits the utterances according to the
language used. The  rst step in this process is
to encode the dialogs into vectors, one vector
for each dialog.
4.5 Representing utterances as feature
vectors
The decision-tree method used symbolic fea-
tures in order to generate the questions that
split the corpus. The k-means clustering
method is purely numerical therefore we need
here to encode dialog turns into numerical fea-
ture vectors. According to what we learned
from the previous experiments we decide to put
in the vectors only the semantic calltype labels.
Indeed as 70% of the utterances share the same
prompt and the same utterance length category,
it did not seem relevant to us to put these fea-
tures in the vectors as we are looking for clusters
that are quite balanced in size. The calltype la-
bels are relevant as we are interested here in
clusters that can be semantically di erent.
Therefore, a feature vector representing an
utterance is a  rst order statistic on the call-
types. A component value is the number of oc-
currences of the corresponding calltype within
the utterance. We kept the 34 most frequent
calltype labels in the training corpus in order
to build the vectors. They cover over 96% of
the calltype occurrences on the HMIHY train-
ing corpus.
4.6 Results on the partitional
clustering
This experiment is made on the same applica-
tion than the last one (HMIHY) but on another
data set. The training corpus contains 10k di-
alogs and 35.5k utterances and the test corpus
contains 1.4k dialogs and 5k utterances. The
number of clusters is set to 5. This value has
the advantages of both providing a good con-
vergence to the k-means process and splitting
rather equally the training corpus.
Table 2 illustrates the partitional clustering
of utterances on this corpus. As we can see,
the distribution of the total amount of words
of the corpus among the clusters is much more
even than the one obtained with the hierarchi-
cal clustering. The largest cluster contains only
38.5% of the words compared to 73.9% previ-
ously. To check if these clusters can be useful
for training speci c LMs, we  rst split the test
corpus according to the clustering model esti-
mated on the training data. We use here, to
build the vectors encoding the test corpus, au-
tomatic calltypes calculated by the Spoken Lan-
guage Understanding Module. Then we com-
pare the perplexity measures obtained with a
general LM trained on the whole training cor-
pus and the one obtained with a speci c LM
adapted on the corresponding cluster. Table 2
shows these results: the gain in perplexity ob-
tained is smaller than the one obtained with the
other method. Indeed the total perplexity on
the test corpus is 21.3 and the one obtained with
the speci c LMs is 17.8, so a gain of 16% com-
pared to the 26.8% obtained previously. How-
ever this result is not surprising as the previous
method was designed for speci cally decreasing
the perplexity measure.
Perplexity
C % words utt. length 1-pass 2-pass
1 8.6 5.5 16.9 12.6
2 20.3 17.8 25.5 21.0
3 14.5 11.6 21.6 18.2
4 38.5 7.5 19.6 17.2
5 18.1 8.6 22.8 18.6
Table 2: Results for each cluster obtained with
the partitional clustering method, at the utter-
ance level, on the HMIHY corpus
5 Clustering at the dialog level
As we have seen in the previous section, speci c
dialog situations (like those obtained with the
hierarchical clustering) proved to be more e -
cient than the semantic channel (represented by
the calltype labels) for clustering utterances in
relation with the language used, at least from
the perplexity point of view.
However, for clustering dialogs rather than
utterances, the semantic channel is the main
channel that we are going to use because in this
case we want to characterize the whole interac-
tion between a system and a user rather than
just the language used.
For example, an utterance like I want to pay
my bill can be used in very di erent dialog con-
texts: in a standard interaction if this request
is expressed just after the opening prompt or at
a di erent stage of the dialog. To capture this
kind of phenomena, we have to cluster at the
dialog level rather than the utterance level.
After describing how we encode dialogs into
feature vectors in the next section, we present
in section 5.2 some preliminary work on the in-
terpretation of the clusters obtained.
5.1 Representing dialogs as feature
vectors
We use here the partitional clustering method.
Each dialog is represented by a vector contain-
ing three kinds of information representing the
interaction:
1. the number of turns in the dialog: associ-
ated to other features this parameter can
be a strong indicator that the dialog is go-
ing  ne (associated with a lot of con rma-
tions or item values) or that the interaction
is di cult (lot of repetitions for example).
2.  rst order statistics data on the calltype
labels: these are the 34 calltypes pre-
sented in section 2 and representing both
application-speci c requests (Pay Bill) and
dialog-based concepts like Yes, No, I want
to talk to somebody, Help, etc....
3. second order statistics data on the calltype
labels: we chose the bigrams of the previ-
ous calltypes that had the highest weighted
Mutual Information and we store in the
vectors their frequencies. These features al-
low us to observe certain patterns like rep-
etition of a request, request followed by a
con rmation, people asking twice for a rep-
resentative, etc.. ..
5.2 Analyzing the clusters obtained
The experiment reported here is made on the
HMIHY corpus. The vectors used contain 55
components: 1 for the number of turns, 34 for
the calltypes and 20 for the bigrams on the call-
types. The number of clusters to be found by
the k-means clustering algorithm was set to 5.
Firstly because as we want to give an interpreta-
tion to each cluster we need to keep a relatively
small number of them. Secondly because this
number leads to a fast convergence of the k-
mean clustering process. The clustering model
is obtained on the training corpus and applied
to the test corpus. Table 3 shows the distribu-
tion of the dialogs among the clusters, on the
training corpus, ranging from 35.4% for C1 to
7% for C2.
There are two ways of analyzing the clusters
obtained: from the language point of view and
from the semantic point of view. Table 3 illus-
trates the language channel features with the
average amount of turns (#turn), the average
utterance lengths (utt. length) and the per-
plexity measure (pplex). This perplexity mea-
sure is obtained with a 3-gram LM trained on
the whole training corpus and applied to the
manual transcriptions of each cluster, of the test
and the training corpus.
The di erences in utterance lengths are not
as big as those observed in section 4.6. This is
an indicator that, unlike the clustering at the
utterance level, the clusters obtained represent
similar dialog situation. It is the way the dia-
log progresses, the dialog pattern, rather than
the theme of the dialog which distinguishes the
clusters. The lengths of the dialogs and the per-
plexity measures are indicators of these di erent
dialog patterns.
The results on the test corpus presented in
Table 3 are obtained with automatic calltypes,
completely unsupervised. The F-measure score
on the detection of these calltypes on the test
corpus is 75%. As one can see, having errors in
the calltypes detected does not a ect too much
the characteristics of the clusters obtained.
Training corpus
C % dialogs #turn utt. length pplex
1 35.4% 3.9 7.3 7.32
2 7% 3.2 11.7 9.9
3 22.3% 3.5 9 7.3
4 10.4% 3.9 12.4 9.5
5 25% 2.9 9.3 7.9
Test corpus
C % dialogs #turn utt. length pplex
1 32.4% 4.0 8.5 17.2
2 7.4% 3.3 11 30.2
3 20.8% 3.5 7.5 13.2
4 12.0% 3.8 11.6 28.2
5 27.5% 2.8 8.3 17
Table 3: Language features attached to the
dialog clusters obtained with the partitional
method, at the dialog level, on the HMIHY cor-
pus
In order to analyze the clusters on the seman-
tic channel we plot the weighed Mutual Infor-
mation, wMI(ci;tj), between each cluster, ci,
and each vector component, tj. This measure is
estimated in the following way:
wMI(ci; tj) = P(ci; tj)log P(ci;tj)P(c
i)P(tj)
This plot is shown on Figure 2 for the HMIHY
training corpus. By grouping together com-
ponents corresponding to a phenomenon we
want to observe we are able to characterize
more closely each cluster. In the color-coded
graph, x-axis corresponds to call-type unigrams
or selected call-type pairs, y-axis corresponds
to each cluster. The color of each rectangle
shows the degree of correlation, determined by
the weighted mutual information between call-
type and the cluster. As also can be seen from
the color spectrum on the right hand side, dark
red means high correlation (top), and dark blue
means reverse correlation (bottom).
−0.01
−0.008
−0.006
−0.004
−0.002
0
0.002
0.004
0.006
0.008
0.01
Call−types
Clusters
Weighted MI(Call−type;Cluster)
wMI(Call−type;Cluster)
5 10 15 20 25 30 35 40 45 50
1
2
3
4
5
Figure 2: Weighted Mutual Information mea-
sures between and clusters on the HMIHY train-
ing corpus. The color of each rectangle shows
the degree of correlation, determined by the
weighted mutual information between the vec-
tor components and the cluster. As can be seen
from the color spectrum on the right, dark red
means high correlation, and dark blue means
reverse correlation.
We chose to analyze the clusters according to
3 dimensions:
1. Request = the kind of request expressed
by the user. We split all the request call-
types into two categories: the easy requests
that correspond to the calltypes well de-
tected by the SLU module and the di cult
requests that contain all the calltypes that
are often badly recognized.
2. Understanding = to try to character-
ize the understanding of a user by the
system we use two features: the bigrams
request + yes (conf) that can be indica-
tors that the request is understood because
the following concept expressed by the user
is yes; the bigrams request + request
(repeat) which indicate that the same re-
quest is repeated twice in a row, which can
indicate that the system misunderstood it.
3. Problems = we grouped in this category
the features that can be related to some
problems the user have during the interac-
tion. These features are: request for help
(help), two requests in a row for a rep-
resentative (rep) and a calltype indicat-
ing that no meaningful information is ex-
tracted from the user’s utterance (null).
Request Underst. Problems
C easy dif conf repeat help rep null
1 + { + = { { {
2 { + { + + { +
3 + { = = = + =
4 { + + + + + +
5 { + + + = = {
Table 4: Interaction features attached to the
dialog clusters obtained with the partitional
method, at the dialog level, on the HMIHY cor-
pus
Table 4 represents the 5 clusters from  gure 2,
ternary encoded according to these semantic di-
mensions. A 0+0 means a higher value in the
wMI scale, 0 0 means a lower value in wMI scale
and 0 =0 stands for 0 in the wMI scale. Accord-
ing to the results presented in Tables 3 and 4,
the following description can be given to the 5
clusters:
 Cluster C1 : easy requests, expressed clearly
(low perplexity, short utterances), under-
stood correctly (conf=’+’), and with no
evidence of a problem.
 Cluster C2 : di cult requests, with
confusing formulation (high perplexity,
long utterances), often misunderstood
(repeat=’+’), leading users to have prob-
lems (request for help, no info).
 Cluster C3 : rather similar to C1 except
that people ask more often for a representa-
tive, maybe because the requests are not al-
ways recognized immediately (conf=’=’).
 Cluster C4 : di cult requests, with confus-
ing formulation (high perplexity, long ut-
terances), leading users to have problems
(request for help, no info), but unlike C2
eventually leading to a success (high num-
ber of turns and conf=’+’).
 Cluster C5 : di cult requests, expressed
clearly (low perplexity, rather short ut-
terances), understood correctly (conf=’+’)
even after some repetition because these
are di cult requests, and with no evidence
of a problem.
Figure 3 gives portions of two example dialogs
from clusters C1 and C2. As can be seen, in the
Example dialog from C1:
 System: How may I help you?
 User: I want to pay a bill.
 System Call-type: Make(Payment)
 System: Do you want to arrange a bill
payment?
 User: Yes.
 System Call-type: Yes
Example dialog from C2:
 System: How may I help you?
 User: Yes somebody just called me from
I don’t know it’s collect or something like
that and it had to do with A T and T and
when...
 System Call-type: Explain(Bill)
 System: Do you have a question about a
speci c charge on your bill?
 User: No
 System Call-type: No
 System: I’m sorry . How may I help you?
 User: It just said call you guys for help
about trying to get through to a long dis-
tant number and it’s not getting through
when I called the long distant ...
 System Call-type: Other
Figure 3: Two example dialogs from our corpus.
 rst dialog, the user’s responses are very clear,
whereas in the second one they are very vague
and complicated.
6 Using clusters for Speech Mining
In the previous section we have interpreted the
clusters obtained with the respech to the dialog
contexts. Each cluster is identi ed by various
features computed, characterizing the language
and the interaction. These features can be seen
as a set of tools allowing the designer of a Spo-
ken Dialog System to monitor the behavior of
the system. Indeed, once a clustering model
has been build on a manually labeled training
corpus, this model can be applied in a fully un-
supervised way to non transcribed data and all
the features presented in Figure 2 and Tables 3
and 4 can be extracted. Even if mistakes in
the calltype detection occur, the general struc-
ture of the clusters is stable as shown on the
HMIHY test corpus in Table 3 and in Figure 4
that plot the same parameters for the test cor-
pus with automatic calltypes that Figure 2 has
for the training corpus with manual labels.
−0.01
−0.008
−0.006
−0.004
−0.002
0
0.002
0.004
0.006
0.008
0.01
Call−types
Clusters
Weighted MI(Call−type;Cluster)
wMI(Call−type;Cluster)
5 10 15 20 25 30 35 40 45 50
1
2
3
4
5
Figure 4: Weighted Mutual Information mea-
sures between calltype labels and clusters on the
HMIHY test corpus with automatic calltype la-
bels
7 Conclusions
We presented in this paper an application
of data clustering methods to large Human-
Computer spoken dialog corpora. Di erent
ways for encoding dialogs into multidimensional
structures (symbolic and numerical) and dif-
ferent clustering methods have been proposed.
Preliminary results are given for cluster inter-
pretation and dynamic model adaptation using
the clusters obtained.
Acknowledgements
The authors would like to thank Driss Matrouf
for providing the k-means clustering tools used
in this study.

References
Frederic Bechet, Giuseppe Riccardi, and Dilek Hakkani-Tur. 2003. Multi-channel sentence classi cation for spoken dialogue language modeling. In Proceedings of Eurospeech’03, Geneve, Switzerland.
Yannick Esteve, Frederic Bechet, Alexis Nasr, and Renato De Mori. 2001. Stochastic  - nite state automata language model triggered by dialogue states. In Proceedings of Eurospeech’01, volume 1, pages 725{728, Aalborg, Danemark.
A.L. Gorin, G. Riccardi, and J.H. Wright. 1997. How May I Help You? Speech Communication, 23:113{127.
Helen Wright Hastie, Rashmi Prasad, and Marilyn A. Walker. 2002. What’s the trouble: Automatically identifying problematic dialogs in darpa communicator dialog systems. In Proceedings of the Association of Computational Linguistics Meeting, Philadelphia, USA.
Candace Kamm, Marilyn A. Walker, and Diane Litman. 1999. Evaluating spoken language systems. In Proceedings of American Voice Input/Output Society AVIOS.
R. Kuhn and R. De Mori. 1995. The application of semantic classi cation trees to natural language understanding. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(449-460).
