Learning dialog act processing 
Stefan Wermter and Matthias LSchel 
Computer Science Department 
University of Itamburg 
22765 Hamburg 
Germany 
wermter@informatik.uni-hamburg.de 
15chel @informatik.uni-h amburg.de 
Abstract 
In this paper we describe a new approach 
for learning dialog act processing. In 
this approach we integrate a symbolic se- 
mantic segmentation parse,: with a learn- 
ing dialog act network. In order to sup- 
port the unforeseeable errors and varia- 
tions of spoken language we have con- 
centrated on robust data-driven learn- 
ing. This approach already compares fa- 
vorably with the statistical average plan- 
sibility method, produces a segmenta- 
tion and dialog act assignment for all 
utteranccs in a robust manner, and re- 
daces knowledge engineering since it can 
be bootstrapped from rather small cor- 
pora. Therefore, we consider this new 
approach as very promising for learning 
dialog act processing. 
1 Introduction 
For several decades, the pragmatic interpretation 
at a dialog act level belongs to the most diffi- 
cult and challenging tasks tbr natural language 
processing and computational linguistics (Austin, 
1962; Searle, 1969; Wilks, 1985). Recently, we 
can see an important development in natural lan- 
guage processing and computational linguistics 
towards the use of empirical learning methods 
(for instance, (Charniak, 1993; Marcus et al., 
1993; Wermter, 11995; Jones, 1995; Werml;er et al., 
1996)). 
Primarily, new learning approaches have been 
successful for leo~'ically or syntactically tagged text 
corpora. In this paper we want to examine the 
potential of learning techniques at highcr prag- 
matic dialog levels of spoken language. Learn- 
ing at least part of the dialog knowledge is de- 
sirable since it could reduce the knowledge engi- 
neering effort. Furthermore, inductive learning al- 
gorithms work in a data-driven mode and have the 
ability to extract gradual regularities in a robust 
manner. This robustness is particularly impor- 
tant for processing spoken language since spoken 
language can contain constructions including in- 
terjections, pauses, corrections, repetitions, false 
starts, semantically or syntactically incorrect con- 
structions, etc. 
Tile use of learning is a new approach at the 
level of dialog acts and only recently, there have 
been some learning approaches for dialog knowl- 
edge (Mast et al., 1996; Alexanderson et al., 1995; 
Reithinger and Maier, 1995; Wang and Waibel, 
1995). Different from these approaches, in this pa- 
per we examine the combination of learning tech- 
niques in simple recurrent networks with symbolic 
segmentation parsing at a dialog act level. 
Input to our dialog component are utterances 
h'om a corpus of business meeting arrangements 
like: "Tuesday at 10 is for me now again bad 
because I there still train I think we should \[de- 
lay\] the whole then really to the next week is this 
for you possible" 1. For a fiat level of dialog act 
processing, the incrementM output is (1) utter- 
ance boundaries within a dialog turn and (2) the 
specific dialog act within an utterance. The pa- 
per is structured as follows: First we will out- 
line the domain and task and we will illustrate 
the dialog act categories. Then, we will describe 
the overall architecture of the dialog component 
in the SCREEN system (Symbolic Connectionist 
Robust Enterprise for Natural language), consist- 
ing of the segmentation parser and the dialog act 
network. We will describe the learning and gen- 
eralization results for this dialog component and 
we will point out contributions and further work. 
l'Phis is ahnost a literal translation of the Ger- 
mau utterance: "l)ienstags um zehn ist bei mir nun 
wiederum schlecht weft ich da noch trainieren bin ich 
denke wir sollten das Ganze dann doch auf die niichste 
Woche verschieben geht es bei ihnen da." We have 
chosen the literal word-by-word trauslation since our 
processing is incremental and knowledge about the or- 
der of the German words matter for processing. 
740 
2 The Task 
'Fb.e main task is the examinatiotl of learning h)r 
(liMog act processing and the donlain is (,\]tc ar- 
rangement of business dates. For this domain we 
have developed a classification of dialog acts which 
is shown in table 1 together with examples. Our 
guideline for the choice of these dialog acts was 
based on (l) the particular domMn and corpus 
and (2) our goal to learn rather few dialog <:at<;: 
gories but in a robusl; n\]anucr 2. 
Dialog act (Abbreviation) 
acceptance (ace) 
query (query) 
rejection (rej) 
request comment (re-c) 
request suggestion (re-s) 
statement (state) 
date/loc, suggestion (sag) 
miscellaneous (mist) 
Example 
That would be \[ine 
1)o you know ttamburg 
This is /,oo late for mc 
Is that possible 
When wouM it be ok 
Right, it's a Tuesday 
\[ propose April 13th 
So long, bye 
Table I: Dialog acts and examples 
For example, in our example turu below there 
arc several utterances and each of them has a par- 
ticular dialog act as shown below. The turn starts 
with a reiection, followed by all explaining state- 
ment. rl'hen a suggestion is made and a request 
for commenting on this suggestion: 
* l)ienstags nm zehn ist bei mir mm wiederum 
s(:hlecht (Tuesday at I0 is for me now again 
bad) -+ rejection 
- well ich da noch trainieret, bin (because I 
there still train) --~ statement 
• ich denke (I think) -+ miscellaneous 
• wit sollten das Ganze dann doch auf die 
naechstc Woche verschieben (we should tile 
whole then really to the next week delay; we 
should delay the whole then really to the next 
week) -+ suggestion 
• geht es bei ihnen da (is that for you possible) 
-+ request comment 
It is important to note that segmentation pars- 
ing and dialog act processing work increinentM 
and in parallel on the incoming stream of word 
hypotheses. Alter each incoming word the seg- 
mentation parsing and dialog act processing an- 
alyze the current input. For instance, dialog act 
hypotheses are available with the first input word, 
although good hypotheses may only be possible 
2This is also motivated by our additional goal of re- 
ceiving noisy input directly from a speech recognizer. 
after most of an utterance has been seen. Our 
genera\] goal here is to produce hypotheses about 
segmentation and diMog acts as early as possible 
in an incremental manner. 
3 The Overall Approach 
The research presented here is embedded in a 
larger effort for examining hybrid eonnectionist 
learning capabilities for the analysis of spoken 
language at various acoustic, syntactic, semantic 
and pragmatic levels. To investigate hybrid con- 
nectionist architectures for speech/language anal- 
ysis we devek)l)ed the SCREEN system (Sym- 
bolic Connectionist ll.obust Enterprise for Natu- 
ral language) (Wermter and Weber, 1996). For the 
task of analyzing spontancous language we pursue 
a shallow screening analysis which uses prima,> 
ily flat representations (like category sequences) 
wherever possible. 
also Friday the nineteenth is not possible 
reject 
but Thursday afternoon is ok for mc 
accept 
segmetq 
fr;ntle 
dialog act 
type (is) verb-\[onn 
questmn auxiliary 
agent otject 
recipient 
Jmc-at 
~ dialog act nrocessint~ -- 
l-- knowledge base -- I \] \[+_+ 
aM semantic \[ knowledge \]1 ldial°g kn°wledge \] / 
l,'igure 1: Architecture of dialog act component 
Figure 1 gives an overview of our dialog compo- 
nent in SCI{EEN. The interpretation of utterances 
is based on syntactic, semantic and dialog knowl- 
edge for each word. The syntactic and semantic 
knowledge is provided by other SCREEN conlpo- 
heats and has been described elsewhere (Wermter 
att(l Weber, 1995). Each word of an utterance 
is processed incrementally and passed to tile seg- 
741 
mentation parser and to the dialog act network. 
The dialog act network provides the currently rec- 
ognized dialog act for the current fiat fl'ame rep- 
resentation of the utterance part. The segmen- 
tation parser provides knowledge about utterance 
boundaries. This is important control knowledge 
for the dialog act network since without know- 
ing about utterance boundaries the dialog network 
may assign incorrect dialog acts. 
4 The Segmentation Parser 
The segmentation parser receives one word at a 
time and builds up a flat frame structure in an 
incremental manner (see tables 2 and 3). Together 
with each word the segmentation parser receives 
syntactic and semantic knowledge about this word 
based on other syntactic and semantic modules 
in SCREEN. Each word is associated with 1. its 
most plausible basic syntactic category (e.g. noun, 
verb, adjective), 2. its most plausible abstract 
syntactic category (e.g. noun group, verb group, 
prepositional group), 3. basic semantic category 
(e.g., animate, abstract), and 4:. abstract semantic 
category (e.g., agent, object, recipient). 
Slots 3. Phrase Final l'hrase 
dialog act 
type 
verb-form 
question 
auxiliary 
agent 
object 
recipient 
time-at 
cat? 
is 
((is)) 
nil 
nil 
nil 
nil 
((Tuesday) (at :tO)) 
reject 
is 
((is)) 
nil 
nil 
nil 
nil ((for n,e)) 
((Tn~sday) (at 1o)) 
time-from nil 
timeoto nil 
location-at nil 
location-from nil 
location-to nil 
confirm nil 
negation nil 
miscellaneous nil 
input %msday at 10 
is 
nil 
nil 
nil 
nil 
nil 
nil 
((bad)) ((now a~ai~)) 
Tuesday at 10 
is for me 
now again bad 
Table 2: Incremental slot filling in frame 1 : literal 
incremental translation: Dienstags mn zehn ist bet 
mir nun wiederum schlecht (Tuesday at 10 is for 
me now again bad) 
This syntactic and semantic category knowl- 
edge is used by the segmentation parser for two 
main purposes. First, this category knowledge 
is needed for our segmentation heuristics. For 
our domain we have developed segmentation rules 
which allow the system to split turns into utter- 
ances. For instance, if we know that the basic syn- 
tactic category of a word "because" is col\junction 
and it is part of a conjunction group, then this is 
an indication to close the current frame and trig- 
ger a new fl:ame for the next; utterance. Second, 
the category knowledge, primarily the abstract se- 
mantic knowledge, is used for :filling the flames, 
so that we get a symbolically accessible structure 
rather than a tagged word sequence. 
Slots 1.-3. Phrase Final Phrase 
dialog act 
type 
verb-form 
question 
auxiliary 
agent 
object 
recipient 
time-at 
time-from 
time-to 
location-at 
location-fl'om 
location-to 
c o nt~r 11(1 
negation 
miscellaneous 
input 
cat? 
Inove 
nil 
nil 
nil 
((I)) 
nil 
nil 
nil 
nil 
lfil 
nil 
nil 
nil 
nil 
nil 
((because) 
(there still)) 
because I 
there still 
statement 
move ((t,-ain)) 
nil 
aIn ((t)) 
nil 
nil 
nil 
nil 
nil 
nil 
nil 
nil 
nil 
nil 
((because) 
(there still)) 
because I 
there still 
train am 
Table 3: Incremental slot filling ill frame 2;...well 
ich (la noch trainieren bin (because I there still 
train \[am\]) 
The segmentation parser is able to segment 84% 
of the 184 turns with 314 utterances correctly. 
The remaining 16% are mostly difficult ambigu- 
ous cases some of which could be resolved if more 
knowledge could be used. For instance, while 
many conjunctions like "because" are good indi- 
cators for utterance borders, some conjunctions 
like "and" and "or" may not start new coordi- 
nated subsentences but coordinate noun groups. 
Fundamental structural disambiguation could be 
used to deal with these cases. Since they occur 
relatively rarely in our spoken utterances we have 
chosen not to incorporate structural disambigua- 
lion. Furthermore, another class of errors is char- 
acterized by time and location specifiers which can 
occur at the end or start of an utterance. For in- 
stance, consider the example: "On Tuesday the 
sixth of April \[ still have a slot in the afternoon 
- is that possible" versus "On Tuesday the sixth 
of April I still have a slot --- in the afternoon is 
that possible". Such decisions are difficult and ad- 
742 
ditional knowledge' like prosody might help here. 
Currently, there is a pref>rence for @ling the ear- 
lier Dame. 
5 The Dialog Act Network 
In t, able I we have described the dialog acts w(' 
use in our domain. Before we start to describe, 
any experiments on learning dialog acts we show 
the distributioll of dialog acts across our tr;dning 
and test, sets. Table ,1 shows the distribution for 
ore: set of 1184 turns with 3:14 utterances. There 
were 100 utterances in the training set att(l 21d in 
the test set. As we can see, st,ggestions and ex- 
l>lanatory st~d;ements often occur but in general all 
dialog acts occur reasonably of'ten. This disl, ribu-- 
Lion analysis is iml)orl,;mt \[br judging tit(: leat'ltiug 
and generalization behavior. 
Category 
sug 371% 
state 20% 
rej 12% 
mist I 1% 
re-s 10% 
aec 9% 
query 5% 
re-e 2% 
'Fal)le d: l)istribution of the dialog acts 
and test sot, 
Training Test 
26% 
21% 
1 0% 
18% 
8% 
12% 
3% 
:/% 
in training 
After this initial distribution aualysis we now 
describe ore: nel, work architectur(, for learning di- 
alog acts. I)iaJog acts depend a lot on signiti- 
cant words and word order. Certain key words 
are much more significant R)r n certain dialog act 
than others. For instance "prol)ose" is highly sig- 
nificanl; for the dialog act, su.qgcsl, while "in" is 
nol,. 'Fherefore we COlnputed a, smoothed dialog 
act plausibility vector for each word w which re- 
\[lects the i)lausilility of the cat,egol:ies \[br a par- 
ticular word. The sm-n of all values is 1 and each 
wdue is at leasl, 0.01. The plausibility value of a 
word w in a. dialog category chti with the frequency 
f is computed as describ('d in tJtc formula below. 
J~,+, (,,t,) - (A, (',,,) * A, := 0(,%) * o.ol) 
Total frcq.uc, ucy f(w) in cortms 
'1%1)1(; 5 shows ex~unples of plausibility w'.cl,ors 
rot some words. As we can see, "bad" has the 
highest plausibility Rw 1,he reject dialog act, aml 
"l)ropose" for the 8tty!leSl, dialog act. On I;he other 
haud the word "is" is not i)articul,u'ly significant 
for certain dialog acts and therefore has a plau- 
sibility vector with relatively evenly distributed 
V~|lteS. 
bad propose is 
t.cc 0.28 0.01 0.22 
nisc 0.0l 0.38 0.02 
lucry 0.01 0.01 0.07 
cj 0.66 0.01 0.34 
e-c 0.0J 0.0J 0.01 
e-s 0.0:1 0.01 0.02 
tate 0.01 0.01 0.27 
ug 0.01 0.56 0.05 
'l'aloh" 5: Three examples for plausibility vectors 
We have experimented with dilferent variations 
of simple recurrent networks (Elman, 1990) for 
learning dialog ~mt assignment. We had chosen 
simple recurrent networks since these networks 
can represent the previous context in an utter- 
ante in their recurrent context layer. The best 
performing network is shown in figure 2. 
output layer hidden l;a~'t~ 
in mJ3~ayer 
+% % %5. % %+ % % 
Figm:e 2: I)ia\]og act network with dialog plausi- 
1)ilil;y vectors a.s input 
Input to t, his network is tile current word repre- 
sented by its dialog plausibility vector. The out- 
put is the dialog act of the whole uttera.m:(,. Be- 
tween input and output layer there are the hidden 
layer and the context layer. All the DedR)rward 
connections in the network are flflly connected. 
Only the recurrent connections fi:om the hidden 
layer to the context layer are 1:1 copy connec- 
tR)tlS, which represent the internal learned context 
of the. utl;erance before the current word. 'lYa\[n- 
ing in these uetworks is per\[brined by using gra- 
dient descent; (l{mnelhart et M., 1986) using up 
to 3000 cycles through the training set.. By us- 
ing Che iuternM learned context it is possiMe to 
~na.ke dialog act assignments for a whole utter- 
743 
ance. While processing a whole utterance, each 
word is presented with its plausibility vector and 
at the output layer we can check the incrementally 
assigned dialog acts for each incoming word of the 
utterance. 
We have experimented with different input 
knowledge (only dialog act plausibility vectors, 
additional abstract semantic plausibility vectors, 
etc.), different architectures (different numbers of 
context layers, and different number of units in 
hidden layer, ere). l)ue to space restrictions it 
is not possible to describe all these comparisons. 
Therefbre we just focus on the description of the 
network with the best generalization performance. 
Dialog acts 
aCC 
state 
misc 
query 
r~j 
sug 
re-c 
re-s 
Total 
'Ih:aining Test 
88.9 72.0 
90.0 90.9 
54.5 73.7 
40.0 0.0 
9\]..7 85.7 
90.3 92.9 
0.0 0.0 
90.0 82.4 
82.0 79.4 
Table 6: Performance of simple recurrent network 
with dialog plausibility vectors in percent 
Table 6 shows the results for our training and 
test utterances. The overall performance on the 
training set was 82.0% on the training set and 
79.4% on the test set. An utterance was counted 
as classified in the correct dialog act class if the 
majority of the outputs of the dialog act network 
corresponded with the desired dialog act. This 
good performance is partly due to the distributed 
representation in the dialog plausibility vector at 
the input layer. Other second best networks with 
additional local representations tbr abstract se- 
mantic category knowledge could perform better 
on the training set but failed to generalize on the 
test set and only reached 71%. 
The remaining errors are partly due to seldomly 
occurring dialog acts. Por instance, there are only 
2% of the training utterances and 2.8% of the test 
utterances which belong to the request-comment 
dialog act. The network was not able to learn cor- 
rect assignments due to the little training data. 
The drop in the performance for the query dia- 
log act from training to test set can be explained 
by the higher variability of the queries compared 
to all other categories. Since queries differ much 
more from each other than all other dialog acts 
they could not be generalized. However they do 
not occur very often. All other often occurring 
dialog act categories performed very well as the 
individual percentages and the overall percentage 
show. 
6 Discussion and Conclusions 
What do we learn from this? When we started this 
work it was not clear to what extent a symbolic 
segmentation parser and a connectionist learning 
dialog act network could be integrated to perform 
an analysis at the semantics and dialog level. We 
have shown that a symbolic segmentation parser 
and a learning dialog network can be integrated 
to perform dialog act assignments for spoken ut- 
terances. While other related work has focused 
on statistical learning we have explored the use of 
learning in simple recurrent networks. Our corpus 
of 2228 words is still medium size. Nevertheless, 
we consider the results as promising, given that 
it is - to the best of our knowledge - the first at- 
tempt go integrate symbolic segmentation parsing 
with dialog act learning in simple recurrent net- 
works. 
How well do we perform compared to related 
work? In spite of many projects in the ATIS and 
VERBMOBIL domains there is not a lot of work 
on learning for the dialog level. However, recently 
there have been some investigations of statistical 
techniques (Reithinger and Maier, 1995) (Alexan- 
derson et al., 1995) (Mast et al., 1996). For in- 
stance Mast and colleagues report 58% for learn- 
ing dialog act assignment with semantic classifica- 
tion trees and 69% for learning with pentagrams 
but they also used more categories than in our 
approach so that the approaches are not directly 
comparable. 
For a further evaluation of our trMned network 
architecture we compared our results with a sta- 
tistical approach based on the same data. Pl£u- 
sibility vectors for dialog acts represent the dis- 
tribution of dialog acts for each word for the cur- 
rent corpus, t\]owever, for assigning a dialog act 
to a whole utterance all the words of this utter- 
ance have to be considered. A simple but efficient 
approach would be to compute the average plau- 
sibility vector: for each utterance which has been 
tbund. Then the dialog act with the highest aver- 
aged plausibility vector for a complete utterance 
would be taken as the computed dialog act. This 
statistical approach reached a performance of 62% 
correctness on the training and test set compared 
to the 82% and 79% of our dialog network. So 
simple recurrent networks performed better than 
the statistical average plausibility method. In 
comparison to statistical techniques which have 
744 
also been used successflllly on large corpora, it is 
our understanding that simple recurrent networks 
may be particularly suitable tbr domains where 
only smaller corpora are awdlable or where clas- 
s|liter|on data is hard to got (as it is the case {'or 
pragmatic dialog acts.) 
What will be further work? So far we llave con- 
centratcd on single utterances and we do not ac- 
count for the relationship between utl;erances in a 
dialog. While we could demonstrate that such a 
local strategy could assign correct dialog acts in 
many eases, it might be interesting to explore to 
what extent knowledge about previous dialog acts 
in previous utterances could oven improve our re- 
salts. Furthermore, we have developed tim seg- 
mentation parser and dialog act network as very 
robust components, lit fact, both are very ro- 
bust in the sense that they will always produce 
the best possible segmentation and dialog act cat- 
egorization, hi the future we plan to explore how 
the output from a speech recognizer can be pro- 
cessed by our dialog conlponent. ~qenteiice and 
word hypotheses from a speech recognizer are still 
far fl'om optimal for continuously spoken spon- 
taneous speech. Therefore we have to account 
for highly ungrammaticM constructions. The seg- 
mentation parser and the dialog network already 
contain the robustness which is a precondition for 
dealing with real-world speech input. 
Acknowledgements 
This research was funded by the German Federal 
Ministry for fteseareh and '\['echnology (BMBF) 
under Grant @01IV101A0 and by the German l{e- 
search Association (DI,'G) under contract I)I"G IIa 
1026/6-2. We would like to thank S. Ilaack, M. 
Meurer, U. Sauerland, M. Schrattenholzer, and V. 
Weber for their work on SCREEN. 
References 
J. Alexanderson, E. Meier, and N. lh;ithinger. 
1995. A robust and efficient three-layered di- 
alogue component tbr a speech-to-speech trans- 
lation system. In Proceedings of the Euro- 
pean Association for Computational Linguis- 
tics, l)ublin. 
J. Austin. 1962. tlow to do things wilh words. 
Clarendon Press, Oxford. 
E. Charniak. 1993. Statistical Language Learning. 
MIT Press, Cambridge, MA. 
J. 1,. l,Jhnan. 1990. Finding structure in time. 
Cognitive £'cience., 14:179 221. 
1). Jones, editor. 1995. New Methods in Language 
Processing. University College London. 
M. Marcus, B. Santorini, and M. Marcinkiewicz. 
1993. Building a large annotated corpus of En- 
glish: the Penn treebank. Computational Lin- 
guistics, 19(1). 
M. Mast, E. Noeth, 11. Niemann, and 
E. G. Schukat 'l'alanmzzini. 1996. Automatic 
classification of dialog acts with semantic clas- 
sification trees and polygrarns. In S. Wermter, 
E. Rilotf, and G. Scheler, editors, Connection|st, 
Slatistical and Symbolic Approaches to Learning 
for Nalural Language P'wcessing, pages 217- 
229. Springer, tleidelberg. 
N. l{eithinger and E. Maim:. 1995. Utilizing sta- 
tistical dialogue act processing in verbmobil. In 
Computational Linguistics A~hive. 
I). E. lt.uinelhart, G. E. Hinton, and IL. J. 
Williams. 1986. Learning internal representa- 
tions by error propagation. In D. E. Rumel- 
hart and J. L. MeClelland, editors, Parallel Dis- 
tribaled Processing, volume 1, pages 318-362. 
MIT Press, Cambridge, MA. 
J. R. Searle. 1969. £'pecch Acts. Cambridge Uni- 
versity Press, Cambridge. 
Y. Wang and A. Waibel. 1995. Connection|st 
traust'cr in machine translation. In Proceedings 
o\]" the International Conference on Recent Ad- 
vances in Natural Language Processing, pages 
37 44, Tzigov Cliark. 
S. Wcrmter and V. Weber. 1995. Artificial neurM 
networks for automatic knowledge acquisition 
in multiple real-world language domains. In 
Proceedings of the International Conference on 
Neural Networks and their Applications, Mar- 
seille. 
S. Wermter and V. Weber. 1996. Interactive spo- 
ken language processing in the hybrid connec- 
tion|st system SCREEN: learning robustness in 
the real world. IEEE Computer, 1996. in press. 
S. Wermter, E. l{iloff, and G. Scheler. 1996. Con- 
nectionist, Statistical and Symbolic Approaches 
to Learning for Natural Language Pwccssing. 
Springer, Berlin. 
S. Wermter. 1995. Hybrid Connection|st Natu- 
ral Language Processing. Chapman and tIall, 
London, UK. 
Y. Wilks. 1985. I{elevance, points of view 
and speech acts: An artificial intelligence view. 
Technical Report MCCS-85-25, New Mexico 
State University. 
745 
