Statistical Method of Recognizing Local Cohesion 
in Spoken Dialogues 
Naoto Katoh and Tsuyoshi Morimoto 
ATR Interpreting Telecommunications Research Laboratories 
Seika-cho Soraku-gun Kyoto 619-02 Japan 
{katonao, morimoto }@ itl.atr.co.jp 
Abstract 
This paper presents a method for auto- 
matically recognizing local cohesion be- 
tween utterances, which is one of the dis- 
course structures in task-oriented spoken 
dialogues. More specifically we can auto- 
matically acquire discourse knowledge 
from an annotated corpus with local co- 
hesion. In this paper we focus on speech 
act type-based local cohesion. The pre- 
sented method consists of two steps 1) 
identifying the speech act expressions in 
an utterance and 2) calculating the plausi- 
bility of local cohesion between the 
speech act expressions by using the dia- 
logue corpus annotated with local cohe- 
sion. We present two methods of interpo- 
lating the plausibility of local cohesion 
based on surface information on utter- 
ances. The presented method has ob- 
tained a 93% accuracy for closed data and 
a 78% accuracy for open data in recogniz- 
ing a pair of utterances with local cohe- 
sion. 
1 Introduction 
For ambiguity resolution, processing of a dis- 
course structure is one of the important processes 
in Natural Language Processing (NLP). Indeed, 
discourse structures play a useful role in speech 
recognition, which is an application of NLP. In the 
case of Japanese, it is very difficult to recognize 
the end in utterances by using current speech rec- 
ognition techniques because the sound power of 
an ending tends to be small. For example, "desu", 
which represents the speech act type "response", 
is often misrecognized as "desu-ka (question)" or 
"desu-ne (confirmation)". On the other hand, 
Japanese can easily select the adequate expression 
"desu", when the intention of the previous utter- 
ance is concerned with a question. This is because 
they use the coherence relation (local cohesion) 
between the two utterances, question-response. 
In the conventional approach (i.e., rule- 
based approach) to processing the discourse struc- 
ture \[Hauptmann 88\]\[Kudo 90\]\[Yamaoka 
91\]\[Young 91\], NLP engineers built discourse 
knowledge by hand-coding. However, the rule- 
based approach has a bottleneck in that it is a hard 
job to add discourse knowledge when the em- 
ployed NLP system deals with a larger domain 
and more vocabulary. 
Recently, statistical approaches have been 
attracting attention for their ability to acquire lin- 
guistic knowledge from a corpus. Compared with 
the above rule-based approach, a statistical ap- 
proach is easy to apply to larger domains since the 
linguistic knowledge can be automatically ex- 
tracted from the corpora concerned with the do- 
main. However, little research has been reported 
in discourse processing \[Nagata 94\]\[Reithinger 
95\], while in the areas of morphological analysis 
and syntactic analysis, many research studies have 
been proposed in recent years. 
This paper presents a method for automati- 
cally recognizing local cohesion between utter- 
ances, which is one of the discourse structures in 
task-oriented spoken dialogues. We can automati- 
cally acquire discourse knowledge from an anno- 
tated corpus with local cohesion. In this paper we 
focus on speech act type-based local cohesion. 
The presented method consists of two steps ~1) 
identifying the speech act expressions in an utter- 
ance and 2) calculating the plausibility of local 
cohesion between the speech act expressions by 
using the dialogue corpus annotated with local co- 
hesion. We present two methods of interpolating 
the plausibility of local cohesion based on surface 
information on utterances. Our method has ob- 
tained a 93% accuracy for closed data and a 78% 
accuracy for open data in recognizing a pair of ut- 
terances with local cohesion. 
In Section 2, local cohesion in task-oriented 
dialogues is described. In Section 3, our statistical 
method is presented. In Section 4, the results of a 
series of experiments using our method are de- 
scribed. 
634 
--Topics 
(1) hotel reservation 
(2) date 
(3) the number of persons 
Global 
cohesion 
I 
Local 
cohesion 
i u--q 
(1) I--U1 :Heya wo yoyaku shitai-no-desu-ga 
L (l would like to make a room reservation.) 
U2:Kashikomari-mashita 
(OK.) 
-U3:Go-kibou no hinichi wo onegai-itashimasu 
(Could you tell me when you would like to stay?) 
U4:Hachi-gatsu to-ka kara ju-ni-nichi desu 
(From August 10th to 12th.) 
U5 :Hachi-gatsu to-ka kara ju-niqfichi made ni-haku desu-ne 
(That's two nights from August 10th to 12th, is that right?) 
--U6:Hai, sou-desu 
(Yes, that's correct.) 
l(3"1V -U7:Nan-mei-sama de shou-ka 
~zz_j (How many persons will be staying?) 
--t t--U8:Futari desu 
/ (Two persons.) 
Figure 1. An example of a task-oriented dialogue in Japanese 
2 Local Cohesion between Utter- 
ances 
The discourse structure in task-oriented dialogues 
has two types of cohesion: global cohesion and 
local cohesion. Global cohesion is a top-down 
structured context and is based on a hierarchy of 
topics led by domain (e.g., hotel reservation or 
flight cancellation). Using this cohesion, a task- 
oriented dialogue is segmented into several 
subdialogues according to the topic. On the other 
hand, local cohesion is a bottom-up structured 
context and a coherence relation between utter- 
ances, such as question-response or response-con- 
firmation. Different from global cohesion, local 
cohesion does not have a hierarchy. This paper 
focuses on local cohesion. 
Figure 1 shows a Japanese conversation be- 
tween a person and a hotel staff member, which is 
an example of a task-oriented dialogue; The per- 
son is making a hotel reservation. The first col- 
umn represents global cohesion and the second 
column represents local cohesion. For example, 
the pair of U3 and U4 has local cohesion, because 
it has a coherence relation for each word in the 
utterances as follows: 
c 1) speech act pattern between "onegai- 
itashimasu (requirement)"in U3 and "desu 
(response)" in U4 
c2) semantic coherence between nouns, 
"hinichi (date)"in U3, and "hachi-gatsu to-ka 
(August 10th)" and "ju-ni-nichi (12th)"in U4 
In the same way, (U4, U5) and (U5, U6) have lo- 
cal cohesion. Thus, U3 to U6 are built up as one 
structure and form a subdialogue with the topic 
"date". 
As observed from this example, whether 
two utterances have local cohesion with one an- 
other or not is determined by coherence relations 
between the speech act types in the utterances, co- 
herence relations between the verbs in them and 
coherence relations between the nouns in them. In 
recognizing local cohesion, our method uses these 
three coherence relations. 
3 Our Approach 
3.1 Utterance Model with Local Co- 
hesion 
In this paper, we approximate an utterance in a 
dialogue to a three-tuple; 
U = (SPEECH __ ACT, VERB, NOUNS) (1) 
where SPEECH ACT is the speech act type, 
VERB is the ma\]-n verb in the utterance, and 
NOUNS is a set of nouns in the utterance' (e.g., a 
subject noun and an object noun for the main 
verb). Figure 2 shows a dialogue with our utter- 
ance model. 
635 
U 1 : (SPEECH ACT 1, VERB1, NOUNS1) 
U 2 = (SPEECH ACT2, VERB 2, NOUNS2) 
~ U i = (SPEECH ACT i, VERBi, NOUNSi) 
Uj : (SPEECH _ACTj, VERBj, NOUNSj) 
Local 
cohesion? .J 
Figure 2. A dialogue with our utterance model 
As mentioned in Section 2, when the ith utterance 
Ui(1 <_ i < j - 1) and thejth utterance Uj in a dia- 
logue have local cohesion with one another, 
(SPEECH_ACT i, SPEECH_ACT.), (VERB i, 
VERB.) and (NOUNS., NOUNS.) have coherence .g l .~. 
relations. Therefore, the plauslbdlty of local cohe- 
sion between U. and U. can be formally defined t j 
as: 
cohesion local(U., U.) 
-- l J = ,~ lcohesion_speechact(SPEECHACTr 
SPEECtI_A Cr} 
+ J 2cohesion_verb(VERB r VERB j) 
+ ,~ 3cohesion_noun(NOUNSr NOUNSj) (2) 
where ,~ , ,/ and ,¢ ('{ + "~ + '~3=l) are \] 2.. 3 ./. . 2 
nonnegatlve weights contributing to local cohe- 
sion, and cohesion_speechact, cohesion_verb and 
cohesion noun are functions giving the plausibil- 
ity of coherence relations between speech act 
types, verbs and nouns respectively. The problem 
of deciding an utterance that has local cohesion 
with U. can then be formally defined as finding a J 
utterance with the highest plausibility of local co- 
hesion for Uj, which is the result of the following 
function: 
Uopt s = arg max0<i<\]_l cohesion(U i , U\] ) (3) 
As the first step, this paper uses only speech act 
types in the calculation (i.e. A t =1, J 2 = J 3 = 0). 
This is because the speech act types are more pow- 
erful in finding local cohesion than the verbs or 
the nouns as follows: 
rl) The speech act types are independent of 
domain. 
r2) The speech act types are stable, while the 
nouns and the verbs are sometimes omitted in 
utterances in spoken dialogues. 
Thus, Equation (2) is reduced to: 
cohesion_local( U r Uj) 
= cohesion_speechact(SPEECHACT r 
SPEECH_ACT) (4) 
In order to calculate Equation (4), two kinds of in- 
formation to answer the following questions are 
required as discourse knowledge: 
q l) What expressions in an utterance indicates 
a speech act type? 
q2) What speech act pattems have local 
cohesion? 
We automatically acquire these discourse knowl- 
edge from a annotated corpus with local cohesion. 
According to the information, our method is com- 
posed of two processes, 1) identifying the expres- 
sions which indicate a speech act type (called 
speech act expressions) in an utterance and 2) cal- 
culating the plausibility of the speech act patterns 
by using the dialogue corpus annotated with local 
cohesion. 
3.2 Identifying Speech Act Expres- 
sions in an Utterance 
The first process in our method identifies the 
speech act expression in each of the utterances by 
matching the longest pattern with the words in a 
set of speech act expressions. The words can be 
collected by automatically extracting fixed ex- 
pressions from the parts at the end of utterances, 
because the speech act expressions in Japanese 
have two features as follows: 
fl) The speech act expression forms fixed 
patterns. 
f2) The speech act expressions lie on the parts 
at the end of the utterance. For example, 
"desu" in "Futari desu"(U8) in Figure 1 
represents a speech act type "response". We 
call these expressions ENDEXPR expressions. 
For the automatic extraction, we use a slight 
modification of the cost criteria-based method 
\[Kita 94\], which uses the product of frequency 
and length of expressions, because this is easy 
when dealing with languages that do not use de- 
limiters in their writings such as Japanese. Kita et 
al. extract expressions in the order of larger cost 
criteria values. We do so in the order of longer 
fixed-expressions with cost-criteria values above 
a certain threshold. For the more details of our ex- 
traction method, see \[Katoh 95\]. 
When the ENDEXPR expressions, which 
are listed in the set of the speech act expressions 
(represented as Set-ENDEXPR), are defined as a 
symbol ENDEXPR, we can approximate the 
speech act types as SPEECH ACT=ENDEXPR. 
Thus, Equation (4) is transfornTed to: 
cohesion_local( U r ~) 
= cohesion_speechact(SPEECH_ACT r 
SPEECH ACT) 
-- j 
= cohesion endexpr(ENDEXPR., ENDEXPR .) (5) 
-- . ~ . . . J where cohesion_endexpr Is a function giving the 
plausibility of coherence relations between the 
ENDEXPR expressions. 
636 
3.3 Calculating the Plausibility of 
Local Cohesion 
The second process is to calculate the plausibility 
of local cohesion between utterances from the dia- 
logue corpus using a statistical method. In this pa- 
per, we define the plausibility of local cohesion, 
i.e., Equation (5), as follows: 
cohesion_local( U i, U.) 
= cohesion endexpr(ENDEXPRi, ENDEXPRj) 
= f(ENDEXPR~,ENDEXPRj ) 
- f(ENDEXPRi,ENDEXPRj) (6) 
where 
f (ENDEXPR~, ENDEXPRj ) 
= P(ENDEXPR~ n ENDEXPRj) 
P( ENDEXPR~ n ENDEXPRj ) 
× log 
P( ENDEXPR~ ) × P( ENDEXPR j ) 
D 
f (ENDEXPR,, ENDEXPRj ) 
= P(ENDEXPR~ ~ ENDEXPRj) 
P( ENDEXPRi n ENDEXPR j ) 
x log 
P( ENDEXPR i ) x P(ENDEXPRj ) 
Thefand f are modified functions of mutual in- 
formation in Information Theory. We call them 
pseudo-mutual information. P(.) is the relative 
frequency of two utterances with local cohesion 
and P(.) is that of two utterances without local 
cohesion. ENDEXPR~ (~ ENDEXPR. means that 
ENDEXPR: appears next to ENDEXPRz. We call 
a series of two ENDEXPR expressions (i.e., 
ENDEXPR i 0 ENDEXPRi+I) an ENDEXPR 
bigram. 
The htrger the value of Equation (6) in two 
utterances gets, the more plausible the local cohe- 
sion between them becomes. For example, in ap- 
plications in speech recognition, the optimal result 
has the largest plausibility value among several 
candidates obtained from a module of speech pat- 
tern recognition. 
3.4 Smoothing Methods 
Although the statistical approach is easy to imple- 
ment, it has a major problem, i.e., sparse data 
problem. Indeed Equation (6) gives very small 
values in some cases. In order to overcome this 
problem (to interpolate the plausibility), we pro- 
pose two smoothing techniques. 
\[Smoothing Method 1\] 
Interpolate the plausibility by using partial fixed 
expressions in Set-ENDEXPR. For example, in 
Japanese, "itadake-masu-ka" can be segmented 
into the smaller morpheme "masu-ka" or "ka", and 
the original morpheme "itadake-masu-ka" is in- 
terpolated by these two morphemes as follows: 
cohesion endexpr(t'itadake-masu-ka ", "desu") 
= # ocohesion_endexpr("itadake-masu-ka", 
"desu") 
+/~ 21 cohesion_endexpr("masu -ka", "desu") 
+ Y 31 cohesion_endexpr("ka", "desu") 
where flo, #21 and #3t(fZo + f12~ + #31 =1 ) 
are nonnegative parameters. 
Formally, if we assume the partial fixed expres- 
sions ENDEXPR .P ~ Set-ENDEXPR and l 
ENDEXPR 51 ~ Set-ENDEXPR, and represent the 
J 
"smaller" as a symbol "<" (e.g., "ka"<"masu- 
ka"<"itadake-masu-ka"), we can interpolate the 
original phmsibility by using these smaller mor- 
phemes: 
cohesion endexpr(ENDEXPR i, ENDEXPR :) 
=/x ocohesion_endexpr(ENDEXPR i, END~XPtL) 
-I- Z /1 Pq cohesion_endexpr(ENDEXPR i, 
l<_p<_m 
l<_q<_,, ENDEXPR) 
where 
ENDEXPR\] < ENDEXPR. 2 <...< ENDEXPR. m l ! l 
ENDEXPR \] < ENDEXPR .2 <...< ENDEXPR.n J J J 
\[Smoothing Method 2\] 
Interpolate the plausibility by using the speech act 
types themselves. For example, "itadake-masu-ka 
(requirement)" and "desu (response)" are interpo- 
lated by the relation of the speech act types (not 
the speech act expressions), i.e., "requirement-re- 
sponse", as follows: 
cohesion endexpr("itadake-masu-ka", "desu") 
= y ocohesionendexpr("itadake-mas u-ka", 
"desu") 
+ y lCOhesionspeechacttype(req uirement, 
response) 
where/z 0and # 1 (# o + # l = 1 ) are nonnegative 
parameters. 
These speech act types are automatically con- 
structed by clustering ENDEXPRs based on 
ENDEXPR bigrams, and then the type bigrams are 
re-calculated from the ENDEXPR bigrams. 
637 
Formally, when the speech act types are de- 
noted by SACT TYPE, we can interpolate an 
original plausibilhy by using these type patterns: 
cohesion_endexpr( END EXP R i' END EXP Ri ) 
=/~ ocohesion_endexpr(ENDEXPRi, ENDEXPRj) 
+/x fohesion speechacttype(SACT TYPE a, 
SA CT TYPE b) 
where 
ENDEXPR i ~ SACTTYPE , 
ENDEXPR ~ SA CT_ TYPE h 
cohesion_speechact_type is a function giving the 
plausibility of coherence relations between the 
speech act types. 
The former method must use the n Xm parameters 
(i.e., ,u ~ ) and the latter one must produce the Iq 
speech act types. We chose the former method for 
our first experiments, because it was easier to 
implement. 
4 Experiments 
A series of experiments was done to evaluate our 
method. The experiments were carried out to de- 
cide whether one utterance and the next one have 
local cohesion with each other or not. The results 
of the experiment were able to segment a dialogue 
into several subdialogues. 
First, Set-ENDEXPR was constructed au- 
tomatically by our extraction method for fixed ex- 
pressions from the ATR speech and language da- 
tabase \[Morimoto 94\]. The database includes 
about 600 task-oriented dialogues concerned with 
several domains, such as hotel reservation, flight 
cancellation and so on and each of the dialogues 
includes about 50 utterances on average. We have 
extracted about one hundred fixed-expressions 
from the database by the extraction method. Table 
1 shows examples of the fixed expressions. 
Table 2a. Examples of ENDEXPR bigrams 
Table 2b. 
Table 1. Examples of results by the extraction 
method 
the end of expressions (speech act type) 
arigatou-gozai-mashita (thank-you) 
itadake-masu-ka (requirement) 
onegai-shimasu (requirement) 
de-shou-ka (question) 
masu-ka (question) 
desu-ka (question) 
su-ka (question) 
desu (response) 
ka (question) 
Secondly, we chose 60 dialogues from the ATR 
database and annotated them with local cohesion 
by hand-code such as that shown in Figure 1 in 
Section 2. Then, six dialogues were taken from the 
60 dialogues to use in testing for the open data, 
and the rest of the dialogues (54 dialogues) were 
used to calculate ENDEXPR bigrams. Moreover, 
six dialogues were taken from the 54 dialogues for 
the closed data. Using the 54 dialogues, the 
ENDEXPR bigrams were produced in the follow- 
ing four parts: 
An ENDEXPR bigram 
part A : with local cohesion + turn-taking. 
part B : with local cohesion + no turn-taking. 
part C : without local cohesion + turn-taking. 
part D : without local cohesion + no turn-taking. 
where "turn-taking" means that an utterance and 
the next one are produced by different persons and 
"no turn-taking" means that they are done by the 
same person. 
Table 2 shows examples of ENDEXPR bigrams 
with local cohesion. 
with local cohesion + turn-taking (part A) 
relative frequency 
.01307 
.01226 
.01144 
.00735 
.00735 
ENDPXPR bigrams 
wo-onegai -shimasu kashikomari-mashita 
deshou-ka desu 
de-onegai-shimasu kashikomari-mashita 
desu-ka desu 
desu de-gozai-masu-ne 
Examples of ENDEXPR bigrams with local cohesion + no turn-taking (part B) 
relative frequency 
.03063 
.01685 
.01225 
.01225 
.01072 
ENDPXPR bigrams 
desu desu 
imasu desu 
shoushou-omachi-kudasai 
hai desu 
desu-ne desu-ne 
omatase-itashimashita 
638 
Using these ENDEXPR bigrams, a series of ex- 
periments was done for the closed data. and the 
open data. In the experiments, we defined the two 
utterances with local cohesion, if they had the 
plausibility above a certain threshold, and we 
chose Smoothing Method 1 under the condition 
1 that ,u 0=0.7, ~ /L pq=0.7 and/~ m 
l<p<m m x n 
l~q(_n 
Table 3 shows the accuracy of recognizing local 
cohesion in three cases: the turn-taking case, the 
no turn-taking case and the total case. 
Table 3. The accuracy of recognizing local 
cohesion 
data 
closed 
data 
open 
data 
turn- no turn. 
method taking taking total 
Our method 
without smoothing 90.6% 93.9% 92.3% 
Our method 
with smoothing 93.6% 93.9% 93.8% 
Default method 74.8% 54.2% 64.7% 
Our method without smoothing 77.4% 70.9% 73.9% 
Ourmethod 81.8% 75.5% 78.4% with smoothing 
Default method 76.6% 50.4% 65.3% 
In Table 3, the "default method" assumed that all 
of the pairs of utterances in a dialogue has local 
cohesion, and the accuracy was calculated as: 
The accuracy in the "default method" 
The number of the pairs of utterances . wltn local coneston 
= The total number of the pairs of utterances in a dialogue 
As shown in Table 3, the accuracy of our method 
was higher than that of the "default method". Us- 
ing the smoothing method, we obtained a 93.8% 
accuracy for the closed data and about a 78.4% 
accuracy for the open data. 
5 Conclusion 
We described our statistical method of recogniz- 
ing local cohesion between utterances, based on 
pseudo-mutual information. We focused on 
speech act expressions in calculating the plausibil- 
ity of local cohesion between two utterances. The 
results of the first experiments showed a 93% ac- 
curacy for closed data and a 78% accuracy for 
open data. We conclude that the presented method 
will be effective for recognizing local cohesion 
between two utterances. 
To improve our method, we will use the co- 
herence relations in the verbs and set of nouns and 
will use a larger corpus with local cohesion. We 
plan to apply the method to speech recognition in 
a speech-to-speech machine translation system. 
References 
\[Hauptmann 88\] Hauptmann, A.G. et al.: Using 
Dialog-Level Knowledge Sources to Improve 
Speech Recognition, Proceedings of AAAI-88, 
pp. 729-733, 1988. 
\[Kudo 90\] Kudo, h Local Cohesive Knowledge 
for a Dialogue-machine Translation System, Pro- 
ceedings of COLING-90, Helsinki, Vol. 3, pp. 
391-393, 1990. 
\[Katoh 95\] Katoh, N. and Morimoto, T.: Statisti- 
cal Approach to Discourse Processing, Proceed- 
ings of SIG-SLUD-9502-3 of JSAI, pp. 16-23, 
1995. (in Japanese) 
\[Kita 94\] Kita, K et al.: Application of Corpora in 
Second Language Learning - The Problem of 
Collocational Knowledge Acquisition, Proceed- 
ings of WVLC-94, pp. 43-56,1994. 
\[Nagata 94\] Nagata, M. and Morimoto, T.: An In- 
formation-Theoretic Model of Discourse for Next 
Utterance Type Prediction, Trans. of IPSJ, Vol. 
35, No. 6, pp. 1050-1061, 1994. 
\[Reithinger 95\] Reithinger N. and Maier E.: Uti- 
lizing Statistical Dialogue Act Processing in 
Verbmobil, ACL-95, pp. 116-121, 1995. 
\[Yamaoka 91\] Yamaoka, T. and lida, H.: Dia- 
logue Interpretation Model and its Application to 
Next Utterance Prediction for Spoken Language 
Processing, EUROSPEECH-91, pp. 849-852, 
1991. 
\[Young 91\] Young, S. and Matessa, Y.: Using 
Pragmatic and Semantic Knowledge to Correct 
Parsing of Spoken Language Utterances, Proceed- 
ings of EUROSPEECH-9 l, pp. 223-227, 1991. 
639 
