An Efficient Statistical Speech Act Type Tagging System for 
Speech Translation Systems 
Hideki Tanaka and Akio Yokoo 
ATR Interpreting Telecommunications Research Laboratories 
2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0288, Japan 
{t anakah I ayokoo}©itl, atr. co. jp 
Abstract 
This paper describes a new efficient speech 
act type tagging system. This system cov- 
ers the tasks of (1) segmenting a turn into 
the optimal number of speech act units 
(SA units), and (2) assigning a speech act 
type tag (SA tag) to each SA unit. Our 
method is based on a theoretically clear 
statistical model that integrates linguistic, 
acoustic and situational information. We 
report tagging experiments on Japanese 
and English dialogue corpora manually la- 
beled with SA tags. We then discuss the 
performance difference between the two 
languages. We also report on some trans- 
lation experiments on positive response 
expressions using SA tags. 
1 Introduction 
This paper describes a statistical speech act type 
tagging system that utilizes linguistic, acoustic and 
situational features. This work can be viewed as a 
study on automatic "Discourse Tagging" whose ob- 
jective is to assign tags to discourse units in texts or 
dialogues. Discourse tagging is studied mainly from 
two different viewpoints, i.e., linguistic and engineer- 
ing viewpoints. The work described here belongs to 
the latter group. More specifically, we are interested 
in automatically recognizing the speech act types of 
utterances and in applying them to speech transla- 
tion systems. 
Several studies on discourse tagging to date have 
been motivated by engineering applications. The 
early studies by Nagata and Morimoto (1994) and 
Reithinger and Maier (1995) showed the possibility 
of predicting dialogue act tags for next utterances 
with statistical methods. These studies, however, 
presupposed properly segmented utterances, which 
is not a realistic assumption. In contrast to this 
assumption, automatic utterance segmentation (or 
discourse segmentation) is desired here. 
Discourse segmentation in linguistics, whether 
manual or automatic, has also received keen atten- 
tion because such segmentation provides the founda- 
tion of higher discourse structures (Grosz and Sid- 
net, 1986). 
Discourse segmentation has also received keen at- 
tention from the engineering side because the nat- 
ural language processing systems that follow the 
speech recognition system are designed to accept lin- 
guistically meaningful units (Stolcke and Shriberg, 
1996). There has been a lot of research following 
this line such as (Stolcke and Shriberg, 1996) (Cet- 
tolo and Falavigna, 1998), to only mention a few. 
We can take advantage of these studies as a pre- 
process for tagging. In this paper, however, we pro- 
pose a statistical tagging system that optimally per- 
forms segmentation and tagging at the same time. 
Previous studies like (Litman and Passonneau, 1995) 
have pointed out that the use of a multiple informa- 
tion source can contribute to better segmentation 
and tagging, and so our statistical model integrates 
linguistic, acoustic and situational information. 
The problem can be formalized as a search prob- 
lem on a word graph, which can be efficiently han- 
dled by an extended dynamic programming algo- 
rithm. Actually, we can efficiently find the optimal 
solution without limiting the search space at all. 
The results of our tagging experiments involving 
both Japanese and English corpora indicated a high 
performance for Japanese but a considerably lower 
performance for the English corpora. This work 
also reports on the use of speech act type tags for 
translating Japanese and English positive response 
expressions. Positive responses quite often appear 
in task-oriented dialogues like those in our tasks. 
They are often highly ambiguous and problematic 
in speech translation. We will show that these ex- 
pressions can be effectively translated with the help 
of dialogue information, which we call speech act 
type tags. 
2 The Problems 
In this section, we briefly explain our speech act type 
tags and the tagged data and then formally define 
the tagging problem. 
381 
2.1 Data and Tags 
The data used in this study is a collection of tran- 
scribed dialogues on a travel arrangement task be- 
tween Japanese and English speakers mediated by 
interpreters (Morimoto et al., 1994). The tran- 
scriptions were separated by language, i.e., En- 
glish and Japanese, and the resultant two corpora 
share the same content. Both transcriptions went 
through morphological analysis, which was manually 
checked. The transcriptions have clear turn bound- 
aries (TB's). 
Some of the Japanese and English dialogue files 
were manually segmented into speech act units (SA 
units) and assigned with speech act type tags (SA 
tags). The SA tags represent a speaker's intention 
in an utterance, and is more or less similar to the 
traditional illocutionary force type (Searle, 1969). 
The SA tags for the Japanese language were based 
on the set proposed by Seligman et al. (1994) and 
had 29 types. The English SA tags were based on 
the Japanese tags, but we redesigned and reduced 
the size to 17 types. We believed that an excessively 
detailed tag classification would decrease the inter- 
coder reliability and so pruned some detailed tags) 
The following lines show an example of the English 
tagged dialogues. Two turns uttered by a hotel clerk 
and a customer were Segmented into SA units and 
assigned with SA tags. 
<clerk's turn> 
Hello, (expressive) 
New York City Hotel, (inform) 
may I help you ? (offer) 
<customer(interpreter)'s turn> 
Hello, (expressive) 
my name is Hiroko Tanaka (inform) 
and I would like to make a reservation for 
a room at your hotel. (desire) 
The tagging work to the dialogue was conducted 
by experts who studied the tagging manual before- 
hand. The manual described the tag definitions 
and turn segmentation strategies and gave examples. 
The work involved three experts for the Japanese 
corpus and two experts for the English corpus. 2 
The result was checked and corrected by one ex- 
pert for each language. Therefore, since the work 
was done by one expert, the inter-coder tagging in- 
stability was suppressed to a minimum. As the re- 
sult of the tagging, we obtained 95 common dialogue 
files with SA tags for Japanese and English and used 
them in our experiments. 
1Japanese tags, for example, had four tags mainly 
used for dialogue endings: thank, offer-follow-up, good- 
wishes, and farewell, most of which were reduced to ex- 
pressive in English. 
2They did not listen to the recorded sounds in either 
case. 
2.2 Problem Formulation 
Our tagging system assumes an input of a word se- 
quence for a dialogue produced by a speech recog- 
nition system. The word sequence is accompanied 
with clear turn boundaries. Here, the words do not 
contain any punctuation marks. The word sequence 
can be viewed as a sequence of quadruples: 
"'" (Wi-1, li-1, ai-1, si-1), (wi, li, ai, 8i)... 
where wi represents a surface wordform, and each 
vector represents the following additional informa- 
tion for wi. 
li: canonical form and part of speech of 
wi (linguistic feature) 
ai: pause duration measured milliseconds 
after wi (acoustic feature) 
si: speaker's identification for wi such as 
clerk or customer (situational feature) 
Therefore, an utterance like Hello I am John Phillips 
and ... uttered by a cuslomer is viewed as a sequence 
like 
(Hello, (hello, INTER), 100, customer), 
(I,(i, PRON),0, customer)), (am, (be, 
BE), 0, customer) .... 
From here, we will denote a word sequence as W = 
wl, w2, .. • wi, .. •, Wn for simplicity. However, note 
that W is a sequence of quadruples as described 
above. 
The task of speech act type tagging in this pa- 
per covers two tasks: (1) segmentation of a word 
sequence into the optimal number of SA units, and 
(2) assignment of an SA tag to each SA unit. Here, 
the input is a word sequence with clear TB's, and 
our tagger takes each turn as a process unit. 3 
In this paper, an SA unit is denoted as u and the 
sequence is denoted as U. An SA tag is denoted as 
e represents t and the sequence is denoted as T. x s 
a sequence of x starting from s to e. Therefore, 
represents a tag sequence from 1 to j. 
The task is now formally addressed as follows: 
find the best SA unit sequence U and tag sequence 
T for each turn when a word sequence W with clear 
TB's is given. We will treat this problem with the 
statistical model described in the next section. 
3 Statistical Model 
The problem addressed in Section 2 can be formal- 
ized as a search problem in a word graph that holds 
all possible combinations of SA units in a turn. We 
take a probabilistie approach to this problem, which 
formalizes it as finding a path (U,T) in the word 
graph that maximizes the probability P(U, T I W). 
3Although we do not explicitly represent TB's in a 
word sequence in the following discussions, one might 
assume virtual TB markers like @ in the word sequence. 
382 
This is formally represented in equation (1). This 
probability is naturally decomposed into the prod- 
uct of two terms as in equation (3). The first prob- 
ability in equation (3) represents an arbitrary word 
sequence constituting one SA unit ui, given hj (the 
history of SA units and tags from the beginning of 
a dialogue, hj = uJ-l,t j-l) and input W. The sec- 
ond probability represents the current SA unit u i 
bearing a particular SA tag tj, given uj, hi, and 
W. 
(U,T) = argmaxP(U,T I w), (1) 
U,T 
k 
P(uj,tj I hi, W), = argmax H (2) 
U,T j=l 
k 
_- argm x l\] P(ui I hi, W) 
U,T j=l 
x P(tj I uj, hi, W). (3) 
We call the first term "unit existence probability" 
Ps and the second term "tagging probability" PT. 
Figure 1 shows a simplified image of the probability 
calculation in a word graph, where we have finished 
processing the word sequence of w~ -1 
Now, we estimate the probability for the word se- 
quence w~ +p-1 constituting an SA unit uj and hav- 
ing a particular SA tag tj. Because of the problem of 
sparse data, these probabilities are hard to directly 
estimate from the training corpus. We will use the 
following approximation techniques. 
3.1 Unit Existence Probability 
The probability of unit existence PE is actually 
equivalent to the probability that the word sequence 
w~,..., w,+p-1 exists as one SA unit given h i and 
W (Fig. 1). 
We then approximate PE by 
PE ~-- P(B~,_I,w, = l l hj, W) 
xP(B~.+,,_,,w,.,, = 1 I hi, W) 
s+p--2 
x H P(Bw,-,~+I = 0 I hi,W), (4) 
ITl:$ 
where the random variable Bw=,,~=+l takes the bi- 
nary values 1 and 0. A value of 1 corresponds to the 
existence of an SA unit boundary between wx and 
w=+l, and a value of 0 to the non-existence of an SA 
unit boundary. PE is approximated by the product 
of two types of probabilities: for a word sequence 
break at both ends of an SA unit and for a non- 
break inside the unit. Notice that the probabilities 
of the former type adjust an unfairly high probabil- 
ity estimation for an SA unit that is made from a 
short word sequence. 
The estimation of PE is now reduced to that of 
P(Bw=,w~+l I hi, W). This probability is estimated 
by a probabilistic decision tree and we have 
P(Bw=,Wx+, I hi, W) ~- P(Bw .... +1 I eE(hj, W)), 
where riPE is a decision tree that categorizes hj, W 
into equivalent classes (Jelinek, 1997). We modi- 
fied C4.5 (Quinlan, 1993) style algorithm to produce 
probability and used it for this purpose. The deci- 
sion tree is known to be effective for the data sparse- 
ness problem and can take different types of parame- 
ters such as discrete and continuous values, which is 
useful since our word sequence contains both types 
of features. 
Through preliminary experiments, we found that 
hj (the past history of tagging results) was not useful 
and discarded it. We also found that the probability 
was well estimated by the information available in a 
short range of r around w=, which is stored in W. 
Actually, the attributes used to develop the tree were 
at~X-\]-7* in W' = ~-r+l" *+r • surface wordforms for ~=-~+1, 
z+r and the pause duration parts of speech for wx_r+l, 
between wx and w=+l. The word range r was set 
from 1 to 3 as we will report in sub-section 5.3. 
As a result, we obtained the final form of PE as 
PE ~-- P(Bw .... ~, = 1 \[~s(W')) 
x P(B~,+p_,,~,+p = 1 \[ ~s(W')) 
s+p-2 
× H P(S~,,.w~,+ 1 = 01~E(W'))(5) 
m:$ 
3.2 Tagging Probability 
The tagging probability PT was estimated by the 
following formula utilizing a decision tree eT- Two 
functions named f and g were also utilized to extract 
information from the word sequence in uj. 
PT ~-- P(tj J ff2T(f(uj),g(uj),tj_l,...,tj_m)) (6) 
As this formula indicates, we only used information 
available with the uj and m histories of SA tags in 
hi. The function f(uj) outputs the speaker's identi- 
fication of uj. The function g(uj) extracts cue words 
for the SA tags from uj using a cue word list. The 
cue word list was extracted from a training corpus 
that was manually labeled with the SA tags. For 
each SA tag, the 10 most dependent words were ex- 
tracted with a x2-test. After converting these into 
canonical forms, they were conjoined. 
To develop a statistical decision tree, we used an 
input table whose attributes consisted of a cue word 
list, a speaker's identification, and m previous tags. 
The value for each cue word was a binary value, 
where 1 was set when the utterance uj contained 
the word, or otherwise 0. The effect of f(uj), g(uj), 
and length m for the tagging performance will be 
reported in sub-section 5.3. 
4 Search Method 
A search in a word graph was conducted using the 
extended dynamic programming technique proposed 
383 
hj history turn boundary current process front 
o-----o o \] ~.~ Uj-l' (i-1 ~ uj, (\] - - - O<::>IO .... C:) - C:> 0 .... CD 
Wl Ws-1 | Ws Ws+l Ws+p-1 |Ws+p Wn 
W word sequence for a dialogue 
Figure 1: Probability calculation. 
by Nagata (1994). This algorithm was originally de- 
veloped for a statistical Japanese morphological an- 
alyzer whose tasks are to determine boundaries in an 
input character sequence having no separators and 
to give an appropriate part of speech tag to each 
word, i.e., a character sequence unit. This algorithm 
can handle arbitrary lengths of histories of pos tags 
and words and efficiently produce n-best results. 
We can see a high similarity between our task and 
Japanese morphological analysis. Our task requires 
the segmentation of a word sequence instead of a 
character sequence and the assignment of an SA tag 
instead of a pos tag. 
The main difference is that a word dictionary is 
available with a morphological analyzer. Thanks to 
its dictionary, a morphological analyzer can assume 
possible morpheme boundaries. 4 Our tagger, on 
the other hand, has to assume that any word se- 
quence in a turn can constitute an SA unit in the 
search. This difference, however, does not require 
any essential change in the search algorithm. 
5 Tagging Experiments 
5.1 Data Profile 
We have conducted several tagging experiments on 
both the Japanese and English corpora described in 
sub-section 2.1. Table 1 shows a summary of the 
95 files used in the experiments. In the experiments 
described below, we used morpheme sequences for 
input instead of word sequences and showed the cor- 
responding counts. 
The average number of SA units per turn was 
2.68 for Japanese and 2.31 for English. The aver- 
age number of boundary candidates per turn was 
18 for Japanese and 12.7 for English. The number 
of tag types, the average number of SA units, and 
the average number of SA boundary candidates in- 
dicated that the Japanese data were more difficult 
to process. 
4Als0, the probability for the existence of a word can 
be directly estimated from the corpus. 
Table 1: Counts in both corpora. 
Counts Japanese English 
Turn 2,020 2,020 
SA unit 5,416 4,675 
Morpheme 38,418 27,639 
POS types 30 33 
SA tag type 29 17 
5.2 Evaluation Methods 
We used "labeled bracket matching" for evalua- 
tion (Nagata, 1994). The result of tagging can be 
viewed as a set of labeled brackets, where brack- 
ets correspond to turn segmentation and their labels 
correspond to SA tags. With this in mind, the eval- 
uation was done in the following way. We counted 
the number of brackets in the correct answer, de- 
noted as R (reference). We also counted the num- 
ber of brackets in the tagger's output, denoted as 
S (system). Then the number of matching brackets 
was counted and denoted as M (match). Thus, we 
could define the precision rate with M/S and the 
recall rate with M/R. 
The matching was judged in two ways. One was 
"segmentation match": the positions of both start- 
ing and ending brackets (boundaries) were equal. 
The other was "segmentation+tagging match": the 
tags of both brackets were equal in addition to the 
segmentation match. 
The proposed evaluation simultaneously con- 
firmed both the starting and ending positions of an 
SA unit and was more severe than methods that only 
evaluate one side of the boundary of an SA unit. 
Notice that the precision and recall for the segmen- 
tation+tagging match is bounded by those of the 
segmentation match. 
5.3 Tagging Results 
The total tagging performance is affected by the two 
probability terms PE and PT, both of which contain 
the parameters in Table 2. To find the best param- 
384 
Table 2: Parameters in probability terms. 
PE PT x+r 
Wx-r+l 
r: word range 
f(uj): speaker of uj 
g(uj): cue words in uj 
tj-1 ... tj_,~ : previous SA tags 
Table 4: T-scores for segmentation accuracies. 
Recall Precision 
A B C A B C 
B 2.84 - - B 1.25 - - 
C 2.71 0.12 - C 0.83 0.44 - 
D 2.57 0.28 0.17 D 0.74 0.39 0.01 
Table 3: Average accuracy for segmentation match. 
Parameter Recall rate % Precision rate % 
A 89.50 91.99 
B 91.89 92.92 
C 92.00 92.57 
D 92.20 92.58 
Table 5: Average accuracy for seg.+tag, match. 
Parameter Recall rate % Precision rate % 
E 72.25 72.70 
F 74.91 75.35 
G 74.83 75.29 
H 74.50 74.96 
eter set and see the effect of each parameter, we 
conducted the following two types of experiments. 
I Change the parameters for PE with fixed pa- 
rameters for PT 
The effect of the parameters in PE was mea- 
sured by the segmentation match. 
II Change the parameters for PT with fixed pa- 
rameters for PE 
The effect of the parameters in PT was mea- 
sured by the segmentation+tagging match. 
Now, we report the details with the Japanese set. 
5.3.1 Effects of DE with Japanese Data 
We fixed the parameters for PT as f(uj), g(uj), 
tj-1, i.e., a speaker's identification, cue words in the 
current SA unit, and the SA tag of the previous SA 
unit. The unit existence probability was estimated 
using the following parameters. 
(A): Surface wordforms and pos's ofw~ +1, i.e., word 
range r = 1 
(B): Surface wordforms and pos's of w x+2 i.e., word x-i, 
range r ---- 2 
(C): (h) with a pause duration between wx, Wx+l 
(D): (U) with a pause duration between wx, wx+l 
Under the above conditions, we conducted 10-fold 
cross-validation tests and measured the average re- 
call and precision rates in the segmentation match, 
which are listed in Table 3. 
We then conducted l-tests among these average 
scores. Table 4 shows the l-scores between different 
parameter conditions. In the following discussions, 
we will use the following l-scores: t~=0.0~5(18) -- 
2.10 and t~=0.05(18) = 1.73. 
We can note the following features from Tables 3 
and 4. 
• recall rate 
(B), (C), and (D) showed statistically signif- 
icant (two-sided significance level of 5%, i.e., 
t > 2.10) improvement from (A). (D) did not 
show significant improvement from either (B) 
nor (C). 
• precision rate 
Although (n) and (C) did not improve from 
(A) with a high statistical significance, we can 
observe the tendency of improvement. (D) did 
not show a significant difference from (B) or 
(C). 
We can, therefore, say that (B) and (C) showed 
equally significant improvement from (A): expansion 
of the word range r from I to 2 and using pause infor- 
mation with word range 1. The combination of word 
range 2 and pause (D), however, did not show any 
significant differences from (B) or (C). We believe 
that the combination resulted in data sparseness. 
5.3.2 Effects of PT with Japanese Data 
For the Type II experiments, we set the parame- 
ters for PE as condition (C): surface wordforms and 
pos's of wx TM and a pause duration between w~ and 
w~+l. Then, PT was estimated using the following 
parameters. 
(E): Cue words in utterance uj, i.e., g(uj) 
(F): (S) with tj_ 1 
(G): (E) with tj_l and tj_2 
(H): (E) with tj-1 and a speaker's identification 
f(uj) 
The recall and precision rates for the segmenta- 
tion÷tagging match were evaluated in the same way 
as in the previous experiments. The results are 
shown in Table 5. The l-scores among these param- 
eter setting are shown in Table 6. We can observe 
the following features. 
• recall rate 
(F) and (G) showed an improvement from (E) 
with a two-sided significance level of 10% (1 > 
385 
Table 6: T-scores for seg.+tag, accuracies. 
Recall Precision 
E F G E F G 
F 1.87 - - F 1.97 - - 
G 1.78 0.05 - G 1.90 0.04 - 
H 1.50 0.26 0.21 H 1.60 0.28 0.24 
1.73). However, (G) and (H) did not show sig- 
nificant improvements from (F). 
• precision rate 
Same as recall rate. 
Here, we can say that tj-1 together with the cue 
words (F) played the dominant role in the SA tag 
assignment, and the further addition of history tj-2 
(G) or the speaker's identification f(uj) (H) did not 
result in significant improvements. 
5.3.3 Summary of Japanese Tagging 
Experiments 
As a concise summary, the best recall and preci- 
sion rates for the segmentation match were obtained 
with conditions (n) and (C): approximately 92% 
and 93%, respectively. The best recall and preci- 
sion rates for the segmentation+tagging match were 
74.91% and 75.35 %, respectively (Table 5 (F)). We 
consider these figures quite satisfactory considering 
the severeness of our evaluation scheme. 
5.3.4 English Tagging Experiment 
We will briefly discuss the experiments with En- 
glish data. The English corpus experiments were 
similar to the Japanese ones. For the SA unit seg- 
mentation, we changed the word range r from 1 to 
3 while fixing the parameters for PT to (H), where 
we obtained the best results with word range r --- 2, 
i.e., (B). The recall rate was 71.92% and the preci- 
sion rate was 78.10%. 5 
We conducted the exact same tagging experi- 
ments as the Japanese ones by fixing the parame- 
ter for PE to (B). Experiments with condition (H) 
showed the best score: the recall rate was 53.17% 
and the precision rate was 57.75%. We obtained 
lower performance than that for Japanese. This was 
somewhat surprising since we thought English would 
be easier to process. The lower performance in seg- 
mentation affected the total tagging performance. 
We will further discuss the difference in section 7. 
6 Application of SA tags to speech 
translation 
In this section, we will briefly discuss an application 
of SA tags to a machine translation task. This is one 
~Experiments with pause information were not 
conducted. 
of the motivations of the automatic tagging research 
described in the previous sections. We actually dealt 
with the translation problem of positive responses 
appearing in both Japanese and English dialogues. 
Japanese positive responses like Hat 
and Soudesuka, and the English ones like Yes and 
I see appear quite often in our corpus. Since our di- 
alogues were collected from the travel arrangement 
domain, which can basically be viewed as a sequence 
of a pair of questions and answers, they naturally 
contain many of these expressions. 
These expressions are highly ambiguous in word- 
sense. For example, Hai can mean Yes (accept), Uh 
huh (acknowledgment), hello (greeting) and so on. 
Incorrect translation of the expression could confuse 
the dialogue participants. These expressions, how- 
ever, are short and do not contain enough clues for 
proper translation in themselves, so some other con- 
textual information is inevitably required. 
We assume that SA tags can provide such neces- 
sary information since we can distinguish the trans- 
lations by the SA tags in the parentheses in the 
above examples. 
We conducted a series of experiments to verify 
if positive responses can be properly translated us- 
ing SA tags with other situational information. We 
assumed that SA tags are properly given to these ex- 
pressions and used the manually tagged corpus de- 
scribed in Table 1 for the experiments. 
We collected Japanese positive responses from the 
SA units in the corpus. After assigning an En- 
glish translation to each expression, we categorized 
these expressions into several representative forms. 
For example, the surface Japanese expression Ee, 
Kekkou desu was categorized under the representa- 
tive form Kekkou. 
We also made such data for English positive re- 
sponses. The size of the Japanese and English data 
in representative forms (equivalent to SA unit) is 
shown in Table 7. Notice that 1,968 out of 5,416 
Japanese SA units are positive responses and 1,037 
out of 4,675 English SA units are positive responses. 
The Japanese data contained 16 types of English 
translations and the English data contained 12 types 
of Japanese translations in total. 
We examined the effects of all possible combi- 
nations of the following four features on transla- 
tion accuracy. We trained decision trees with the 
C4.5 (Quinlan, 1993) type algorithm while using 
these features (in all possible combinations) as at- 
tributes. 
(I) Representative form of the positive response 
(J) SA tag for the positive response 
(K) SA tag for the SA unit previous to the positive 
response 
(L) Speaker (Hotel/Clerk) 
386 
Table 7: Representation forms and the counts. 
Japanese freq. 
Kekkou 69 
Soudesu ka 192 
Hal 930 
Soudesu 120 
Moehiron 7 
Soudesu ne 16 
Shouchi 30 
Wakari- 
mashita 304 
Kashikomari- 
mashita 300 
English freq. 
I understand 6 
Great 5 
Okay 240 
I see 136 
All right 136 
Very well 13 
Certainly 27 
Yes 359 
Fine 52 
Right 10 
Sure 44 
Very good 9 
Total 1,968 Total 1,037 
Table 8: Accuracies with one feature. 
Feature J toE(%) EtoJ (%) 
I 54.83 46.96 
J 51.73 34.33 
K 73.02 55.35 
L 40.09 37.80 
We will show some of the results. Table 8 shows 
the accuracy when using one feature as the attribute. 
We can naturally assume that the use of feature (I) 
gives the baseline accuracy. 
The result gives us a strange impression in that 
the SA tags for the previous SA units (K) were far 
more effective than the SA tags for the positive re- 
sponses themselves (J). This phenomenon can be 
explained by the variety of tag types given to the 
utterances. A positive response expressions of the 
same representative form have at most a few SA tag 
types, say two, whereas the previous SA units can 
have many SA tag types. If a positive response ex- 
pression possesses five translations, they cannot be 
translated with two SA tags. 
Table 9 shows the best feature combinations at 
each number of features from 1 to 4. The best fea- 
ture combinations were exactly the same for both 
translation directions, Japanese to English and vice 
versa. The percentages are the average accuracy ob- 
tained by the 10-fold cross-validation, and the t- 
score in each row indicates the effect of adding one 
feature from the upper row. We again admit a t- 
score that is greater than 2.01 as significant (two- 
sided significance level of 5 %). 
The accuracy for Japanese translation was sat- 
urated with the two features (K) and (I). Further 
addition of any feature did not show any significant 
improvement. The SA tag for the positive responses 
did not work. 
The accuracy for English translation was satu- 
Table 9: Best performance for each number of fea- 
tures. 
Features J toE(%) t EtoJ (%) t 
K 73.02 - 55.35 - 
K,I 88.51 15.42 60.66 3.10 
K,I,L 88.92 0.51 65.58 2.49 
K,I,L,J 88.21 0.75 66.74 0.55 
rated with the three features (K), (I), and (L). The 
speaker's identification proved to be effective, unlike 
Japanese. This is due to the necessity of controlling 
politeness in Japanese translations according to the 
speaker. The SA tag for the positive responses did 
not work either. 
These results suggest that the SA tag informa- 
tion for the previous SA unit and the speaker's in- 
formation should be kept in addition to representa- 
tive forms when we implement the positive response 
translation system together with the SA tagging sys- 
tem. 
7 Related Works and Discussions 
We discuss the tagging work in this section. In sub- 
section 5.3, we showed that Japanese segmentation 
into SA units was quite successful only with lexical 
information, but English segmentation was not that 
successful. 
Although we do not know of any experiments di- 
rectly comparable to ours, a recent work reported 
by Cettolo and Falavigna (1998) seems to be sim- 
ilar. In that paper, they worked on finding se- 
mantic boundaries in Italian dialogues with the 
"appointment scheduling task." Their semantic 
boundary nearly corresponds to our SA unit bound- 
ary. Cettolo and Falavigna (1998) reported recall 
and precision rates of 62.8% and 71.8%, respec- 
tively, which were obtained with insertion and dele- 
tion of boundary markers. These scores are clearly 
lower than our results with a Japanese segmentation 
match. 
Although we should not jump to a generalization, 
we are tempted to say the Japanese dialogues are 
easier to segment than western languages. With this 
in mind, we would like to discuss our study. 
First of all, was the manual segmentation quality 
the same for both corpora? As we explained in sub- 
section 2.1, both corpora were tagged by experts, 
and the entire result was checked by one of them 
for each language. Therefore, we believe that there 
was not such a significant gap in quality that could 
explain the segmentation performance. 
Secondly, which lexical information yielded such 
a performance gap? We investigated the effects of 
part-of-speech and morphemes in the segmentation 
387 
of both languages. We conducted the same 10-fold 
cross-validation tests as in sub-section 5.3 and ob- 
tained 82.29% (recall) and 86.16% (precision) for 
Japanese under condition (B'), which used only pos's 
in " x+~ for the PE calculation. English, in con- Wx-1 
trast, marked rates of 65.63% (recall) and 73.35% 
(precision) under the same condition. These results 
indicated the outstanding effectiveness of Japanese 
pos's in segmentation. Actually, we could see some 
pos's such as "ending particle (shu-jyoshi)" which 
clearly indicate sentence endings and we considered 
that they played important roles in the segmenta- 
tion. English, on the other hand, did not seem to 
have such strong segment indicating pos's. Although 
lexical information is important in English segmen- 
tation (Stoleke and Shriberg, 1996), what other in- 
formation can help improve such segmentation? 
Hirschberg and Nakatani (1996) showed that 
prosodic information helps human discourse segmen- 
tation. Litman and Passonneau (1995) addressed 
the usefulness of a "multiple knowledge source" 
in human and automatic discourse segmentation. 
Vendittiand Swerts (1996) stated that the into- 
national features for many Indo-European lan- 
guages help cue the structure of spoken dis- 
course. Cettolo and Falavigna (1998) reported im- 
provements in Italian semantic boundary detection 
with acoustic information. All of these works indi- 
cate that the use of acoustic or prosodic information 
is useful, so this is surely one of our future directions. 
The use of higher syntacticM information is also 
one of our directions. The SA unit should be a mean- 
ingful syntactic unit, although its degree of meaning- 
fulness may be less than that in written texts. The 
goodness of this aspect can be easily incorporated in 
our probability term PE. 
8 Conclusions 
We have described a new efficient statistical speech 
act type tagging system based on a statistical model 
used in Japanese morphological analyzers. This sys- 
tem integrates linguistic, acoustic, and situational 
features and efficiently performs optimal segmenta- 
tion of a turn and tagging. From several tagging 
experiments, we showed that the system segmented 
turns and assigned speech act type tags at high ac- 
curacy rates when using Japanese data. Compara- 
tively lower performance was obtained using English 
data, and we discussed the performance difference. 
We Mso examined the effect of parameters in the sta- 
tistical models on tagging performance. We finally 
showed that the SA tags in this paper are useful in 
translating positive responses that often appear in 
task-oriented dialogues such as those in ours. 
Acknowledgment 
The authors would like to thank Mr. Yasuo 
Tanida for the excellent programming works and Dr. 
Seiichi Yamamoto for stimulus discussions. 

References 
M. Cettolo and D. Falavigna. 1998. Automatic de- 
tection of semantic boundaries based on acoustic 
and lexical knowledge. In ICSLP '98, volume 4, 
pages 1551-1554. 
B. J. Grosz and C. L. Sidner. 1986. Atten- 
tion, intentions and the structure of discourse. 
Computational Linguistics, 12(3):175-204, July- 
September. 
J. Hirschberg and C. H. Nakatani. 1996. A prosodic 
analysis of discourse segments in direction-giving 
monologues. In 34th Annual Meeting of the Asso- 
ciation for the Computational Linguistics, pages 
286-293. 
F. Jelinek, 1997. Statistical Methods for Speech 
Recognition, chapter 10. The MIT Press. 
D. J. Litman and R. J. Passonneau. 1995. Com- 
bining multiple knowledge sourses for discourse 
segmentation. In 33rd Annual Meeting of the As- 
sociation for the Computational Linguistics, pages 
108-115. 
T. Morimoto, N. Uratani, T. Takezawa, O. Furuse, 
Y. Sobashima, H. Iida, A. Nakamura, Y. Sagisaka, 
N. Higuchi, and Y. Yamazaki. 1994. A speech and 
language database for speech translation research. 
In ICSLP '94, pages 1791-1794. 
M. Nagata and T. Morimoto. 1994. An information- 
theoretic model of discourse for next utterance 
type prediction. Transactions of Information 
Processing Society of Japan, 35(6):1050-1061. 
M. Nagata. 1994. A stochastic Japanese morpholog- 
ical analyzer using a forward-DP and backward- 
A* N-best search algorithm. In Proceedings of 
Coling94, pages 201-207. 
J. R. Quinlan. 1993. C~.5: Programs for Machine 
Learning. Morgan Kaufmann. 
N. Reithinger and E. Maier. 1995. Utilizing statisti- 
cal dialogue act processing in verbmobil. In 33rd 
Annual Meeting of the Associations for Computa- 
tional Linguistics, pages 116-121. 
J. R. Searle. 1969. Speech Acts. Cambridge Univer- 
sity Press. 
M. Seligman, L. Fais, and M. Tomokiyo. 1994. 
A bilingual set of communicative act labels for 
spontaneous dialogues. Technical Report TR-IT- 
0081, ATR-ITL. 
A. Stolcke and E. Shriberg. 1996. Automatic lin- 
guistic segmentation of conversational speech. In 
ICSLP '96, volume 2, pages 1005-1008. 
J. Venditti and M. Swerts. 1996. Intonational cues 
to discourse structure in Japanese. In ICSLP '96, 
volume 2, pages 725-728. 
