Japanese Dialogue Corpus of Multi-Level Annotation 
The Japanese Discourse Research Initiative 
http://www.slp.cs.rit sumei.ac.jp/dt ag/ 
Abstract 
This paper describes a Japanese 
dialogue corpus annotated with 
multi-level information built by the 
Japanese Discourse Research Initia- 
tive, Japanese Society for Artificial 
Intelligence. The annotation in- 
formation consists of speech, tran- 
scription delimited by slash units, 
prosodic, part of speech, dialogue 
acts and dialogue segmentation. In 
the project, we used the corpus for 
obtaining new findings by examining 
the relationship between linguistic 
information and dialogue acts, that 
between prosodic information and 
dialogue segment, and the charac- 
teristics of agreement/disagreement 
expressions and non-sentence ele- 
ments. 
1 Introduction 
This paper describes a Japanese dialogue cor- 
pus annotated with multi-level information 
such as speech, linguistic and discourse infor- 
mation built by the Japanese Discourse Re- 
search Initiative, supported by Japanese So- 
ciety for Artificial Intelligence. 
Dialogue corpora are now indispensable to 
speech and  research communities. • 
The corpora have been used not only for ex- 
amining the relationship between speech and 
linguistic phenomena, but also for building • 
speech and  understanding systems. 
Sharing corpora among researchers is most 
desirable since creating the corpora needs 
considerable cost like writing and revising an- 
notation manuals, annotating the data, and 
checking the consistency and reliability of the 
annotated data. Discourse Research Initia- 
tive was set up in March of 1996 by US, Eu- 
ropean, and Japanese researchers to develop 
standardized discourse annotation schemes 
(Carletta et al., 1997; Core et al., 1998). 
The efforts of the initiative have been called 
'standardization', but this naming is mislead- 
ing at least. In typical standardizing ef- 
forts, as done in audio-visual and telecom- 
munication technologies, commercial compa- 
nies try to expand the market for their prod- 
ucts or interfaces by the standard. The ob- 
jective of standardizing efforts in discourse is 
to promote interactions among discourse re- 
searchers and thereby provide a solid founda- 
tion for corpus-based discourse research, dis- 
pensing with duplicating resource making ef- 
forts and increasing sharable resources. 
In cooperation with this initiative, 
Japanese Discourse Research Initiative has 
started in Japan in May 1996, supported by 
Japanese Society for Artificial Intelligence 
(JDRI, 1996; Ichikawa et al., 1999). The 
activities of the initiative involve: 
creating and revising annotation schemes 
based on the survey of the existing 
schemes and annotation experiments, 
annotating corpora based on the pro- 
posed annotation schemes, and 
doing research using the corpora not only 
for examining the utility of the schemes 
and corpora but also for obtaining new 
findings. 
ESPSP~. ~ , ChaS~ 
Prosodic \] \ I Part-of-speech Slash unit 
Word alignment \] I Dialogue acts I 
Figure 1: The relations among the annotation 
information 
In the following, a Japanese dialogue 
corpus of multi-level annotation is demon- 
strated. The annotation schemes deal with 
the information for speech, transcription seg- 
mented by utterance units, called 'slash 
units,' prosody, part of speech, dialogue acts 
and dialogue segment. Figure 1 shows the re- 
lations among the annotation information. 
2 Speech Sound and Transcription 
The corpus consists of a collection of 14 task- 
oriented dialogues, each performed by two na- 
tive speakers of Japanese. The total time of 
the 14 dialogues is 53 minutes. The tasks in- 
clude scheduling, route guidance, telephone 
shopping, and so on. We set the roles of the 
two speakers and the goal of the task but no 
pre-defined scenarios. For example, in the 
scheduling task, the speakers were given the 
roles of a private secretary and a client, and 
asked to arrange a meting appointment. 
The speech sound of the two speakers partici- 
pating in a dialogue was recorded on separate 
channels, which enables us to perform accu- 
rate acoustic/prosodic analysis even for over- 
lapped talks. The transcription contains or- 
thographic representations in Kanji and the 
starting and ending time of each utterance, 
where an utterance is defined as a continuous 
speech region delimited by pauses of 400 msec 
or longer. 
3 Prosodic Information and 
Part-of-speech 
The prosodic information and the part- 
of-speech tags were assigned (semi- 
)automatically using the speech sound 
and the transcription. 
3.1 Prosodic information 
Prosody has been widely recognized as one 
of the important factors which relate to dis- 
course structures, dialogue acts, informa- 
tion status, and so on. Informative corpora 
should, in the first place, contain some form 
of prosodic information. 
At this stage, our corpus merely includes, 
as prosodic information, raw values of fun- 
damental frequency, voicing probability, and 
rms energy, which were obtained from the 
speech sound using speech analysis software 
ESPS/waves-b (Entropic, 1996) and simple 
post-processing for smoothing. The future 
corpus will contain more abstract descriptions 
of prosodic events such as accents and bound- 
ary tones. 
3.2 Part-of-speech 
The part-of-speech is another basic in- 
formation for speech recognition, syntac- 
tic/semantic parsing, and dialogue processing 
as well as linguistic and psycholinguistic anal- 
ysis of spoken discourse. 
Part-of-speech tags were, first, obtained au- 
tomatically from the transcription using the 
morphological analysis system ChaSen (Mat- 
sumoto et al., 1999), and, then, corrected 
manually. The tag set was extended to cover 
filled pauses and contracted forms peculiar to 
spontaneous speech, and some dialects. The 
tagged corpus will be used as a part of the 
training data for the statistical learning mod- 
ule of ChaSen to improve its performance for 
spontaneous speech, which can be used for fu- 
ture applications. 
3.3 Word alignment 
In some applications such as co-reference res- 
olution utilizing prosodic correlates of given- 
new status of words, it is useful to know the 
prosodic information of particular words or 
pl~rases. In order to obtain such informa- 
tion, the correspondence between the word se- 
quence and the speech sound must be given. 
Our corpus contains the information for the 
starting and the ending time of every word. 
The time-stamp of each word in an ut- 
terance was obtained automatically from the 
speech sound and the part-of-speech using the 
forced alignment function of speech recogni- 
tion software HTK (Odell et al., 1997) with 
the tri-phone model for Japanese speech de- 
veloped by the IPA dictation software project 
(Shikano et al., 1998). Apparent errors were 
corrected manually with reference to sound 
waveforms and spectrograms obtained and 
displayed on a screen by ESPS/waves+ (En- 
tropic, 1996). 
4 Utterance Units 
4.1 Slash units 
In the transcription, an utterance is defined 
as a continuous speech region delimited by 
pauses of 400 msec or longer. However, this 
definition of the utterances does not corre- 
spond to the units for discourse annotation. 
For example, the utterances are sometimes 
interrupted by the partner. For reliable dis- 
course annotation, analysis units must be con- 
structed from the utterances defined above. 
Following Meteer and Taylor (1995), we call 
such a unit 'slash unit.' 
4.2 Criteria for determining slash 
units 
The criteria for determining slash units in 
Japanese were defined with reference to those 
for English (Meteer and Taylor, 1995). The 
slash units were annotated mantually with ref- 
erence to the speech sound and transcription 
of dialogues. 
Single utterances as slash unit Single 
utterances which can be thought to represent 
sentences conceptually are qualified as a slash 
unit. Figure 2 shows examples of slash units 
by single utterances (slash units are delimited 
by the symbol '/'). 
In the cases where the word order is in- 
verted, the utterances are regarded as a slash 
f 
A: hai/ ;{response} 
(yes.) 
A: kochira chi~ annais~utemudesu / 
;{a single sentence} 
(This is the sightseeing guide 
system.) 
A: ryoukin niha fukumarete orimasen ga 
betto 1200 en de goyoui sasete 
itadakimasu / 
;{a complex sentence} 
(This is not included in the charge. 
We offer the service for 
the separate charge of 1200 yen.) 
\ j 
Figure 2: Examples of single utterances as 
slash unit 
I shuppatsu chiten kara -- 
(From the starting point) 
-- nishi gawa ni -- 
(to the west) 
-- sukoshi dake ikimasu / 
(move a little) 
Figure 3: An example of multiple utterances 
as slash unit 
unit only if the utterances with normalized 
word order are qualified as a slash unit. 
A sequence of one speaker's speech that ter- 
minates with a hesitation, an interruption and 
a slip of the tongue, but does not continue in 
the speaker's next utterance is also qualified 
as a slash unit. 
Multiple utterances as slash unit When 
collection of multiple utterances form a sen- 
tence, as in Figure 3, they are qualified as one 
slash unit. In slash units spanning multiple 
utterances, the symbol '--' is marked both at 
the end of the first utterance and at the start 
of the last utterance. 
4.3 Non sentence elements 
Non sentence elements consist of 'aiduti', con- 
junction markers, discourse markers, fillers 
3 
f 
A: sukoshi dake itte / 
(move a little) 
B: ~2 / 
(ok) 
A: {D de} hidari naname shitani 
({D "then} I;o your lefv. and down) 
Figure 4: An example of a slash unit defined 
by discourse markers 
and non speech elements, which are enclosed 
by {S ...}, {C ...}, {D ...}, {F ...}, and 
{N ... }, respectively. These elements can be 
used to define a slash unit. For example, when 
'aiduti' is expressed by the words such as "hai 
(yes, yeah, right)", "un (yes, yeah, right)" and 
"ee (mmm, yeah)" or by word repetition, it is 
regarded as an utterance. Otherwise, 'aiduti' 
is not qualified as an independent slash unit. 
The main function of discourse markers is 
to show the relations between utterances, like 
starting a new topic, changing topics, and 
restarting an interrupted conversation. The 
words such as "mazu (first, firstly)", "dewa 
(then, ok)", "tsumari (I mean, that means 
that)" and "sorede (and so)" may become dis- 
course markers when they appear at the head 
of the utterances. An utterance just before 
the one with discourse markers is qualified as 
a slash unit (Figure 4). 
In the Switchboard project(Meteer and 
Taylor, 1995), our {S... } (aiduti) category is 
not regarded as a separate category. However 
in Japanese dialogue, signals that indicate a 
heater's attention to speaker's utterances, are 
expressed frequently. For this reason, we cre- 
ated 'aiduti' as a separate category. Other- 
wise {A... }(aside), {E...}(Explict editing 
term), the restart and the repair are not an- 
notated in our scheme at the present stage. 
5 Dialogue Acts 
Identifying dialogue act of the slash unit is 
difficult task because the mapping between 
surface form and dialogue act is not obvious. 
In addition, some slash units have more than 
one function, e.g. answering question with 
stating additional information. Considering 
above problems, DAMSL architecture codes 
various functions at one utterance, such as 
forward looking function, backward looking 
function, etc. 
However, it is difficult to determine the 
function of the isolated utterance. We had 
shown that assumptions of dialogue structure 
and exchange structure improved agreement 
score among coders (Ichikawa et al., 1999). 
Therefore, we define our dialogue act tagging 
scheme as hierarchical refinement from the ex- 
change structure. 
The annotation scheme for dialogue acts 
includes a set of rules to identify the func- 
tion of each slash unit based on the theory of 
speech act (Searle, 1969) and discourse anal- 
ysis (Coulthhard, 1992; Stenstr6m, 1994). 
This scheme provides a basis for examining 
the local structure of dialogues. 
/ \ 
• Task-oriented dialogue 
(Opening) 
Problem solving 
(Closing) 
• Problem solving 
Exchange + 
• Exchange 
Initiation 
(Response)/Initiation* 
(Response)* 
(Follow-up) 
(Follow-up) 
Figure 5: Model for task-oriented dialogues 
In general, a dialogue 1 is modeled with 
problem solving subdialogues, sometimes pre- 
ceded by opening subdialogue (e.g., greeting) 
and followed by closing subdialogue (e.g., ex- 
pressing gratitude). A problem solving sub- 
dialogue consists of initiating and responding 
l In this paper, we limit our attention to task- 
oriented dialogues, which are the main target of the 
study in computational linguistics and spoken dia- 
logue research. 
4 
f 
(Initiation) 
41 A: chikatetsu no ekimei ha? 
(What's the name of the subway 
station?) 
(Response) 
42 B: chikatetsu no teramachi eki ni 
narimasu 
(The name of the subway station is 
Teramachi.) 
(Follow-up) 
43 A: hai 
(Ok.) J 
Figure 6: An example problem solving subdi- 
alogue with the exchange structure 
utterances, sometimes followed by following 
up utterances (Figure 5). 
Figure 6 shows an example problem solving 
subdialogue with the exchange structure. 
In this scheme, dialogue acts, the elements 
of the exchange structure, are classified into 
the tags shown in Figure 7. 
6 Dialogue Structure and 
Constraints on Multiple 
Exchanges 
6.1 Dialogue Segment 
In the previous discourse model(Grosz and 
Sidner, 1986), a discourse segment has a be- 
ginning and an ending utterances and may 
have smaller discourse segments in it. It is not 
an easy task to identify such segments with 
the nesting structure for spoken dialogues, 
because the structure of a dialogue is often 
very complicated due to the interaction of two 
speakers. In a preliminary experiment of cod- 
ing segments in spoken dialogues, there were a 
lot of disagreements on the granularity or the 
relation of the segments and on identifying 
ending utterances of the segment. An alterna- 
tive scheme of coding the dialogue structure 
(DS) is necessary to build dialogue corpora 
annotated with the discourse level structure. 
Our scheme annotates spoken dialogues 
/ 
Dialogue management 
Open, Close 
Initiation 
Request, Suggest, Persuade, Propose, 
Confirm, Yes-no question, Wh-question, 
Promise, Demand, Inform, Other assert, 
Other initiate. 
• Response 
Positive, Negative, Answer, Other re- 
sponse. 
• Follow-up 
Understand 
• Response with Initiation 
The element of this category is repre- 
sented as Response type / Initiation type. 
J 
Figure 7: The classification of dialogue acts 
with boundary marking of the DS, instead of 
identifying a beginning and an ending utter- 
ance of each DS. A building block of dialogue 
segments is identified based on the exchanges 
explained in Section 5. A dialogue segment 
(DS) tag is inserted before initiating utter- 
ances because the initiating utterances can be 
thought of as a start of new discourse seg- 
ments. 
The DS tag consists of a topic break index 
(TBI), a topic name and a segment relation. 
TBI signifies the degree of topic dissimilarity 
between the DSs. TBI takes the value of 1 or 
2: the boundary with TBI 2 is less continuous 
than the one with TBI 1 with regard to the 
topic. The topic name is labeled by coders' 
subjective judgment. The segment relation 
indicates the one between the preceding and 
the following segments, which is classified into 
the following categories. 
clarification 
suspends the exchange and makes a clar- 
ification in order to obtain information 
necessary to answer the partner's utter- 
ance; 
5 
: room for a lecture: \] 
38 A: {F e} heya wa dou simashou ka? 
(How about meeting room'?) 
\[I: small-sized meeting room: clarification\] 
39 B: heya wa shou-kaigishitsu wa aite masu ka? 
(Can I use the small-sized meeting room?) 
40 h: {F to} kayoubi no {F e} 14 ji han kara wa {F e) shou-kao~itsu wa aite imasen 
(The small meeting room is not available from 14:30 on Tuesday.) 
\[1:the large-sized meeting room: \] 
41 A: dai-kaigishitsu ga tukae masu 
(You can use the large meeting room.) 
\[i: room for a lecture: return\] 
42 B: {D soreja) dai-ka~ishitsu de onegaishimasu 
(Ok. Please book the large meeting room.) 
Figure 8: An example dialogue with the dialogue segment tags 
• interruption 
starts a different topic from the previous 
one during or after the partner's explana- 
tory utterances; and 
• return 
goes back to the previous topic after the 
clarification or the interruption. 
Figure 8 shows an example dialogue anno- 
tated with the DS tags. 
6.2 Constraints on multiple 
exchanges 
Annotation of dialogue segments mostly de- 
pends on the coders' intuitive judgment on 
topic dissimilarity between the segments. In 
order to lighten the burden of the coders' 
judgment, the structural constraints on multi- 
ple exchanges are experimentally introduced. 
The constraints can be classified into two 
types: one concerns embedding exchanges 
(relevance type 1) and the other is neighbor- 
ing exchanges (relevance type 2). 
In relevance type 1, the relation of an initi- 
ating utterance and its responding utterance 
is shown by attaching the id number of the ini- 
tiating utterance to the responding utterance. 
This id number can indicates non-adjacent 
initiation-response pairs including embedded 
exchanges inside. 
In relevance type 2, the structures of neigh- 
boring exchanges such as chaining, coupling, 
elliptical coupling (StenstrSm, 1994) are in- 
troduced. Chaining takes the pattern of \[A:I 
B:R\] \[A:I B:R\] (in both exchanges, speaker 
A initiates an utterance and speaker B re- 
sponds to A). Coupling is the pattern of \[A:I 
B:R\] \[B:I A:R\]. (speaker A initiates, speaker 
B both responds and initiates and speaker 
A responds to B). Elliptical coupling is the 
pattern of \[A:I\] \[B:I A:R\], equivalent to the 
one in which B's second response is omitted 
in coupling. Relevance type 2 shows whether 
the above structures of neighboring exchanges 
can be observed or not. Figure 9 shows an ex- 
ample of annotation of relevance types 1 and 
2. 
7 Corpus Building Tools 
In the experiments, various tools for tran- 
scription and annotation were used. For tran- 
scription, the automatics segmentizer (TIME) 
and the online transcriber (PV) were used 
(Horiuchi et al., 1999). The former lists up 
/<Yes-no question> <relevance no>\] 
27 A: hatsukano jyuuji kara ha aite irun de syou ka? 
(Is the room available from lOam on the 20th?) 
\[<Yes-no question> <relevance yes>\] 
28 B: kousyuu shitsu desu ka? 
(Are you mentioning the seminar room?) 
\[<Positive> <0028>\] 
29 A: hai 
(Yes.) 
\[<Negative> <0027>\] 
30 B: hatsuka ha aite oNmasen 
(It is not available on the 20th.) 
\[<Understand>\] 
31 A: soudesu ka 
(Ok.) 
J 
Figure 9: An example dialogue with relevance types 1 and 2 
candidates for unit utterances according to 
the parameter for the length of silences. The 
latter displays energy measurement of each 
speaker's utterance on the two windows using 
a speech data file. Users can see any part of 
a dialogue using the scroll bar, and can hear 
speech for both speakers or each speaker by 
selecting any region of the windows using a 
mouse. 
For prosodic and part of speech annotation, 
the speech analysis software ESPS/waves+ 
(Entropic, 1996), speech recognition software 
HTK (Odell et al., 1997) with the tri-phone 
model for Japanese speech developed by the 
IPA dictation software project (Shikano et al., 
1998) and the morphological analysis system 
ChaSe, (Matsumoto et al., 1999) were used. 
For discourse annotation, Dialogue Anno- 
tation Tool (DAT) had been used in the previ- 
ous experiments (Core and Allen, 1997). Al- 
though DAT had a consistency check between 
some functions in one sentence, we need more 
wide-ranging consistency check because our 
scheme has assumptions of dialogue structure 
and exchange structure. Therefore it is dis- 
satisfying but the modification of the tool to 
our need is not easy. Thus, for the moment, 
we decided to use just a simple transcription 
viewer and sound player (TV) (Horiuchi et 
al., 1999), which enables us to hear the sound 
of utterances on the transcription. 
Our project does not have any intention to 
create new tools. Rather we do want to use 
any existing tools if they suit or can be eas- 
ily modified to satisfy our needs. The tools 
of MATE project(Carletta and Isard, 1999), 
which also directs multi-level annotation, can 
be a good candidate for our project. In the 
near future, we will examine if we can effec- 
tively use their tools in the project. 
8 Conclusion 
This paper described a Japanese dialogue cor- 
pus annotated with multi-level information 
built by the Japanese Discourse Research Ini- 
tiative supported by Japanese Society for Ar- 
tificial Intelligence. The annotation informa- 
tion includes speech, transcription delimited 
by slash units, prosodic, part of speech, dia- 
logue acts and dialogue segmentation. In the 
project (JSAI, 2000), we used the corpus for 
obtaining new findings by examining: 
• the relationship between linguistic infor- 
mation and dialogue acts 
• the relationship between \]prosodic infor- 
mation and dialogue segment, and 
• the characteristics of agree- 
ment/disagreement expressions and 
non-sentence elements. 
This year we plan to quadruple the size of the 
corpus and make it publicly available as soon 
as we finish the annotation and its verifica- 
tion. 

References 
J. Carletta and A. Isard. 1999. The MATE Anno- 
tation Workbench: User Requirements. In The 
Proceedings of the A CL'99 Workshop on To- 
wards Standards and Tools for Discourse Tag- 
ging, pages 11-17. 
J. Carletta, N. Dahlback, N. Reithinger, 
and M. A. Walker. 1997. Standards 
for Dialogue Coding in Natural Language 
Processing. ftp ://f~p. cs. uni- sb. de/pub/ 
dagstzthl/report e/97/9706, ps. gz. 
M. Core and J. Allen. 1997. Coding Dialogues 
with the DAMSL Annotation Scheme. In The 
Proceedings of AAAI Fall Symposium on Com- "C 
municative Action in Humans and Machines, 
pages 28-35. 
M. Core, M. Ishizaki, J. Moore, C. Nakatani, 
N. Reithinger, D. Traum, and S. Tutiya. 1998. 
The Report of the Third Workshop of the 
Discourse Research Initiative, Chiba Corpus 
Project. Technical Report 3, Chiba University. 
M. Coulthhard, editor. 1992. Advances in Spoken 
Discourse Analysis. Routledge. 
Entropic Research Laboratory, Inc. 1996. 
ESPS/waves+ 5.1.1 Reference Guide. 
Grosz, B. J. and Sidner, C.L. 1986. Attention, In- 
tentions, and the Structure of Discourse, Com- 
putational Linguistics, 12(3), pages 175-204. 
Y. Horiuchi, Y. Nakano, H. Koiso, M. Ishizaki, 
H. Suzuki, M. Okada, M. Makiko, S. Tutiya, 
and A. Ichikawa. 1999. The Design and Sta- 
tistical Characterization of the Japanese Map 
Task Dialogue Corpus. Japanese Society of Ar- 
tificial Intelligence, 14(2). 
A. Ichikawa, M. Araki, Y. Horiuchi., M. Ishizaki, 
S. Itabashi, T. Itoh, H. Kashioka, K. Kato, 
H. Kikuchi, H. Koiso, T. Kumagai, A. Kure- 
matsu, K. Maekawa, S. Nakazato, M. Tamoto, 
S. Tutiya, Y. Yamashita, and T. Yoshimura. 
1999. Evaluation of Annotation Schemes for 
Japanese Discourse. In Proceedings of ACL'99 
Workshop on Towards Standards and Tools for 
Discourse Tagging, pages 26-34. 
Japanese Discourse Research Initiative. http:// 
www. slp. cs. rit sumei, ac. jp/dt ag/. 
Y. Matsumoto, A. Kitauchi, T. Yamashita, 
Y. Hirano, H. Matsuda, and M. Asa- 
hara. Japanese morphological analysis sys- 
tem ChaSen version 2.0 manual (2nd edi- 
tion). 1999. Technical Report NAIST-IS- 
TR99012, Graduate School of Information Sci- 
ence, Nara Institute of Science and Technol- 
ogy. http://el, aist-nara, ac. jp/lab/nlt/ 
chas en/manual2/manual, pdf. 
M. Meteer and A. Taylor. 1995. Dysflu- 
ency Annotation Stylebook for the Switch- 
board Corpus. ftp://ftp, cis. upenn, edu/ 
pub/treebank/swbd/do c/DFL-book, ps. gz. 
Japanese Society for Artificial Intelligence. 2000. 
Technical Report of SIG on Spoken Language 
Understanding and Dialogue Processing. SIG- 
SLUD-9903. 
J. Odell, D. Ollason, V. Valtchev, and P. Wood- 
land. 1997. The HTK Book (for HTK Ver- 
sion 2.1). Cambridge University 
J. R. Searle. 1969. Speech Acts: An Essay in the 
Philosopy of Language. Cambridge University 
Press. 
K. Shikano, T. Kawahara, K. Ito, K. Takeda, 
A. Yamada, T. Utsuro, T. Kobayashi, N. Mine- 
matsu, and M. Yamamoto. 1998. The Devel- 
opment of Basic Softwares for the Dictation of 
Japanese Speech: Research Report 1998. 
A. B. StenstrSm. 1994. An Introduction to Spoken 
Interaction. Longman. 
