File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1001_metho.xml
Size: 16,117 bytes
Last Modified: 2025-10-06 14:07:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1001"> <Title>Japanese Dialogue Corpus of Multi-Level Annotation The Japanese Discourse Research Initiative</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Prosodic Information and </SectionTitle> <Paragraph position="0"> Part-of-speech The prosodic information and the part-of-speech tags were assigned (semi)automatically using the speech sound and the transcription.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Prosodic information </SectionTitle> <Paragraph position="0"> Prosody has been widely recognized as one of the important factors which relate to discourse structures, dialogue acts, information status, and so on. Informative corpora should, in the first place, contain some form of prosodic information.</Paragraph> <Paragraph position="1"> At this stage, our corpus merely includes, as prosodic information, raw values of fundamental frequency, voicing probability, and rms energy, which were obtained from the speech sound using speech analysis software ESPS/waves-b (Entropic, 1996) and simple post-processing for smoothing. The future corpus will contain more abstract descriptions of prosodic events such as accents and boundary tones.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Part-of-speech </SectionTitle> <Paragraph position="0"> The part-of-speech is another basic information for speech recognition, syntactic/semantic parsing, and dialogue processing as well as linguistic and psycholinguistic analysis of spoken discourse.</Paragraph> <Paragraph position="1"> Part-of-speech tags were, first, obtained automatically from the transcription using the morphological analysis system ChaSen (Matsumoto et al., 1999), and, then, corrected manually. The tag set was extended to cover filled pauses and contracted forms peculiar to spontaneous speech, and some dialects. The tagged corpus will be used as a part of the training data for the statistical learning module of ChaSen to improve its performance for spontaneous speech, which can be used for future applications.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Word alignment </SectionTitle> <Paragraph position="0"> In some applications such as co-reference resolution utilizing prosodic correlates of given-new status of words, it is useful to know the prosodic information of particular words or pl~rases. In order to obtain such information, the correspondence between the word sequence and the speech sound must be given.</Paragraph> <Paragraph position="1"> Our corpus contains the information for the starting and the ending time of every word.</Paragraph> <Paragraph position="2"> The time-stamp of each word in an utterance was obtained automatically from the speech sound and the part-of-speech using the forced alignment function of speech recognition software HTK (Odell et al., 1997) with the tri-phone model for Japanese speech developed by the IPA dictation software project (Shikano et al., 1998). Apparent errors were corrected manually with reference to sound waveforms and spectrograms obtained and displayed on a screen by ESPS/waves+ (Entropic, 1996).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Utterance Units </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Slash units </SectionTitle> <Paragraph position="0"> In the transcription, an utterance is defined as a continuous speech region delimited by pauses of 400 msec or longer. However, this definition of the utterances does not correspond to the units for discourse annotation. For example, the utterances are sometimes interrupted by the partner. For reliable discourse annotation, analysis units must be constructed from the utterances defined above.</Paragraph> <Paragraph position="1"> Following Meteer and Taylor (1995), we call such a unit 'slash unit.'</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Criteria for determining slash units </SectionTitle> <Paragraph position="0"> The criteria for determining slash units in Japanese were defined with reference to those for English (Meteer and Taylor, 1995). The slash units were annotated mantually with reference to the speech sound and transcription of dialogues.</Paragraph> <Paragraph position="1"> Single utterances as slash unit Single utterances which can be thought to represent sentences conceptually are qualified as a slash unit. Figure 2 shows examples of slash units by single utterances (slash units are delimited by the symbol '/').</Paragraph> <Paragraph position="2"> In the cases where the word order is inverted, the utterances are regarded as a slash f A: hai/ ;{response} (yes.) A: kochira chi~ annais~utemudesu / ;{a single sentence} (This is the sightseeing guide system.) A: ryoukin niha fukumarete orimasen ga betto 1200 en de goyoui sasete itadakimasu / ;{a complex sentence} (This is not included in the charge. We offer the service for the separate charge of 1200 yen.)</Paragraph> <Paragraph position="4"> as slash unit unit only if the utterances with normalized word order are qualified as a slash unit. A sequence of one speaker's speech that terminates with a hesitation, an interruption and a slip of the tongue, but does not continue in the speaker's next utterance is also qualified as a slash unit.</Paragraph> <Paragraph position="5"> Multiple utterances as slash unit When collection of multiple utterances form a sentence, as in Figure 3, they are qualified as one slash unit. In slash units spanning multiple utterances, the symbol '--' is marked both at the end of the first utterance and at the start of the last utterance.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Non sentence elements </SectionTitle> <Paragraph position="0"> Non sentence elements consist of 'aiduti', conjunction markers, discourse markers, fillers by discourse markers and non speech elements, which are enclosed by {S ...}, {C ...}, {D ...}, {F ...}, and {N ... }, respectively. These elements can be used to define a slash unit. For example, when 'aiduti' is expressed by the words such as &quot;hai (yes, yeah, right)&quot;, &quot;un (yes, yeah, right)&quot; and &quot;ee (mmm, yeah)&quot; or by word repetition, it is regarded as an utterance. Otherwise, 'aiduti' is not qualified as an independent slash unit.</Paragraph> <Paragraph position="1"> The main function of discourse markers is to show the relations between utterances, like starting a new topic, changing topics, and restarting an interrupted conversation. The words such as &quot;mazu (first, firstly)&quot;, &quot;dewa (then, ok)&quot;, &quot;tsumari (I mean, that means that)&quot; and &quot;sorede (and so)&quot; may become discourse markers when they appear at the head of the utterances. An utterance just before the one with discourse markers is qualified as a slash unit (Figure 4).</Paragraph> <Paragraph position="2"> In the Switchboard project(Meteer and Taylor, 1995), our {S... } (aiduti) category is not regarded as a separate category. However in Japanese dialogue, signals that indicate a heater's attention to speaker's utterances, are expressed frequently. For this reason, we created 'aiduti' as a separate category. Otherwise {A... }(aside), {E...}(Explict editing term), the restart and the repair are not annotated in our scheme at the present stage.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Dialogue Acts </SectionTitle> <Paragraph position="0"> Identifying dialogue act of the slash unit is difficult task because the mapping between surface form and dialogue act is not obvious.</Paragraph> <Paragraph position="1"> In addition, some slash units have more than one function, e.g. answering question with stating additional information. Considering above problems, DAMSL architecture codes various functions at one utterance, such as forward looking function, backward looking function, etc.</Paragraph> <Paragraph position="2"> However, it is difficult to determine the function of the isolated utterance. We had shown that assumptions of dialogue structure and exchange structure improved agreement score among coders (Ichikawa et al., 1999).</Paragraph> <Paragraph position="3"> Therefore, we define our dialogue act tagging scheme as hierarchical refinement from the exchange structure.</Paragraph> <Paragraph position="4"> The annotation scheme for dialogue acts includes a set of rules to identify the function of each slash unit based on the theory of speech act (Searle, 1969) and discourse analysis (Coulthhard, 1992; Stenstr6m, 1994).</Paragraph> <Paragraph position="5"> This scheme provides a basis for examining the local structure of dialogues.</Paragraph> <Paragraph position="6"> In general, a dialogue 1 is modeled with problem solving subdialogues, sometimes preceded by opening subdialogue (e.g., greeting) and followed by closing subdialogue (e.g., expressing gratitude). A problem solving sub-dialogue consists of initiating and responding l In this paper, we limit our attention to task-oriented dialogues, which are the main target of the study in computational linguistics and spoken dialogue research.</Paragraph> <Paragraph position="7"> alogue with the exchange structure utterances, sometimes followed by following up utterances (Figure 5).</Paragraph> <Paragraph position="8"> Figure 6 shows an example problem solving subdialogue with the exchange structure.</Paragraph> <Paragraph position="9"> In this scheme, dialogue acts, the elements of the exchange structure, are classified into the tags shown in Figure 7.</Paragraph> </Section> <Section position="7" start_page="0" end_page="5" type="metho"> <SectionTitle> 6 Dialogue Structure and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="5" type="sub_section"> <SectionTitle> Constraints on Multiple Exchanges 6.1 Dialogue Segment </SectionTitle> <Paragraph position="0"> In the previous discourse model(Grosz and Sidner, 1986), a discourse segment has a beginning and an ending utterances and may have smaller discourse segments in it. It is not an easy task to identify such segments with the nesting structure for spoken dialogues, because the structure of a dialogue is often very complicated due to the interaction of two speakers. In a preliminary experiment of coding segments in spoken dialogues, there were a lot of disagreements on the granularity or the relation of the segments and on identifying ending utterances of the segment. An alternative scheme of coding the dialogue structure (DS) is necessary to build dialogue corpora annotated with the discourse level structure.</Paragraph> <Paragraph position="1"> Our scheme annotates spoken dialogues segments is identified based on the exchanges explained in Section 5. A dialogue segment (DS) tag is inserted before initiating utterances because the initiating utterances can be thought of as a start of new discourse segments. null The DS tag consists of a topic break index (TBI), a topic name and a segment relation.</Paragraph> <Paragraph position="2"> TBI signifies the degree of topic dissimilarity between the DSs. TBI takes the value of 1 or 2: the boundary with TBI 2 is less continuous than the one with TBI 1 with regard to the topic. The topic name is labeled by coders' subjective judgment. The segment relation indicates the one between the preceding and the following segments, which is classified into the following categories.</Paragraph> <Paragraph position="3"> clarification suspends the exchange and makes a clarification in order to obtain information necessary to answer the partner's utterance; null</Paragraph> </Section> <Section position="2" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 6.2 Constraints on multiple exchanges </SectionTitle> <Paragraph position="0"> Annotation of dialogue segments mostly depends on the coders' intuitive judgment on topic dissimilarity between the segments. In order to lighten the burden of the coders' judgment, the structural constraints on multiple exchanges are experimentally introduced. The constraints can be classified into two types: one concerns embedding exchanges (relevance type 1) and the other is neighboring exchanges (relevance type 2).</Paragraph> <Paragraph position="1"> In relevance type 1, the relation of an initiating utterance and its responding utterance is shown by attaching the id number of the initiating utterance to the responding utterance. This id number can indicates non-adjacent initiation-response pairs including embedded exchanges inside.</Paragraph> <Paragraph position="2"> In relevance type 2, the structures of neighboring exchanges such as chaining, coupling, elliptical coupling (StenstrSm, 1994) are introduced. Chaining takes the pattern of \[A:I B:R\] \[A:I B:R\] (in both exchanges, speaker A initiates an utterance and speaker B responds to A). Coupling is the pattern of \[A:I B:R\] \[B:I A:R\]. (speaker A initiates, speaker B both responds and initiates and speaker A responds to B). Elliptical coupling is the pattern of \[A:I\] \[B:I A:R\], equivalent to the one in which B's second response is omitted in coupling. Relevance type 2 shows whether the above structures of neighboring exchanges can be observed or not. Figure 9 shows an example of annotation of relevance types 1 and</Paragraph> </Section> </Section> <Section position="8" start_page="5" end_page="5" type="metho"> <SectionTitle> 2. 7 Corpus Building Tools </SectionTitle> <Paragraph position="0"> In the experiments, various tools for transcription and annotation were used. For transcription, the automatics segmentizer (TIME) and the online transcriber (PV) were used (Horiuchi et al., 1999). The former lists up candidates for unit utterances according to the parameter for the length of silences. The latter displays energy measurement of each speaker's utterance on the two windows using a speech data file. Users can see any part of a dialogue using the scroll bar, and can hear speech for both speakers or each speaker by selecting any region of the windows using a mouse.</Paragraph> <Paragraph position="1"> For prosodic and part of speech annotation, the speech analysis software ESPS/waves+ (Entropic, 1996), speech recognition software HTK (Odell et al., 1997) with the tri-phone model for Japanese speech developed by the IPA dictation software project (Shikano et al., 1998) and the morphological analysis system ChaSe, (Matsumoto et al., 1999) were used.</Paragraph> <Paragraph position="2"> For discourse annotation, Dialogue Annotation Tool (DAT) had been used in the previous experiments (Core and Allen, 1997). Although DAT had a consistency check between some functions in one sentence, we need more wide-ranging consistency check because our scheme has assumptions of dialogue structure and exchange structure. Therefore it is dissatisfying but the modification of the tool to our need is not easy. Thus, for the moment, we decided to use just a simple transcription viewer and sound player (TV) (Horiuchi et al., 1999), which enables us to hear the sound of utterances on the transcription.</Paragraph> <Paragraph position="3"> Our project does not have any intention to create new tools. Rather we do want to use any existing tools if they suit or can be easily modified to satisfy our needs. The tools of MATE project(Carletta and Isard, 1999), which also directs multi-level annotation, can be a good candidate for our project. In the near future, we will examine if we can effectively use their tools in the project.</Paragraph> </Section> class="xml-element"></Paper>