File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2073_metho.xml
Size: 7,860 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2073"> <Title>dialogue models[?]</Title> <Section position="4" start_page="563" end_page="564" type="metho"> <SectionTitle> 2 Annotation models </SectionTitle> <Paragraph position="0"> The statistical annotation model that we used initially was inspired by the one presented in (Stolcke et al., 2000). Under a maximum likelihood framework, they developed a formulation that assigns DAs depending on the conversation evidence (transcribed words, recognised words from a speech recogniser, phonetic and prosodic features,...). Stolcke's model uses simple and popular statistical models: N-grams and Hidden Markov Models. The N-grams are used to model the probability of the DA sequence, while the HMM are used to model the evidence likelihood given the DA. The results presented in (Stolcke et al., 2000) are very promising.</Paragraph> <Paragraph position="1"> However, the model makes some unrealistic assumptions when they are evaluated to be used as strategy models. One of them is that there is a complete dialogue available to perform the DA assignation. In a real dialogue system, the only available information is the information that is prior to the current user input. Although this alternative is proposed in (Stolcke et al., 2000), no experimental results are given.</Paragraph> <Paragraph position="2"> Another unrealistic assumption corresponds to the availability of the segmentation of the turns into utterances. An utterance is defined as a dialogue-relevantsubsequenceofwordsinthecurrent turn (Stolcke et al., 2000). It is clear that the only information given in a turn is the usual information: transcribed words (for text systems), recognised words, and phonetic/prosodic features (for speech systems). Therefore, it is necessary to develop a model to cope with both the segmentation and the assignation problem.</Paragraph> <Paragraph position="3"> Let Ud1 = U1U2 ***Ud be the sequence of DA assigned until the current turn, corresponding to the first d segments of the current dialogue. Let W = w1w2 ...wl be the sequence of the words of the current turn, where subsequences Wji = wiwi+1 ...wj can be defined (1 [?] i [?] j [?] l).</Paragraph> <Paragraph position="4"> For the sequence of words W, a segmentation is defined as sr1 = s0s1 ...sr, where s0 = 0 and</Paragraph> <Paragraph position="6"> optimal sequence of DA for the current turn will be given by:</Paragraph> <Paragraph position="8"> After developing this formula and making several assumptions and simplifications, the final model, called unsegmented model, is:</Paragraph> <Paragraph position="10"> This model can be easily implemented using simple statistical models (N-grams and Hidden Markov Models). The decoding (segmentation and DA assignation) was implemented using the Viterbi algorithm. A Word Insertion Penalty (WIP) factor, similar to the one used in speech recognition, can be incorporated into the model to control the number of utterances and avoid excessive segmentation.</Paragraph> <Paragraph position="11"> When the segmentation into utterances is provided, the model can be simplified into the segmented model, which is:</Paragraph> <Paragraph position="13"> All the presented models only take into account word transcriptions and dialogue acts, although they could be extended to deal with other features (like prosody, sintactical and semantic information, etc.).</Paragraph> </Section> <Section position="5" start_page="564" end_page="565" type="metho"> <SectionTitle> 3 Experimental data </SectionTitle> <Paragraph position="0"> Two corpora with very different features were used in the experiment with the models proposed in Section 2. The SwitchBoard corpus is composed of human-human, non task-oriented dialogueswithalargevocabulary. TheDihanacorpus is composed of human-computer, task-oriented dialogues with a small vocabulary.</Paragraph> <Paragraph position="1"> Although two corpora are not enough to let us draw general conclusions, they give us more reliable results than using only one corpus. Moreover, the very different nature of both corpora makes our conclusions more independent from the corpus type, the annotation scheme, the vocabulary size, etc.</Paragraph> <Section position="1" start_page="564" end_page="564" type="sub_section"> <SectionTitle> 3.1 The SwitchBoard corpus </SectionTitle> <Paragraph position="0"> The first corpus used in the experiments was the well-known SwitchBoard corpus (Godfrey et al., 1992). The SwitchBoard database consists of human-human conversations by telephone with no directed tasks. Both speakers discuss about general interest topics, but without a clear task to accomplish. null The corpus is formed by 1,155 conversations, which comprise 126,754 different turns of spontaneous and sometimes overlapped speech, using a vocabulary of 21,797 different words. The corpus was segmented into utterances, each of which was annotated with a DA following the simplified DAMSL annotation scheme (Jurafsky et al., 1997). The set of labels of the simplified DAMSL scheme is composed of 42 different labels, which define categories such as statement, backchannel, opinion, etc. An example of annotation is presented in Figure 1.</Paragraph> </Section> <Section position="2" start_page="564" end_page="565" type="sub_section"> <SectionTitle> 3.2 The Dihana corpus </SectionTitle> <Paragraph position="0"> The second corpus used was a task-oriented corpus called Dihana (Bened'i et al., 2004). It is composed of computer-to-human dialogues, and the mainaimofthetaskistoanswertelephonequeries about train timetables, fares, and services for longdistancetrainsinSpanish. Atotalof900dialogues were acquired by using the Wizard of Oz technique and semicontrolled scenarios. Therefore, the voluntary caller was always free to express him/herself (there were no syntactic or vocabularyrestrictions); however, insomedialogues, s/he had to achieve some goals using a set of restrictions that had been given previously (e.g. departure/arrival times, origin/destination, travelling on a train with some services, etc.).</Paragraph> <Paragraph position="1"> These 900 dialogues comprise 6,280 user turns and 9,133 system turns. Obviously, as a task- null oriented and medium size corpus, the total number of different words in the vocabulary, 812, is not as large as the Switchboard database.</Paragraph> <Paragraph position="2"> The turns were segmented into utterances. It was possible for more than one utterance (with their respective labels) to appear in a turn (on average, there were 1.5 utterances per user/system turn). A three-level annotation scheme of the utterances was defined (Alc'acer et al., 2005). These labels represent the general purpose of the utterance (first level), as well as more specific semantic information (second and third level): the second level represents the data focus in the utterance and the third level represents the specific data present in the utterance. An example of three-level annotated user turns is given in Figure 2. The corpus was annotated by means of a semiautomatic procedure, and all the dialogues were manually corrected by human experts using a very specific set of defined rules.</Paragraph> <Paragraph position="3"> After this process, there were 248 different labels (153 for user turns, 95 for system turns) using the three-level scheme. When the detail level was reduced to the first and second levels, there were 72 labels (45 for user turns, 27 for system turns).</Paragraph> <Paragraph position="4"> When the detail level was limited to the first level, there were only 16 labels (7 for user turns, 9 for system turns). The differences in the number of labels and in the number of examples for each label with the SwitchBoard corpus are significant.</Paragraph> </Section> </Section> class="xml-element"></Paper>