File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2073_evalu.xml

Size: 8,633 bytes

Last Modified: 2025-10-06 13:59:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2073">
  <Title>dialogue models[?]</Title>
  <Section position="6" start_page="565" end_page="568" type="evalu">
    <SectionTitle>
4 Experiments and results
</SectionTitle>
    <Paragraph position="0"> The SwitchBoard database was processed to remove certain particularities. The main adaptations performed were: * The interrupted utterances (which were labelled with '+') were joined to the correct previous utterance, thereby avoiding interruptions (i.e., all the words of the interrupted utterance were annotated with the same DA).</Paragraph>
    <Paragraph position="1">  words.</Paragraph>
    <Paragraph position="2"> The experiments were performed using a cross-validation approach to avoid the statistical bias that can be introduced by the election of fixed training and test partitions. This cross-validation approach has also been adopted in other recent works on this corpus (Webb et al., 2005). In our case, we performed 10 different experiments. In each experiment, the training partition was composed of 1,136 dialogues, and the test partition was composed of 19 dialogues. This proportion was adopted so that our results could be compared with the results in (Stolcke et al., 2000), where similar training and test sizes were used. The mean figures for the training and test partitions are shown in Table 1.</Paragraph>
    <Paragraph position="3"> With respect to the Dihana database, the preprocessing included the following points:  A cross-validation approach was adopted in Dihana as well. In this case, only 5 different partitions were used. Each of them had 720 dialogues for training and 180 for testing. The statistics on the Dihana corpus are presented in Table 2.</Paragraph>
    <Paragraph position="4"> For both corpora, different N-gram models, with N = 2,3,4, and HMM of one state were trained from the training database. In the case of the SwitchBoard database, all the turns in the test set were used to compute the labelling accuracy. However, for the Dihana database, only the user turns were taken into account (because system turns follow a regular, template-based scheme, which presents artificially high labelling accuracies). Furthermore, in order to use a really significant set of labels in the Dihana corpus, we performed the experiments using only two-level labels instead of the complete three-level labels. This restriction allowed us to be more independent from the understanding issues, which are strongly related to the third level. It also allowed us to concentrate on the dialogue issues, which relate more  to the first and second levels.</Paragraph>
    <Paragraph position="5"> The results in the case of the segmented approach described in Section 2 for SwitchBoard are presented in Table 3. Two different definitions of accuracy were used to assess the results: * Utterance accuracy: computes the proportion of well-labelled utterances.</Paragraph>
    <Paragraph position="6"> * Turn accuracy: computes the proportion of totally well-labelled turns (i.e.: if the labelling has the same labels in the same order as in the reference, it is taken as a well-labelled turn).</Paragraph>
    <Paragraph position="7"> As expected, the utterance accuracy results are a bit worse than those presented in (Stolcke et al., 2000). This may be due to the use of only the past history and possibly to the cross-validation approach used in the experiments. The turn accuracy was calculated to compare the segmented and the unsegmented models. This was necessary because the utterance accuracy does not make sense for the unsegmented model.</Paragraph>
    <Paragraph position="8"> The results for the unsegmented approach for SwitchBoard are presented in Table 4. In this case, three different definitions of accuracy were used to assess the results: * Accuracy at DA level: the edit distance between the reference and the labelling of the turn was computed; then, the number of correct substitutions (c), wrong substitutions (s), deletions (d) and insertions (i) was com- null puted, and the accuracy was calculated as 100 * c(c+s+i+d).</Paragraph>
    <Paragraph position="9"> * Accuracy at turn level: this provides the proportion of well-labelled turns, without taking into account the segmentation (i.e., if the labelling has the same labels in the same order as in the reference, it is taken as a well-labelled turn).</Paragraph>
    <Paragraph position="10"> * Accuracy at segmentation level: this provides the proportion of well-labelled and segmented turns (i.e., the labels are the same as in the reference and they affect the same utterances). null The WIP parameter used in Table 4 was 50, which is the one that offered the best results. The segmentation accuracy in Table 4 must be comparedwiththeturnaccuracyinTable3. AsTable4 shows, theaccuracyofthelabellingdecreaseddramatically. This reveals the strong influence of the availability of the real segmentation of the turns. To confirm this hypothesis, similar experiments were performed with the Dihana database. Table 5 presents the results with the segmented corpus, and Table 6 presents the results with the unsegmented corpus (with WIP=50, which gave the best results). In this case, only user turns were taken into account to compute the accuracy, although the model was applied to all the turns (both user and system turns). For the Dihana corpus, the degradation of the results of the unsegmented approach with respect to the segmented approach was not as high as in the SwitchBoard corpus, due to the smaller vocabulary and complexity of the dialogues.</Paragraph>
    <Paragraph position="11"> These results led us to the same conclusion, even for such a different corpus (much more labels, task-oriented, etc.). In any case, these accuracy figures must be taken as a lower bound on the model performance because sometimes an incorrect recognition of segment boundaries or dialogueactsdoesnotcauseaninappropriatereaction null of the dialogue strategy.</Paragraph>
    <Paragraph position="12">  An illustrative example of annotation errors in the SwitchBoard database, is presented in Figure 3 for the same turns as in Figure 1. An error analysis of the segmented model was performed. The results reveals that, in the case of most of the errors were produced by the confusion of the 'sv' and 'sd' classes (about 50% of the times 'sv' was badly labelled, the wrong label was 'sd') The second turn in Figure 3 is an example of this type of error. The confusions between the 'aa' and 'b' classes were also significant (about 27% of the times 'aa' was badly labelled, the wrong label was 'b'). This was reasonable due to the similar definitions of these classes (which makes the annotation difficult, even for human experts). These errors were similar for all the N-grams used. In the case of the unsegmented model, most of the errors were produced by deletions of the 'sd' and 'sv' classes, as in the first turn in Figure 3 (about 50% of the errors). This can be explained by the presence of very short and very long utterances in both classes (i.e., utterances for 'sd' and 'sv' did not present a regular length).</Paragraph>
    <Paragraph position="13"> Some examples of errors in the Dihana corpus are shown in Figure 4 (in this case, for the same turns as those presented in Figure 2). In the segmented model, most of the errors were substitutions between labels with the same first level (especially questions and answers) where the second level was difficult to recognise. The first and third turn in Figure 4 are examples of this type of error. This was because sometimes the expressions only differed with each other by one word, or  the previous segment influence (i.e., the language model weight) was not enough to get the appropriate label. This was true for all the N-grams tested. Inthecaseoftheunsegmentedmodel,most of the errors were caused by similar misrecognitions in the second level (which are more frequent due to the absence of utterance boundaries); however, deletion and insertion errors were also significant. The deletion errors corresponded to acceptance utterances, which were too short (most of them were &amp;quot;Yes&amp;quot;). The insertion errors corresponded to &amp;quot;Yes&amp;quot; words that were placed after a new-consult system utterance, which is the case of the second turn presented in Figure 4. These words should not have been labelled as a separate utterance. In both cases, these errors were very dependant on the WIP factor, and we had to get an adequate WIP value which did not increase the insertions and did not cause too many deletions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML