File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1026_metho.xml

Size: 20,817 bytes

Last Modified: 2025-10-06 14:10:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1026">
  <Title>Learning the Structure of Task-driven Human-Human Dialogs</Title>
  <Section position="6" start_page="201" end_page="202" type="metho">
    <SectionTitle>
4 Structural Analysis of a Dialog
</SectionTitle>
    <Paragraph position="0"> We consider a task-oriented dialog to be the result of incremental creation of a shared plan by the participants (Lochbaum, 1998). The shared plan is represented as a single tree that encapsulates the task structure (dominance and precedence relations among tasks), dialog act structure (sequences of dialog acts), and linguistic structure of utterances (inter-clausal relations and predicate-argument relations within a clause), as illustrated in Figure 1. As the dialog proceeds, an utterance from a participant is accommodated into the tree in an incremental manner, much like an incremental syntactic parser accommodates the next word into a partial parse tree (Alexandersson and Reithinger, 1997). With this model, we can tightly couple language understanding and dialog management using a shared representation, which leads to improved accuracy (Taylor et al., 1998).</Paragraph>
    <Paragraph position="1"> In order to infer models for predicting the structure of task-oriented dialogs, we label human-human dialogs with the hierarchical information shown in Figure 1 in several stages: utterance segmentation (Section 4.1), syntactic annotation (Section 4.2), dialog act tagging (Section 4.3) and  subtask labeling (Section 5).</Paragraph>
    <Section position="1" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
4.1 Utterance Segmentation
</SectionTitle>
      <Paragraph position="0"> The task of cleaning up spoken language utterances by detecting and removing speech repairs and dys uencies and identifying sentence boundaries has been a focus of spoken language parsing research for several years (e.g. (Bear et al., 1992; Seneff, 1992; Shriberg et al., 2000; Charniak and Johnson, 2001)). We use a system that segments the ASR output of a user's utterance into clauses.</Paragraph>
      <Paragraph position="1"> The system annotates an utterance for sentence boundaries, restarts and repairs, and identi es coordinating conjunctions, lled pauses and discourse markers. These annotations are done using a cascade of classi ers, details of which are described in (Bangalore and Gupta, 2004).</Paragraph>
    </Section>
    <Section position="2" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
4.2 Syntactic Annotation
</SectionTitle>
      <Paragraph position="0"> We automatically annotate a user's utterance with supertags (Bangalore and Joshi, 1999). Supertags encapsulate predicate-argument information in a local structure. They are composed with each other using the substitution and adjunction oper- null to derive a dependency analysis of an utterance and its predicate-argument structure.</Paragraph>
    </Section>
    <Section position="3" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
4.3 Dialog Act Tagging
</SectionTitle>
      <Paragraph position="0"> We use a domain-speci c dialog act tagging scheme based on an adapted version of DAMSL (Core, 1998). The DAMSL scheme is quite comprehensive, but as others have also found (Jurafsky et al., 1998), the multi-dimensionality of the scheme makes the building of models from DAMSL-tagged data complex. Furthermore, the generality of the DAMSL tags reduces their utility for natural language generation. Other tagging schemes, such as the Maptask scheme (Carletta et al., 1997), are also too general for our purposes.</Paragraph>
      <Paragraph position="1"> We were particularly concerned with obtaining suf cient discriminatory power between different types of statement (for generation), and to include an out-of-domain tag (for interpretation). We provide a sample list of our dialog act tags in Table 2. Our experiments in automatic dialog act tagging are described in Section 6.3.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="202" end_page="203" type="metho">
    <SectionTitle>
5 Modeling Subtask Structure
</SectionTitle>
    <Paragraph position="0"> Figure 2 shows the task structure for a sample dialog in our domain (catalog ordering). An order placement task is typically composed of the sequence of subtasks opening, contact-information, order-item, related-offers, summary. Subtasks can be nested; the nesting structure can be as deep as ve levels. Most often the nesting is at the left or right frontier of the subtask tree.</Paragraph>
    <Paragraph position="1">  task structure The goal of subtask segmentation is to predict if the current utterance in the dialog is part of the current subtask or starts a new subtask. We compare two models for recovering the subtask structure a chunk-based model and a parse-based model.</Paragraph>
    <Paragraph position="2"> In the chunk-based model, we recover the precedence relations (sequence) of the subtasks but not dominance relations (subtask structure) among the subtasks. Figure 3 shows a sample output from the chunk model. In the parse model, we recover the complete task structure from the sequence of utterances as shown in Figure 2. Here, we describe our two models. We present our experiments on subtask segmentation and labeling in Section 6.4.</Paragraph>
    <Section position="1" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
5.1 Chunk-based model
</SectionTitle>
      <Paragraph position="0"> This model is similar to the second one described in (Poesio and Mikheev, 1998), except that we use tasks and subtasks rather than dialog games.</Paragraph>
      <Paragraph position="1"> We model the prediction problem as a classi cation task as follows: given a sequence of utterances a0a2a1 in a dialog a3a5a4 a0a7a6a9a8a10a0a12a11a13a8a9a14a9a14a9a14a15a8a10a0a17a16 and a  subtask label vocabulary a18a20a19a15a21 a1a23a22a25a24a27a26a29a28 , we need to predict the best subtask label sequence a30a32a31a34a33 a4</Paragraph>
      <Paragraph position="3"> Each subtask has begin, middle (possibly absent) and end utterances. If we incorporate this information, the re ned vocabulary of subtask labels is a24a27a26a34a62 a4a64a63 a19a15a21a66a65</Paragraph>
      <Paragraph position="5"> our experiments, we use a classi er to assign to each utterance a re ned subtask label conditioned on a vector of local contextual features (a72 ). In the interest of using an incremental left-to-right decoder, we restrict the contextual features to be from the preceding context only. Furthermore, the search is limited to the label sequences that respect precedence among the re ned labels (begin a73 middle a73 end). This constraint is expressed in a grammar G encoded as a regular expression (a74a75a18a77a76 a28 a4a79a78a1 a18a20a19a47a21a66a65a1 a18a20a19a15a21 a36a1 a28 a33 a19a47a21a66a67a1 a28 a33 ). However, in order to cope with the prediction errors of the classi er, we approximate a74a51a18a77a76 a28 with an a80 -gram language model on sequences of the re ned tag labels:</Paragraph>
      <Paragraph position="7"> a28 we use the general technique of choosing the maximum entropy (maxent) distribution that properly estimates the average of each feature over the training data (Berger et al., 1996). This can be written as a Gibbs distribution parameterized with weights a102 , where a103 is the size of the label set. Thus,</Paragraph>
      <Paragraph position="9"> We use the machine learning toolkit LLAMA (Haffner, 2006) to estimate the conditional distribution using maxent. LLAMA encodes multiclass maxent as binary maxent, in order to increase the speed of training and to scale this method to large data sets. Each of the a103 classes in the set a24a122a26a123a62 is encoded as a bit vector such that, in the vector for class a124 , the a124a66a125a127a126 bit is one and all other bits are zero. Then, a103 one-vs-other binary classi ers are used as follows.</Paragraph>
      <Paragraph position="11"> where a102a145a144a146 is the parameter vector for the antilabel a147a148 and a102a131a149a146 a4 a102 a146a151a150 a102 a144a146 . In order to compute</Paragraph>
    </Section>
    <Section position="2" start_page="203" end_page="203" type="sub_section">
      <SectionTitle>
5.2 Parse-based Model
</SectionTitle>
      <Paragraph position="0"> As seen in Figure 3, the chunk model does not capture dominance relations among subtasks, which are important for resolving anaphoric references (Grosz and Sidner, 1986). Also, the chunk model is representationally inadequate for center-embedded nestings of subtasks, which do occur in our domain, although less frequently than the more prevalent tail-recursive structures.</Paragraph>
      <Paragraph position="1"> In this model, we are interested in nding the most likely plan tree (a101 a31 ) given the sequence of utterances:</Paragraph>
      <Paragraph position="3"> For real-time dialog management we use a top-down incremental parser that incorporates bottom-up information (Roark, 2001).</Paragraph>
      <Paragraph position="4"> We rewrite equation (6) to exploit the subtask sequence provided by the chunk model as shown in Equation 7. For the purpose of this paper, we approximate Equation 7 using one-best (or k-best)</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="8" start_page="203" end_page="206" type="metho">
    <SectionTitle>
6 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> In this section, we present the results of our experiments for modeling subtask structure.</Paragraph>
    <Section position="1" start_page="203" end_page="204" type="sub_section">
      <SectionTitle>
6.1 Data
</SectionTitle>
      <Paragraph position="0"> As our primary data set, we used 915 telephone-based customer-agent dialogs related to the task of ordering products from a catalog. Each dialog was transcribed by hand; all numbers (telephone, credit card, etc.) were removed for privacy reasons. The average dialog lasted for 3.71 1However, it is conceivable to parse the multiple hypotheses of chunks (encoded as a weighted lattice) produced by the chunk model.</Paragraph>
      <Paragraph position="1">  minutes and included 61.45 changes of speaker. A single customer-service representative might participate in several dialogs, but customers are represented by only one dialog each. Although the majority of the dialogs were on-topic, some were idiosyncratic, including: requests for order corrections, transfers to customer service, incorrectly dialed numbers, and long friendly out-of-domain asides. Annotations applied to these dialogs include: utterance segmentation (Section 4.1), syntactic annotation (Section 4.2), dialog act tagging (Section 4.3) and subtask segmentation (Section 5). The former two annotations are domain-independent while the latter are domain-speci c.</Paragraph>
    </Section>
    <Section position="2" start_page="204" end_page="204" type="sub_section">
      <SectionTitle>
6.2 Features
</SectionTitle>
      <Paragraph position="0"> Of ine natural language processing systems, such as part-of-speech taggers and chunkers, rely on both static and dynamic features. Static features are derived from the local context of the text being tagged. Dynamic features are computed based on previous predictions. The use of dynamic features usually requires a search for the globally optimal sequence, which is not possible when doing incremental processing. For dialog act tagging and subtask segmentation during dialog management, we need to predict incrementally since it would be unrealistic to wait for the entire dialog before decoding. Thus, in order to train the dialog act (DA) and subtask segmentation classi ers, we use only static features from the current and left context as shown in Table 1.2 This obviates the need for constructing a search network and performing a dynamic programming search during decoding.</Paragraph>
      <Paragraph position="1"> In lieu of the dynamic context, we use larger static context to compute features word trigrams and trigrams of words annotated with supertags computed from up to three previous utterances.</Paragraph>
    </Section>
    <Section position="3" start_page="204" end_page="205" type="sub_section">
      <SectionTitle>
6.3 Dialog Act Labeling
</SectionTitle>
      <Paragraph position="0"> For dialog act labeling, we built models from our corpus and from the Maptask (Carletta et al., 1997) and Switchboard-DAMSL (Jurafsky et al., 1998) corpora. From the les for the Maptask corpus, we extracted the moves, words and speaker information (follower/giver). Instead of using the 2We could use dynamic contexts as well and adopt a greedy decoding algorithm instead of a viterbi search. We have not explored this approach in this paper.</Paragraph>
      <Paragraph position="1"> raw move information, we augmented each move with speaker information, so that for example, the instruct move was split into instruct-giver and instruct-follower. For the Switchboard corpus, we clustered the original labels, removing most of the multidimensional tags and combining together tags with minimum training data as described in (Jurafsky et al., 1998). For all three corpora, non-sentence elements (e.g., dys uencies, discourse markers, etc.) and restarts (with and without repairs) were kept; non-verbal content (e.g., laughs, background noise, etc.) was removed.</Paragraph>
      <Paragraph position="2"> As mentioned in Section 4, we use a domain-speci c tag set containing 67 dialog act tags for the catalog corpus. In Table 2, we give examples of our tags. We manually annotated 1864 clauses from 20 dialogs selected at random from our corpus and used a ten-fold cross-validation scheme for testing. In our annotation, a single utterance may have multiple dialog act labels. For our experiments with the Switchboard-DAMSL corpus, we used 42 dialog act tags obtained by clustering over the 375 unique tags in the data. This corpus has 1155 dialogs and 218,898 utterances; 173 dialogs, selected at random, were used for testing.</Paragraph>
      <Paragraph position="3"> The Maptask tagging scheme has 12 unique dialog act tags; augmented with speaker information, we get 24 tags. This corpus has 128 dialogs and 26181 utterances; ten-fold cross validation was used for testing.</Paragraph>
      <Paragraph position="4">  Table 3 shows the error rates for automatic dialog act labeling using word trigram features from the current and previous utterance. We compare error rates for our tag set to those of Switchboard-DAMSL and Maptask using the same features and the same classi er learner. The error rates for the catalog and the Maptask corpus are an average of ten-fold cross-validation. We suspect that the larger error rate for our domain compared to Maptask and Switchboard might be due to the small size of our annotated corpus (about 2K utterances for our domain as against about 20K utterances for  Maptask and 200K utterances for DAMSL).</Paragraph>
      <Paragraph position="5"> The error rates for the Switchboard-DAMSL data are signi cantly better than previously published results (28% error rate) (Jurafsky et al., 1998) with the same tag set. This improvement is attributable to the richer feature set we use and a discriminative modeling framework that supports a large number of features, in contrast to the generative model used in (Jurafsky et al., 1998). A similar obeservation applies to the results on Maptask dialog act tagging. Our model outperforms previously published results (42.8% error rate) (Poesio and Mikheev, 1998).</Paragraph>
      <Paragraph position="6"> In labeling the Switchboard data, long utterances were split into slash units (Meteer et.al., 1995). A speaker's turn can be divided in one or more slash units and a slash unit can extend over multiple turns, for example: sv B.64 utt3: C but, F uh b A.65 utt1: Uh-huh. / + B.66 utt1: people want all of that / sv B.66 utt2: C and not all of those are necessities. / b A.67 utt1: Right . / The labelers were instructed to label on the basis of the whole slash unit. This makes, for example, the dys uency turn B.64 a Statement opinion (sv) rather than a non-verbal. For the purpose of discriminative learning, this could introduce noisy data since the context associated to the labeling decision shows later in the dialog. To address this issue, we compare 2 classi ers: the rst (non-merged), simply propagates the same label to each continuation, cross turn slash unit; the second (merged) combines the units in one single utterance. Although the merged classi er breaks the regular structure of the dialog, the results in Table</Paragraph>
    </Section>
    <Section position="4" start_page="205" end_page="205" type="sub_section">
      <SectionTitle>
6.4 Subtask Segmentation and Labeling
</SectionTitle>
      <Paragraph position="0"> For subtask labeling, we used a random partition of 864 dialogs from our catalog domain as the training set and 51 dialogs as the test set. All the dialogs were annotated with subtask labels by hand. We used a set of 18 labels grouped as shown in Figure 4.</Paragraph>
    </Section>
    <Section position="5" start_page="205" end_page="206" type="sub_section">
      <SectionTitle>
Type Subtask Labels
</SectionTitle>
      <Paragraph position="0"> Table 5 shows error rates on the test set when predicting re ned subtask labels using word a80 -gram features computed on different dialog contexts. The well-formedness constraint on the rened subtask labels signi cantly improves prediction accuracy. Utterance context is also very helpful; just one utterance of left-hand context leads to a 10% absolute reduction in error rate, with further reductions for additional context. While the use of trigram features helps, it is not as helpful as other contextual information. We used the dialog act tagger trained from Switchboard-DAMSL corpus to automatically annotate the catalog domain utterances. We included these tags as features for the classi er, however, we did not see an improvement in the error rates, probably due to the high error rate of the dialog act tagger.</Paragraph>
      <Paragraph position="1">  task labels. The error rates without the well-formedness constraint is shown in parenthesis.</Paragraph>
      <Paragraph position="2"> The error rates with dialog acts as features are separated by a slash.</Paragraph>
      <Paragraph position="3">  We retrained a top-down incremental parser (Roark, 2001) on the plan trees in the training dialogs. For the test dialogs, we used the a167 -best (k=50) re ned subtask labels for each utterance as predicted by the chunk-based classier to create a lattice of subtask label sequences. For each dialog we then created a80 -best sequences (100-best for these experiments) of subtask labels; these were parsed and (re-)ranked by the parser.3 We combine the weights of the subtask label sequences assigned by the classi er with the parse score assigned by the parser and select the top 3Ideally, we would have parsed the subtask label lattice directly, however, the parser has to be reimplemented to parse such lattice inputs.</Paragraph>
      <Paragraph position="4">  scoring sequence from the list for each dialog.</Paragraph>
      <Paragraph position="5"> The results are shown in Table 6. It can be seen that using the parsing constraint does not help the subtask label sequence prediction signi cantly.</Paragraph>
      <Paragraph position="6"> The chunk-based model gives almost the same accuracy, and is incremental and more ef cient.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="206" end_page="206" type="metho">
    <SectionTitle>
7 Discussion
</SectionTitle>
    <Paragraph position="0"> The experiments reported in this section have been performed on transcribed speech. The audio for these dialogs, collected at a call center, were stored in a compressed format, so the speech recognition error rate is high. In future work, we will assess the performance of dialog structure prediction on recognized speech.</Paragraph>
    <Paragraph position="1"> The research presented in this paper is but one step, albeit a crucial one, towards achieving the goal of inducing human-machine dialog systems using human-human dialogs. Dialog structure information is necessary for language generation (predicting the agents' response) and dialog state speci c text-to-speech synthesis. However, there are several challenging problems that remain to be addressed.</Paragraph>
    <Paragraph position="2"> The structuring of dialogs has another application in call center analytics. It is routine practice to monitor, analyze and mine call center data based on indicators such as the average length of dialogs, the task completion rate in order to estimate the efciency of a call center. By incorporating structure to the dialogs, as presented in this paper, the analysis of dialogs can be performed at a more ne-grained (task and subtask) level.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML