File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1030_metho.xml
Size: 21,482 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1030"> <Title>Implications for Generating Clarification Requests in Task-oriented Dialogues</Title> <Section position="4" start_page="239" end_page="242" type="metho"> <SectionTitle> 2 CR Classification Schemes </SectionTitle> <Paragraph position="0"> We now discuss two recently proposed classification schemes for CRs, and assess their usefulness for generating CRs in a spoken dialogue system (SDS).</Paragraph> <Section position="1" start_page="239" end_page="239" type="sub_section"> <SectionTitle> 2.1 Purver, Ginzburg and Healey (PGH) </SectionTitle> <Paragraph position="0"> Purver, Ginzburg and Healey (2003) investigated CRs in the British National Corpus (BNC) (Burnard, 2000). In their annotation scheme, a CR can take seven distinct surface forms and four readings, as shown in Table 2. The examples for the form feature are possible CRs following the statement &quot;I want a flight to Edinburgh&quot;. The focus of this classification scheme is to map semantic readings to syntactic surface forms. The form feature is defined by its relation to the problematic utterance, i.e., whether a CR reprises the antecedent utterance and to what extent.</Paragraph> <Paragraph position="1"> CRs may take the three different readings as defined by Ginzburg and Cooper (2001), as well as a fourth reading which indicates a correction.</Paragraph> <Paragraph position="2"> Although PGH report good coverage of the scheme on their subcorpus of the BNC (99%), we found their classification scheme to to be too coarse-grained to prescribe the form that a CR should take. As shown in example 1, Reprise Fragments (RFs), which make up one third of the BNC, are ambiguous in their readings and may also take several surface forms.</Paragraph> <Paragraph position="3"> (1) I would like to book a flight on Monday.</Paragraph> <Paragraph position="4"> (a) Monday? frg, con/cla (b) Which Monday? frg, con (c) Monday the first? frg, con (d) The first of May? frg, con (e) Monday the first or Monday the eighth? frg, (exclusive) con RFs endorse literal repetitions of part of the problematic utterance (1.a); repetitions with an additional question word (1.b); repetition with further specification (1.c); reformulations (1.d); and alternative questions (1.e)1.</Paragraph> <Paragraph position="5"> In addition to being too general to describe such differences, the classification scheme also fails to describe similarities. As noted by (Rodriguez and Schlangen, 2004), PGH provide no feature to describe the extent to which an RF repeats the problematic utterance.</Paragraph> <Paragraph position="6"> Finally, some phenomena cannot be described at all by the four readings. For example, the readings do not account for non-understanding on the pragmatic level. Furthermore the readings may have several problem sources: the clausal reading may be appropriate where the CR initiator failed to recognise the word acoustically as well as when he failed to resolve the reference. Since we are interested in generating CRs that indicate the source of the error, we need a classification scheme that represents such information.</Paragraph> </Section> <Section position="2" start_page="239" end_page="240" type="sub_section"> <SectionTitle> 2.2 Rodriguez and Schlangen (R&S) </SectionTitle> <Paragraph position="0"> Rodriguez and Schlangen (2004) devised a multi-dimensional classification scheme where form and 1Alternative questions would be interpreted as asking a polar question with an exclusive reading.</Paragraph> <Paragraph position="1"> function are meta-features taking sub-features as attributes. The function feature breaks down into the sub-features source, severity, extent, reply and satisfaction. The sources that might have caused the problem map to the levels as defined by Clark (1996). These sources can also be of different severity. The severity can be interpreted as describing the set of possible referents: asking for repetition indicates that no interpretation is available (cont-rep); asking for confirmation means that the CR initiator has some kind of hypothesis (cont-conf). The extent of a problem describes whether the CR points out a problematic element in the problem utterance. The reply represents the answer the addressee gives to the CR. The satisfaction of the CR-initiator is indicated by whether he renews the request for clarification or not.</Paragraph> <Paragraph position="2"> The meta-feature form describes how the CR is lingustically realised. It describes the sentence's mood, whether it is grammatically complete, the relation to the antecedent, and the boundary tone. According to R&S's classification scheme our illustrative example would be annotated as follows2: (2) I would like to book a flight on Monday.</Paragraph> <Paragraph position="3"> (a) Monday? mood: decl completeness: partial rel-antecedent: repet source: acous/np-ref severity: cont-repet extent: yes (b) Which Monday? mood: wh-question completeness: partial rel-antecedent: addition source: np-ref severity: cont-repet extent: yes (c) Monday the first? mood: decl completeness: partial rel-antecedent: addition source: np-ref severity: cont-conf extent: yes (d) The first of May? mood: decl completeness: partial 2The source features answer and satisfaction are ignored as they depend on how the dialogue continues. The interpretation of the source is dependent on the reply to the CR. Therefore all possible interpretations are listed.</Paragraph> <Paragraph position="4"> rel-antecedent: reformul source: np-ref severity: cont-conf extent: yes (d) Monday the first or Monday the eighth? mood: alt-q completeness: partial rel-antecedent: addition source: np-ref severity: cont-repet extent: yes In R&S's classification scheme, ambiguities about CRs having different sources cannot be resolved entirely as example (2.a) shows. However, in contrast to PGH, the overall approach is a different one: instead of explaining causes of CRs within a theoretic-semantic model (as the three different readings of Ginzburg and Cooper (2001) do), they infer the interpretation of the CR from the context. Ambiguities get resolved by the reply of the addressee and the satisfaction of the CR initiator indicates the &quot;mutually agreed interpretation&quot; . R&S's multi-dimensional CR description allows the fine-grained distinctions needed to generate natural CRs to be made. For example, PGH's general category of RFs can be made more specific via the values for the feature relation to antecedent. In addition, the form feature is not restricted to syntax; it includes features such as intonation and coherence, which are useful for generating the surface form of CRs. Furthermore, the multi-dimensional function feature allows us to describe information relevant to generating CRs that is typically available in dialogue systems, such as the level of confidence in the hypothesis and the problem source.</Paragraph> </Section> <Section position="3" start_page="240" end_page="241" type="sub_section"> <SectionTitle> 3.1 Material and Method </SectionTitle> <Paragraph position="0"> Material: We annotated the human-human travel reservation dialogues available as part of the Carnegie Mellon Communicator Corpus (Bennett and Rudnicky, 2002) because we were interested in studying naturally occurring CRs in task-oriented dialogue. In these dialogues, an experienced travel agent is making reservations for trips that people in the Carnegie Mellon Speech Group were taking in the upcoming months. The corpus comprises 31 dialogues of transcribed telephone speech, with 2098 dialogue turns and 19395 words.</Paragraph> <Paragraph position="1"> Annotation Scheme: Our annotation scheme, shown in Figure 1, is an extention of the R&S scheme described in the previous section. R&S's scheme was devised for and tested on the Bielefeld Corpus of German task-oriented dialogues about joint problem solving.3 To annotate the Communicator Corpus we extended the scheme in the following ways. First, we found the need to distinguish CRs that consist only of newly added information, as in example 3, from those that add information while also repeating part of the utterance to be clarified, as in 4. We augmented the scheme to allow two distinct values for the form feature relation-antecedent, add for cases like 3 and repet-add for cases like 4.</Paragraph> <Paragraph position="2"> (3) Cust: What is the last flight I could come back on? Agent: On the 29th of March? (4) Cust: I'll be returning on Thursday the fifth. Agent: The fifth of February? To the function feature source we added the values belief to cover CRs like 5 and ambiguity refinement to cover CRs like 6.</Paragraph> <Paragraph position="3"> (5) Agent: You need a visa.</Paragraph> <Paragraph position="4"> Cust: I do need one? Agent: Yes you do.</Paragraph> <Paragraph position="5"> (6) Agent: Okay I have two options . . . with Hertz . . . if not they do have a lower rate with Budget and that is fifty one dollars.</Paragraph> <Paragraph position="6"> Cust: Per day? Agent: Per day um mm.</Paragraph> <Paragraph position="7"> Finally, following Gabsdil (2003) we introduced an additional value for severity, cont-disamb, to cover CRs that request disambiguation when more than one interpretation is available.</Paragraph> <Paragraph position="8"> Method: We first identified turns containing CRs, and then annotated them with form and function features. It is not always possible to identify CRs from the utterance alone. Frequently, context (e.g., the reaction of the addressee) or intonation is required to distinguish a CR from other feedback strategies, such as positive feedback. See (Rieser, 2004) for a detailed discussion. The annotation was only performed once. The coding scheme is a slight variation of R&S, which has been shown relaiable with Kappa of 0.7 for identifying source.</Paragraph> </Section> <Section position="4" start_page="241" end_page="242" type="sub_section"> <SectionTitle> 3.2 Forms and Functions of CRs in the Communicator Corpus </SectionTitle> <Paragraph position="0"> The human-human dialogues in the Communicator Corpus contain 98 CRs in 2098 dialogue turns (4.6%).</Paragraph> <Paragraph position="1"> Forms: The frequencies for the values of the individual form features are shown in Table 3. The most frequent type of CRs were partial declarative questions, which combine the mood value declarative and the completeness value partial.4 These account for 53.1% of the CRs in the corpus. Moreover, four of the five most frequent surface forms of CRs in the Communicator Corpus differ only in the value for the feature relation-antecedent. They are partial declaratives with rising boundary tone, that either reformulate (7.1%) the problematic utterance, repeat 4Declarative questions cover &quot;all cases of non-interrogative word-order, i.e., both declarative sentences and fragments&quot; (Rodriguez and Schlangen, 2004).</Paragraph> <Paragraph position="2"> the problematic constituent (11.2%), add only new information (7.1%), or repeat the problematic constituent and add new information (10.2%). The fifth most frequent type is conventional CRs (10.2%).5 Functions: The distributions of the function features are given in Figure 4. The most frequent source of problems was np-reference. Next most frequent were acoustic problems, possibly due to the poor channel quality. Third were CRs that enquire about intention. As indicated by the feature extent, almost 80% of CRs point out a specific element of the problematic utterance. The features severity and answer illustrate that most of the time CRs request confirmation of an hypothesis (73.5%) with a yesno-answer (64.3%). The majority of the provided answers were satisfying, which means that the addressee tends to interpret the CR correctly and answers collaboratively. Only 6.1% of CRs failed to elicit a response.</Paragraph> </Section> </Section> <Section position="5" start_page="242" end_page="244" type="metho"> <SectionTitle> 4 CRs in Task-oriented Dialogue </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="242" end_page="244" type="sub_section"> <SectionTitle> 4.1 Comparison </SectionTitle> <Paragraph position="0"> In order to determine whether there are differences as regards CRs between task-oriented dialogues and everyday conversations, we compared our results to those of PGH's study on the BNC and those of R&S on the Bielefeld Corpus. The BNC contains a 10 million word sub-corpus of English dialogue transcriptions about topics of general interest. PGH analysed a portion consisting of ca. 10,600 turns, ca. 150,000 words. R&S annotated 22 dialogues from the Bielefeld Corpus, consisting of ca. 3962 turns, ca. 36,000 words.</Paragraph> <Paragraph position="1"> The major differences in the feature distributions are listed in Table 5. We found that there are no significant differences between the feature distributions for the Communicator and Bielefeld corpora, but that the differences between Communicator and BNC, and Bielefeld and BNC are significant at the levels indicated in Table 5 using Pearson's kh2. The differences between dialogues of different types suggest that there is a different grounding strategy. In task-oriented dialogues we see a trade-off between avoiding misunderstanding and keeping the conversation as efficient as possible. The hypothesis that grounding in task-oriented dialogues is more cautious is supported by the following facts (as shown by the figures in Table 5): * CRs are more frequent in task-oriented dialogues. null * The overwhelming majority of CRs directly follow the problematic utterance.</Paragraph> <Paragraph position="2"> p < .005.) * CRs in everyday conversation fail to elicit a response nearly three times as often.6 * Even though dialogue participants seem to have strong hypotheses, they frequently confirm them.</Paragraph> <Paragraph position="3"> Although grounding is more cautious in task-oriented dialogues, the dialogue participants try to keep the dialogue as efficient as possible: * Most CRs are partial in form.</Paragraph> <Paragraph position="4"> * Most of the CRs point out one specific element (with only a minority being independent as shown in Table 5). Therefore, in task-oriented dialogues, CRs locate the understanding problem directly and give partial credit for what was understood.</Paragraph> <Paragraph position="5"> * In task-oriented dialogues, the CR-initiator asks to confirm an hypothesis about what he understood rather than asking the other dialogue participant to repeat her utterance.</Paragraph> <Paragraph position="6"> * The addressee prefers to give a short y/n answer in most cases.</Paragraph> <Paragraph position="7"> Comparing error sources in the two task-oriented corpora, we found a number of differences as shown in Table 6. In particular: 6Another factor that might account for these differences is that the BNC contains multi-party conversations, and questions in multi-party conversations may be less likely to receive responses. Furthermore, due to the poor recording quality of the BNC, many utterances are marked as &quot;not interpretable&quot;, which could also lower the response rate.</Paragraph> <Paragraph position="8"> ambiguity 4.1% not eval. n/a belief 6.1% not eval. n/a relevance 2.1% not eval. n/a intention 8.2% 22.2% ** several 2.0% 14.3% *** * Dialogue type: Belief and ambiguity refinement do not seem to be a source of problems in joint problem solving dialogues, as R&S did not include them in their annotation scheme.</Paragraph> <Paragraph position="9"> For CRs in information seeking these features need to be added to explain quite frequent phenomena. As shown in Table 6, 10.2% of CRs were in one of these two classes.</Paragraph> <Paragraph position="10"> * Modality: Deictic reference resolution causes many more understanding difficulties in dialogues where people have a shared point of view than in telephone communication (Bielefeld: most frequent problem source; Communicator: one instance detected). Furthermore, in the Bielefeld Corpus, people tend to formulate more fragmentary sentences. In environments where people have a shared point of view, complete sentences can be avoided by using non-verbal communication channels. Finally, we see that establishing contact is more of a problem when speech is the only modality available. * Channel quality: Acoustic problems are much more likely in the Communicator Corpus.</Paragraph> <Paragraph position="11"> These results indicate that the decision process for grounding needs to consider the modality, the domain, and the communication channel. Similar extensions to the grounding model are suggested by (Traum, 1999).</Paragraph> </Section> <Section position="2" start_page="244" end_page="244" type="sub_section"> <SectionTitle> 4.2 Consequences for Generation </SectionTitle> <Paragraph position="0"> The similarities and differences detected can be used to give recommendations for generating CRs.</Paragraph> <Paragraph position="1"> In terms of when to initiate a CR, we can state that clarification should not be postponed, and immediate, local management of uncertainty is critical. This view is also supported by observations of how non-native speakers handle non-understanding (Paek, 2003).</Paragraph> <Paragraph position="2"> Furthermore, for task-oriented dialogues the system should present an hypothesis to be confirmed, rather than ask for repetition. Our data suggests that, when they are confronted with uncertainty, humans tend to build up hypotheses from the dialogue history and from their world knowledge. For example, when the customer specified a date without a month, the travel agent would propose the most reasonable hypothesis instead of asking a wh-question. It is interesting to note that Skantze (2003) found that users are more satisfied if the system &quot;hides&quot; its recognition problem by asking a task-related question to help to confirm the hypothesis, rather than explicitly indicating non-understanding.</Paragraph> </Section> </Section> <Section position="6" start_page="244" end_page="244" type="metho"> <SectionTitle> 5 Correlations between Function and </SectionTitle> <Paragraph position="0"> Form: How to say it? Once the dialogue system has decided on the function features, it must find a corresponding surface form to be generated. Many forms are indeed related to the function as shown in Table 7, where we present a significance analysis using Pearson's kh2 (with Yates correction).</Paragraph> <Paragraph position="1"> Source: We found that the relation to the antecedent seems to distinguish fairly reliably between CRs clarifying reference and those clarifying acoustic understanding. In the Communicator Corpus, for acoustic problems the CR-initiator tends to repeat the problematic part literally, while reference problems trigger a reformulation or a repetition with addition. For both problem sources, partial declarative questions are preferred. These findings are also supported by R&S. For the first level of non-understanding, the inability to establish contact, complete polar questions with no relation to the antecedent are formulated, e.g., &quot;Are you there?&quot;. Severity: The severity indicates how much was understood, i.e., whether the CR initiator asks to confirm an hypothesis or to repeat the antecedent utterance. The severity of an error strongly correlates with the sentence mood. Declarative and polar questions, which take up material from the problematic utterance, ask to confirm an hypothesis. Wh-questions, which are independent, reformulations or repetitions with additions (e.g., whsubstituted reprises) of the problematic utterance usually prompt for repetition, as do imperatives. Alternative questions prompt the addressee to disambiguate the hypothesis.</Paragraph> <Paragraph position="2"> Answer: By definition, certain types of question prompt for certain answers. Therefore, the feature answer is closely linked to the sentence mood of the CR. As polar questions and declarative questions generally enquire about a proposition, i.e., an hypothesis or belief, they tend to receive yes/no answers, but repetitions are also possible. Whquestions, alternative questions and imperatives tend to get answers providing additional information (i.e., reformulations and elaborations).</Paragraph> <Paragraph position="3"> Extent: The function feature extent is logically independent from the form feature completeness, although they are strongly correlated. Extent is a binary feature indicating whether the CR points out a specific element or concerns the whole utterance.</Paragraph> <Paragraph position="4"> Most fragmentary declarative questions and fragmentary polar questions point out a specific element, especially when they are not independent but stand in some relation to the antecedent utterance. Independent complete imperatives address the whole previous utterance.</Paragraph> <Paragraph position="5"> The correlations found in the Communicator Corpus are fairly consistent with those found in the Bielefeld Corpus, and thus we believe that the guidelines for generating CRs in task-oriented dialogues may be language independent, at least for German and English.</Paragraph> </Section> class="xml-element"></Paper>