File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1056_metho.xml

Size: 19,001 bytes

Last Modified: 2025-10-06 14:07:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1056">
  <Title>Evaluating a Trainable Sentence Planner for a Spoken Dialogue System</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experimental Context and Design
</SectionTitle>
    <Paragraph position="0"> Our research concerns developing and evaluating a portable generation component for a mixed-initiative travel planning system, AMELIA, developed at AT&amp;T Labs as part of DARPA Communicator. Consider the required generation capabilities of AMELIA, as illustrated in Figure 1.</Paragraph>
    <Paragraph position="1"> Utterance System1 requests information about the caller's departure airport, but in User2, the caller takes the initiative to provide information about her destination. In System3, the system's goal is to implicitly confirm the destination (because of the possibility of error in the speech recognition component), and request information (for the second time) of the caller's departure airport. This combination of communicative goals arises dynamically in the dialog because the system supports user initiative, and requires different capabilities for generation than if the system could only understand the direct answer to the question that it asked in System1.</Paragraph>
    <Paragraph position="2"> In User4, the caller provides this information but takes the initiative to provide the month and day of travel. Given the system's dialog strategy, the communicative goals for its next turn are to implicitly confirm all the information that the user has provided so far, i.e. the departure and destination cities and the month and day information, as well as to request information about the time of travel. The system's representation of its communicative goals for System5 is in Figure 2. As before, this combination of communicative goals arises in response to the user's initiative.</Paragraph>
    <Paragraph position="4"> System5 in Figure 1 Like most working research spoken dialog systems, AMELIA uses hand-crafted, template-based generation. Its output is created by choosing string templates for each elementary speech act, using a large choice function which depends on the type of speech act and various context conditions. Values of template variables (such as origin and destination cities) are instantiated by the dialog manager. The string templates for all the speech acts of a turn are heuristically ordered and then appended to produce the output. In order to produce output that is not highly redundant, string templates must be written for every possible combination of speech acts in a text plan. We refer to the output generated by AMELIA using this approach as the TEMPLATE output.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
System Realization
</SectionTitle>
      <Paragraph position="0"> TEMPLATE Flying from Newark to Dallas, Leaving on the 1st of September, And what time did you want to leave? SPoT What time would you like to travel on September the 1st to Dallas from Newark?  September the 1st to Dallas from Newark? RANDOM Leaving in September. Leaving on the 1st. What time would you, traveling from Newark to Dallas, like to leave? NOAGG Leaving on the 1. Leaving in September. Going to Dallas. Leaving from Newark.</Paragraph>
      <Paragraph position="1"> What time would you like to leave?  for each type of generation system used in the evaluation experiment.</Paragraph>
      <Paragraph position="2"> We perform an evaluation using human subjects who judged the TEMPLATE output of AMELIA against five NLG-based approaches: SPOT, two rule-based approaches, and two baselines. We describe them in Section 3. An example output for the text plan in Figure 2 for each system is in Figure 3. The experiment required human subjects to read 5 dialogs of real interactions with AMELIA. At 20 points over the 5 dialogs, AMELIA's actual utterance (TEMPLATE) is augmented with a set of variants; each set of variants included a representative generated by SPOT, and representatives of the four comparison sentence planners. At times two or more of these variants coincided, in which case sentences were not repeated and fewer than six sentences were presented to the subjects. The subjects rated each variation on a 5-point Likert scale, by stating the degree to which they agreed with the statement The system's utterance is easy to understand, well-formed, and appropriate to the dialog context. Sixty colleagues not involved in this research completed the experiment.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Sentence Planning Systems
</SectionTitle>
    <Paragraph position="0"> This section describes the five sentence planners that we compare. SPOT, the two rule-based systems, and the two baseline sentence planners are all NLG based sentence planners. In Section 3.1, we describe the shared representations of the NLG based sentence planners. Section 3.2 describes the baselines, RANDOM and NOAGG.</Paragraph>
    <Paragraph position="1"> Section 3.3 describes SPOT. Section 3.4 describes the rule-based sentence planners, RBS and ICF.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Aggregation in Sentence Planning
</SectionTitle>
      <Paragraph position="0"> In all of the NLG sentence planners, each speech act is assigned a canonical lexico-structural representation (called a DSyntS - Deep Syntactic Structure (MelVcuk, 1988)). We exclude issues of lexical choice from this study, and restrict our attention to the question of how elementary structures for separate elementary speech acts are assembled into extended discourse. The basis of all the NLG systems is a set of clause-combining operations that incrementally transform a list of elementary predicate-argument representations (the DSyntSs corresponding to the elementary speech acts of a single text plan) into a list of lexico-structural representations of one or more sentences, that are sent to a surface realizer. We uti- null DSyntSs are combined using the operations exemplified in Figure 4. The result of applying the operations is a sentence plan tree (or sp-tree for short), which is a binary tree with leaves labeled by all the elementary speech acts from the input text plan, and with its interior nodes labeled with clause-combining operations. As an example, Figure 5 shows the sp-tree for utterance System5 in Figure 1. Node soft-merge-general a0 merges an implicit-confirmation of the destination city and the origin city. The row labelled SOFT-MERGE in Figure 4 shows the result when Args 1 and 2 are implicit confirmations of the origin and destination. See (Walker et al., 2001) for more detail on the sp-tree. The experimental sentence planners described below vary how the sp-tree is constructed.</Paragraph>
      <Paragraph position="1">  tem 5 in Dialog D1</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Baseline Sentence Planners
</SectionTitle>
      <Paragraph position="0"> In one obvious baseline system the sp-tree is constructed by applying only the PERIOD operation: each elementary speech act is realized as its own sentence. This baseline, NOAGG, was suggested by Hovy and Wanner (1996). For NOAGG, we order the communicative acts from the text plan as follows: implicit confirms precede explicit confirms precede requests. Figure 3 includes a NOAGG output for the text plan in Figure 2.</Paragraph>
      <Paragraph position="1"> A second possible baseline sentence planner simply applies combination rules randomly according to a hand-crafted probability distribution based on preferences for operations such as the MERGE family over CONJUNCTION and PERIOD.</Paragraph>
      <Paragraph position="2"> In order to be able to generate the resulting sentence plan tree, we exclude certain combinations, such as generating anything other than a PERIOD above a node labeled PERIOD in a sentence plan.</Paragraph>
      <Paragraph position="3"> The resulting sentence planner we refer to as RANDOM. Figure 3 includes a RANDOM output for the text plan in Figure 2.</Paragraph>
      <Paragraph position="4"> In order to construct a more complex, and hopefully better, sentence planner, we need to encode constraints on the application of, and ordering of, the operations. It is here that the remaining approaches differ. In the first approach, SPOT, we learn constraints from training material; in the second approach, rule-based, we construct constraints by hand.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 SPoT: A Trainable Sentence Planner
</SectionTitle>
      <Paragraph position="0"> For the sentence planner SPOT, we reconceptualize sentence planning as consisting of two distinct phases as in Figure 6. In the first phase, the sentence-plan-generator (SPG) randomly generates up to twenty possible sentence plans for a given text-plan input. For this phase we use the RANDOM sentence-planner. In the second phase,  by using RANDOM to randomly generate up to 20 realizations for 100 turns; two human judges then ranked each of these realizations (using the setup described in Section 2). Over 3,000 features were discovered from the generated trees by routines that encode structural and lexical aspects of the sp-trees and the DSyntS. RankBoost identified the features that contribute most to a realization's ranking. The SPR uses these rules to rank alternative sp-trees, and then selects the top-ranked output as input to the surface realizer.</Paragraph>
      <Paragraph position="1"> Walker et al. (2001) describe SPOT in detail.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Two Rule-Based Sentence Planners
</SectionTitle>
      <Paragraph position="0"> It has not been the object of our research to construct a rule-based sentence planner by hand, be it domain-independent or optimized for our domain. Our goal was to compare the SPOT sentence planner with a representative rule-based system. We decided against using an existing off-the-shelf rule-based system, since it would be too complex a task to port it to our application. Instead, we constructed two reasonably representative rule-based sentence planners. This task was made easier by the fact that we could reuse much of the work done for SPOT, in particular the data structure of the sp-tree and the implementation of the clause-combining operations. We developed the two systems by applying heuristics for producing good output, such as preferences for aggregation. They differ only in the initial ordering of the communicative acts in the input text plan.</Paragraph>
      <Paragraph position="1"> In the first rule-based system, RBS (for &amp;quot;Rule-Based System&amp;quot;), we order the speech acts with explicit confirms first, then requests, then implicit confirms. Note that explicit confirms and requests do not co-occur in our data set. The second rule-based system is identical, except that implicit confirms come first rather than last. This system we call ICF (for &amp;quot;Rule-based System with Implicit Confirms First&amp;quot;).</Paragraph>
      <Paragraph position="2"> In the initial step of both RBS and ICF, we take the two leftmost members of the text plan and try to combine them using the following preference ranking of the combination operations: ADJECTIVE, the MERGEs, CONJUNC-TION, RELATIVE-CLAUSE, PERIOD. The first operation to succeed is chosen. This yields a binary sp-tree with three nodes, which becomes the current sp-tree. As long as the root node of the current sp-tree is not a PERIOD, we iterate through the list of remaining speech acts on the ordered text plan, combining each one with the current sp-tree using the preference-ranked operations as just described. The result of each iteration step is a binary, left-branching sp-tree. However, if the root node of the current sp-tree is a PERIOD, we start a new current sp-tree, as in the initial step described above. When the text plan has been exhausted, all partial sp-trees (all of which except for the last one are rooted in PERIOD) are combined in a left-branching tree using PERIOD. Cue words are added as follows: (1) The cue word now is attached to utterances beginning a new subtask; (2) The cue word and is attached to utterances continuing a subtask; (3) The cue words alright or okay are attached to utterances containing implicit confirmations. Figure 3 includes an RBS and an ICF output for the text plan in Figure 2. In this case ICF and RBS differ only in the verb chosen as a more general verb during the SOFT-MERGE operation.</Paragraph>
      <Paragraph position="3"> We illustrate the RBS procedure with an example for which ICF works similarly. For RBS, the text plan in Figure 2 is ordered so that the request is first. For the request, a DSyntS is chosen that can be paraphrased as What time would you like to leave?. Then, the first implicit-confirm is translated by lookup into a DSyntS which on its own could generate Leaving in September.</Paragraph>
      <Paragraph position="4"> We first try the ADJECTIVE aggregation operation, but since neither tree is a predicative adjective, this fails. We then try the MERGE family. MERGE-GENERAL succeeds, since the tree for the request has an embedded node labeled leave. The resulting DSyntS can be paraphrased as What time would you like to leave in September?, and is attached to the new root node of the resulting sp-tree. The root node is labeled MERGE-GENERAL, and its two daughters are the two speech acts. The implicit-confirm of the day is added in a similar manner (adding another left-branching node to the sp-tree), yielding a DSyntS that can be paraphrased as What time would you like to leave on September the 1st? (using some special-case attachment for dates within MERGE). We now try and add the DSyntS for the implicit-confirm, whose DSyntS might generate Going to Dallas. Here, we again cannot use ADJECTIVE, nor can we use MERGE or MERGE-GENERAL, since the verbs are not identical. Instead, we use SOFT-MERGE-GENERAL, which identifies the leave node with the go root node of the DSyntS of the implicit-confirm. When softmerging leave with go, fly is chosen as a generalization, resulting in a DSyntS that can be generated as What time would you like to fly on September the 1st to Dallas?. The sp-tree has added a layer but is still left-branching. Finally, the last implicit-confirm is added to yield a DSyntS that is realized as What time would you like to fly on September the 1st to Dallas from Newark?.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> All 60 subjects completed the experiment in a half hour or less. The experiment resulted in a total of 1200 judgements for each of the systems being compared, since each subject judged 20 utterances by each system. We first discuss overall differences among the different systems and then make comparisons among the four different types of systems: (1) TEMPLATE, (2) SPOT, (3) two rule-based systems, and (4) two baseline systems.</Paragraph>
    <Paragraph position="1"> All statistically significant results discussed here had p values of less than .01.</Paragraph>
    <Paragraph position="2"> We first examined whether differences in human ratings (score) were predictable from the  independent variable and score as the dependent variable showed that there were significant differences in score as a function of system. The overall differences are summarized in Figure 7.</Paragraph>
    <Paragraph position="3"> As Figure 7 indicates, some system outputs received more consistent scores than others, e.g. the standard deviation for TEMPLATE was much smaller than RANDOM. The ranking of the systems by average score is TEMPLATE, SPOT, ICF, RBS, NOAGG, and RANDOM. Posthoc comparisons of the scores of individual pairs of systems using the adjusted Bonferroni statistic revealed several different groupings.2 The highest ranking systems were TEMPLATE and SPOT, whose ratings were not statistically significantly different from one another. This shows that it is possible to match the quality of a hand-crafted system with a trainable one, which should be more portable, more general and require less overall engineering effort.</Paragraph>
    <Paragraph position="4"> The next group of systems were the two rule-based systems, ICF and RBS, which were not statistically different from one another. However SPOT was statistically better than both of these systems (p a0 .01). Figure 8 shows that SPOT got more high rankings than either of the rule-based systems. In a sense this may not be that surprising, because as Hovy and Wanner (1996) point out, it is difficult to construct a rule-based sentence planner that handles all the rule interactions in a reasonable way. Features that SPoT's SPR uses allow SPOT to be sensitive to particular discourse configurations or lexical collocations.</Paragraph>
    <Paragraph position="5"> In order to encode these in a rule-based sentence  dentally finding differences between systems when making multiple comparisons among systems.</Paragraph>
    <Paragraph position="6"> planner, one would first have to discover these constraints and then determine a way of enforcing them. However the SPR simply learns that a particular configuration is less preferred, resulting in a small decrement in ranking for the corresponding sp-tree. This flexibility of incrementing or decrementing a particular sp-tree by a small amount may in the end allow it to be more sensitive to small distinctions than a rule-based system. Along with the TEMPLATE and RULE-BASED systems, SPOT also scored better than the base-line systems NOAGG and RANDOM. This is also somewhat to be expected, since the baseline systems were intended to be the simplest systems constructable. However it would have been a possible outcome for SPOT to not be different than either system, e.g. if the sp-trees produced by RANDOM were all equally good, or if the aggregation rules that SPOT learned produced output less readable than NOAGG. Figure 8 shows that the distributions of scores for SPOT vs. the baseline systems are very different, with SPOT skewed towards higher scores.</Paragraph>
    <Paragraph position="7"> Interestingly NOAGG also scored better than RANDOM (p a0 .01), and the standard deviation of its scores was smaller (see Figure 7). Remember that RANDOM's sp-trees often resulted in arbitrarily ordering the speech acts in the output. While NOAGG produced redundant utterances, it placed the initiative taking speech act at the end of the utterance in its most natural position, possibly resulting in a preference for NOAGG over RAN-DOM. Another reason to prefer NOAGG could be its predictability.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML