File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0317_metho.xml

Size: 21,836 bytes

Last Modified: 2025-10-06 14:15:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0317">
  <Title>Cue Phrase Selection in Instruction Dialogue Using Machine Learning</Title>
  <Section position="4" start_page="100" end_page="100" type="metho">
    <SectionTitle>
3 Annotation of dialogue corpus
</SectionTitle>
    <Paragraph position="0"> In this section, we mention the way of the annotation in our corpus. Then, the inter-coder agreement for the annotations is discussed.</Paragraph>
    <Section position="1" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
3.1 Class of cue phrases
</SectionTitle>
      <Paragraph position="0"> The domain of our dialogue corpus in Japanese is to instruct the initial setting of an answering machine.</Paragraph>
      <Paragraph position="1"> The corpus consists of nine dialogues with 5,855 utterances. There are, 1,117 cue phrases in 96 distinct</Paragraph>
      <Paragraph position="3"> D T: And, there is a time-punch button under the panel,{P: Yes. } push it.</Paragraph>
      <Paragraph position="4"> P: Yes.</Paragraph>
      <Paragraph position="5"> I&amp;quot;&amp;quot; T: An.._.d month and day are input as L integers.</Paragraph>
      <Paragraph position="6"> ds3.4.1 P: Yes.</Paragraph>
      <Paragraph position="7"> T: Input by dial button.</Paragraph>
      <Paragraph position="8"> P: Yes f-- T: First, It is January 27th. P: Yes.</Paragraph>
      <Paragraph position="10"> T: Yes, ~ today is Thursday, {P: Yes } the days of the week are numbered from one to seven ds3.4.3 starting with Sunday, {P: Yes } since today is Thursday, input number is 5.</Paragraph>
      <Paragraph position="11"> t__p: Yes, I've input it.</Paragraph>
      <Paragraph position="12"> --T: An._..d, it is two thirty now, {P: Yes } using the 24 hour time ds3.4.4 system, {P: yes } input 1, 4, 3, O.  m a__p: Yes. I've input it.</Paragraph>
      <Paragraph position="13"> B T: Finally, push the registration button again.</Paragraph>
      <Paragraph position="14"> -- P: Yes.</Paragraph>
      <Paragraph position="15">  than five times.</Paragraph>
      <Paragraph position="16"> As the result of classifying these 31 cue phrases based on the classification of Japanese connectives (Ichikawa, 1978; Moriyama, 1997) and cue phrase classification in Enghsh (Grosz and Sidner, 1986; Cohen, 1984; Knott and Dale, 1994; Moser and Moore, 1995b), 20 cue phrases, which occurred total of 848 times, were classified into three classes: changeover, such as soredeha, deha (&amp;quot;now&amp;quot;, &amp;quot;now then&amp;quot; in English), conjunctive, such as sore.de, de (&amp;quot;and&amp;quot;, &amp;quot;and then&amp;quot;), and ordinal, such as mazu, tsugini (&amp;quot;first&amp;quot;, &amp;quot;next&amp;quot;). Besides these simple cue phrases, there are composite cue phrases such as soredeha-tsugini (&amp;quot;now first&amp;quot;). Note that meaning and the usage of each of these Japanese cue phrases does not completely correspond to those of the English words and phrases in parentheses. For example. the meaning of the Japanese cue phrase soredeha is close to the English word now in its discourse sense. However, soredeha does not have a sentential sense though now does.</Paragraph>
      <Paragraph position="17"> The purpose of this study is to decide which of these three classes of simple cue phrases should be selected as the cue phrase at the beginning of a dis-ICue phrases which occur in the middle of the segment and in the segment other than action direction such as clarification segment are included.</Paragraph>
      <Paragraph position="18">  course segment. We do not deal with composite types of cue phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
3.2 Annotation of discourse structure
</SectionTitle>
      <Paragraph position="0"> As the basis for examining the relationship between cue phrase and dialogue structure, discourse segment boundary and the level of embedding of the segments were annotated in each dialogue. We define discourse segment (or simply segment) as chunks of utterances that have a coherent goal (Grosz and Sidner, 1986; Nakatani et al., 1995; Passonneau and Litman, 1997). The annotation of hierarchical relations among segments was based on (Nakatani et al., 1995).</Paragraph>
      <Paragraph position="1"> Figure 1 shows an example from the annotated dialogue corpus. This dialogue was translated from the original Japanese. This example provides instruction on setting the calendar and clock of the answering machine. The purpose of ds3.4 is to input numbers by dial buttons and each input action is directed in ds3.4.2, ds3.4.3, and ds3.4.4, for inputting the date, the day of the week, and the time, respectively. Subdialogues such as confirmation and pupil initiative clarification are treated as one segment as in ds3.4.2.1. The organization cue phrases are underlined in the sample dialogue. For example, the cue phrase for ds3.3 is &amp;quot;And&amp;quot;, and that for ds3.5 is &amp;quot;Finally&amp;quot; 2</Paragraph>
    </Section>
    <Section position="3" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
3.3 Annotation of discourse purpose and
</SectionTitle>
      <Paragraph position="0"> pre-exchange As the information about task structure and dialogue context, we annotated the discourse purpose of each segment and the dialogue exchange at the end of the immediately preceding segment.</Paragraph>
      <Paragraph position="1"> In annotating the discourse purpose, the coders selected the purpose of each segment from a topic list. The topic list consists of 127 topics. It has a hierarchical structure and represents the task structure of the domain of our corpus. When the discourse purpose cannot be selected from the topic list, the segment was annotated as &amp;quot;others&amp;quot;. In such segments, the information about task structure cannot be obtained.</Paragraph>
      <Paragraph position="2"> The pre-exchange is annotated as a kind of dialogue context and used as one of the learning features itself. The coders annotated the kind of pre-exchange by selecting one of nine categories of exchanges which are defined in section 4.1 in detail.</Paragraph>
    </Section>
    <Section position="4" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
3.4 Inter-coder agreement for the annotation
</SectionTitle>
      <Paragraph position="0"> As mentioned in the previous sections, we annotated our corpus with regard to the following characteristics: the class of cue phrases (ordinal, changeover, conjunctive), segment boundary, and hierarchical structure of the segment, the purpose of the segment, and the dialogue exchange at the end of the immediately preceding segment.</Paragraph>
      <Paragraph position="1"> The extent of inter-coder agreement between two coders in these annotation are calculated by using :When a cue phrase follows acknowledgement (Yes) or a stammer, these speech fragments that do not have ~ftropositional content axe ignored and the cue phrases er the fragments axe annotated as the beginning of the segment.</Paragraph>
      <Paragraph position="2"> Cohen's Kappa ~ (Bakeman and Gottman, 1986; Carletta, 1996). The inter-coder agreement (to) about the class of cue phrase is 0.68, about the purpose of the segment is 0.79, and about the type of pre-exchange is 0.67. The extent of agreement about the segment boundary and the hierarchical structure is calculated using modified Cohen's Kappa presented by (Flammia and Zue, 1995). This Cohen's Kappa is 0.66.</Paragraph>
      <Paragraph position="3"> Fleiss et al. (1981) characterizes kappas of .40 to .60 as fair, .60 to .75 as good, and over .75 as excellent. According to this categorization of levels of inter-coder agreement, the inter-coder agreement for cue phrase, pre-exchange, and discourse boundary and structure is good. The agreement on segment purpose is excellent. Thus, these results indicate that our corpus coding is adequately rehable and objective.</Paragraph>
      <Paragraph position="4"> When the two coders' analyses did not agree, the third coder judged this point; only those parts whose analysis is output by more than two coders was used as learning data.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="100" end_page="104" type="metho">
    <SectionTitle>
4 Learning experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="100" end_page="102" type="sub_section">
      <SectionTitle>
4.1 Learning features
</SectionTitle>
      <Paragraph position="0"> This section describes a learning experiment using C4.5 (Quinlan, 1993). First, we define 10 learning features concerned with three factors.</Paragraph>
      <Paragraph position="1"> (1)Discourse structure: Structural information about the preceding dialogue.</Paragraph>
      <Paragraph position="2"> Embedding The depth of embedding from the top level.</Paragraph>
      <Paragraph position="3"> Place The number of elder sister segments.</Paragraph>
      <Paragraph position="4"> Place2 The number of elder sister segments except pupil initiative segments.</Paragraph>
      <Paragraph position="5"> Recent elder sister's cue (Res-cue) The cue phrase that occurs at the beginning of the most recent elder sister segment. They axe classified into three kinds of simple cue phrases: ord (ordinal), ch (changeover), con (conjunctive) or a kind of composite cue phrase such as ch+ord (changeover + ordinal).</Paragraph>
      <Paragraph position="6"> Res-cue2 The cue phrase that occurs at the beginning of the most recent elder sister segment except pupil initiative segments.</Paragraph>
      <Paragraph position="7"> Discourse transition (D-trans) Types of change in attentional state accompanied by topic change 3 such as push and pop. Pop from the pupil initiative subdialogue is categorized as &amp;quot;ui-pop&amp;quot;.</Paragraph>
      <Paragraph position="8"> (2)Task structure: Information that estimates the complexity of succeeding dialogue.</Paragraph>
      <Paragraph position="9"> 3Clark (1997) presents a term &amp;quot;discourse topic&amp;quot; as concept equivalent to focus space in (Grosz and Sidner, 1986), and call their transition &amp;quot;discourse transition&amp;quot;. For example, &amp;quot;push&amp;quot; is defied as the transition to the sub topic, and &amp;quot;next&amp;quot; is defined as the transition to the same level proceeding topic.</Paragraph>
      <Paragraph position="10">  nil, ord, Oh, con, ch+ord, con/ord, con+ch, other nil, ord, C\[1, Cou, cn/ord~ con+ord, con+ch, other pop, push, next, m-pop, ~A integer mteger conf, req, inf, quest, ui-conf, ui-req, tti-inf, ui-quest, NA nil, oral, ch, con, ch+ord, con+ord, con+ch, other Task-hierarchy (T-hierarchy) The number  of goal-subgoal relations from the current goal to primitive actions. This estimates the depth of embedding in the succeeding dialogue.</Paragraph>
      <Paragraph position="11"> Subgoal The number of direct subgoals of the current goal. If zero, then it is a primitive action.</Paragraph>
      <Paragraph position="12"> (3)Dialogue context Information about the preceding segment.</Paragraph>
      <Paragraph position="13"> Pre-exchange Type of exchange that occurs at the end of the immediately preceding segment, or type of exchange immediately preceding the cue phrase. There are four categories, conf (confirmation-answer), req (request-answer), inf (information-reply). ques (question-answer). They are also distinguished by the initiator of the exchange; explainer initiative or pupil initiative. When the category of the exchange is not clear, it is classified as not applicable (NA). Therefore, there are nine values for this feature.</Paragraph>
      <Paragraph position="14"> Preceding segment's cue (Ps-cue) The cue phrase that occurs at the beginning of the immediately preceding segment.</Paragraph>
      <Paragraph position="15"> The values of these features are shown in Table 1. Among the above learning features, Embed- ding, Place, Place$, Res-cue, Res-cue~, Ps-cue, and D-trans are derived automatically from the information about segment boundary and the segment hierarchy annotated in the corpus (an example is shown in Figure 1). The depth of task hierarchy (T- hierarchy) and the number of direct subgoais (Sub- goal) are determined by finding the annotated segment purpose in the given task structure.</Paragraph>
    </Section>
    <Section position="2" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
4.2 Learning algorithm
</SectionTitle>
      <Paragraph position="0"> In this study, C4.5 (Quinlan, 1993) is used as learning program. This program takes two inputs, (1)the definition of classes that should be learned, and the names and the values of a set of features, and (2) the data which is a set of instances whose class and feature values are specified. As a result of machine learning, the program outputs a decision tree fe~ judgement.</Paragraph>
      <Paragraph position="1"> We use cross-validation for estimating the accuracy of the model because this method avoids the disadvantages common with small data sets whose number of cases is less than 1000. In this study, 10-fold cross-validation is applied, so that in each run 90% of the cases are used for training and the remaining 10% are used for testing. The C4.5 program also has an option that causes the values of discrete attribute to be grouped. We selected this, option because there are many values in some features and the decision tree becomes very complex if each value has one branch.</Paragraph>
    </Section>
    <Section position="3" start_page="102" end_page="104" type="sub_section">
      <SectionTitle>
4.3 Results and discussion
</SectionTitle>
      <Paragraph position="0"> Decision trees for distinguishing the usage of three kinds of cue phrases (changeover, ordinal, and conjunctive) were computed by the machine learning al: gorithm C4.5. As learning features, the 10 features mentioned in section 4.1 are used. From nine dialogues; 545 instances were derived as training data.</Paragraph>
      <Paragraph position="1"> In 545 instances, 300 were conjunctive, 168 were changeover, and 77 were ordinal. The most frequent category, conjunctive, accounts for 557o of all cases.</Paragraph>
      <Paragraph position="2"> Thus, the baseline error rate is 4570. This means that one would be wrong 45~0 of the time if this category was always chosen.</Paragraph>
      <Paragraph position="3"> First, the prediction power of each learning feature is examined. The results of learning experiments using single features are shown in Table 2. I.~ pruning the initial tree, C4.5 calculates actual and estimated error rates for the pruned tree. The error rate shown in this table is the mean of estimated error rates for the pruned trees under 10-fold crossvalidation. The 95% confidence intervals are shown after &amp;quot;'+-&amp;quot;. Those are calculated using Student's t distribution. The error rate el is significantly better than e2 if the upper bound of the 95% confidence interval for e~ is lower than the lower bound of the 95% confidence interval for e2. As shown in Table 2, the decision tree obtained with the Pre-exchange fen- null ture performs best, and its error rate is 41.5%. In all experiments, the error rates are more than 40% and none are considerably better than the baseline.</Paragraph>
      <Paragraph position="4"> These results suggest that using only a single learning feature is not sufficient for selecting cue phrases correctly.</Paragraph>
      <Paragraph position="5"> As the single feature models are not sufficient, it is necessary to find the best set of learning features for selecting cue phrases. We call a set of features a model and the best model (the best set of features) is obtained using the following procedure. First, we set some multiple features models and carry out learning experiments using these models in order to find the best performing model and the best error rate.</Paragraph>
      <Paragraph position="6"> We then eliminate the features from the best performance model in order to make the model simpler.</Paragraph>
      <Paragraph position="7"> Thus, the best model we try to find is the one that uses the smallest number of learning features but whose performance equals the best error rate.</Paragraph>
      <Paragraph position="8"> We construct four multiple feature models. The name of the model and the combination of features in the model are shown in Table 3. The discourse structure model (the DS model) used learning features concerned with discourse structure. The Task model used those concerned with task structure, and the dialogue context (the DC model) used those concerned with dialogue context. The All .feature model uses all learning features. The best error rate among these models is 29.9% in All .feature model as shown in Table 2. The error rate is reduced about 15% from the baseline.</Paragraph>
      <Paragraph position="9"> Therefore, the best model is the one that uses fewer learning features than the All .feature model and that equals the performance of that model. In order to reduce the number of features considered, we examined which features have redundant information, and omitted these features from the All \]eature model. The overlapping features were found by examining the correlation between the features. As for numerical features that take number values, the correlation coefficient between Place and Place~, and between T-hierarchy and Subgoal are high (p=0.694, 0.784, respectively). As for categorical features, agreement between Res-cue and Res-cue2 is 95%.</Paragraph>
      <Paragraph position="10"> These highly correlated features can be represented by just one of them. As the result of many experiments varying the combination of features used, we determined the Simplest model which uses six features: Embedding, Place, D-trans, Subgoal, Preezchange, and Ps-cue as shown at the bottom line in Table 3. The error rate of the Simplest model is 30.6% as shown in Table 2. It is very close to that of the All \]eature model though the difference is statistically significant.</Paragraph>
      <Paragraph position="11"> In addition to comparing only the overall error rates, in order to compare the performance of these two models in more detail, we calculated the information retrieval metrics for each category, changeover, ordinal, and conjunctive. Figure 2 shows the equations used to calculate the metrics.</Paragraph>
      <Paragraph position="12"> For example, recall rate is the ratio of the cue phrases correctly predicted by the model as class X to the cue phrases of class X in the corpus. Precision rate is the ratio of cue phrases correctly predicted to be class X to all cue phrases predicted to be class X.</Paragraph>
      <Paragraph position="13"> In addition, in order to get an intuitive feel of over-all performance, we also calculated the sum of the deviation from ideal values in each metric as in (Passonneau and Litman, 1997). The summed deviation is calculated by the following numerical formula:</Paragraph>
      <Paragraph position="15"> Table 4 shows the results of these metrics for the two models. Standard deviation is shown in parentheses. The value of each metric is the average of  the metrics on the test set in each run of 10-fold cross-validation. Comparing the summed deviation. the performance of the Simplest model is better than that of the All feature model in all categories of cue phrases. The summed deviations of the Simplest model, 1.01 for ordinal, 1.27 for changeover, and 1.09 for conjunctive, are lower than those of the All feature model. Thus, as a result of evaluating the models in detail using the information retrieval metrics, it is concluded that the Simplest model is the best performing model. In addition, the Simplest model is the most elegant model because it uses fewer learning features than the All feature model. Just six features, Embedding, Place, D-trans, Subgoal, Preexchange, and Ps-cue, are enough for selecting organization cue phrases.</Paragraph>
      <Paragraph position="16"> Classifying the six features in the Simplest model, it is found that these features come from all factors, discourse structure, task structure, and dialogue context. Embedding, Place, D-trans are the features of discourse structure, Subgoal is about task structure, and Pre-exchange and Ps-cue are about dialogue context. This result indicates that all the factors are necessary to predict cue phrases. The important factors for cue phrase selection are task structure and dialogue context as well as discourse structure, the focus of many earlier studies.</Paragraph>
      <Paragraph position="17"> While we identified the six features from the three kinds of factors, by looking at the decision trees created in the learning experiment, we found which features were more important than others in selecting cue phr~es. The features appearing near the root node are more important. Figure 3 shows the top part of a decision tree obtained from the Simplest model. In all 10 decision trees resulting from the cross-validation experiment in the Simplest model, Place feature appears at the root node. In 7 of 10 trees, Embedding and Pre-exchange appeared just below the root node. In these trees, if the Place of the segment is the first at that level (i.e. there is no elder sister.), then Embedding appears at the next node, otherwise if the segment is not the first one at that level, then Pre-exchange appears atthe next node. Thus, if there are some elder sister segments, information about dialogue context is used for selecting cue phrases. On the other hand, if there is no elder sister segment, information about discourse structure is used for the judgement. These results suggest that the information about discourse structure, especially place of segments and the depth of embedding, and the dialogue context, especially the kind of immediately preceding dialogue exchange, play important roles in cue phrase selection.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML