XML Viewer - p99-1026

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1026_metho.xml
Size: 22,532 bytes
Last Modified: 2025-10-06 14:15:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1026">
  <Title>Understanding Unsegmented User Utterances in Real-Time Spoken Dialogue Systems</Title>
  <Section position="4" start_page="200" end_page="200" type="metho">
    <SectionTitle>
15 people
</SectionTitle>
    <Paragraph position="0"> As far as Japanese is concerned, several studies have pointed out that speech intervals in dialogues are not always well-formed substrings (Seligman et al., 1997; Takezawa and Morimoto, 1997).</Paragraph>
    <Paragraph position="1"> On the other hand, since parsing results cannot be obtained unless the end of the utterance is identified, making real-time responses is impossible without boundary information. For example, consider the utterance &amp;quot;I'd like to book Meeting Room 1 on Wednesday&amp;quot;. It is expected that the system should infer the user wants to reserve the room on 'Wednesday this week' if this utterance was made on Monday. In real conversations, however, there is no guarantee that 'Wednesday' is the final word of the utterance. It might be followed by the phrase 'next week', in which case the system made a mistake in inferring the user's intention and must backtrack and re-understand. Thus, it is not possible to determine the interpretation unless the utterance boundary is identified. This problem is more serious in head-final languages such as Japanese because function words that represent negation come after content words. Since there is no explicit clue indicating an utterance boundary in unrestricted user utterances, the system cannot make an interpretation and thus cannot respond appropriately. Waiting for a long pause enables an interpretation, but prevents response in real time. We therefore need a way to reconcile real-time understanding and analysis without boundary clues.</Paragraph>
  </Section>
  <Section position="5" start_page="200" end_page="201" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> Several techniques have been proposed to segment user utterances prior to parsing. They use intonation (Wang and Hirschberg, 1992; Traum and Heeman, 1997; Heeman and Allen, 1997) and probabilistic language models (Stolcke et al., 1998; Ramaswamy and Kleindienst, 1998; Cettolo and Falavigna, 1998). Since these methods are not perfect, the resulting segments do not always correspond to utterances and might not be parsable because of speech recognition errors. In addition, since the algorithms of the probabilistic methods are not designed to work in an incremental way, they cannot be used in real-time analysis in a straightforward way.</Paragraph>
    <Paragraph position="1"> Some methods use keyword detection (Rose, 1995; Hatazaki et al., 1994; Seto et al., 1994) and key-phrase detection (Aust et al., 1995; Kawahara et al., 1996) to understand speech mainly because the speech recognition score is not high enough.</Paragraph>
    <Paragraph position="2"> The lack of the full use of syntax in these approaches, however, means user utterances might be misunderstood even if the speech recognition gave the correct answer. Zechner and Waibel (1998) and Worm (1998) proposed understanding utterances by combining partial parses. Their methods, however, cannot syntactically analyze phrases across pauses since they use speech intervals as input units. Although Lavie et al. (1997) proposed a segmentation method that combines segmentation prior to parsing and segmentation during parsing, but it suffers from the same problem.</Paragraph>
    <Paragraph position="3"> In the parser proposed by Core and Schubert (1997), utterances interrupted by the other dialogue participant are analyzed based on recta-rules. It is unclear, however, how this parser can be incorpo- null rated into a real-time dialogue system; it seems that it cannot output analysis results without boundary clues.</Paragraph>
  </Section>
  <Section position="6" start_page="201" end_page="203" type="metho">
    <SectionTitle>
4 Incremental Significant-Utterance-
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
Sequence Search Method
4.1 Overview
</SectionTitle>
      <Paragraph position="0"> The above problem can be solved by incremental understanding, which means obtaining the most plausible interpretation of user utterances every time a word hypothesis is inputted from the speech recognizer. For incremental understanding, we propose incremental significant-utterance-sequence search (ISSS), which is an integrated parsing and discourse processing method. ISSS holds multiple possible belief states and updates those belief states when a word hypothesis is inputted. The response generation module produces responses based on the most likely belief state. The timing of responses is determined according to the content of the belief states and acoustic clues such as pauses.</Paragraph>
      <Paragraph position="1"> In this paper, to simplify the discussion, we assume the speech recognizer incrementally outputs elements of the recognized word sequence. Needless to say, this is impossible because the most likely word sequence cannot be found in the midst of the recognition; only networks of word hypotheses can be outputted. Our method for incremental processing, however, can be easily generalized to deal with incremental network input, and our experimental system utilizes the generalized method.</Paragraph>
    </Section>
    <Section position="2" start_page="201" end_page="201" type="sub_section">
      <SectionTitle>
4.2 Significant-Utterance Sequence
</SectionTitle>
      <Paragraph position="0"> A significant utterance (SU) in the user's speech is a phrase that plays a crucial role in performing the task in the dialogue. An SU may be a full sentence or a subsentential phrase such as a noun phrase or a verb phrase. Each SU has a speech act that can be considered a command to update the belief state. SU is defined as a syntactic category by the grammar for linguistic processing, which includes semantic inference rules.</Paragraph>
      <Paragraph position="1"> Any phrases that can change the belief state should be defined as SUs. Two kinds of SUs can be considered; domain-related ones that express the user's intention about the task of the dialogue and dialogue-related ones that express the user's attitude with respect to the progress of the dialogue such as confirmation and denial. Considering a meeting room reservation system, examples of domain-related SUs are &amp;quot;I need to book Room 2 on Wednesday&amp;quot;, &amp;quot;I need to book Room 2&amp;quot;, and &amp;quot;Room 2&amp;quot; and dialogue-related ones are &amp;quot;yes&amp;quot;, &amp;quot;no&amp;quot;, and &amp;quot;Okay&amp;quot;.</Paragraph>
      <Paragraph position="2"> User utterances are understood by finding a sequence of SUs and updating the belief state based on the sequence. The utterances in the sequence do not overlap. In addition, they do not have to be adjacent to each other, which leads to robustness against speech recognition errors as in fragment-based understanding (Zechner and Waibel, 1998; Worm, 1998).</Paragraph>
      <Paragraph position="3"> The belief state can be computed at any point in time if a significant-utterance sequence for user utterances up to that point in time is given. The belief state holds not only the user's intention but also the history of system utterances, so that all discourse information is stored in it.</Paragraph>
      <Paragraph position="4"> Consider, for example, the following user speech in a meeting room reservation dialogue.</Paragraph>
      <Paragraph position="5"> I need to, uh, book Room 2, and it's on Wednesday.</Paragraph>
      <Paragraph position="6"> The most likely significant-utterance sequence consists of &amp;quot;I need to, uh, book Room 2&amp;quot; and &amp;quot;it's on Wednesday&amp;quot;. From the speech act representation of these utterances, the system can infer the user wants to book Room 2 on Wednesday.</Paragraph>
    </Section>
    <Section position="3" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
4.3 Finding Significant-Utterance Sequences
</SectionTitle>
      <Paragraph position="0"> SUs are identified in the process of understanding.</Paragraph>
      <Paragraph position="1"> Unlike ordinary parsers, the understanding module does not try to determine whether the whole input forms an SU or not, but instead determines where SUs are. Although this can be considered a kind of partial parsing technique (McDonald, 1992; Lavie, 1996; Abney, 1996), the SUs obtained by ISSS are not always subsentential phrases; they are sometimes full sentences.</Paragraph>
      <Paragraph position="2"> For one discourse, multiple significant-utterance sequences can be considered. &amp;quot;Wednesday next week&amp;quot; above illustrates this well. Let us assume that the parser finds two SUs, &amp;quot;Wednesday&amp;quot; and &amp;quot;Wednesday next week&amp;quot;. Then three significant-utterance sequences are possible: one consisting of &amp;quot;Wednesday&amp;quot;, one consisting of &amp;quot;Wednesday next  week&amp;quot;, and one consisting of no SUs. The second sequence is obviously the most likely at this point, but it is not possible to choose only one sequence and discard the others in the midst of a dialogue.</Paragraph>
      <Paragraph position="3"> We therefore adopt beam search. Priorities are assigned to the possible sequences, and those with low priorities are neglected during the search.</Paragraph>
    </Section>
    <Section position="4" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
4.4 ISSS Algorithm
</SectionTitle>
      <Paragraph position="0"> The ISSS algorithm is based on shift-reduce parsing.</Paragraph>
      <Paragraph position="1"> The basic data structure is context, which represents search information and is a triplet of the following data.</Paragraph>
      <Paragraph position="2"> stack: A push-down stack used in a shift-reduce parser.</Paragraph>
      <Paragraph position="3"> belief state: A set of the system's beliefs about the user's intention with respect to the task of the dialogue and dialogue history.</Paragraph>
      <Paragraph position="4"> priority: A number assigned to the context. null Accordingly, the algorithm is as follows.  (I) Create a context in which the stack and the belief state are empty and the priority is zero. (II) For each input word, perform the following process.</Paragraph>
      <Paragraph position="5"> 1. Obtain the lexical feature structure for the word and push it to the stacks of all existing contexts.</Paragraph>
      <Paragraph position="6"> 2. For each context, apply rules as in a  shift-reduce parser. When a shift-reduce conflict or a reduce-reduce conflict occur, the context is duplicated and different operations are performed on them. When a reduce operation is performed, increase the priority of the context by the priority assigned to the rule used for the reduce  operation.</Paragraph>
      <Paragraph position="7"> 3. For each context, if the top of the stack is an SU, empty the stack and update the belief state according to the content of the SU. Increase the priority by the square of the length (i.e., the number of words) of this SU.</Paragraph>
      <Paragraph position="8"> (I) SU \[day: ?x\] -~ NP \[sort: day, sem: ?x\] (priority: 1) (11) NP\[sort: day\] :~ NP \[sort: day\] NP \[sort: week\]  . Discard contexts with low priority so that the number of remaining contexts will be the beam width or less.</Paragraph>
      <Paragraph position="9"> Since this algorithm is based on beam search, it works in real time if Step (II) is completed quickly enough, which is the case in our experimental system. null The priorities for contexts are determined using a general heuristics based on the length of SUs and the kind of rules used. Contexts with longer SUs are preferred. The reason we do not use the length of an SU, but its square instead, is that the system should avoid regarding an SU as consisting of several short SUs. Although this heuristics seems rather simple, we have found it works well in our experimental systems.</Paragraph>
      <Paragraph position="10"> Although some additional techniques, such as discarding redundant contexts and multiplying a weight w (w &gt; 1) to the priority of each context after the Step 4, are effective, details are not discussed here for lack of space.</Paragraph>
    </Section>
    <Section position="5" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
4.5 Response Generation
</SectionTitle>
      <Paragraph position="0"> The contexts created by the utterance understanding module can also be accessed by the response generation module so that it can produce responses based on the belief state in the context with the highest priority at a point in time. We do not discuss the timing of the responses here, but, generally speaking, a reasonable strategy is to respond when the user pauses. In Japanese dialogue systems, producing a backchannel is effective when the user's intention is not clear at that point in time, but determining the content of responses in a real-time spoken dialogue system is also beyond the scope of this paper.</Paragraph>
    </Section>
    <Section position="6" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
4.6 A Simple Example
</SectionTitle>
      <Paragraph position="0"> Here we explain ISSS using a simple example.</Paragraph>
      <Paragraph position="1"> Consider again &amp;quot;Wednesday next week&amp;quot;. To simplify the explanation, we assume the noun phrase</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> 'next week' is one word. The speech recognizer incrementally sends to the understanding module the word hypotheses 'Wednesday' and 'next week'.</Paragraph>
      <Paragraph position="6"> The rules used in this example are shown in Figure 1.</Paragraph>
      <Paragraph position="7"> They are unification-based rules. Not all features and semantic constraints are shown. In this example, nouns and noun phrases are not distinguished.</Paragraph>
      <Paragraph position="8"> The ISSS execution is shown in Figure 2.</Paragraph>
      <Paragraph position="9"> When 'Wednesday' is inputted, its lexical feature structure is created and pushed to the stack. Since Rule (I) can be applied to this stack, (2b) in Figure 2 is created. The top of the stack in (2b) is an SU, thus (2c) is created, whose belief state contains the user's intention of meeting room reservation on Wednesday this week. We assume that 'Wednesday' means Wednesday this week by default if this utterance was made on Monday, and this is described in the additional conditions in Rule (I). After 'next week' is inputted, NP is pushed to the stacks of all contexts, resulting in (3a) and (3b). Then Rule (II) is applied to (3a), making (4b). Rule (I) can be applied to (4b), and then (4c) is created and is turned into (4d), which has the highest priority.</Paragraph>
      <Paragraph position="10"> Before 'next week' is inputted, the interpretation that the user wants to book a room on Wednesday this week has the highest priority, and then after that, the interpretation that the user wants to book a room on Wednesday next week has the highest  priority. Thus, by this method, the most plausible interpretation can be obtained in an incremental way.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="203" end_page="205" type="metho">
    <SectionTitle>
5 Implementation
</SectionTitle>
    <Paragraph position="0"> Using ISSS, we have developed several experimental Japanese spoken dialogue systems, including a meeting room reservation system.</Paragraph>
    <Paragraph position="1"> The architecture of the systems is shown in Figure 3. The speech recognizer uses HMM-based continuous speech recognition directed by a regular  grammar (Noda et al., 1998). This grammar is weak enough to capture spontaneously spoken utterances, which sometimes include fillers and self-repairs, and allows each speech interval to be an arbitrary number of arbitrary bunsetsu phrases.l The grammar contains less than one hundred words for each task; we reduced the vocabulary size so that the speech recognizer could output results in real time. The speech recognizer incrementally outputs word hypotheses as soon as they are found in the best-scored path in the forward search (Hirasawa et al., 1998; G6rz et al., 1996). Since each word hypothesis is accompanied by the pointer to its preceding word, the understanding module can reconstruct word sequences. The newest word hypothesis determines the word sequence that is acoustically most likely at a point in time. 2 The utterance understanding module works based on ISSS and uses a domain-dependent unification grammar with a context-free backbone that is based on bunsetsu phrases. This grammar is more restrictive than the grammar for speech recognition, but covers phenomena peculiar to spoken language such as particle omission and self-repairs. A belief state is represented by a frame (Bobrow et al., 1977); thus, a speech act representation is a command for changing the slot value of a frame.</Paragraph>
    <Paragraph position="2"> Although a more sophisticated model would be required for the system to engage in a complicated dialogue, frame representations are sufficient for our tasks. The response generation module is invoked when the user pauses, and plans responses based on the belief state of the context with the highest priority. The response strategy is similar to that of previous frame-based dialogue systems (Bobrow et al., 1977). The speech production module outputs speech according to orders from the response generation module.</Paragraph>
    <Paragraph position="3"> Figure 4 shows the transcription of an example dialogue of a reservation system that was recorded in the experiment explained below. As an example of SUs across pauses, &amp;quot;gozen-jftji kara gozen-jaichiji made (from 10 a.m. to 11 a.m.)&amp;quot; in U5 and U7 IA bunsetsu phrase is a phrase that consists of one content word and a number (possibly zero) of function words.</Paragraph>
    <Paragraph position="4"> 2A method for utilizing word sequences other than the most likely one and integrating acoustic scores and ISSS priorities remains as future work.</Paragraph>
    <Paragraph position="5"> SI: donoy6na goy6ken de sh6ka (May I 5.69-7.19 help you?) U2: kaigishitsu no yoyaku o onegaishimasu 7.79-9.66 (I'd like to book a meeting room.) \[hai s~desu gogoyoji made (That's right, to 4 p.m.)\] $3: hal (uh-huh) 10.06-10.32 U4: e konshO no suiy6bi (Well, Wednesday 11.75-13.40 this week) \[iie konsh~ no suiyObi (No, Wednesday this week)\]  Recognition results are enclosed in square brackets. The figures in the rightmost column are the start and end times (in seconds) of utterances.</Paragraph>
    <Paragraph position="6"> was recognized. Although the SU '~ianiji yoyaku shitekudasai (12 o'clock, please book it)&amp;quot; in U13 and U15 was syntactically recognized, the system could not interpret it well enough to change the frame because of grammar limitations. The reason why the user hesitated to utter U15 is that S14 was not what the user had expected.</Paragraph>
    <Paragraph position="7"> We conducted a preliminary experiment to investigate how ISSS improves the performance of spoken dialogue systems. Two systems were com- null pared: one that uses ISSS (system A), and one that requires each speech interval to be an SU (an interval-based system, system B). In system B, when a speech interval was not an SU, the frame was not changed. The dialogue task was a meeting room reservation. Both systems used the same speech recognizer and the same grammar. There were ten subjects and each carried out a task on the two systems, resulting in twenty dialogues. The subjects were using the systems for the first time. They carried out one practice task with system B beforehand. This experiment was conducted in a computer terminal room where the machine noise was somewhat adverse to speech recognition. A meaningful discussion on the success rate of utterance segmentation is not possible because of the recognition errors due to the small coverage of the recognition grammar. 3 All subjects successfully completed the task with system A in an average of 42.5 seconds, and six subjects did so with system B in an average of 55.0 seconds. Four subjects could not complete the task in 90 seconds with system B. Five subjects completed the task with system A 1.4 to 2.2 times quicker than with system B and one subject completed it with system B one second quicker than with system A. A statistical hypothesis test showed that times taken to carry out the task with system A are significantly shorter than those with system B (Z = 3.77, p &lt; .0001). 4 The order in which the subjects used the systems had no significant effect. In addition, user impressions of system A were generally better than those of system B. Although there were some utterances that the system misunderstood because of grammar limitations, excluding the data for the three subjects who had made those utterances did not change the statistical results. The reason it took longer to carry out the tasks 3About 50% of user speech intervals were not covered by the recognition grammar due to the small vocabulary size of the recognition grammar. For the remaining 50% of the intervals, the word error rate of recognition was about 20%. The word error rate is defined as 100 * ( substitutions + deletions + insertions ) / ( correct + substitutions + deletions ) (Zechner and Waibel, 1998).</Paragraph>
    <Paragraph position="8"> 4In this test, we used a kind of censored mean which is computed by taking the mean of the logarithms of the ratios of the times only for the subjects that completed the tasks with both systems. The population distribution was estimated by the bootstrap method (Cohen, 1995).</Paragraph>
    <Paragraph position="9"> with system B is that, compared to system A, the probability that it understood user utterances was much lower. This is because the recognition results of speech intervals do not always form one SU.</Paragraph>
    <Paragraph position="10"> About 67% of all recognition results of user speech intervals were SUs or fillers. 5 Needless to say, these results depend on the recognition grammar, the grammar for understanding, the response strategy and other factors. It has been suggested, however, that assuming each speech interval to be an utterance unit could reduce system performance and that ISSS is effective.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML