File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1026_metho.xml

Size: 19,852 bytes

Last Modified: 2025-10-06 14:14:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1026">
  <Title>A Robust and Efficient Three-Layered Dialogue Component for a Speech-to-Speech Translation System*</Title>
  <Section position="4" start_page="0" end_page="188" type="metho">
    <SectionTitle>
2 Tasks of the Dialogue
Component
</SectionTitle>
    <Paragraph position="0"> The dialogue component within VERBMOBIL has four mQor tasks:  (1) to support speech recognition and linguis null tic analysis when processing the speech signal. Top-down predictions can be made to restrict the search space of other analysis components to get better results in shorter time (Young et al., 1989; Andry, 1992)). For instance, predictions about a speech act can be used to narrow down the set of words which are likely to occur in the following utterance - a fact exploited by the speech recognition component which uses adaptive language models (Jellinek, 1990). Top-down predictions are also used to limit the set of applicable grammar rules to a specific subgrammar. They are of particular importance since the system has to work under real-time constraints.</Paragraph>
    <Paragraph position="1"> (2) to provide contextual information for other VERBMOBIL components. In order to get good translations, context plays an important role. One example is the translation of the German &amp;quot;Geht es bei Ihnen?&amp;quot; which can be translated as &amp;quot;Does it suit you?&amp;quot; or &amp;quot;How about your place?&amp;quot;, depending on whether the dialogue partners discussed a time or a place before. A discourse history is constructed which can be accessed by other VP.RB-MOBIL components(Ripplinger and Caroli, 1994; LuperFoy and Rich, 1992).</Paragraph>
    <Paragraph position="2"> (3) to follow the dialogue when V~.RBMOBIL is off-line. When both dialogue participants speak  English (and no automatic translation is necessary) VERBMOBIL is &amp;quot;passive&amp;quot;, i.e. no syntactic or semantic analyses are performed. In such cases, the dialogue component tries to follow the dialogue by using a keyword spotter. This device scans the input for a small set of predetermined words which are characteristic for certain stages of the dialogue. The dialogue component computes the most probable speech act type of the next utterance in order to selects its typical key words. (4) to control clarification dialogues between VERBMOBIL and its users. If processing breaks down VERBMOBIL has to initiate a clarification dialogue in order to recover.</Paragraph>
  </Section>
  <Section position="5" start_page="188" end_page="188" type="metho">
    <SectionTitle>
3 The Architecture
</SectionTitle>
    <Paragraph position="0"> The abovementioned requirements cannot be met when using a single method of processing: if we use structural knowledge sources like plans or dialogue-grammars, top-down predictions are difficult make, because usually one can infer many possible follow-up speech acts from such knowledge sources that are not scored (Nagata and Morimoto, 1993). Also, a planning-only approach is inappropriate when the dialogue is processed only partially. Therefore we chose a hybrid 3-layered approach (see fig. 1) where the layers differ with respect to the type of knowledge they use and the task they are responsible for. The components are A Statistic Module The task of the statistic module is the prediction of the following speech act, using knowledge about speech act frequencies in our training corpus.</Paragraph>
    <Paragraph position="1"> A Finite State Machine (FSM) The finite state machine describes the sequence of speech acts that are admissible in a standard appointment scheduling dialogue and checks the ongoing dialogue whether it follows these expectations (see fig. 2).</Paragraph>
    <Paragraph position="2"> A Planner The hierarchical planner constructs a description of the dialogue's underlying dialogue and thematic structures, making extensive use of contextual knowledge. This module is sensitive to inconsistencies and therefore robustness and backup-strategies are the most important features of this component.</Paragraph>
    <Paragraph position="3"> While the statistical component completely relies on numerical information and is able to provide scored predictions in a fast and efficient way, the planner handles time-intensive tasks exploiting various knowledge sources, in particular linguistic information. The FSM can be located in between these two components: it works like an efficient parser for the detection of inconsistent dialogue states. The three modules interact in cases of repair, e.g. when the planner needs statistical information to resume an incongruent dialogue.</Paragraph>
    <Paragraph position="4"> On the input side the dialogue component is interfaced with the output from the semantic construction/evaluation module, which is a Drts-like feature-value structure (Bos et al., 1994) containing syntactic, semantic, and occasionally pragmatic information. The input also includes information from the generation component about the utterance produced in the target language and a word lattice from the keyword spotter.</Paragraph>
    <Paragraph position="5"> The output of the dialogue module is delivered to any module that needs information about the-dialogue pursued so far, as for example the transfer module and the semantic construction/evaluation module. Additionally, the key-word spotter is provided with words expected in the next utterance.</Paragraph>
  </Section>
  <Section position="6" start_page="188" end_page="188" type="metho">
    <SectionTitle>
4 Layered Dialogue Processing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="188" end_page="188" type="sub_section">
      <SectionTitle>
4.1 Knowledge-Based Layers
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="7" start_page="188" end_page="191" type="metho">
    <SectionTitle>
- The Dialogue Model
</SectionTitle>
    <Paragraph position="0"> Like previous approaches for modeling task-oriented dialogues we base our ideas on the assumption that a dialogue can be described by means of a limited but open set of speech acts (e.g. (Bilange, 1991), (Mast, 1993)). As point of departure we take speech acts as proposed by (Austin, 1962) and (Searle, 1969) and also a number of so-called illocutionary acts as employed in a model of information-seeking dialogues (Sitter and Stein, 1992). We examined the VERBMOBIL corpus of appointment, scheduling dialogues for their occurrence and for the necessity to introduce new speech acts 1 .</Paragraph>
    <Paragraph position="1"> At present,, our model contains 17 speech acts (see (Maier, 1994) for more details on the characterization of the various speech acts; the dialogue model describing admissible sequences of z The acts we introduce below are mostly of illocutionary nature. Nevertheless we will refer to them as speech acts throughout this paper.</Paragraph>
    <Paragraph position="2">  speech acts is given in fig. 2). Among the domain-dependent speech acts there are low-level (primitive) speech acts like BEC~RUESSUNG for initiating and VERABSCHIEDUNG for concluding a dialogue.</Paragraph>
    <Paragraph position="3"> Among the domain-independent speech acts we use acts as e.g. AKZEPTANZ and ABLEHNUNG.</Paragraph>
    <Paragraph position="4"> Additionally, we introduced two speech acts necessary for modeling our appointment scheduling dialogues: INIT_TERMINABSPRACHE and BESTAE-TIGUNG. While the first is used to describe utterances which state date s or places to be negotiated, the latter corresponds to contributions that contain a mutual agreement concerning a given topic.  appointment scheduling dialogues The dialogue consists Of three phases (Maier, 1994). First, an introductory phase, where the discourse participants greet I each other, introduce themselves and provide information e.g. about their professional status. After * this, the topic of the conversation is introduced, usually the fact that one or more appointments have to be scheduled. Then negotiation *begins where the discourse participants repeatedly offer possible time frames, make counter offers, refine the time frames, reject offers and request other * possibilities. Once an item is accepted ~nd mutual agreement exists either the dialogue can be terminated, or another appointment is negotiated.</Paragraph>
    <Paragraph position="5"> A dialogue model based on speech acts seems to be an appropriate approach also from the point of view of machine translation and of transfer in particular: While in written discourse sentences can be considered the basic units of transfer, this assumption is not valid for spoken dialogues. In many cases only sentence fragments are uttered, which often are grammatically incomplete or even incorrect. Therefore different descriptive units have to be chosen. In the case Of VERBMOBIL these units are speech acts.</Paragraph>
    <Paragraph position="6"> The speech acts which in our approach are embedded in a sequential model of interaction can be additionally classified using the taxonomy of dialogue control functions as proposed in e.g. (Bunt, 1989). Speech acts like BEGRUESSUNG and VE-RABSCHIEDUNG, for example, can be classified as dialogue flmctions controlling interaction management. More fine-grained taxonomical distinctions like CONFIRM and CONFIRM/WEAK as proposed in (Bunt, 1994) are captured in our approach by pragmatic features like suitability and possibility specified in the DRS-description of an utterance, which serves as input for the dialogue component.</Paragraph>
    <Paragraph position="7">  The finite state machine provides an efficient and robust implementation of the dialogue model.</Paragraph>
    <Paragraph position="8"> It parses the speech acts encountered so far, tests their consistency with the dialogue model and saves the current state. When an inconsistency occurs fall back strategies (using for instance the statistical layer) are used to select the most probable state. The state machine is extended to allow for phenomena that might appear anywhere in a dialogue, e.g. human-human clarification dialogues and deliberation. It can also handle recursively embedded clarification dialogues.</Paragraph>
    <Paragraph position="9"> An important task of this layer is to signal to the planner when an inconsistency has occurred, i.e. when a speech act is not within the standard model so that it can activate repair techniques.</Paragraph>
    <Paragraph position="10">  To incorporate constraints in dialogue processing and to allow decisions to trigger follow-up actions a plan-based approach has been chosen.</Paragraph>
    <Paragraph position="11"> This approach is adopted from text generation where plan-operators are responsible for choosing linguistic means in order to create coherent stretches of text (see, for instance, (Moore and Paris, 1989) and (Hovy, 1988)). The application of plan operators depends on the validity of constraints. Planning proceeds in a top-down fashion, i.e. high-level goals are decomposed into subgoals, each of which has to be achieved individually in order to be fulfilled. Our top-level goal SCHEDULE-MEETING (see below) is decomposed into three subgoals each of which is responsible for the treatment of one dialogue segment: the in- null troductory phase (GREET-INTRODUCE-TOPIC), the negotiation phase (NEGOTIATE) and the closing phase (FINISH). These goals have to be fulfilled in the specified order. The keyword iterate specifies that negotiation phases can occur repeatedly.  In our hierarchy of plan operators the leaves, i.e. the most specific operators, correspond to the individual speech acts of the model as given in fig. 2. Their application is mainly controlled by pragmatic and contextual constraints. Among these constraints are, for example, features related to the discourse participants (acquaintance, level of expertise) and features related to the dialogue history (e.g. the occurrence of a certain speech act in the preceding context).</Paragraph>
    <Paragraph position="12"> Additionally, our plan operators contain an actions slot, where operations which are triggered after a successful fulfillment of the subgoals are specified. Actions, therefore, are employed to interact with other system components. In the subplan 0FFER-0PERATOR, for example, which is responsible for planning a speech act of the type VORSCHLAG, the action (retrieve-theme) filters the information relevant for the progress of the negotiation (e.g. information related to dates, like months, weeks, days) and updates the thematic structure of the dialogue history. During the planning process tree-like structures are built which mirror the structure of the dialogue.</Paragraph>
    <Paragraph position="13"> The dialogue memory consists of three layers of dialog structure: (1) an intentional structure representing dialogue phases and speech acts as occurring in the dialogue, (2) a thematic structure representing the dates being negotiated, and (3) a referential structure keeping track oflexical realizations. The planner also augments the input sign by pragmatic information, i.e. by information concerning its speech act.</Paragraph>
    <Paragraph position="14"> The plan-based and the other two layers statistics and finite state machine - interact in a number of ways&amp;quot; in cases where gaps occur in the dialogue statistical rating can help to determine the speech acts which are most likely to miss. Also, when the finite state machine detectSS an error, the planner must activate plan operators which are specialized for recovering the dialogue state in order not to fail. For this purpose specialized repair-operators have been implemented which determine both the type of error occurred and the most likely and plausible way to continue the dialogue. It is an intrinsic feature of the dialogue planner that it is able to process any input - even dialogues which do not the least coincide with our expectations of a valid dialogue - and that it proceeds properly if the parts processed by VERBMOBIL contain gaps.</Paragraph>
    <Section position="1" start_page="190" end_page="191" type="sub_section">
      <SectionTitle>
4.2 The Statistical Layer- Statistical
Modeling and Prediction
</SectionTitle>
      <Paragraph position="0"> Another level of processing is an implementation of an information-theoretic model. In speech recognition language models are commonly used to reduce the search space when determining a word that can match a given part of the indeg put. This approach is also used in the domain of discourse modeling to support the recognition process in speech-processing systems (Niedermair, 1992; Nagata and Morimoto, 1993). The units to be processed are not words, but the speech acts of a text or a dialogue. The basis oLprocessing is a training corpus annotated with the speech acts of the utterances. This corpus is used to gain statistical information about the dialogue structure, namely unigram, bigram and trigram frequencies of speech acts. They can be used for e.g. the prediction of following speech acts to support the speech processing components (e.g. dialogue dependent language models), for the disambiguation of diflhrent readings of a sentence, or for guiding the dialogue planner. Since the statistical model always delivers a result and since it can adapt itself to unknown structures, it is very robust. Also, if the statistic is updated during normal operation, it can adapt itself to the dialogue patterns of the VERBMOBIL user, leading to a higher prediction accuracy.</Paragraph>
      <Paragraph position="1"> Considering a dialogue to be a source that has speech acts as output, we can predict the nth speech act s,~ using the maximal conditional probability null s,, := max.. P(sls,,,1, s,,-2, s,_a, ...) We approximate P with the standard smoothing technique known as deleted interpolation (Jellinek, 1990), using unigram, bigram and tri-gram relative frequencies, where f are relative frequencies and qi are weights whose sum is 1:</Paragraph>
      <Paragraph position="3"> Given tl/is formula and the required N-grams we can determine the k best predictions for the next speech acts.</Paragraph>
      <Paragraph position="4"> In order to evaluate the statistical model, we made various experiments. In the table below the results for two experiments are shown. Experiment TS1 uses 52 hand-annotated dialogues with  2340 speech acts as training corpus, and 41 dialogues with 2472 speech acts as test data. TS2 uses another 81 dialogues with 2995 speech acts as test data.</Paragraph>
      <Paragraph position="5">  Compared to the data from (Nagata and Morimoto, 1993) who report prediction accuracies of 61.7 %, 77.5 % and 85.1% for one, two or three predictions respectively, our predictions are less reliable. The main reason is, that the dialogues in our corpus frequently do not follow conventional dialogue behavior, i.e. the dialogue structure differs remarkably from dialogue to dialogue.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="191" end_page="191" type="metho">
    <SectionTitle>
5 An Annotated Example
</SectionTitle>
    <Paragraph position="0"> To get an impression of the flmctionality of the dialogue module, we will show the processing of three sentences which are part of an example dialogue which has a total length of 25 turns. This dialogue is part of a corpus of 200 dialogues which are all fully processed by our dialogue component.</Paragraph>
    <Paragraph position="1"> Prior to sentence DEO04 given below ~.L initialized the dialogue requesting a date for a trip s.</Paragraph>
    <Paragraph position="2"> DEO04: #oh ja, gut, nach meinem Terminkalender &lt;Pause&gt;, wie w&amp;quot;ars im Oktober?# (VORSCHLAG) VMO05: just lookin at my diary, I would suggest October. (VORSCHLAG) DEO06/I: &lt;Pause&gt; I propose from Tuesday the fifth/-DEO06/2: &lt;Pause&gt; no, Tuesday the fourth to Saturday the eighth &lt;Pause&gt;, those five days? (VORSCHLAG) ELO07: oh, that's too bad, I'm not free right then. (ABLEHNUlVG) &lt;Pause&gt; I could fit it into my schedule &lt;Smack&gt; the week after, from Saturday to Thursday, the thirteenth. (VORSCHLAG) If we trace the processing with the finite state machine and the statistics component, allowing two predictions, we get the following results:  lation provided by VERBMOBIL and EL the English speaker. # indicates pressing or release of the button that activates VERBMOBIL.</Paragraph>
    <Paragraph position="3"> Prediction: (AKZEPTANZ ABLEHNUNG) While the finite state machine accepts the sequence of speech acts without failure the predictions made by the statistical module are not correct for DE006/2. The four best predictions and their scores are AKZEPTANZ (28.09~,), VORSCHLAG (26.93~,), ABLEHNUNG (21.67~,) and AUFFORDERUNG_STELLUNG (9.7~,). In comparison with the fourth prediction, the first three predictions have a very similar ranking, so that the failure can only be considered a near miss. The overall prediction rates for the whole dialogue are 56.52 %, 82,60%, and 95.65% for one, two, and three predictions, respectively.</Paragraph>
    <Paragraph position="4"> Since the dialogue can be processed properly by the finite state machine no repair is necessary. The only task of the planner therefore is the construction of the dialogue memory. It adds the incoming speech acts to the intentional structure, keeps track of the dates being negotiated, stores the various linguistic realizations of objects (e.g. lexical variations, referring expressions) and builds and administrates the links to the instantiated representation of these objects in the knowledge representation language BACK (Hoppe et al., 1993). In fig. 3 we give two snapshots showing how the dialogue memory looks like after processing the turns DE006/2 and EL007.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML