File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1018_metho.xml
Size: 14,883 bytes
Last Modified: 2025-10-06 14:07:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1018"> <Title>Some Notes on the Complexity of Dialogues *</Title> <Section position="3" start_page="161" end_page="161" type="metho"> <SectionTitle> 2 Architecture </SectionTitle> <Paragraph position="0"> The dialogue system to which we first applied VALDIA (Heister~mp and McGlashan, 1996; Ehrlich et al., 1997) was designed for answering questions about and/or selling insurances in the domain of car insurances. In case of .failure or problems with the dialogue, the system passes the customer to a human operator. The architecture of the system includes an HMM-based speaker independent speech recognizer, an island parser, DM, generator and synthesizer as depicted in figure 1. The system also includes a data base which is accessed for the retrieval of domain specific information. It is important for this paper that the speech recognizer is not limited to &quot;allowed user contributions&quot; but outputs a word hypotheses lattice or the best; chain which is processed by an island parser. Thus, the input to the DM might, depending on recognition quality, consist of arbitrary sequences of semantic expressions. A basic requirement is that the DM is not allowed to fail on any of these inputs.</Paragraph> <Paragraph position="1"> For testing, we peel the interfacing components away from the DM and regard the DM as a black box. It is assumed that we send a piece of input to the DM which then reacts in a way we can observe (for instance by returnlng/generating some output). We assume that the DM has no notion of time. This mean.q that to test the DM, we simply have to feed it with input and wait for it to acknowledge this by sending a responsive output request. In looking at the response, however, we have to be sensitive to effects like timeout (e.g., the DM is &quot;thinking&quot; too long) and/or loops (e.g., the DM outputs the same item all the time). Although in (Levin and Pieraccini, 1997) the utterances triggering the actions are not mentioned at all, this is very important.</Paragraph> <Paragraph position="2"> In general we don't know which utterance will trigger a certz.in action when the DM is in a certain state, or if the DM needs an utterante at all to perform another action. As the exhaustive validation criteria for the DM do not allow us to assume any insight into the DM itself, we have to simply feed it with all possible sequences of utterances.</Paragraph> <Paragraph position="3"> Our test architecture is shown in figure 2.</Paragraph> <Paragraph position="4"> We connect to the DM at the same place as the analysis. We also watch the output sent to the generator. Additionally we watch the process status of the DM, that is we notice if the DM fails or breaks. In that case we can restart the DM and continue the testing.</Paragraph> </Section> <Section position="4" start_page="161" end_page="162" type="metho"> <SectionTitle> 3 Complexity </SectionTitle> <Paragraph position="0"> This section puts forward some notes on the complexity of dialogue. We are aware that the discussion and the results are not necessar-</Paragraph> <Paragraph position="2"> ily generalizable because they depend on the representation of the input formalism to the DM. However, we were certainly surprised by the results ourselves and it has consequences for the degree of coverage and testing one can achieve. For our dialogue system the semantic representation formalism is simple. It consists of propositional content represented as sequences of semantic objects the SIL 1 representation language (McGlashan et al., 1994).</Paragraph> <Paragraph position="3"> Here is one example: &quot;Ein Audi 80 Avant value : quattro\], def : indef\], \[type: power, themeasuretype : ps, thevalue: \[type: number, cvalue: 125, modus: \[rel : above\] \], modus : Ire1: with\] \] \] This representation is motivated by the fact that the analysis component is an island</Paragraph> </Section> <Section position="5" start_page="162" end_page="163" type="metho"> <SectionTitle> 1 Semantic Interface Language </SectionTitle> <Paragraph position="0"> parser (Hanrieder, 1996), and can thus find islands or sequences of semantic objects.</Paragraph> <Section position="1" start_page="162" end_page="163" type="sub_section"> <SectionTitle> 3.1 The complexity of an utterance </SectionTitle> <Paragraph position="0"> The basic entity is a semantic object (S) which is an atomic item treated by the DM.</Paragraph> <Paragraph position="1"> The DM knows about (and thus can treat or react on) M different semantic objects. Examples of a semantic object are cmc_type, power, greeting, bye, integer, and year.</Paragraph> <Paragraph position="2"> We will not pay attention to the fact that a semantic item could be instantiated with, e.g., a street name - in the navigation domain there exist about 42,000 different names of cities in Germany, and Berlin has 11,500 different street names - but we could of course extend the discussion below (on the cost of complexity). null We call a user contribution an utterance.</Paragraph> <Paragraph position="3"> We assume that an utterance U is a (possibly empty) sequence of semantic objects. This can of course be relaxed to sequences or trees in some algebra, but for this discussion it sufrices to deal with sequences - as we will see, the complexity is &quot;complex enough&quot; with this assumption. A sentence can consist of max O number of semantic objects. An utterance is a multi-set in the real system, but for this discussion we assume an utterance is not. Each semantic object can therefore appear at most one time. Given the definitions above we can now compute the number of possible utterances \[ U \[: All sequences of a certain length l are We therefore have Ivl For one of our dialogue models, concerned with car insurance, we have M = 25 and O = 9. That is, 25 different semantic objects and we allow for a maximum of 9 semantic items (arbitrarily chosen by estimate of breath length) in one utterance:</Paragraph> <Paragraph position="5"> Now, if we would like to test whether our DM can treat all utterances or not, we will have to wait quite a while: Suppose our DM can process 10 utterances per second, then we can process 10-60-60 = 36000 utterances per hour, 36000 - 24 = 864000 utterances per day, 7. 864000 = 6048000 per week, or 864000. 365 = 315360000 utterances per year. To process all possible utterances we would need more than six years! ..:, Obviously, the current parameters of the system make the complexity of the number of utterances intractable in realistic settings. Figure 3 shows how different parameter setting affects the cardinality of utterances for different values of M. The (logarithmic) y-axis represents the cardinality of utterances, and the (linear) x-axis the maximal number of semantic items in one utterance. As can be seen, for our DM, we will have to limit, e.g., the number of semantic items to 6 per utterance if we want to test all utterances in one week.</Paragraph> </Section> <Section position="2" start_page="163" end_page="163" type="sub_section"> <SectionTitle> 3.2 The complexity of dialogue </SectionTitle> <Paragraph position="0"> A dialogue can - at least theoretically - consist of a sequence of the same utterance.</Paragraph> <Paragraph position="1"> Many of the dialogues will of course be non-cooperative and very lmnatural or, put in other words, not legal. But, as indicated above, it is important to us tlhat the DM does not fail on any input. To generate all possible dialogues I D I of a certain length L, we therefore have:</Paragraph> <Paragraph position="3"> For our scenario 15 user contributions are not unnatural, so for L = 15 and the figures above, we have I D I ~ 1014deg which will take quite a while to process 2. Even ff we restrict the length of the dialogues to 2, we get</Paragraph> <Paragraph position="5"> possible dialogues and can thus process just an infinitely small part of them.</Paragraph> </Section> <Section position="3" start_page="163" end_page="163" type="sub_section"> <SectionTitle> 3.3 Consequences </SectionTitle> <Paragraph position="0"> Now, suppose we randomly select some dialogues out of the set of possible ones. While testing the DIALOGUE MANAGER with them we thereby encounter a certain number of (or even zero) errors, it is interesting to be able to say something about how error-free the DM is. For this discussion, it is important that by viewing the DM as a black box, we can not do anything more than assuming the errors to be distributed according to the norreal distribution. Moreover, we can only apply this reasoning if we do a large number of observations. The figures below may depending on the theoretical number of dialogues - not be valid. By using the approxhnation of the normal distribution we know that if we tested N = 10000 dialogues and received errors in DM in, say, 250 of the dialogues (-,z f = 2so = 0.025), we can say Y~6 that the DM contains (with a degree of con- null Here we have to use a trick: Instead we suppose we found one error, and thus</Paragraph> <Paragraph position="2"> cases raise an error.</Paragraph> </Section> </Section> <Section position="6" start_page="163" end_page="167" type="metho"> <SectionTitle> 4 VALDIA - The Implementation </SectionTitle> <Paragraph position="0"> To allow for intelligent testing, we decided to implement our test tool in using the following three parts: * the core test engine, * the interface to the DM (implemented in OZ/MOZARTa), and aThe reason for using OZ is manifold: OZ features threads, multiple platforms (UNIX/LINUX and Windows), nniRcation, a Td/Tk library, and finally it comes for fzee. See hl;tp://www .mozart-oz. org * a graphical editor for the definition of stochastic automata (implemented in Tcl/Tk), The core test engine uses the definition of stochastic automata to create sequences of semantic expressions to be sent to the DM. It records both the input and the output to and from the DM and checks for special messages (e.g. end of dialogue), crashes, if the DM is emitting the same response all the time, or other events events that indicate erroneous behaviour of the DM. It also creates test profiles and checkpoint files to enable interruption and restart of test runs.</Paragraph> <Paragraph position="1"> The interface handles the connection between VALDIA and the DM. It realizes a TCP/IP connection to and from the DM. In case parallel test runs are made, it can also handle different processes.</Paragraph> <Paragraph position="2"> The motivation for the stochastic automaton editor and, at the same time, the main feature of VALDLA (see Figure 4) is that it allows for the design of utterances or even dialogues or utterance sequences, and thus test specific areas in the space of theoretically possible dialogues. The dialogue system devel- null oper can interactively define the automata, using the pointing device to draw the states an the transitions. In each state, it is possible to change the constraints for the definition of a SIL expression. More precisely we change the probalrility of the alternatives of (a part of) an expression. The arcs between the states are augmented with probabilities which guide state transitions in a stochastic m~uner, thus creating certain sequences by preference, without completely excluding others. In Figure 5 the left row contaius the basic semantic entities, the middle the probability, and the right one the number of occurrences for that particular semantic item in each utterance. For the semantic items the variable parts are linked to another window where the .... their instantiations are described. The constraints are semi-automatically derived from the definition of the interface specification for the DM. The reason for &quot;semi-automatically&quot; and not automatically is that we have had no time to write a generic function for this.</Paragraph> <Paragraph position="3"> But, basically the derivation is straightforward. Consequently we can design interesting utterance sequences, according to, e.g., experiences gained during WOZ-experhnents- null Finally, by using just one state and no constraints, we can, of course, produce completely arbitrary utterance sequences.</Paragraph> <Paragraph position="4"> During the testing of the dialogue manager we can run the system in two modes. The first - exhaustive mode - generates all sequences of dialogues by enumerating all dialogues. This is based on the enumeration of all possible utterances in each state. The exhaustive mode can be used when we know that the complexity of the automaton (and utterances) is testable - VALDIA can compute the number of dialogues and compute an upper time limit based on the computational power of the DM. In the second mode Monte Carlo mode - the utterance generation in each state as well as the change of state is random. In this way we randomly wa.lk~ the automaton and randomly generate utterance profiles. This has been proven useful in the cases where we number of possible dialogues to large is for exhaustive testing.</Paragraph> <Paragraph position="5"> Notice that we can not pay any attention to legal moves. VALDIA has (i) no knowledge about what a legal move is, and (ii) no possibility to react on the response from the DM. Therefore the &quot;legal moves&quot; and &quot;cooperativeness&quot; is non existent concepts here. But, this is what we want: People behave weird! Our speech recognizer produces errors! And most important: We have to live with this, and must not fail on any input!</Paragraph> </Section> <Section position="7" start_page="167" end_page="167" type="metho"> <SectionTitle> 5 First Results </SectionTitle> <Paragraph position="0"> During the development of VALDIA we have detected several errors in the implementation of our DM. Most of the errors where logical errors of the kind &quot;Now that's a combination of things we didn't cover.&quot; e.g., the co-occurence of good_bye and request_repetition in a user utterance led to a goal conflict in the DM that caused it to hang, as did the non-exclusive handling of disjunction in &quot;It's older (or) younger than 5 years&quot;, etc.</Paragraph> <Paragraph position="1"> Additionally we discovered that the DM in some of the test runs crashed ~ffter about 500 (l) dialogues due to erroneous memory handling. This is something one would never detect during normal testing with a full system, but immediately after delivering the system.</Paragraph> <Paragraph position="2"> VALDIA produces huge amounts of (huge) trace files. Analyzing these is at present a pain as big as testing the complete dialogue system. Consequently, we will have to develop functionality for condensing the trace information.</Paragraph> </Section> class="xml-element"></Paper>