File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1602_metho.xml
Size: 15,691 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1602"> <Title>VariantTransduction: A Method for Rapid Developmentof InteractiveSpoken Interfaces</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Characteristics of our approach </SectionTitle> <Paragraph position="0"> The goal of the approach discussed in this paper (whichwerefertoas\variant transduction&quot;) is to avoid the eort and specialized expertise used to build current researchprototypes, while allowing more natural spoken input than is handled byspoken dialog systems built using current commercial practice.</Paragraph> <Paragraph position="1"> This led us to adopt the following constraints: Applications are constructed using a relatively small number of example inputs (no grammar development or extensive data collection).</Paragraph> <Paragraph position="2"> No intermediate semantic representations are needed. Instead, manipulations are performed on word strings and on action strings that are nal (back-end) application calls.</Paragraph> <Paragraph position="3"> Conrmation queries posed bythesystem to the user are constructed automatically from the examples, without the use of a separate generation component.</Paragraph> <Paragraph position="4"> Dialog control should be simple to specify forsimple applications, while allowing the exibility of delegating this control to another module (e.g. an \intelligent&quot; back-end agent) for more complex applications. null Wehave constructed two telephone-based applications using this method, an application to access email and a call-routing application. These two applications were chosen to gain experience with the method because they have dierent usage characteristics and back-end complexity. For the e-mail access system, usage is typically habitual, and the system's mapping of user utterances to back-end actions needs to takeinto accountdynamic aspects of the current email session. For the call-routing application, the back-end calls executed by the system are relatively simple, but users may only encounter the system once, and the system's initial prompt is not intended to constrain the rst input spoken by the user.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Constructing an application with </SectionTitle> <Paragraph position="0"> example-action contexts An interactivespoken language application constructed with the variant transduction method consists of a set of contexts. Each context provides the mapping between user inputs and application actions that are meaningful in a particular stage of interaction between the user and system. For example the e-mail reader application includes contextsfor logging in and for navigating a mail folder.</Paragraph> <Paragraph position="1"> The actual contexts that are used at run-time are created through a four step process: 1. The application developer species (a small number of) triples he;;a;;ci where e is a natural language string (a typical user input), a is an application action (back-end application API call). For instance, the string read the message from John might be paired with the API call mailAgent.getWithSender(&quot;jsmith@att.com&quot;). The third element of a triple, c,isan expression identifying another (or the same) context, specically, the context the system will transition to if e is the closest match to the user's input.</Paragraph> <Paragraph position="2"> 2. The set of triples for eachcontext is expanded by the system into a larger set of triples. The additional triples are of the form hv;;a ;;ci where v is a \variant&quot; of example e (as explained in section 4 below), and a is an \adapted&quot; version of the action a.</Paragraph> <Paragraph position="3"> 3. During an actual user session, the set of triples for a context may optionally be expanded further to takeinto account the dynamic aspects of a particular session. For example, in the mail access application, the set of names available for recognition is increased to include those present as senders in the user's current mail folder.</Paragraph> <Paragraph position="4"> 4. A speech recognition language model is compiled from the expanded set of examples. We currently use a language model that accepts any sequence of sub-strings of the examples, optionally separated by ller words, as well as sequences of digits. (For a small number of examples, a statistical N-gram model is ineffective because of low N-gram counts.) A detailed account of the recognition language model techniques used in the system is beyond the scope of this paper.</Paragraph> <Paragraph position="5"> In the current implementation, actions are sequences of statements in the Java language. Constructors can be called to create new objects (e.g. a mail session object) whichcanbe assigned to variables and referenced in other actions. The context interpreter loads the required classes and evaluates methods dynamically as needed. It is thus possible for an application developer to build a spoken interface to their target API without introducing any new Java classes. The system could easily be adapted touse action strings fromother interpreted languages.</Paragraph> <Paragraph position="6"> Akey property of the process described above is that the application developer needs to know only the back-end API and English (or some other natural language).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Variant compilation </SectionTitle> <Paragraph position="0"> Dierent expansion methods can be used in the second step to produce variants v of an example e. In the simplest case, v maybe a paraphrase of e. Such paraphrase variants are used in the experiments in section 7, where domain-independent \carrier&quot; phrases are used to create variants. For example, the phrase I'd like to (among others) is used as a possible alternative for the phrase I want to.</Paragraph> <Paragraph position="1"> The context compiler includes an English-to-English paraphrase generator, so the application developer is not involved in the expansion process, relieving her of the burden of handling this type of language variation. We are also experimenting with other forms of variation, including those arising from lexical-semantic relations, user-specic customization, and those variants uttered by users during eld trials of a system.</Paragraph> <Paragraph position="2"> When v is a paraphrase of e, the adapted action a is the same string as a. In the more general case, the meaning of variant v is differentfromthatof e, and the systemattempts (not always correctly) to construct a so that it reects this dierence in meaning. For example, including the variant show the message from Bill Wilson of an example read the message from John,involves modifying the action mailAgent.getWithSender(&quot;jsmith@att.com&quot;) to mailAgent.getWithSender(&quot;wwilson@att.com&quot;). We currently adopt a simple approachto the process of mapping language string variants to their corresponding target action string variants. The process requires the availability of a \token mapping&quot; t between these twostring domains, ordataor heuristics fromwhich such amapping can be learned automatically. Examples of the token mapping are names to email addresses as illustrated in the example above, name to identier pairs in a database system, \soundex&quot; phonetic string spelling in directory applications, and a bilingual dictionary in a translation application. The process proceeds as follows: 1. Compute a set of lexical mappings between the variant v and example e. This is currently performed by aligning the twostringinsuchawayasthatthe alignment minimizes the (weighted) edit distance between them (Wagner and Fischer, 1974).</Paragraph> <Paragraph position="3"> 2. The token mapping t is used to map substitution pairs identied by the alignment(hread;;showi and hJohn, Bill Wilsoni in the example above) to corresponding substitution pairs in the action string. In general this will result in a smaller set of substitution strings since not all word strings will be presentin the domain of t. (In the example, this results in the single pair hjsmith@att.com, wwilson@att.comi.) 3. The action substitution pairs are applied to a to produce a 0 .</Paragraph> <Paragraph position="4"> 4. The resulting action a 0 is checked for (syntactic) well-formedness in the action string domain;; the variant v is rejected if a 0 is ill-formed.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Input interpretation </SectionTitle> <Paragraph position="0"> When an example-action context is active during an interaction with a user, twocomponents (in addition to the speech recognition language model) are compiled from the context in order to map the user inputs into the appropriate (possibly adapted) action: Classier A classier is built with training pairs hv;;ai where v is a variantofan example e for which the example action pair he;;aiis amember ofthe unexpanded pairs in the context. Note that the classier is not trained on pairs with adapted examples a since the set of adapted actions may be too large for accurate classication (with standard classication techniques). The classiers typically use text features such as N-grams appearing in the training data. In our experiments, wehave used dierent classiers, including BoosTexter (Schapire and Singer, 2000), and a classier based on Phi-correlation statistics for the text features (see Alshawi and Douglas (2000) for our earlier application of Phi statistics in learning machine translation models from examples). Other classiers such as decision trees (Quinlan, 1993) or support vector machines (Vapnik, 1995) could be used instead.</Paragraph> <Paragraph position="1"> Matcher The matcher can compute a distortion mapping and associated distance between the output s of the speechrecognizer and a variant v.Various matchers can be used such as those suggested in example-based approaches to machine translation (Sumita and Iida, 1995). So far wehave used a weighted string edit distance matcher and experimented with dierent substitution weights including ones based on measures of statistical similaritybetween words such as the one described byPereira et al. (1993). The output of the matcher is a real number (the distance) and a distortion mapping represented as a sequence of edit operations (Wagner and Fischer, 1974).</Paragraph> <Paragraph position="2"> Using these two components, the method for mapping the user's utterance to an exe- null i for which v produces the smallest distance is selected and passedalongwith e tothedialog controller. null The relationship between the input s,variant v, example e, and actions a and a</Paragraph> <Paragraph position="4"> depicted in Figure 1. In the gure, f is the mapping between examples and actions in the unexpanded context;; r is the relation between examples and variants;; and g is the searchmapping implemented by the classiermatcher. The role of e is related to conrmations as explained in the following section.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Conrmation and dialog control </SectionTitle> <Paragraph position="0"> Dialog control is straightforwardas the reader might expect, except for two aspects described in this section: (i) evaluation of nextcontext expressions, and (ii) generation of p (prompt): say a mailreader command s (words spoken): now show me messages from Bill v (variant): show the message from Bill Wilson e (example): read the message from John a (associated action): mailAgent.getWithSender(&quot;jsmith@att.com&quot;) conrmation requests based on the examples in the context and the user's input.</Paragraph> <Paragraph position="1"> As noted in section 3 the third element c of each triple he;;a;;ci in a context is an expression that evaluates to the name of the next context (dialog state) that the system will transition to if the triple is selected. For simple applications, c can simply always be an identier for a context, i.e. the dialog state transition networkis specied explicitly in advance in the triples by the application developer. null For more complex applications, next context expressions c may be calls that evaluate to context identiers. In our implementation, these calls can be Java methods executed on objects known to the action interpreter. They maythus be calls on the back-end application system, which is appropriate for cases when the back-end has state information relevant to what should happen next (e.g. if it is an \intelligentagent&quot;). It might also be a call to component that implements a dialog strategy learning method (e.g. Levin and Pieraccini (1997)),though wehave not yet tried such methods in conjunction with the present system.</Paragraph> <Paragraph position="2"> of the mapping h representing the distortion between e and v. h is derived from h by removing those edit operations whichwere not involved in mapping the action a to the adapted action a in order to avoid misleading the user about the extent to which the application action is being adapted. Thus if h includes the substitution w ! w to be executed by the system. For instance, in the example in Figure 2, the word \now&quot;in the user's input does not correspond to any part of the adapted action, and is not included in the conrmation string. In practice, the conrmation string e ) is derived from the original example pair (e;;a).</Paragraph> <Paragraph position="3"> The dialog owofcontrol proceeds as follows: null 1. The activecontext c is set to a distinguished initial context c 0 indicated by the application developer.</Paragraph> <Paragraph position="4"> 2. A prompt associated with the currentac null tive context c is played to the user using a speechsynthesiser or byplaying an audio le. For this purpose the application developer provides a text string (or audio le) for each context in the application. 3. The user's utterance is interpreted as explained in the previous section to produce the triple hv;;a i.</Paragraph> <Paragraph position="5"> 4. A match distance d is computed as the sum of the distance computed for the matcher between s and v and the distance computed by the matcher between v and e (where e is the example from which v was derived).</Paragraph> <Paragraph position="6"> 5. If d is smaller than a preset threshold, it is assumed that no conrmation is necessaryand thenext threesteps areskipped. 6. The system asks the user do you mean: e 0 . If the user responds positively then proceed to the next step, otherwise return to step 2.</Paragraph> <Paragraph position="7"> 7. The action a is executed, and any string output it produces is read to the user with the speechsynthesizer.</Paragraph> <Paragraph position="8"> 8. The active context is set to the result of evaluating the expression c strings involved in a dialog turn. Handling the user's verbal response to the conrmation is done with a built-in yes-no context. The generation of conrmation requests requires no work by the application developer. Our approach thus provides an even more extreme version of automatic conrmation generation than that used byChu-Carroll and Carpenter (1999) where only a small eort is required by the developer. In both cases, the benets of carefully crafted conrmation requests are being traded for rapid application development.</Paragraph> </Section> class="xml-element"></Paper>