File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1715_metho.xml
Size: 26,451 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1715"> <Title>SALT: An XML Application for Web-based Multimodal Dialog Management</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Dialog Architecture Overview </SectionTitle> <Paragraph position="0"> With the advent of XML Web services, the Web has quickly evolved into a gigantic distributed computer where Web services, communicating in XML, play the role of reusable software components. Using the universal description, discovery, and integration (UDDI) standard, Web services can be discovered and linked up dynamically to collaborate on a task. In other words, Web services can be regarded as the software agents envisioned in the open agent architecture (Bradshaw 1996). Conceptually, the Web infrastructure provides a straightforward means to realize the agent-based approach suitable for modeling highly sophisticated dialog (Sadek et al 1997). This distributed model shares the same basis as the SALT dialog management architecture.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Page-based Dialog Management </SectionTitle> <Paragraph position="0"> An examination on human to human conversation on trip planning shows that experienced agents often guide the customers in dividing the trip planning into a series of more manageable and relatively untangled subtasks (Rudnicky et al 1999). Not only the observation contributes to the formation of the plan-based dialog theory, but the same principle is also widely adopted in designing GUI-based transactions where the subtasks are usually encapsulated in visual pages. Take a travel planning Web site for example. The first page usually gathers some basic information of the trip, such as the traveling dates and the originating and destination cities, etc. All the possible travel plans are typically shown in another page, in which the user can negotiate on items such as the price, departure and arrival times, etc. To some extent, the user can alter the flow of interaction. If the user is more flexible for the flight than the hotel reservation, a well designed site will allow the user to digress and settle the hotel reservation before booking the flight. Necessary confirmations are usually conducted in separate pages before the transaction is executed.</Paragraph> <Paragraph position="1"> The designers of SALT believe that spoken dialog can be modeled by the page-based interaction as well, with each page designed to achieve a sub-goal of the task. There seems no reason why the planning of the dialog cannot utilize the same mechanism that dynamically synthesizes the Web pages today.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Separation of Data and Presentation </SectionTitle> <Paragraph position="0"> SALT preserves the tremendous amount of flexibility of a page-based dialog system in dynamically adapting the style and presentation of a dialog (Wang 2000). A SALT page is composed of three portions: (1) a data section corresponding to the information the system needs to acquire from the user in order to achieve the sub-goal of the page; (2) a presentation section that, in addition to GUI objects, contains the templates to generate speech prompts and the rules to recognize and parse user's utterances; (3) a script section that includes inference logic for deriving the dialog flow in achieving the goal of the page. The script section also implements the necessary procedures to manipulate the presentation sections.</Paragraph> <Paragraph position="1"> This document structure is motivated by the following considerations. First, the separation of the presentation from the rest localizes the natural language dependencies. An application can be ported to another language by changing only the presentation section without affecting other sections. Also, a good dialog must dynamically strike a balance between system initiative and user initiative styles. However, the needs to switch the interaction style do not necessitate changes in the dialog planning. The SALT document structure maintains this type of independence by separating the data section from the rest of the document, so that when there are needs to change the interaction style, the script and the presentation sections can be modified without affecting the data section. The same mechanism also enables the app to switch among various UI modes, such as in the mobile environments where the interactions must be able to seamlessly switching between a GUI and speech-only modes for hand-eye busy situations.</Paragraph> <Paragraph position="2"> The presentation section may vary significantly among the UI modes, but the rest of the document can remain largely intact.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.3 Semantic Driven Multimodal Integration </SectionTitle> <Paragraph position="0"> SALT follows the common GUI practice and employs an object-oriented, event-driven model to integrate multiple input methods. The technique tracks user's actions and reports them as events. An object is instantiated for each event to describe the causes. For example, when a user clicks on a graphical icon, a mouse click event is fired. The mouse-click event object contains information such as coordinates where the click takes place. SALT extends the mechanism for speech input, in which the notion of semantic objects (Wang 2000, Wang 1998) is introduced to capture the meaning of spoken language.</Paragraph> <Paragraph position="1"> When the user says something, speech events, furnished with the corresponding semantic objects, are reported. The semantic objects are structured and categorized. For example, an utterance &quot;Send mail to John&quot; is composed of two nested semantic objects: &quot;John&quot; representing the semantic type &quot;Person&quot; and the whole utterance the semantic type &quot;Email command.&quot; SALT therefore enables a multimodal integration algorithm based on semantic type compatibility (Wang 2001). The same command can be manifest in a multimodal expression, as in &quot;Send email to him [click]&quot; where the email recipient is given by a point-and-click gesture. Here the semantic type provides a straightforward way to resolve the cross modality reference: the handler for the GUI mouse click event can be programmed into producing a semantic object of the type &quot;Person&quot; which can subsequently be identified as a constituent of the &quot;email command&quot; semantic object. Because the notion of semantic objects is quite generic, dialog designers should find little difficulty employing other multimodal integration algorithms, such as the unification based approach described in (Johnston et al 1997), in SALT.</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Basic Speech Elements in SALT </SectionTitle> <Paragraph position="0"> SALT speech objects encapsulate speech functionality. They resemble to the GUI objects in many ways. Because they share the same high level abstraction, SALT speech objects interoperate with GUI objects in a seamless and consistent manner. Multimodal dialog designers can elect to ignore the modality of communication, much the same way as they are insulated from having to distinguish whether a text string is entered to a field through a keyboard or cut and pasted with a pointing device.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The Listen Object </SectionTitle> <Paragraph position="0"> The &quot;listen&quot; object in SALT is the speech input object. The object must be initialized with a speech grammar that defines the language model and the lexicon relevant to the recognition task.</Paragraph> <Paragraph position="1"> The object has a start method that, upon invocation, collects the acoustic samples and performs speech recognition. If the language model is a probabilistic context free grammar (PCFG), the object can return the parse tree of the recognized outcome. Optionally, dialog designers can embed XSLT templates or scripts in the grammar to shape the parse tree into any desired format. The most common usage is to transform the parse tree into a semantic tree composed of semantic objects.</Paragraph> <Paragraph position="2"> A SALT object is instantiated in an XML document whenever a tag bearing the object name is encountered. For example, a listen object can be instantiated as follows:</Paragraph> <Paragraph position="4"> The object, named &quot;foo,&quot; is given a speech grammar whose universal resource indicator (URI) is specified via a <grammar> constituent.</Paragraph> <Paragraph position="5"> As in the case of HTML, methods of an object are invoked via the object name. For example, the command to start the recognition is foo.start() in the ECMAScript syntax. Upon a successful recognition and parsing, the listen object raises the event &quot;onreco.&quot; The event handler, f(), is associated in the HTML syntax as shown above.</Paragraph> <Paragraph position="6"> If the recognition result is rejected, the listen object raises the &quot;onnoreco&quot; event, which, in the above example, invokes function g(). As mentioned in Sec. 0, these event handlers reside in the script section of a SALT page that manages the within-page dialog flow. Note that SALT is designed to be agnostic to the syntax of the eventing mechanism. Although the examples through out this article use HTML syntax, SALT can operate with other eventing standards, such as World Wide Web Consortium (W3C) XML Document Object Model (DOM) Level 2, ECMA Common Language Infrastructure (CLI), or the upcoming W3C proposal called XML Events.</Paragraph> <Paragraph position="7"> The SALT listen object can operate in one of the three modes designed to meet different UI requirements. The automatic mode, shown above, automatically detects the end of utterance and cut off the audio stream. The mode is most suitable for push-to-talk UI or telephony based systems.</Paragraph> <Paragraph position="8"> Reciprocal to the start method, the listen object also has a stop method for forcing the recognizer to stop listening. The designer can explicitly invoke the stop method and not rely on the recognizer's default behavior. Invoking the stop method becomes necessary when the listen object operates under the single mode, where the recognizer is mandated to continue listening until the stop method is called. Under the single mode, the recognizer is required to evaluate and return hypotheses based on the full length of the audio, even though some search paths may have reached a legitimate end of sentence token in the middle of the audio stream. In contrast, the third multiple mode allows the listen object to report hypotheses as soon as it sees fit. The single mode is designed for push-hold-and-talk type of UI, while the multiple mode is for real-time or dictation type of applications.</Paragraph> <Paragraph position="9"> The listen object also has methods to modify the PCFG it contains. Rules can be dynamically activated and deactivated to control the perplexity of the language model. The semantic parsing templates in the grammar can be manipulated to perform simple reference resolution. For example, the grammar below (in SAPI format) demonstrates how a deictic reference can be resolved inside the SALT listen In this example, the propname and propvalue attributes are used to generate the semantic objects. If the user says &quot;the left one,&quot; the above grammar directs the listen object to return the semantic object as <drink text=&quot;the left one&quot;>coffee</drink>. This mechanism for composing semantic objects is particularly useful for processing expressions closely tied to how data are presented. The grammar above may be used when the computer asks the user for choice of the drink by displaying the pictures of the choices side by side. However, if the display is tiny, the choices may be rendered as a list, to which a user may say &quot;the first one&quot; or &quot;the bottom one.&quot; SALT allows dialog designers to approach this problem by dynamically adjusting the speech grammar.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The prompt object </SectionTitle> <Paragraph position="0"> The SALT &quot;prompt&quot; object is the speech output object. Like the listen object, the prompt object has a start method to begin the audio playback.</Paragraph> <Paragraph position="1"> The prompt object can perform text to speech synthesis (TTS) or play pre-recorded audio. For TTS, the prosody and other dialog effects can be controlled by marking up the text with synthesis directives.</Paragraph> <Paragraph position="2"> Barge-in and bookmark are two events of the prompt object particularly useful for dialog designs. The prompt object raises a barge-in event when the computer detects user utterance during a prompt playback. SALT provides a rich program interface for the dialog designers to specify the appropriate behaviors when the barge-in occurs. Designers can choose whether to delegate SALT to cut off the outgoing audio stream as soon as speech is detected. Delegated cut-off minimizes the barge-in response time, and is close to the expected behavior for users who wish to expedite the progress of the dialog without waiting for the prompt to end. Similarly, non-delegated barge-in let the user change playback parameters without interrupting the output. For example, the user can adjust the speed and volume using speech commands while the audio playback is in progress. SALT will automatically turn on echo cancellation for this case so that the playback has minimal impacts on the recognition.</Paragraph> <Paragraph position="3"> The timing of certain user action or the lack thereof often bears semantic implications.</Paragraph> <Paragraph position="4"> Implicit confirmation is a good example, where the absence of an explicit correction from the user is considered as a confirmation. The prompt object introduces an event for reporting the landmarks of the playback. The typical way of catching the playback landmarks in SALT is as When the synthesizer reaches the TTS markup <bookmark>, the onbookmark event is raised and the event hander f() is invoked. When a barge-in is detected, the dialog designer can determine if the barge-in occurs before or after the bookmark by inspecting whether the function f() has been called or not.</Paragraph> <Paragraph position="5"> Multimedia synchronization is another main usage for TTS bookmarks. When the speech output is accompanied with, for example, graphical animations, TTS bookmarks are an effective mechanism to synchronize these parallel outputs.</Paragraph> <Paragraph position="6"> To include dynamic content in the prompt, SALT adopts a simple template-based approach for prompt generation. In other words, the carrier phrases can be either pre-recorded or hard-coded, while the key phrases can be inserted and synthesized dynamically. The prompt object that confirms a travel plan may appear as the following in HTML: <input name=&quot;origin&quot; type=&quot;text&quot; /> <input name=&quot;destination&quot; type=&quot;text&quot; /> <input name=&quot;date&quot; type=&quot;text&quot; /> ...</Paragraph> <Paragraph position="7"> <prompt ...> Do you want to fly from <value targetElement=&quot;origin&quot;/> to <value targetElement=&quot;destination&quot;/> on <value targetElement=&quot;date&quot;/>? </prompt> As shown above, SALT uses a <value> tag inside a prompt object to refer to the data contained in other parts of the SALT page. In this example, the prompt object will insert the values in the HTML input objects in synthesizing the prompt.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Declarative Rule-based Programming </SectionTitle> <Paragraph position="0"> Although the examples use procedural programming in managing the dialog flow control, SALT designers can practice inference programming in a declarative rule-based fashion in which rules are attached to the SALT objects capturing user's actions, e.g., the listen object. Instead of authoring procedural event handlers, designers can declare inside the listen object rules that will be evaluated and invoked when the semantic objects are returned. This is achieved through a SALT <bind> element as demonstrated below: <listen ...> <grammar .../> <bind test=&quot;/@confidence $lt$ 50&quot; targetElement=&quot;prompt_confirm&quot; targetMethod=&quot;start&quot; targetElement=&quot;listen_confirm&quot; targetMethod=&quot;start&quot; /> <bind test=&quot;/@confidence $ge$ 50&quot; targetElement=&quot;origin&quot; value=&quot;/city/origin&quot; targetElement=&quot;destination&quot; value=&quot;/city/destination&quot; targetElement=&quot;date&quot; value=&quot;/date&quot; /> ...</Paragraph> <Paragraph position="1"> </listen> The predicate of each rule is applied in turns against the result of the listen object. They are expressed in the standard XML Pattern language in the &quot;test&quot; clause of the <bind> element. In this example, the first rule checks if the confidence level is above the threshold. If not, the rule activates a prompt object (prompt_confirm) for explicit confirmation, followed by a listen object listen_confirm to capture the user's response. The speech objects are activated via the start method of the respective object. Object activations are specified in the targetElement and the targetMethod clauses of the <bind> element.</Paragraph> <Paragraph position="2"> Similarly, the second rule applies when the confidence score exceeds the prescribed level. The rule extracts the relevant semantic objects from the parsed outcome and assigns them to the respective elements in the SALT page. As shown above, SALT reuses the W3C XPATH language for extracting partial semantic objects from the parsed outcome.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 SALT Extensibilities </SectionTitle> <Paragraph position="0"> Naturally spoken language is a modality that can be used in widely diverse environments where user interface constraints and capabilities vary significantly. As a result, it is only practical to define into SALT the speech functions that are universally applicable and implementable. For example, the basic speech input function in SALT only deals with speaker independent recognition and understanding, even though speaker dependent recognition or speaker verification are in many cases very useful. As a result, extensibility is crucial to a natural language interface like SALT.</Paragraph> <Paragraph position="1"> SALT follows the XML standards that allow extensions being introduced on demand without sacrificing document portability. Functions that are not already defined in SALT can be introduced at the component level, or as a new feature of the markup language. In addition, SALT requires the standard of XML be followed so that extensions can be identified and the methods to process the extensions can be discovered and integrated.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Component extensibility </SectionTitle> <Paragraph position="0"> SALT components can be extended with new functions individually through the component configuration mechanism of the <param> element. For example, the <listen> element has an event to signal when speech is detected in the incoming audio stream. However, the standard does not specify an algorithm for detecting speech. A SALT document author, however, can declare reference cues so that the document can be rendered in the similar way among different processors. The <param> element can be used to set the reference algorithm and threshold for detecting speech in the <listen> object: Here the parameters are set using an algorithm whose uniform resource name (URN), xyz.edu/algo-1, is declared as an attribute of the XML namespace of <param>. The parameters for configuring this specific speech detection method are further specified in the child elements. A document processor can perform a schema translation on the URN namespace into any schema the processor understands. For example, if the processor implements the speech detection algorithm where the detection threshold has a different range, the adjustment can be easily made when the document is parsed.</Paragraph> <Paragraph position="1"> The same mechanism is used to extend the functionality. For instance, the <listen> object can be used for speaker verification because the algorithm used for verification and the programmatic interfaces share a lot in common with the recognition. In SALT, a <listen> object can be extended for speaker verification through configuration parameters: In this example, the <listen> object is extended for speaker verification that compares a user's voice against a cohort set. The events &quot;onreco&quot; and &quot;onnoreco&quot; are invoked when the voice passes or fails the test, respectively. As in the previous case, the extension must be decorated with URN that specifies the intended behavior of the document author. Being an extension, the document processor might not be able to discern the semantics implied by the URN natively.</Paragraph> <Paragraph position="2"> However, XML based protocols allow the processor to query and employ Web services that can either (1) transform the extended document segment into an XML schema the processor understands, or (2) perform the function described by the URN. By closely following XML standards, SALT documents fully enjoy the benefits of extensibility and portability of XML with SALT.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Language extensibility </SectionTitle> <Paragraph position="0"> In addition to component extensibility, the whole language of SALT can be enhanced with new functionality using XML. Communicating with other modalities, input devices, and advanced discourse and context management are just a few potential use for language-wise extension.</Paragraph> <Paragraph position="1"> SALT message extension, or the <smex> element, is the standard conduit between a SALT document and the outside world. The message element takes <param> as its child element to forge a connection to an external component.</Paragraph> <Paragraph position="2"> Once a link is established, SALT document can communicate with the external component by exchanging text messages. The <smex> element has a &quot;sent&quot; attribute to which a text message can be assigned to. When its value is changed, the new value is regarded as a message intended for the external component and immediately dispatched. When an incoming message is received, the object places the message on a property called &quot;received&quot; and raises the &quot;onreceive&quot; event.</Paragraph> <Paragraph position="3"> Telephones are one of the most important access devices for spoken language enabled Web applications. Call control functions play a central role in a telephony SALT application. The <smex> element in SALT is a perfect match with the telephony call standard known as ECMA 323.</Paragraph> <Paragraph position="4"> ECMA 323 defines the standard XML schemas for messages of telephony functions, ranging from simple call answering, disconnecting, transferring to switching functionality suitable for large call centers. ECMA 323 employs the same message exchange model as the design of <smex>. This allows SALT application to tap into a rich telephony call controls without needing a complicated SALT processor. As shown in (SALT 2002), sophisticated telephony applications can be authored in SALT in a very straightforward manner.</Paragraph> <Paragraph position="5"> In addition to devices such as telephones, SALT <smex> object may also be employed to connect to Web services or other software components to facilitate advanced discourse semantic analysis and context managements. Such capabilities, as described in (Wang 2001), empower the user to realize the full potential of interacting with computers in natural language. For example, the user can simply say &quot;Show driving directions to my next meeting&quot; without having to explicitly and tediously instruct the computer to first look up the next meeting, obtain its location, copy the location to another program that maps out a driving route.</Paragraph> <Paragraph position="6"> The customized semantic analysis can be achieved in SALT as follows: Here the <listen> element includes a rudimentary grammar to analyze the basic sentence structure of user utterance. Instead of resolving every reference (e.g. &quot;my next meeting&quot;), the result is cascaded to the <smex> element linked to a Web service specializing in resolving personal references.</Paragraph> <Paragraph position="7"> Summary In this paper, we describe the design philosophy of SALT in using the existing Web architecture for distributed multimodal dialog. SALT allows flexible and powerful dialog management by fully taking advantage of the well publicized benefits of XML, such as separation of data from presentation. In addition, XML extensibility allows new functions to be introduced as needed without sacrificing document portability. SALT uses this mechanism to accommodate diverse Web access devices and advanced dialog management.</Paragraph> </Section> </Section> class="xml-element"></Paper>