File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0612_metho.xml
Size: 11,156 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0612"> <Title>A Robust Dialogue System with Spontaneous Speech Understanding and Cooperative Response</Title> <Section position="2" start_page="0" end_page="57" type="metho"> <SectionTitle> 2 A Multi-Modal Dialogue System </SectionTitle> <Paragraph position="0"> The domain of our dialogue system is &quot;Mt. Fuji sightseeing guidance (the vocabulary size is 292 words for the recognizer and 948 words for the interpreter, and the test-set word perplexity is 103)&quot;, The dialogue system is composed of 4 parts: Input by speech recognizer and touch screen, graphical user interface, interpreter, and response generator. The latter two parts are described in Sections 3 and 4.</Paragraph> <Section position="1" start_page="0" end_page="57" type="sub_section"> <SectionTitle> 2.1 Spontaneous Speech Recognizer </SectionTitle> <Paragraph position="0"> The speech recognizer uses a frame-synchronous one pass Viterbi algorithm and Earley like parser for context-free grammar, while using HMMs as syllable units. Input speech is analyzed by the following and regression coefficients (ACEP) The acoustic models consist of 113 syllable based HMMs, which have 5 states, 4 Gaussian densities and 4 discrete duration distributions. The speaker-independent HMMs were adapted to the test speaker using 20 utterances for the adaptation. The grammar used in our speech recognizer is represented by a context-free grammar which describes the syntactic and semantic information.</Paragraph> <Paragraph position="1"> Our recognizer integrates the acoustic process with linguistic process directly without the phrase or word lattice. We could say that this architecture is better for not only cooperatively read speech but spontaneous speech rather than hierarchical architectures interleaved with phrase lattice (Kai and Nakagawa, 95). Furthermore, the recognizer processes interjections and restarts based on an unknown word processing technique. The unknown word processing part uses HMM's likelihood scores for arbitrary syllable sequences.</Paragraph> <Paragraph position="2"> A context free grammar is made to be able to accept sentences with omitted post-positions and inversion of word in order to recognize spontaneous speech. We assume that the interjections and restarts occur at the phrase boundaries. Thus, our speech recognizer for read speech was improved to deal with spontaneous speech.</Paragraph> </Section> <Section position="2" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 2.2 Touch screen (pointing device) </SectionTitle> <Paragraph position="0"> The touch panel used here is an electrostatic type produced by Nissya International System Inc. and the resolution is 1024 x 1024 points. This panel is attached on the 21 inch display of SPARC-10, which has coordinate axes of 1152 x 900 and a transmission speed of 180 points/sec.</Paragraph> <Paragraph position="1"> The input by touch screen is used to designate the location of map around Mt.Fuji (which is a main location related to our task) on the display or to select the desired item from the menu which consists of the set of items responded by a speech synthesizer.</Paragraph> <Paragraph position="2"> The response through the speech synthesizer is convenient, however, user cannot memorize the content when the content includes many items. Therefore, we use the display output (map and menu) as well as speech synthesis for the response. User can only use the positioning/selecting input and speech input at the same time. For example, user can utter &quot;Is here ... ~&quot; while positioning the location or menu. In this case, system regard the demonstartive &quot;here'as a keyword that user has positioned/selected.</Paragraph> </Section> <Section position="3" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 2.3 Graphical User Interface </SectionTitle> <Paragraph position="0"> On man-machine communication, user wants to know his or machine situation what information he gets from the dialogue or how machine interprets/understands his utterances, as well as the speech recognition result. Therefore our system displays the history of dialogue. This function helps to eliminate user uneasiness. Figure 1 illustrates an example of map, menu and history. A multi-modal response algorithm is very simple, because the system is sure to respond to user through speech synthesizer and if the system is possible to respond through Whole process is carried out as below: 1. The steps in the following process are carried out one by one. When one of the steps succeeds, go to process 2. If all of the processes fail, go to process 4.</Paragraph> <Paragraph position="1"> (a) syntax and semantics analysis for legal sentence without omission of post-positions and inversion of word order.</Paragraph> <Paragraph position="2"> (b) syntax and semantics analysis for sentence including omission of post-positions.</Paragraph> <Paragraph position="3"> (c) syntax and semantics analysis for sentence including omission of post-positions and inversion of word order.</Paragraph> <Paragraph position="4"> (d) syntax and semantics analysis for sentence including invalid (misrecognized) post-positions and inversion of word order.</Paragraph> <Paragraph position="5"> 2. Fundamental contextual processing is performed. null (a) Replace demonstrative word with adequate words registered for a demonstrative word database (b) Unify different semantic networks using default knowledges, which are considered to be semantically equivalent to each other (processing for semantically omissions).</Paragraph> <Paragraph position="6"> 3. Semantic representation of the sentence is checked using contextual knowledge (we call it filtering hereafter).</Paragraph> <Paragraph position="7"> (a) correct case: Output the semantic representation of the analysis result (end of analysis). null (b) incorrect case: If there are some heuristics for correcting, apply them to the semantic representation. The corrected semantic representation is the result of analysis (end of analysis). If there aren't any applicable heuristics, go to process 4.</Paragraph> <Paragraph position="8"> 4. Keyword analysis (later mentioned) is per null formed by using a partial result of the analysis. First, the interpreter assumes that there are no omissions and inversions in the sentence(l-a). Second, when the analysis fails, the interpreter uses the heuristics which enable to recover about 90% of inversions and post-position omissions(Yamamoto et al., 92)(1-b,c). Furthermore, when the interpreter fails the analysis using the heuristics, it assumes that the post-position is wrong. Post-positions assumed to be wrong are ignored and the correct post-position is guessed using above heuristics(i-d). The interpreter gives the priority to the interpretation where the number of post-position assumed to be wrong is a few as possible.</Paragraph> <Paragraph position="9"> Human agents can recover illegal sentences by using general syntactical knowledge and/or contextual knowledge. To do this process by computer, we realized a filtering process(3-b). Contextually disallowable semantic representations are registered as filters. This process has 2 functions. One is to block semantic networks including the same as the registered networks for wrong patterns. The other is to modify networks so that they can be accepted as semantically correct. If the input pattern matches with one of the registered patterns, its semantic representation is rejected, and the correction procedure is applied if possible. The patterns are specified as semantic representations including variables, and the matching algorithm works a unification-like. When no network is generated at this stage, the interpreter checks the sentence using keyword based method(4). The interpreter has several dozens of template networks which have semantic conditions on some nodes. If one of them is satisfied by some words in the sentence, it is accepted as the corresponding semantic network.</Paragraph> </Section> </Section> <Section position="3" start_page="57" end_page="57" type="metho"> <SectionTitle> 4 The Cooperative Response Generator </SectionTitle> <Paragraph position="0"> Dialogue system through natural language must be designed so that it can cooperatively response to users. For example, if a user's query doesn't have enough conditions/information to answer the question by sysytem, or if there is much retrieved information from the knowledge database for user's question, the dialogue manager queries the user to get necessary conditions or to select the candidate, respectively. Further, if the system can't retrieve any information related to the user's question, the generator proposes an alternative plan. Based on these considerations, we developed a cooperative response generator in the dialogue system.</Paragraph> <Paragraph position="1"> The response generator is composed of dialogue manager, intention(focus) analyzer, problem solver, knowledge databases, and response sentence generator as shown in Figure 2 (lower part).</Paragraph> <Paragraph position="2"> Firstly, the dialogue manager receives a semantic representation (that is,semantic network) through the semantic interpreter for the user's utterance.</Paragraph> <Paragraph position="3"> The dialogue manager is a component which carries out some operations such as dialogue management, control of contextual information and query to users.</Paragraph> <Paragraph position="4"> Secondly, to get intention for managing dialogues, the dialogue manager passes semantic network to intention(M) analyzer which extracts a dialogue intention and conditions/information of a user's query. Then, the dialogue manager decides a flow of dialogue using the intention that is sent back from the intention analyzer and acquires available information from dialogue history as contextual information. Thirdly, the dialogue manager passes a semantic network and contextual information to problem solver to retrieve any information from the knowledge database. Further, if the problem solver can't retrieve any information related to the user's question, the problem solver proposes an alternative plan (information) by changing a part of conditions of usr's query and send it back to dialgoue manager.</Paragraph> <Paragraph position="5"> Then the dialogue manager counts a number of retrieved information. If there is much retrieved information from the knowledge database for user's question, the dialogue manager queries further conditions to the user to select the information. If the number of these is adequate, the dialogue manager gives a semantic network and retrieved information to the response sentence generator.</Paragraph> <Paragraph position="6"> Finally, the response sentence generator decides a response form from the received inputs and then forms response sentence networks according to this form. After this process was finished, the response sentence generator converts these networks into response sentences.</Paragraph> </Section> class="xml-element"></Paper>