File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3002_metho.xml
Size: 14,850 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3002"> <Title>Hybrid Statistical and Structural Semantic Modeling for Thai Multi- Stage Spoken Language Understanding</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Multi-Stage SLU </SectionTitle> <Paragraph position="0"> In the design of our spoken dialogue system, the dialogue manager decides to respond to the user after perceiving the user goal. In some types of goal, information items contained in the utterance are required for communication. For example the goal &quot;request for facilities&quot; must come with the facilities the user is asking for, and the goal &quot;request for prerequisite keys&quot; aims to have the user state the reserved date and the number of participants. Hence, the SLU module must be able to identify the goal and extract the required information items.</Paragraph> <Paragraph position="1"> We proposed a novel SLU model (Wutiwiwatchai and Furui, 2003b) that processes an input utterance in three stages, concept extraction, goal identification, and concept-value recognition. Figure 1 illustrates the overall architecture of the SLU model, in which its components are described in detail as follows: Figure 1. Overall architecture of the multi-stage SLU.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Concept extraction </SectionTitle> <Paragraph position="0"> The function of concept extraction is similar to that of other works, aiming to extract a set of concepts from an input utterance. However, our way to define a concept is rather different.</Paragraph> <Paragraph position="1"> preted from a sequence of words arbitrarily placed in the utterance (the sequence can overlap or cross each other).</Paragraph> <Paragraph position="2"> Examples of utterances and concepts contained in the utterances are shown in Table 1. A word sequence or substring corresponding to the concept is presented in the form of a label sequence. The 'e' and two-alphabet symbols such as 'fd' denote the words required to indicate the concept. The two-alphabet symbols additionally specify keywords used for concept-value recognition. The '-' is for other words not related to the concept. As defined above, a concept such as 'reqprovide' (asking whether something is provided) is expressed by the substring &quot;there is ... right&quot;, which contains two separated strings, &quot;there is&quot; and &quot;right&quot;. In the same utterance, another concept 'yesnoq' (asking by a yes-no question) also possesses the word 'right'. We considered this method of definition to have more impact for presenting the meaning of concepts, compared to what has been defined in other works. It must be noted that some concepts contain values such as the concept 'numperson' (the number of people), whereas some do not, such as the concept We implemented the concept extraction component by using weighted finite-state transducers (WFSTs).</Paragraph> <Paragraph position="3"> Similar to the implementation of salient grammar fragments in Gorin et al. (1997), the possible word sequences expressed for a concept are encoded in a WFST, one for each type of concept. Figure 2 demonstrates a portion of WFST for the concept 'numperson'. Each arc or transition of the WFST is labeled with an input word (or word class) followed after a colon by an output semantic label, and enclosed after a slash by a weight. A special symbol 'NIL' represents any word not included in the concept. The transitions, linking between the start and end node, characterize the acceptable word syntax. Weights of these transitions, except those containing 'NIL', are assigned to be -1. The rest are assigned to have zero weights. The output labels indicate keywords as shown in Table 1. These labels will be used later by the concept-value recognition component.</Paragraph> <Paragraph position="4"> In the training step, each concept WFST was cre- null ated separately. The training utterances were tagged by marking just the words required by the concept. Then the WFST was constructed by: 1. replacing the unmarked words in each training utterance by the symbols 'NIL', 2. making an individual FST for the preprocessed utterance, 3. performing the union operation of all FSTs and determinizing the resulting FST, 4. attaching the recursive-arcs of every word to the start and end node as illustrated in Fig. 2, 5. assigning the weights to the transitions as described previously.</Paragraph> <Paragraph position="5"> In the parsing step, an input utterance is fed to every concept WFST in parallel. For each WFST, the words in the utterance that are not included in the WFST are replaced by the symbols 'NIL' and the pre-processed word string is parsed by the WFST using the composition operation. By minimizing the cumulative weight, the longest accepted substring is chosen. A concept is considered to exist if at least one substring is accepted. Since this model is a kind of word-grammar representation for a particular concept, we have called it the concept regular grammar or 'Reg' model in short.</Paragraph> <Paragraph position="6"> &quot;two nights from the sixth of July&quot; corresponding substrings presented by keyword labels.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Goal identification </SectionTitle> <Paragraph position="0"> Having extracted the concepts, the goal of the utterance can be identified. The goal in our case can be considered as a derivative of the dialogue act coupled with additional information. As the examples show in Table 1, the goal 'request_facility' means a request (dialogue act) for some facilities (additional information). Since we observed in our largest corpus that only 1.1% were multiple-goal utterances, an utterance could be supposed to have only one goal.</Paragraph> <Paragraph position="1"> The goal identification task can be viewed as a simple pattern classification problem, where a goal is identified given an input vector of binary values indicating the existence of predefined concepts. Our previous work (Wutiwiwatchai and Furui, 2003b) showed that this task could be efficiently achieved by the simple multi-layer perceptron type of artificial neural network (ANN).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Concept-value recognition </SectionTitle> <Paragraph position="0"> Recall again that some concepts contain values such as the concept 'numperson', whose value is the number of people, whereas some concepts do not, such as the concept 'yesnoq'. Given an input utterance, the SLU module must be able to identify the goal and extract information items such as the reserved date, the number of people, the name of facility, etc. The concepts extracted in the first stage are not only used to identify the goal, but also strongly related to the described information items, that is, the values of concepts are actually the required information items. Hence, extracting the information items is to recognize the concept values.</Paragraph> <Paragraph position="1"> Since the keywords within a concept have already been labeled by WFST composition in the concept extraction step, recognizing the concept-value is just a matter of converting the labeled keywords to a certain format. For sake of explanation, let's consider the utterance &quot;two nights from the sixth of July&quot; in Table 1. After parsing by the 'reservedate' (the reserved date) concept WFST, the substring &quot;from the sixth of July&quot; is accepted with the words &quot;sixth&quot; and &quot;July&quot; labeled by the symbols 'fd' and 'fm' respectively. These label symbols are specifically defined for each type of concept and have their unique meanings, e.g. 'fd' for the check-in date, 'fm' for the check-in month, etc. The labeled keywords are then converted to a predefined format for the concept value. The value of 'reservedate' concept is in a form of <fy-fm-fd_ty-tm-td>, and thus the labeled keywords &quot;sixth(fd) July(fm)&quot; is converted to <04-07-06_ty-tm-td>. It must be noted that although the check-in year is not stated in the utterance, the concept-value recognition process under its knowledge-base inherently assigns the value '04' (the year 2004) to the 'fy'. This process can greatly help in solving anaphoric expressions in natural conversation. Table 2 gives more examples of substrings accepted and labeled by 'reservedate' WFST, and their corresponding values. Currently, this conversion task is performed by simple rules.</Paragraph> <Paragraph position="2"> Although the Reg model described in Sect. 3.1 has an ability to capture long-distant dependencies for seen grammar, it certainly fails to parse an unseen-grammar utterance, especially when it is distorted by speech recognition errors. This article thus presents an effort to improve concept extraction and concept-value recognition by incorporating a statistical approach.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 N-gram modeling </SectionTitle> <Paragraph position="0"> We can view the concept extraction process as a sequence labeling task, where a label sequence L = (l1 ...</Paragraph> <Paragraph position="1"> lT) as shown in the &quot;Label sequence&quot; lines of Table 1 is determined given a word string W = (w1...wT). Each label, in the form of {c:l}, refers to the cth-concept with keyword label l. A word is allowed to be in multiple concepts, hence having multiple keyword labels such as {1:e,3: e} as shown in the last line of Table 1.</Paragraph> <Paragraph position="2"> Finding the most probable sequence L is equivalent to maximizing the joint probability P(W,L), which can be simplified using n-gram modeling (n = 2 for bigram) as follows:</Paragraph> <Paragraph position="4"> The described n-gram model, called 'Ngram' hereafter, can be implemented also by a WFST, whose weights are the smoothed n-gram probabilities. Parsing an utterance by the Ngram WFST is performed simply by applying the WFST composition in the same way as operated with the Reg model.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Logical n-gram modeling </SectionTitle> <Paragraph position="0"> Although the n-gram model can assign a likelihood score to any input utterance, it cannot distinguish between valid and invalid grammar structure. On the other hand, the regular grammar model can give semantic tags to an utterance that is permitted by the grammar, but always rejects an ungrammatical utterance. Thus, another probabilistic approach that integrates the advantages of both models is optimum.</Paragraph> <Paragraph position="1"> Our proposed model, motivated mainly by (Bechet et al. 2002), combines the statistical and structural models in two-pass processing. Firstly, the conventional n-gram model is used to generate M-best hypotheses of label sequences given an input word string.</Paragraph> <Paragraph position="2"> The likelihood score of each hypothesis is then enhanced once its word-and-label syntax is permitted by the regular grammar model. By rescoring the M-best list using the modified scores, the syntactically valid sequence that has the highest n-gram probability is reordered to the top. Even if no label sequence is permitted by the regular grammar, the hybrid model is still able to output the best sequence based on the original n-gram scores. Since the proposed model aims to enhance the logic of n-gram outputs, it is named the logical n-gram model.</Paragraph> <Paragraph position="3"> This idea can be implemented efficiently in the framework of WFST as depicted in Fig. 3. At first, the concept-specific Reg WFST is modified from the one shown in Fig. 2 by replacing the weight -1 by a variable -l, which can be empirically adjusted to gain the best result. An unknown word string in the form of a finite state machine is parsed by the Ngram WFST, producing a WFST of M-best label-sequence hypotheses. Concepts are detected in the top hypothesis. Then, the concept-value recognition process is applied for each detected concept separately. In the concept-value recognition process, the M-best WFST is intersected by the concept-specific Reg WFST. Rescoring the result offers a new WFST of P-best (P < M) hypotheses with a score in logarithmic domain for each hypothesis assigned by</Paragraph> <Paragraph position="5"> where }0,{ll [?]t . If l is set to 0, the intersection operation is just to filter out the hypotheses that violate the regular grammar, while the original scores from n-gram model are left unaltered. If a larger l is used, the hypothesis that contains a longer valid syntax is given a higher score. When no hypothesis in the M-best list is permitted by the grammar (P = 0), the top hypothesis of the M-best list is outputted. It is noted that the strategy of eliminating unacceptable paths of n-gram due to syntactical violation has also successfully been used in a WFST-based speech recognition system (Szarvas and Furui, 2003). Hereafter, we will refer to the logical n-gram modeling as 'LNgram'.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 The use of ASR N-best hypotheses </SectionTitle> <Paragraph position="0"> The probabilistic model allows the use of N-best hypotheses from the automatic speech recognition (ASR) engine. As described in Sect. 4.1, our Ngram semantic model produces a joint probability P(W,L), which indicates the chance that the semantic-label sequence L occurs with the word hypothesis W. When the N-best word hypotheses generated from the ASR are fed into the Ngram semantic parser, the parsed scores are combined with the ASR likelihood scores in a log-linear interpolation fashion (Klakow, 1998) as shown in Eq. 3.</Paragraph> <Paragraph position="2"> where A is an acoustic speech signal, and P(A,W) is a product of an acoustic score P(A|W) and a language score P(W). PhN denotes the N-best list and th is an interpolation weight, which can be adjusted experimentally to give the best result. This interpolation method can be easily implemented in a WFST framework compared to normal linear interpolation.</Paragraph> <Paragraph position="3"> An N-best list can be used in the LNgram using the same criterion as well. The only necessary precaution is an appropriate size of M in the M-best semantic-label list, which is rescored in the second pass to improve the concept-value result.</Paragraph> </Section> </Section> class="xml-element"></Paper>