File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3007_metho.xml
Size: 20,661 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3007"> <Title>Robustness Issues in a Data-Driven Spoken Language Understanding System</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 System Overview </SectionTitle> <Paragraph position="0"> Spoken language understanding (SLU) aims to interpret the meanings of users' utterances and respond reasonably to what users have said. A typical architecture of an SLU system is given in Fig. 1, which consists of a speech recognizer, a semantic parser, and a dialog act decoder. Within a statistical framework, the SLU problem can be factored into three stages. First the speech recognizer recognizes the underlying word string W from each input acoustic signal A, i.e.</Paragraph> <Paragraph position="2"> then the semantic parser maps the recognized word string ^W into a set of semantic concepts C</Paragraph> <Paragraph position="4"> and finally the dialogue act decoder infers the user's dialog acts or goals by solving</Paragraph> <Paragraph position="6"> standing system.</Paragraph> <Paragraph position="7"> The sequential decoding described above is suboptimal since the solution at each stage depends on an exact solution to the previous stage. To reduce the effect of this approximation, a word lattice or N-best word hypotheses can be retained instead of the single best string ^W as the output of the speech recognizer. The semantic parse results may then be incorporated with the output from the speech recognizer to rescore the N-best list as below.</Paragraph> <Paragraph position="9"> where P(AjW) is the acoustic probability from the first pass, P(W) is the language modelling likelihood, P(CjW) is the semantic parse score, LN denotes the N-best list, is a semantic parse scale factor, and is a grammar scale factor.</Paragraph> <Paragraph position="10"> In the system described in this paper, each of these stages is modelled separately. We use a standard HTK-based (HTK, 2004) Hidden Markov Model (HMM) recognizer for recognition, the Hidden Vector State (HVS) model for semantic parsing (He and Young, 2003b), and Tree-Augmented Naive Bayes networks (TAN) (Friedman et al., 1997) for dialog act decoding.</Paragraph> <Paragraph position="11"> The speech recognizer comprises 14 mixture Gaussian HMM state-clustered cross-word triphones augmented by using heteroscedastic linear discriminant analysis (HLDA) (Kumar, 1997). Incremental speaker adaptation based on the maximum likelihood linear regression (MLLR) method (Gales and Woodland, 1996) was performed during the test with updating being performed in batches of five utterances per speaker.</Paragraph> <Paragraph position="12"> The Hidden Vector State (HVS) model (He and Young, 2003b) is a hierarchical semantic parser which associates each state of a push-down automata with the state of a HMM. State transitions are factored into separate stack pop and push operations and then constrained to give a tractable search space. The result is a model which is complex enough to capture hierarchical structure but which can be trained automatically from unannotated data.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> CITY DATE </SectionTitle> <Paragraph position="0"> Let each state at time t be denoted by a vector of Dt semantic concept labels (tags) ct = [ct[1];ct[2];::ct[Dt]] where ct[1] is the preterminal concept and ct[Dt] is the root concept (SS in Figure 2). Given a word sequence W, concept vector sequence C and a sequence of stack pop operations N, the joint probability of P(W; C;N) can be decomposed as</Paragraph> <Paragraph position="2"> where ct at word position t is a vector of Dt semantic concept labels (tags), nt is the vector stack shift operation and takes values in the range 0; ;Dt 1 where Dt 1 is the stack size at word position t 1, and ct[1] = cwt is the new preterminal semantic tag assigned to word wt at word position t.</Paragraph> <Paragraph position="3"> Thus, the HVS model consists of three types of proba- null bilistic move: 1. popping semantic tags off the stack; 2. pushing a pre-terminal semantic tag onto the stack; 3. generating the next word.</Paragraph> <Paragraph position="4"> The dialog act decoder was implemented using the Tree-Augmented Naive Bayes (TAN) algorithm (Friedman et al., 1997), which is an extension of Naive Bayes Networks. One TAN was used for each dialogue act or goal Gu, the semantic concepts Ci which serve as input to its corresponding TAN were selected based on the mutual information (MI) between the goal and the concept. Naive Bayes networks assume all the concepts are conditionally independent given the value of the goal. TAN networks relax this independence assumption by adding dependencies between concepts based on the conditional mutual information (CMI) between concepts given the goal. The goal prior probability P(Gu) and the conditional probability of each semantic concept Ci given the goal Gu, P(CijGu) are learned from the training data. Dialogue act detection is done by picking the goal with the highest posterior probability of Gu given the particular instance of concepts C1 Cn, P(GujC1 Cn).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Noise Robustness </SectionTitle> <Paragraph position="0"> The ATIS corpus which contains air travel information data (Dahl et al., 1994) has been chosen for the SLU system development and evaluation. ATIS was developed in the DARPA sponsored spoken language understanding programme conducted from 1990 to 1995 and it provides a convenient and well-documented standard for measuring the end-to-end performance of an SLU system. However, since the ATIS corpus contains only clean speech, corrupted test data has been generated by adding samples of background noise to the clean test data at the waveform level.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experimental Setup </SectionTitle> <Paragraph position="0"> The experimental setup used to evaluate the SLU system was similar to that described in (He and Young, 2003a).</Paragraph> <Paragraph position="1"> As mentioned in section 2, the SLU system consists of three main components, a standard HTK-based HMM recognizer, the HVS semantic parser, and the TAN dialogue act (DA) decoder. Each of the three major components are trained separately. The acoustic speech signal in the ATIS training data is modelled by extracting 39 features every 10ms: 12 cepstra, energy, and their first and second derivatives. This data is then used to train the speaker-independent, continuous speech recognizer. The HVS semantic parser is trained on the unannotated utterances using EM constrained by the domain-specific lexical class information and the dominance relations built into the abstract annotations (He and Young, 2003b). In the case of ATIS, the lexical classes can be extracted automatically from the relational database, whilst abstract semantic annotations for each utterance are automatically derived from the accompanying SQL queries of the training utterances. The dialogue act decoder is trained using the main topics or goals and the key semantic concepts extracted automatically from the reference SQL queries Performance is measured at both the component and the system level. For the former, the recognizer is evaluated by word error rate, the parser by concept slot retrieval rate using an F-measure metric (Goel and Byrne, 1999), and the dialog act decoder by detection rate. The overall system performance is measured using the standard NIST &quot;query answer&quot; rate.</Paragraph> <Paragraph position="2"> In the expriments reported here, car noise from the NOISEX-92 (Varga et al., 1992) database was added to the ATIS-3 NOV93 and DEC94 test sets. In order to obtain different SNRs, the noise was scaled accordingly before adding to the speech signal.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Experimental Results </SectionTitle> <Paragraph position="0"> Robust spoken language understanding components should be able to compensate for the weakness of the speech recognizer. That is, ideally they should be capable of generating the correct meaning of an utterance even if it is recognized wrongly by a speech recognizer. At minimum, the performance of the understanding components should degrade gracefully as recognition accuracy degrades.</Paragraph> <Paragraph position="1"> Figure 3 gives the system performance on the corrupted test data with additive noise ranging from 25dB to 10dB SNR. The label &quot;clean&quot; in the X-axis denotes the original clean speech data without additive noise. Note that the recognition results on the corrupted test data were obtained directly using the original clean speech HMM models without retraining for the noisy conditions. The upper portion of Figure 3 shows the end-to-end performance in terms of query answer error rate for the NOV93 and DEC94 test sets. For easy reference, WER is also shown. The individual component performance, F-measure for the HVS semantic parser and dialogue act (DA) detection accuracy for the DA decoder, are illustrated in the lower portion of Figure 3. For each test set, the performance on the rescored word hypotheses is given as well. This incorporates the semantic parse scores into the acoustic and language modelling likelihoods to rescore the 25-best word lists from the speech recognizer.</Paragraph> <Paragraph position="2"> It can be observed that the system gives fairly stable performance at high SNRs and then the recognition accuracy degrades rapidly in the presence of increasing noise. At 20dB SNR, the WER for the NOV93 test set increases by 1.6 times relative to clean whilst the query answer error rate increases by only 1.3 times. On decreasing the SNR to 15dB, the system performance degrades significantly. The WER increases by 3.1 times relative to clean but the query answer error rate increases by only 1.7 times. Similar figures were obtained for the DEC94 test set.</Paragraph> <Paragraph position="3"> The above suggests that the end-to-end performance measured in terms of answer error rate degrades more slowly compared to the recognizer WER as the noise level increases. This demonstrates that the statistically-based understanding components of the SLU system, the semantic parser and the dialogue act decoder, are relatively robust to degrading recognition performance.</Paragraph> <Paragraph position="4"> Regarding the individual component performance, the dialogue act detection accuracy appears to be less sensitive to decreasing SNR. This is probably a consequence of the fact that the Bayesian networks are set up to respond to only the presence or absence of semantic concepts or slots, regardless of the actual values assigned to them. In another words, the performance of the dialogue act decoder is not affected by the mis-recognition of individual words, but only by a failure to detect the presence of a semantic concept. It can also be observed from Figure 3 that the F-measure needs to be better than 85% in order to achieve acceptable end-to-end performance.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Adaptation to New Applications </SectionTitle> <Paragraph position="0"> Statistical model adaptation techniques are widely used to reduce the mismatch between training and test or to adapt a well-trained model to a novel domain. Commonly used techniques can be classified into two categories, Bayesian adaptation which uses a maximum a posteriori (MAP) probability criteria (Gauvain and Lee, 1994) and transformation-based approaches such as maximum likelihood linear regression (MLLR) (Gales and Woodland, 1996), which uses a maximum likelihood (ML) criteria. In recent years, MAP adaptation has been successfully applied to n-gram language models (Bacchiani and Roark, 2003) and lexicalized PCFG models (Roark and Bacchiani, 2003). Luo et al. have proposed transformation-based approaches based on the Markov transform (Luo et al., 1999) and the Householder transform (Luo, 2000), to adapt statistical parsers. However, the optimisation processes for the latter are complex and it is not clear how general they are.</Paragraph> <Paragraph position="1"> Since MAP adaptation is straightforward and has been applied successfully to PCFG parsers, it has been selected for investigation in this paper. Since one of the special forms of MAP adaptation is interpolation between the in-domain and out-of-domain models, it is natural to also consider the use of non-linear interpolation and hence this has been studied as well 1.</Paragraph> <Paragraph position="2"> 1Experiments using linear interpolation have also been conducted but it was found that the results are worse than those</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 MAP Adaptation </SectionTitle> <Paragraph position="0"> Bayesian adaptation reestimates model parameters directly using adaptation data. It can be implemented via maximum a posteriori (MAP) estimation. Assuming that model parameters are denoted by , then given observation samples Y , the MAP estimate is obtained as</Paragraph> <Paragraph position="2"> where P(Y j ) is the likelihood of the adaptation data Y and model parameters are random vectors described by their probabilistic mass function (pmf) P( ), also called the prior distribution.</Paragraph> <Paragraph position="3"> In the case of HVS model adaptation, the objective is to estimate probabilities of discrete distributions over vector state stack shift operations and output word generation.</Paragraph> <Paragraph position="4"> Assuming that they can be modelled under the multinomial distribution, for mathematical tractability, the conjugate prior, the Dirichlet density, is normally used. Assume a parser model P(W;C) for a word sequence W and semantic concept sequence C exists with J component distributions Pj each of dimension K, then given some adaptation data Wl, the MAP estimate of the kth</Paragraph> <Paragraph position="6"> where j = PKk=1 j(k) in which j(k) is defined as the total count of the events associated with the kth component of Pj summed across the decoding of all adaptation utterances Wl, is the prior weighting parameter, Pj(k) is the probability of the original unadapted model, and ~Pj(k) is the empirical distribution of the adaptation data, which is defined as</Paragraph> <Paragraph position="8"> As discussed in section 2, the HVS model consists of three types of probabilistic move. The MAP adaptation technique can be applied to the HVS model by adapting each of these three component distributions individually.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Log-Linear Interpolation </SectionTitle> <Paragraph position="0"> Log-linear interpolation has been applied to language model adaptation and has been shown to be equivalent to a constrained minimum Kullback-Leibler distance optimisation problem(Klakow, 1998).</Paragraph> <Paragraph position="1"> Following the notation introduced in section 4.1, where Pj(k) is the probability of the original unadapted model, and ~Pj(k) is the empirical distribution of the adaptation obtained using MAP adaptation or log-linear interpolation. data, denote the final adapted model probability as ^Pj(k). It is assumed that the Kullback-Leibler distance of the adapted model to the unadapted and empirically determined model is</Paragraph> <Paragraph position="3"> Given an additional model probability Pj(k) whose distance to ^Pj(k) should be kept small, and introducing Lagrange multipliers 01 and 02 to ensure that constraints</Paragraph> <Paragraph position="5"> Minimizing D with respect to ^Pj(k) yields the required distribution.</Paragraph> <Paragraph position="6"> With some manipulation and redefinition of the Lagrange Multipliers, it can be shown that</Paragraph> <Paragraph position="8"> where Pj(k) has been assumed to be a uniform distribution which is then absorbed into the normalization term Z .</Paragraph> <Paragraph position="9"> The computation of Z is very expensive and can usually be dropped without significant loss in performance (Martin et al., 2000). For the other parameters, 1 and 2, the generalized iterative scaling algorithm or the simplex method can be employed to estimate their optimal settings.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Experiments </SectionTitle> <Paragraph position="0"> To test the portability of the statistical parser, the initial experiments reported here are focussed on assessing the adaptability of the HVS model when it is tested in a domain which covers broadly similar concepts, but comprises rather different speaking styles. To this end, the flight information subset of the DARPA Communicator Travel task has been used as the target domain (CUData, 2004). By limiting the test in this way, we ensure that the dimensionalities of the HVS model parameters remain the same and no new semantic concepts are introduced by the adaptation training data.</Paragraph> <Paragraph position="1"> The baseline HVS parser was trained on the ATIS corpus using 4978 utterances selected from the context-independent (Class A) training data in the ATIS-2 and ATIS-3 corpora. The vocabulary size of the ATIS training corpus is 611 and there are altogether 110 semantic concepts defined. The parser model was then adapted using utterances relating to flight reservation from the DARPA Communicator data. Although the latter bears similarities to the ATIS data, it contains utterances of a different style and is often more complex. For example, Communicator contains utterances on multiple flight legs, information which is not available in ATIS.</Paragraph> <Paragraph position="2"> To compare the adapted ATIS parser with an in-domain Communicator parser, a HVS model was trained from scratch using 10682 Communicator training utterances.</Paragraph> <Paragraph position="3"> The vocabulary size of the in-domain Communicator training data is 505 and a total of 99 semantic concepts have been defined. For all tests, a set of 1017 Communicator test utterances was used.</Paragraph> <Paragraph position="4"> Table 1 lists the recall, precision, and F-measure results obtained when tested on the 1017 utterance DARPA Communicator test set. The baseline is the unadapted HVS parser trained on the ATIS corpus only. The in-domain results are obtained using the HVS parser trained solely on the 10682 DARPA training data. The other rows of the table give the parser performance using MAP and log-linear interpolation based adaptation of the baseline model using 50 randomly selected adaptation utterances.</Paragraph> <Paragraph position="5"> MAP or log-linear interpolation.</Paragraph> <Paragraph position="6"> Since we do not yet have a reference database for the DARPA Communicator task, it is not possible to conduct the end-to-end performance evaluation as in section 3.</Paragraph> <Paragraph position="7"> However, the experimental results in section 3.2 indicate that the F-measure needs to exceed 85% to give acceptable end-to-end performance (see Figure 3). Therefore, it can be inferred from Table 1 that the unadapted ATIS parser model would perform very badly in the new Communicator application whereas the adapted models would give performance close to that of a fully trained in-domain model.</Paragraph> <Paragraph position="8"> Figure 4 shows the parser performance versus the number of adaptation utterances used. It can be observed that when there are only a few adaptation utterances, MAP adaptation performs significantly better than log-linear interpolation. However above 25 adaptation utterances, the converse is true. The parser performance saturates when the number of adaptation utterances reaches 50 for both techniques and the best performance overall is given by the parser adapted using log-linear interpolation. The performance of both models however degrades when the number of adaptation utterances exceeds 100, possibly due to model overtraining. For this particular application, we conclude that just 50 adaptation utterances would be sufficient to adapt the baseline model to give comparable results to the in-domain Communicator model.</Paragraph> </Section> </Section> class="xml-element"></Paper>