File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1008_evalu.xml
Size: 11,709 bytes
Last Modified: 2025-10-06 13:59:30
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1008"> <Title>Generating statistical language models from interpretation grammars in dialogue systems</Title> <Section position="6" start_page="59" end_page="62" type="evalu"> <SectionTitle> 4 Evaluation and Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="59" end_page="60" type="sub_section"> <SectionTitle> 4.1 Perplexity measures </SectionTitle> <Paragraph position="0"> The 8 SLMs (all using the vocabulary of 1153 words) were evaluated by measuring perplexity with the tools SRI provides on the evaluation test set of 1700 utterances.</Paragraph> <Paragraph position="1"> In Table 1 we can see a dramatic perplexity reduction with the mixed models compared to the simplest of our models the MP3GFLM. Surprisingly, the GSLCLM models the test set better than</Paragraph> <Paragraph position="3"> mar is too restricted and differs considerably from the students' grammars.</Paragraph> <Paragraph position="4"> Lower perplexity does not necessarily mean lower word error rates and the relation between these two measures is not very clear. One of the reasons that language model complexity does not measure the recognition task complexity is that language models do not take into account acoustic confusability (Huang et al, 2001; Jelinek, 1997). Accordingto Rosenfeld(Rosenfeld, 2000a), aperplexity reduction of 5% is usually practically not significant, 10-20% is noteworthy and a perplexity reduction of 30% or more is quite significant. The above results of the mixed models could then mean an improvement in word error rate over the baseline model MP3GFLM. This has been tested and is reported in the next section. In addition, we want to test if we can reduce word error rate using our simple SLM opposed to the Nuance grammar (MP3NuanceGr) which is our recognition baseline. null</Paragraph> </Section> <Section position="2" start_page="60" end_page="60" type="sub_section"> <SectionTitle> 4.2 Recognition rates </SectionTitle> <Paragraph position="0"> The 8 SLMs under consideration were converted with the SRILM toolkit into a format that Nuance accepts and then compiled into recognition packages. These were evaluated with Nuance's batch recognition program on the recorded evaluation test set of 500 utterances (26 speakers). Table 2 presents word error rates (WER) and in parenthesisN-Best(N=10)WERforthemodelsundercon- null sideration and for the Nuance Grammar.</Paragraph> <Paragraph position="1"> As seen, our simple SLM, MP3GFLM, improves recognition performance considerably compared with the Nuance grammar baseline (MP3NuanceGr) showing a much more robust behaviour to the data. Remember that these two models have the same vocabulary and are both de- null rived from the same GF interpretation grammar.</Paragraph> <Paragraph position="2"> However the flexibility of the SLM gives a relative improvement of 37% over the Nuance grammar.</Paragraph> <Paragraph position="3"> The models giving the best results are the models interpolatedwiththeGSLCcorpusandthedomain news corpus in different ways which at best gives a relative reduction in WER of 8% in comparison with MP3GFLM and 43% compared with the baseline. It is interesting to see that the simple way we used to create a domain specific newspaper corpus gives a model that better fits our data than the original much larger newspaper corpus.</Paragraph> </Section> <Section position="3" start_page="60" end_page="61" type="sub_section"> <SectionTitle> 4.3 In-grammar recognition rates </SectionTitle> <Paragraph position="0"> To contrast the word error rate performance with in-grammarutterancesi.e. utterancesthattheoriginal GF interpretation grammar covers, we carried out a second evaluation with the in-grammar recordings. We also used Nuance's parsing tool to extract the utterances that were in-grammar from the recorded evaluation test set. These few recordings (5%) were added to the in-grammar test set.</Paragraph> <Paragraph position="1"> The results of the second recognition experiment are reported in Table 3.</Paragraph> <Paragraph position="2"> The in-grammar results reveal an increase in WER for all the SLMs in comparison to the baseline MP3NuanceGr. However, the simplest model (MP3GFLM), modelling the language of the grammar, do not show any greater reduction in recognition performance.</Paragraph> </Section> <Section position="4" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 4.4 Discussion of results </SectionTitle> <Paragraph position="0"> The word error rates obtained for the best models show a relative improvement over the Nuance grammar of 40%. The most interesting result is that the simplest of our models, modelling the same language as the Nuance grammar, gives such an important gain in performance that it lowers the WER with 22%. We used the Chi-square test of significance to statistically compare the results with the results of the Nuance grammar showing that the differences of WER of the models in comparison with the baseline are all significant on the p=0.05 significance level. However, the Chi-square test points out that the difference of WER for in-grammar utterances of the Nuance model and theMP3GFLMis significant on the p=0.05level. Thismeansthatallthestatisticallanguage models significantly outperform the base-line i.e. the Nuance Grammar MP3NuanceGr on the evaluation test set (being mostly out-ofcoverage) and that the MP3GFLM outperforms the baseline overall as the difference of WER in the in-grammar test is significant but very small.</Paragraph> <Paragraph position="1"> However, as the reader may have noticed, the word error rates are quite high, which is partly due to a totally independent test set with out-of-vocabulary words (9% OOV for the MP3GFLM ) indicating that domain language grammar writing is very subjective. The students have captured a quite different language for the same domain and functionality. This shows the risk of a hand-tailored domain grammar and the difficulty of predicting what users may say. In addition, a fair test of the model would be to measure concept error rate or more specifically dialogue move error rate (i.e. both 'yes' and 'yeah' correspond to the same dialogue move answer(yes)). A closer look at the MP3GFLM results give a hint that in many cases the transcription reference and the recognition hypothesis hold the same semantic content in the domain (e.g. confusing the Swedish prepositions 'i' (into) and 'till' (to) which are both used when referring to the playlist). It was manually estimated that 53% of the recognition hypotheses could be considered as correct in this way opposed to the 65% Sentence Error Rate (SER) that the automatic evaluation gave. This implies that the evaluation carried out is not strictly fair considering the possible task improvement. However, a fair automatic evaluation of dialogue move error rate will be possible only when we have a way to do semantic decoding that is not entirely dependent on the GF grammar rules.</Paragraph> <Paragraph position="2"> The N-Best results indicate that it could be worth putting effort on re-ranking the N-Best lists as both WER and SER of the N-Best candidates are considerably lower. This could ideally give us a reduction in SER of 10% and, considering dialogue move error rate, perhaps even more. More or less advanced post-process methods have been used to analyze and decide on the best choice from the N-Best list. Several different re-ranking methods have been proposed that show how recognition rates can be improved by letting external processes do the top N ranking and not the recognizer (Chotimongkol & Rudnicky, 2001; van Noord et al., 1997). However, the way that seems most appealing is how (Gabsdil & Lemon, 2004) and (Hacioglu & Ward, 2001) re-rank N-Best lists based on dialogue context achieving a considerable improvement in recognition performance. We are considering basing our re-ranking on the information held in the dialogue information state, knowledge of what is going on in the graphical interface and on dialogue moves in the list that seem appropriate to the context. In this way we can take advantage of what the dialogue system knows about the current situation.</Paragraph> <Paragraph position="3"> 5 Concluding remarks and future work AfirstobservationisthattheSLMsgiveusamuch more robust recognition, as expected. Our best SLMs, i.e. the mixed models, give a 43% relative improvement over the baseline i.e. the Nuance grammar compiled from the GF interpretation grammar. However, this also implies a falling off in in-grammar performance. It is interesting that the SLM that only models the grammar (MP3GFLM), although being more robust and giving a significant reduction in WER rate, does not degrade its in-grammar performance to a great extent. This simple model seems promising for use in a first version of the system with the possibility of improving it when logs from system interactions have been collected. In addition, the vocabu- null lary of this model is in sync with our GF interpretationgrammar. Theresultsseemcomparablewith those obtained by (Bangalore & Johnston, 2004) using random generation to produce an SLM from an interpretation grammar.</Paragraph> <Paragraph position="4"> Although interpolating our MP3 model with the GSLC corpus and the newspaper corpora gave a large perplexity reduction it did not have as much impact on WER as expected even though it gave a significant improvement. It seems from the tests that the quality of the data is more important than the quantity. This makes extraction of domain data from larger corpora an important issue and increases the interest of generating artificial corpora. null As the approach of using SLMs in our dialogue systems seems promising and could improve recognition performance considerably we are planning to apply the experiment to other applications that are under development in TALK when the corresponding GF application grammars are finished. In this way we hope to find out if there is a tendency in the performance gain of a statistical language model vs its correspondent speech recognition grammar. If so, we have found a good way of compromising between the ease of grammar writing and the robustness of SLMs in the first stage of dialogue system development. In this way we can use the knowledge and intuition we have about the domain and include it in our first SLM and get a more robust behaviour than with a grammar. From this starting point we can then collect more data with our first prototype of the system to improve our SLM.</Paragraph> <Paragraph position="5"> We have also started to look at dialogue move specific statistical language models (DM-SLMs) by using GF to generate all utterances that are specific to certain dialogue moves from our interpretation grammar. In this way we can produce models that are sensitive to the context but also, by interpolating these more restricted models with the general GF SLM, do not restrict what the users can say but take into account that certain utterances should be more probable in a specific dialogue context. Context-sensitive models and specifically grammars for different contexts have been explored earlier (Baggia et al, 1997; Wright et al, 1999; Lemon, 2004) but generating suchlanguagemodelsartificiallyfromaninterpretation grammar by choosing which moves to combine seems to be a new direction. Our first experiments seem promising but the dialogue move specific test sets are too small to draw any conclusions. We hope to report more on this in the near future.</Paragraph> </Section> </Section> class="xml-element"></Paper>