File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/h01-1058_evalu.xml
Size: 3,215 bytes
Last Modified: 2025-10-06 13:58:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1058"> <Title>On Combining Language Models : Oracle Approach</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5. EXPERIMENTAL RESULTS </SectionTitle> <Paragraph position="0"> The models were developed and tested in the context of the CU Communicator dialog system which is used for telephone-based flight, hotel and rental car reservations [11]. The text corpus was divided into two parts as training and test sets with 15220 and 1220 sentences, respectively. The test set was further divided into two parts. Each part, in turn, was used to optimize language and interpolation weights to be used for the other part in a &quot;jacknife paradigm&quot;. The results were reported as the average of the two results. The average sentence length of the corpus was 4 words (end-of-sentence was treated as a word). We identified 20 dialog contexts and labeled each sentence with the associated dialog context. null We trained a dialog independent (DI) class based LM and dialog dependent (DD) grammar based LM. In all LMs D2 is set to 3.</Paragraph> <Paragraph position="1"> It must be noted that the DI class-based LM served as the LM of the baseline system with 921 unigrams including 19 classes. The total number of the distinct words in the lexicon was 1681. The grammar-based LM had 199 concept and filler classes that completely cover the lexicon. In rescoring experiments we set the N-best list size to 10. We think that the choice of C6 BPBDBCis a resonable tradeoff between performance and complexity.</Paragraph> <Paragraph position="2"> The perplexity results are presented in Table 1. The perplexity of the grammar-based LM is 36.8% better than the baseline class-based LM.</Paragraph> <Paragraph position="3"> We did experiments using BDBC-best lists from the baseline recognizer. We first determined the best possible performance in WER offered by BDBC-best lists. This is done by picking the hypothesis with the lowest WER from each list. This gives an upperbound for the performance gain possible from rescoring BDBC-best lists . The rescoring results in terms of absolute and relative improvements in WER and semantic error rate (SER) along with the best possible improvement are reported in Table 2. It should be noted that the optimizations are made using WER. The slight drop in SER with interpolation might be due to that. Actually this is good for text transcription but not for a dialog system. We believe that the results will reverse if we replace the optimization using WER with with different LMs: the baseline WER is 25.9% and SER is The performance gap between the oracle and interpolation methods promotes the system in Figure 4. We expect that, based on the universal approximation theory, a neural network with consistent features, sufficiently large training data and proper training would approximate fairly well the behavior of the oracle. On the other hand, the performance gap between the oracle and the best possible performance from 10-best lists suggests the use of more than two language models and dynamic combination with the acoustic model.</Paragraph> </Section> class="xml-element"></Paper>