File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1058_metho.xml
Size: 6,336 bytes
Last Modified: 2025-10-06 14:07:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1058"> <Title>On Combining Language Models : Oracle Approach</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. LANGUAGE MODELS </SectionTitle> <Paragraph position="0"> In language modeling, the goal is to find the probability distribution of word sequences, i.e. C8B4CFB5, where CF BP DB</Paragraph> <Paragraph position="2"> We first describe a model for sentence generation in a dialog [5] on which our grammar LM is based. The model is illustrated in Figure 1. Here, the user has a specific goal that does not change throughout the dialog. According to the goal and the dialog context the user first picks a set of concepts with respective values and then use phrase generators associated with concepts to generate the word sequence. The word sequence is next mapped into a sequence of phones and converted into a speech signal by the user's vocal apparatus which we finally observe as a sequence of acoustic feature vectors.</Paragraph> <Paragraph position="3"> Assuming that AF the dialog context CB is given, AF CF is independent of CB but the concept sequence BV, i.e. C8B4CFBPBVBNCBB5BPC8B4CFBPBVB5, AF (W,C) pair is unique (possible with either Viterbi approximation or unambigious association between C and W), one can easily show that C8B4CFB5 is given by</Paragraph> <Paragraph position="5"> AF Concept model: C8B4BVBPCBB5 AF Syntactic model : C8B4CFBPBVB5 BOsBQ I WANT TO FLY FROM MIAMI FLORIDA TO SYDNEY AUS-TRALIA ON OCTOBER FIFTH BO/sBQ BOsBQ [i want] [depart loc] [arrive loc] [date] BO/sBQ BOsBQ I DON'T TO FLY FROM MIAMI FLORIDA TO SYDNEY AFTER AREA ON OCTOBER FIFTH BO/sBQ BOsBQ[Pronoun] [Contraction] [depart loc] [arrive loc] [after] [Noun] [date] BO/sBQ The concept model is conditioned on the dialog context. Although there are several ways to define a dialog context, we select the last question prompted by the system as the dialog context. It is simple and yet strongly predictive and constraining. The concepts are classes of phrases with the same meaning. Put differently, a concept class is a set of all phrases that may be used to express that concept (e.g. [i want], [arrive loc]). Those concept classes are augmented with single word, multiple word and a small number of broad (and unambigious) part of speech (POS) classes. In cases where the parser fails, we break the phrase into a sequence of words and tag them using this set of &quot;filler&quot; classes. Two examples in Figure 2 clearly illustrate the scheme.</Paragraph> <Paragraph position="6"> The structure of the concept sequences is captured by an D2-gram LM. We train a seperate language model for each dialog context. Given the context CB and BV BP CR are for the sentence-begin and sentence-end symbols, respectively.</Paragraph> <Paragraph position="7"> Each concept class is written as a CFG and compiled into a stochastic recursive transition network (SRTN). The production rules define complete paths beginning from the start-node through the end-node in these nets. The probability of a complete path traversed through one or more SRTNs initiated by the top-level SRTN associated with the concept is the probability of the phrase given that concept. This probability is calculated as the multiplication of all arc probabilities that defines the path. That is, concept and rule sequences are assumed to be unique in the above equations. The parser uses heuristics to comply with this assumption. null SCFG and n-gram probabilities are learned from a text corpus by simple counting and smoothing. Our semantic grammars have a low degree of ambiguity and therefore do not require computationally intensive stochastic training and parsing techniques. The class based LM can be considered as a very special case of our grammar based model. Concepts (or classes) are restricted to those that represent a list of semantically similar words, like [city name] , [day of week], [month day] and so forth. So, instead of rule probabilities we have given the class the word probabilities, where CIB4ALB5 is the normalization factor and it is a function of the interpolation weights. The linearity in logarithmic domain is obvious if we take the logarithm of both sides. In the sequel, we omit the normalization term, as its computation is very expensive. We hope that its impact on the performance is not significant. Yet, it prevents us from reporting perplexity results.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. THE ORACLE APPROACH </SectionTitle> <Paragraph position="0"> The set-up for oracle experiments is illustrated in Figure 3. The purpose of this set-up is twofold. First, we use it to evaluate the oracle performance. Second, we use it to prepare data for the training of a stochastic decision model. For the sake of simplicity, we show the set-up for two LMs and do experiments accordingly. Nonetheless, the set-up can be extended to an arbitrary number of LMs.</Paragraph> <Paragraph position="1"> The language models are used for N-best list rescoring. The N-best list is generated by a speech recognizer using a relatively simpler LM (here, a class-based trigram LM) . The framework for N-best list rescoring is the following MAP decision: best hypothesis after rescoring. The oracle compares each hypothesis to the reference and pick the one with the best word (or semantic) accuracy.</Paragraph> <Paragraph position="2"> For training purposes, we create the input feature vector by augmenting features from each rescoring module (CU</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> CV BNCU CR </SectionTitle> <Paragraph position="0"> ) and the dialog context (CB). The output vector is the LM indicator C1 from the oracle. The element that corresponds to the LM with the best final hypothesis is unity and the rest are zeros. After training the oracle combiner (here, we assume a neural network), we set our system as shown in Figure 4. The input to the neural network (NN) is the augmented feature vector. The output of the NN is the LM indicator probably with fuzzy values. So, we first pick the max output, and then, we select and output the respective word string.</Paragraph> </Section> class="xml-element"></Paper>