XML Viewer - h92-1016

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1016_metho.xml
Size: 15,587 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1016">
  <Title>T\]he MIT ATIS System: February 1992 Progress Report 1</Title>
  <Section position="3" start_page="0" end_page="86" type="metho">
    <SectionTitle>
SPEECH RECOGNITION
</SectionTitle>
    <Paragraph position="0"> In this section we will describe the changes we have made over the past year to the speech recognition component (SUMMIT) of our ATIS system. These include improvements to both the phonetic and language models, and refinements on the lexicon. We have also implemented the acoustic models on a set of DSP boards to allow near real-time evaluation and demonstration.</Paragraph>
    <Paragraph position="1"> The baseline SUMMIT system uses a mixture of up to 16 diagonal Gaussian models for each lexical unit. In recent months, we have been able to simplify the input representation of the models significantly with no loss in performance. The current representation consists of 39  segmental measurements for each hypothesized segment.</Paragraph>
    <Paragraph position="2"> This vector is rotated via principal component analysis prior to mixture Gaussian modelling. Segment duration is modelled separately, in the log domain, using a mixture of Gaussians. At the moment, spontaneous disfluencies are represented by one model, and are required to be one segment long.</Paragraph>
    <Section position="1" start_page="84" end_page="85" type="sub_section">
      <SectionTitle>
Training and Testing Corpora
</SectionTitle>
      <Paragraph position="0"> The multi-site ATIS data collection effort has resulted in a significant increase in the amount of speech data available to the community \[6\]. For speech recognition system development, we started with all the MADCOW data released by NIST, and augmented them with ATIS data collected earlier at MIT. Some 9,711 utterances in this pool were designated as training material, and an additional 1,595 utterances were set aside as a development set for independent evaluation.</Paragraph>
      <Paragraph position="1"> To facilitate a meaningful comparison, all the experiments described in this section are performed on the October '91 &amp;quot;dry-run&amp;quot; test set, containing some 362 utterances collected at BBN, CMU, MIT, and SRI. The experiments that we conducted are summarized in Table 1, and will be described in this section.</Paragraph>
      <Paragraph position="2"> In order to monitor progress internally, we also ran the same test set through our system as reported a year ago \[8\]. Our February '91 system had a vocabulary of 577 words. That system constrained the N-best search with the use of a word-pair grammar with a perplexity of 92. The N-best outputs were subsequently resorted using our natural language component TINA. It was trained on some 2400 utterances collected at TI and MIT. The recognition performance of that system on the October '91 &amp;quot;dry-run&amp;quot; test set, with and without the word-pair language model, is shown in the first two rows of Table 1 (labelled as AW and WP, respectively).</Paragraph>
      <Paragraph position="3"> Lexicon With the availability of a larger amount of training data we enlarged our vocabulary to contain 841 words.</Paragraph>
      <Paragraph position="4"> This was done by examining word frequency counts in the training data and adding all reasonable words that occurred more than once. Examples of words that were not added included misspellings or people's names.</Paragraph>
      <Paragraph position="5"> Other improvements to the lexicon included refinement of the pronunciation baseforms and the phonological rules used to generate the pronunciation networks. In part, this involved improving pre-existing rules such as the flapping rule. We also introduced a number of specific allophones for certain phonemes in certain contexts, such as a retroflexed /f/ or a stop closure following a fricative, and a number of new diphone units, allowing a sequence of two phonemes to be treated as a diphthong, such as/el/or/at/. The inventory of phonetic units in the expanded lexicon contained 115 distinct labels.</Paragraph>
      <Paragraph position="6"> As shown in the third row of Table 1 (labelled as AW, Small Training), these changes combined to reduce the word error rate from 62.5% to 55.4% for the system a year ago using an all-word language model. The next row in the same figure (labelled as AW, Full Training) shows that the word error rate is further reduced to 51% by using the full training set described earlier 2. This result is identical to the results of the February '91 system using a word-pair language model, although the latter achieved better sentence recognition accuracy. Unless otherwise specified, the remaining experiments described in this section all use the full training set.</Paragraph>
      <Paragraph position="7"> Bigram Language Model The current SUMMIT system uses significantly more language constraints than were used by its predecessor \[8\]. With the help of the available large training set, we constructed a smoothed bigram grammar. As has been done elsewhere, the bigram was smoothed by interpolating the bigram estimates with the prior probabilities of each word \[2,4\]:</Paragraph>
      <Paragraph position="9"> The interpolation weights were set to vary with the number of times we had observed the conditioning context:</Paragraph>
      <Paragraph position="11"> where K is a single constant that was opt!mized so as to minimize the measured perplexity on the development data set. For the ATIS training data, we found that the 2Due to computational limitations, we did not use the entire designated training set for training. Instead, a subset of about 7,500 utterances were used.</Paragraph>
      <Paragraph position="12">  perplexity had a broad minimum when K was around 20. On our development data set this smoothed bigram had a perplexity of 20.1. The perplexity measures did not include out of vocabulary words since our recognition system does not currently have the capability of detecting these words. Including out-of-vocabulary words in the perplexity measure increased the value slightly to 20.8. Recognition results using the bigram language model are shown in row 5 of Table 1 (labelled as BG). The bi-gram language model is the single most effective change we made to our system, reducing the word-error rate by more than twofold from the best results obtained previously. null  A probabilistic LR parser was used in addition to a bigram model to provide language constraints. The LR algorithm is a deterministic, table-driven, !eft-to-right parsing algorithm for a subset of context-free grammars \[1\]. The probabilistic LR (PLR) model extends this algorithm to assign a probability</Paragraph>
      <Paragraph position="14"> to each word string, (rather than a binary value). In the PLR model the conditional word probabilities are approximated using the parser state.</Paragraph>
      <Paragraph position="15"> If P(Qj\]wo...w~-l) is the probability that the parser is in state Qj having just parsed the substring Wo...wi-1 (without making any moves based on the value of wi), then the conditional word probability can be re-written as:</Paragraph>
      <Paragraph position="17"> Making the assumption that the parser state captures much of the information in the substring wo...wi-1 relevant to the conditional probabilities, this can be approximated by:</Paragraph>
      <Paragraph position="19"> The set of Qj for which P(Qj\]wo...wi-1) is non-zero is determined by the grammar. In particular, if the grammar is deterministic, then P(Qj \[w0...wi-1) = 1, for some</Paragraph>
      <Paragraph position="21"> The probabilities P(wi\[Qj) can be estimated from a corpus of training utterances using the ratio of the number of times wi is the next word when the parser is in state Qj to the number of times the parser is in state Qj.</Paragraph>
      <Paragraph position="22">  addition to the word and sentence error rates, errors due to substitution, insertion and deletion are also provided. Performance of the systems from a year ago on the same data set is included for reference. The symbols are: AW=all-word language model, WP----word-palr language model, BG=bigram language model, CD=context-dependent modelling, PLR=probabilistic LR parser, NL=NL filtering using TINA.</Paragraph>
      <Paragraph position="23"> In previous work using the PLR model for the vOYaGER task \[3\], the language model implemented was strict, that is, it assigned probability 0 to word strings not generated by the input grammar. In order to apply this model to speech recognition (i.e., optimizing word accuracy), the parse table was extended to &amp;quot;accept&amp;quot; all word strings. This was accomplished by adding explicit error states to the parse table, and computing recovery actions to allow normal parsing to resume in an appropriate state after an error 3. Other extensions to the model described previously \[3\] include various mechanisms for smoothing the probabilities by changing the conditioning state.</Paragraph>
      <Paragraph position="24"> The ATIS gramrnar contains 971 rules, the vast majority of which introduce lexical items, and the resulting parse table contains about 1600 states. The lexicon of the parser is the same as that used by the recognizer. The probabilities were trained on all 9,711 utterances in the training set. The perplexity measured on the October '91 test set was 17.6.</Paragraph>
      <Paragraph position="25"> Row 6 of Table 1 (labelled as BG+PLR) shows that further reduction in error rate is possible by incorporating the PLR. PLR is incorporated by using the parse score in place of the bigram score to reorder the 50 N-best outputs produced by the recognizer. The sentence error rate is reduced more than the word error rate, presumably due to the fact that PLR can deal with some of the long distance constraints better than the bigram.</Paragraph>
    </Section>
    <Section position="2" start_page="85" end_page="86" type="sub_section">
      <SectionTitle>
Context-Dependent Modelling
</SectionTitle>
      <Paragraph position="0"> At the last DARPA meeting we first described our work towards accounting for contextual effects on the phonetic modelling component of SUMMIT \[5\]. We proposed using regression tree analysis to find the contex- null tual factors that provided the greatest reduction in the distortion of our phonetic models. In an initial experiment, regression tree analysis was used to form a set of context-specific models for each phonetic unit. However, we found that we were able to obtain the best performance by using the regression trees to independently learn a context-normalization factor for each of the input dimensions of the model. The model for each phonetic unit is then trained using these context-normalized inputs for all of the training samples in that class.</Paragraph>
      <Paragraph position="1"> We have extended this work by considering more contextual effects, including phonetic labels two phones away and whether or not the current segment is in a syllable before a pause or at a sentence boundary. The new effects were simply added to the list of questions that could be asked at each node in the tree splitting algorithm.</Paragraph>
      <Paragraph position="2"> When we applied this context-normalization to the ATIS domain, we found that the word error rate dropped from 24.1% to 20.6%, as shown in rows 5 and 7 in Table 1) (labelled as BG and BG+CD, respectively). This represents a 15% reduction in error rate. In the Resource Management domain, we found a decrease in word error rate from 10.3% to 7%, or 32% \[5\]. We believe that we are achieving a smaller reduction in error rate in the ATIS domain because a greater number of errors can be attributed to problems other than phonetic modelling (e.g., out-of-vocabulary words, mismatch of language model, spontaneous speech effects, etc.). In fact, if we look at the performance of the phonetic models in terms of their ability to match the &amp;quot;forced-recognition&amp;quot; phonetic string (the string obtained during recognition allowing only the correct word string), we see a much larger reduction in error rate in the ATIS domain (37.5%) than in the Resource Management domain (18.8%). This may not be surprising, since we are now considering more contextual effects. In addition, it is likely that there are stronger  set.</Paragraph>
      <Paragraph position="3"> contextual effects in a spontaneous speech corpus such as ATIS than in a more carefully spoken &amp;quot;read&amp;quot; corpus such as Resource Management.</Paragraph>
      <Paragraph position="4"> The combined effect of our improved phonetic and language modelling is shown in row 8 of Table 1 (labelled as BG+CD+PLR). In this case, the PLR score is used in conjunction with the acoustic score to resort the N-best outputs. As expected, there is again a more significant improvement on the sentence error rate.</Paragraph>
      <Paragraph position="5"> Finally, we incorporated our natural language system TINA as a filter on the N-best outputs produced by the recognizer (with N = 40), and the results are shown in the last row of Table 1 (labelled as BG+CD+PLR+NL).</Paragraph>
      <Paragraph position="6"> Not surprisingly, the natural language component is able to reduced the sentence error rate much more than the word error rate.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="86" end_page="86" type="metho">
    <SectionTitle>
OTHER IMPROVEMENTS
</SectionTitle>
    <Paragraph position="0"> The most significant improvement in the back-end, the augmentation of the system with a robust parsing capability is described separately. However, in addition, we have continued to expand the capabilities of the back-end at all levels (syntactic coverage, concepts understood, discourse modelling, dialogue aspects, etc.) We continue to improve the level of sophistication of the booking dialogue, towards the goal of a natural and effective mixed-initiative dialogue to achieve a successful booking.</Paragraph>
    <Paragraph position="1"> The performance of our current spoken language system on the October '91 test set is summarized in Table 2. The significant improvement in our NL result can be attributed to the robust parsing strategy that we have * adopted. Discussion of these results can be found in a companion paper \[9\].</Paragraph>
  </Section>
  <Section position="5" start_page="86" end_page="87" type="metho">
    <SectionTitle>
FEBRUARY BENCHMARK
</SectionTitle>
    <Paragraph position="0"> The February '92 benchmark results were obtained by running the official test set released by NIST through our system once. This test set contains 971 utterances collected AT&amp;T, BBN, CMU, MIT, and SRI. The speech recognition results are shown in Table 3. Comparing Table 3 with the last row of Table 1, we see that the performance of our system on the two test sets is quite similar.  speech input, on the February '92 test set.</Paragraph>
    <Paragraph position="1"> The performance of our current spoken language system on the February '92 test set is summarized in Table 4. Although the system's performance for speech input is similar to that on the October '91 test set, the NL results are not as good. This is a direct reflection of our research priority since October 1991. That is, we have focused our group's attention almost entirely on improving the speech recognition component, to the neglect of expanding our NL system capabilities to adequately conform to the principles of interpretation. Again, discussion of these results can be found elsewhere in these proceedings \[9\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML