File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1013_metho.xml

Size: 11,096 bytes

Last Modified: 2025-10-06 14:13:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1013">
  <Title>Progress Report on the Chronus System: ATIS Benchmark Results</Title>
  <Section position="4" start_page="67" end_page="68" type="metho">
    <SectionTitle>
3. THE NL COMPONENT
</SectionTitle>
    <Paragraph position="0"> 3.1. Training the conceptual model The conceptual model, as explained in the introduction of this paper, consists of concept transition probabilities P(cg~ I cg,_l) and concept conditional bigram language models P(wi \[ w~-l, cg~), where cg~ is the concept expressed by the phrase in which word wi is included. These probabilities were initially trained using a set of 532 sentences whose conceptual segmentation was provided by hand. This initial model was used in the experiments described in \[1, 5\] and gave satisfactory performance as far as the conceptual segmentation of test sentences was concerned. Hand labeling train- null ing sentences is of course a rather expensive procedure whose consistence is rather doubtful. As of today, most of the training sentences available are annotated with a reference file that includes the right answer. However, for taking advantage of the annotated sentences we must use the whole understanding system in the training phase, generate the answer, and compare the answer with the reference file (see Fig. 1). Therefore the comparator \[7\] provides the training procedure with a feedback signal that can be used to partially automatize the training procedure. As a first attempt to develop a completely automatic training procedure, we designed a training loop based on the following steps:  1. Start with a reasonable model.</Paragraph>
    <Paragraph position="1"> 2. Generate an answer for each sentence in the training set.</Paragraph>
    <Paragraph position="2"> 3. Compare each answer with the corresponding reference answer.</Paragraph>
    <Paragraph position="3"> 4. Use the conceptual segmentation of the sentences  that were given a correct answer to reestimate the model parameters.</Paragraph>
    <Paragraph position="4"> 5. Update the model and go to step 2 A certain number of sentences will still produce a wrong answer after several iterations of the training loop. The conceptual segmentation of these sentences may be then corrected by hand and included in the training set for a final reestimation of the model parameters. Table 1 shows the sets of data used for testing the effectiveness of the training loop. All sentences are class A (context independent) sentences and belong to the MADCOW database. The conceptual segmentation of the sentences in set A was done by hand, set B and C were annotated  with reference files (set C corresponds to the official October 91 test set). The comparison with reference files was done using only the minimal answer. The results of this experiment are reported in Table 2. The first line in the table shows the results (as the percentage of correctly answered sentences) both on set B and on the October 91 test set when the initial model, trained on the 532 hand labeled sentences, was used. The second line shows the results on October 91 when the initial model is smoothed using the supervised smoothing described in \[5\]. The third line.reports the accuracy (on both set B and October 91) when the sentences that were correctly answered out of set B were added to the training set (this set is called T(B)) and their conceptual labeling was used along with set A for reestimating the model.</Paragraph>
    <Paragraph position="5"> It is interesting to notice that the performance on the October 91 test set is higher than that obtained with supervised smoothing. The last line of Table 2 shows that supervised smoothing increases the performance by a very small percentage. The results of this experiment show that the use of automatically produced conceptual segmentation along with the feedback introduced by the comparator improves the performance of the system of an amount that is comparable with that obtained by a supervised procedure, like the supervised smoothing.</Paragraph>
    <Paragraph position="6"> 3.2. The dialog manager For dealing with class D sentences we developed a module, within the template generator, called the dialog manager. The function of this module is to keep the history of the dialog. In this version of the dialog man- null text. T(B) is the subset of B that was correctly answered by the system.</Paragraph>
  </Section>
  <Section position="5" start_page="68" end_page="68" type="metho">
    <SectionTitle>
92 test
</SectionTitle>
    <Paragraph position="0"> ager the history is kept by saving the template from the previous sentence in the same session and merging it with the newly formed template, according to a set of application specific rules.</Paragraph>
    <Paragraph position="1"> 3.3. NL results on February 1992 test The February 1992 test set includes 402 class A sentences and 285 class D sentences. This set of 687 sentences, used for scoring the NL performance, is part of a larger set that originally included 283 class X (unanswerable) sentences. The test was carried out for the overall set of 970 sentence, without knowing which class they belong to. The official score given from NIST is summarized in Table 3. After the test we found an inaccuracy in the module of the SQL translator that is responsible for the CAS formatting. We fixed the bug and rescored the whole set of sentences, obtaining the results reported in Table 4. In Table 5 we report a detailed analysis of the results. In this analysis we included only the sentences that generated a false response. Conceptual decoding and template generator errors are generally due to the lack of training data. SQL translator and dialog manager errors are generally due to the limited power of the representation we are currently using. Finally for the errors attributed to the CAS format or labeled as ambiguos we generated a correct internal meaning representation but the format of the answer did not comply with the principles of interpretation, or our interpretation did not agree with the one given by the annotators.</Paragraph>
  </Section>
  <Section position="6" start_page="68" end_page="69" type="metho">
    <SectionTitle>
4. THE SPEECH RECOGNIZER
</SectionTitle>
    <Paragraph position="0"> In this section we give a description of the speech recognition system that was used in conjunction with the natural language understanding system for the February 92 ATIS test. Other details can be found in \[8, 9\]</Paragraph>
  </Section>
  <Section position="7" start_page="69" end_page="70" type="metho">
    <SectionTitle>
92 test
</SectionTitle>
    <Paragraph position="0"> The Speech signal was first filtered from 100 Hz to 3.8 KHz and down-sampled to an 8 kHz sampling rate. 10th order LPC analysis was then performed every 10 msec on consecutive 30 msec windows with a 20 msec frame overlap. Based on the short-time LPC features, 12 LPCderived cepstral coefficients and their first and second derivatives, plus normalized log energy and its first and second derivatives were computed and concatenated to form a single 39-dimension feature vector.</Paragraph>
    <Paragraph position="1"> 6259 spontaneous utterances from the MADCOW data were used for training the acoustic models. Context-dependent phone-like units \[10\], including doublecontext phones, left-context phones, right-context phones, context-independent phones, word-juncture context dependent phones and position dependent phones, were modeled using continuous density hidden Markov models (HMM) with mixture Gaussian state observation densities. The inventory of acoustic units was determined through an occurrency selection rule. Only units that appear in the training database more than 20 times were selected, resulting in a set of 2330 context-dependent phones. A maximum of 16 mixture components was used for each acoustic HMM state. The HMM parameters were estimated by means of the segmental k-means training procedure \[11\].</Paragraph>
    <Paragraph position="2"> The recognition lexicon consisted of 1153 lexical entries including 1060 words appearing in the Feb91 benchmark evaluation and 93 compound words which were mostly concatenation of letters to form acronyms. Each entry had a single pronunciation. In addition, two nonphonetic units, one for modeling weak extraneous (out of vocabulary) speech events and the other for modeling strong extraneous speech events, were included, like in \[12\].</Paragraph>
    <Paragraph position="3"> Word bigrams were used in the test. They were estimated using the same set of 6259 annotated sentences, and smoothed with backoff probabilities. The perplexity of the language defined by the bigram probabilities, computed on the training set, was found to be 17.</Paragraph>
    <Paragraph position="4">  The speech recognition results are summaried in Table 6 Overall we observed 17.5% word error and 64.6% string error.</Paragraph>
    <Paragraph position="5"> In the current system configuration, only 6259 utterances (about 12 hours of speech) were used to create the acoustic HMM models. Out of the 218 speakers, 15 of them were from the ATT training set and 17 of them were from the CMU training set, which amounts to about 90 minutes of training data from each of them. We can see from Table 6 that there is a problem due to an insufficient training for ATT and CMU test data. On the other hand, since most of the training data we used were collected at BBN and MIT, the performance is better for BBN and MIT test speakers.</Paragraph>
    <Paragraph position="6"> 94 out of the 427 deleted words were A and THE. Short function words amounted to over 90% of the deletion errors. As for the 328 insertion errors, 46 of them were insertion of words A and THE. Again, short function words contributed to over 90% of the insertion errors.</Paragraph>
    <Paragraph position="7"> Since function words, in most cases, did not affect the meaning of a recognized sentence, we expect that such errors did not degrade the performance of the NL module. null Substitution errors had a greater impact on the SLS system performance than insertion and deletion errors.</Paragraph>
    <Paragraph position="8"> Most of the substitution errors can be categorized into three types:  1. Out-of-vocabulary words; 2. Morphological inflections of words, which are difficult to discriminate acoustically for band-limited data; 3. short function words.</Paragraph>
    <Paragraph position="9">  Out of the 1153 substitution error, 66 were caused by out-of-vocabulary words, and 127 were caused by morphological inflections. For the remaining 85% of the er- null cies found in the answer formatter, that we don't believe affects the real performance of the CHRONUS system.</Paragraph>
    <Paragraph position="10"> Nevertheless, this suggests the importance of investigating a more meaningful and more rubust scoring criterion.</Paragraph>
  </Section>
  <Section position="8" start_page="70" end_page="70" type="metho">
    <SectionTitle>
92 test
</SectionTitle>
    <Paragraph position="0"> rors, about half involved short function words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML