File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/h91-1021_evalu.xml

Size: 20,431 bytes

Last Modified: 2025-10-06 14:00:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1021">
  <Title>Augmented Role Filling Capabilities for Semantic Interpretation of Spoken Language</Title>
  <Section position="7" start_page="127" end_page="130" type="evalu">
    <SectionTitle>
BENCHMARK RESULTS
</SectionTitle>
    <Paragraph position="0"> Natural Language Common Task Evaluatlon Unisys attempted all four of the nature,\] language tests; both the required and the optional class A and class D1 tests. Our scores as released by NIST are as shown in table 1. The overall level of success is unimpressive. For the class A test, which corresponds most closely to the test last June, our performance is not much better, in spite of eight more months of work on our system. (If the scoring algorithm in effect now had been in effect in June, our score then would have been 42.2Ye) As this paper is being written, we have not had the time to examine our performance on a sentence by sentence basis. It appears likely, however, that the amount of training data has not yet adequately covered the full range of the various ways that people can formulate queries to the ATIS database.</Paragraph>
    <Paragraph position="1"> We are fairly pleased that our &amp;quot;false alarm&amp;quot; rate has not gone up since June. It was 11% then; if we take the 196 sentences involved in the latest 4 tests as a single group, we find our rate of F's to be less than 8%. When we discuss our spoken language results in a subsequent section, we will see that although the rate of correct answers drops noticeably when a speech recognizer is added to the system, the rate of incorrect answers does not appear to increase. The importance of a low  &amp;quot;false alarm&amp;quot; rate is well appreciated by spoken language understanding researchers; from a user's point of view nothing could be worse than an answer which is wrong although the user may have no way of telling it is wrong. It will be important to lower the rate of such errors to a level well below S%.</Paragraph>
    <Paragraph position="2"> Our best performance came on the D1 pairs test. One would have expected a lower score on any test that requires two consecutive sentences to be understood than on a test of self-contained sentences. While we wish we could claim that our work discussed in the earlier section on pragmatics was instrumental in achieving our score, it appears that much of what we added to our system did not come into play in this test.</Paragraph>
    <Paragraph position="3"> A more likely explanation of the unexpectedly high score is that when a user queries the system in a mode which utilizes follow-up queries, he or she tends to use simpler individual queries. Perhaps a user who does not use follow-up queries is trying to put more into each individual query. Some evidence for this is that our score for just the 20 distinct class A antecedent sentences for the D1 pairs test was 75%, well above our 48.3% score for all the class A sentences. Even more striking is the fact that of the 9 speakers represented in this round of tests, only two contributed more than 3 pairs to the class D1 test-speakers CK and CO contributed 13 pairs each. Our scores restricted to just those two speakers were 93~ for the class A test and 65% for the class D1 test (100% for speaker CO in the class D1 test!).</Paragraph>
    <Paragraph position="4"> The optional tests clearly were too small to have much significance. It is not surprising that our system proved to be incapable at this point of dealing with extraneous words in the input queries, for we have made no efforts as yet to compensate for such inputs. These tests will be useful as a benchxnark for comparison after we have addressed such issues.</Paragraph>
    <Paragraph position="5"> Semantics Extensions and the Common Task Tests In the section on semantics we reported the results of two experiments that we ran to assess the effects of extensions to our system. We performed the same tests using the data of the latest class A test of 145 queries. When the extensions to our semantic interpreter were removed, our performance dropped to 72 T, 19 F, or 36.6%, a decrease of 24% from our score of 48.3%. This reinforces our belief that these extensions are very important and useful. When we ran the test without the rules relating multiple decompositions, our performance was 83 T, 14 F, or 47.6%, a decrease of less than 2%. This latter finding was most surprising-basically it implies that in the 1991 test data there were virtually no constructions of the kind which those rules enable us to process, because the absence of the rules relating the decompositions corresponding to those constructions resulted in almost no reduction in our score. In particular, there must have been no nouns modified by relative clauses (&amp;quot;flights that arrive before noon&amp;quot;) or participial modifiers (&amp;quot;flights serving dinner&amp;quot;). This has some implication regarding the distribution of various forms of syntactic expression across speakers, for phenomena which were dearly significant in our training data apparently were absent from 9 speakers' worth of test data.</Paragraph>
    <Paragraph position="6"> The above experiments imply that our system as of last June would have gotten a score of less than 35% on the current class A test, for the extensions discussed in the section on semantics were not the only improvements we have made to our system. This is another indication of variability among speakers; for our system the 5 speakers of last June's test were easier to process. It appears to us that larger test sets are necessary to make a broad evaluation of natural language understanding capabilities. (We do not extend this suggestion to tests involving speech input because of the level of effort that would consume.) We have already noted the absence of relative clauses and participial modifiers in the recent class A test. We also noticed that 23 of 145, or 16%, of the sentences used the word &amp;quot;available&amp;quot;, usually in constructions like &amp;quot;what X is available&amp;quot;, while this word only appeared in 4% of the pilot training data. In the class D1 test, there were few discourse phenomena represented, and we noted in an earlier section that over 70% (27 of 38) of the D1 pairs involved just the phenomenon of flight leg disambiguation. Tests of such size, then, are not broadly representative of the range of query formulations in the ATIS domain.</Paragraph>
    <Paragraph position="7"> Related to the last point is the suspicion that the few thousand sentences of training data are themselves too few to represent the range of user queries for this domain. We have noticed that fewer new words are appearing in the more recent sets of training data, so vocabulary closure is probably occurring. Even so, in the 145 class A queries of the recent test, our system found 12 with unknown words, or 8% of the queries. This was actually higher than the 5.5% we experienced with the test last June, but that is more a comment on the variability due to small test size. It is an open question whether more and more training data is the answer to making our systems more complete, however. After all, larger volumes of data are both expensive to collect and expensive to train from. The lack of closure for the syntactic and semantic variation in user queries presents a challenge for further research in spoken language understanding. It may we\]\] be that we will have to begin studying reasonable ways in which the variation in the range of user expression can be limited, without unduly contrainlng the user in the natural performance of the task.</Paragraph>
    <Section position="1" start_page="128" end_page="130" type="sub_section">
      <SectionTitle>
Spoken Language Evaluations
</SectionTitle>
      <Paragraph position="0"> Unisys-MIT The spoken language results for this system were 29 T, 15 F, and 101 NA, for a weighted score of 9.7%. The system examined an average of 6.5 candidates in the N-best before finding an acceptable one. Of all candidates considered by the system, we found that 85~ were rejected by the syntax component and 3% by the semantics/praKmatics component, and 11% were accepted by both components. It should be pointed out that the syntax component uses a form of compiled semantics constraints during its search for parses (\[5\]), thus the resnlts for purely syntactic rejection are not as high as appears in this comparison, because some semantic constraints are applied during parsing. After a candidate is accepted by both syntax and semantics, the search in the N-best is terminated. However, the application component, which contains a great deal of information about domain-specific pragmatics, can also reject syntactically and semantically acceptable inputs for which it cannot construct a sensible database query.</Paragraph>
      <Paragraph position="1"> In fact, a syntactically and semanticul\]y acceptable candidate was found in 75% of the N-best candidate lists, but a call was</Paragraph>
      <Paragraph position="3"> made for only 30% of inputs. The application component was not able to rnal~e a sensible call for the remaining inputs.</Paragraph>
      <Paragraph position="4"> The false alarm (or F) rate we observed in this test was around 10~, which is consistent with our previous spoken language results (\[1\]) and with our natural language results, as discussed above.</Paragraph>
      <Paragraph position="5"> Unisys-BBN This system received a score of 77 T, 20 F and 48 NA for a weighted score of 39.3%. In this system 74% of all inputs were rejected by syntax, 11% of inputs were accepted by syntax but rejected by semantics and 15~ were accepted by both syntax and semantics. The false alarm rate is 13.8~, which is slightly higher but in the same range as previous false alarm rates.</Paragraph>
      <Paragraph position="6"> As can be seen in Figure 2, in general the system found an acceptable candidate earlier in the N-best with the BBN N-best than with the MIT N-best. The average location of the selected candidate in the N-best with the BBN data was 3.8 compared to 6.5 with the MIT N-best.</Paragraph>
      <Paragraph position="7"> Unisys-LL Using the top-one candidate from the Lincoln Labs speech recognizer the spoken language results for this system were 32 T, 5 F and 108 NA for a weighted score of 18.6~. The false alarm rate for this system was only 3.4~, which is lower than that for the other spoken language and natural language systems on which we report in this paper.</Paragraph>
      <Paragraph position="8"> There is no obvious explanation for this. The simple hypothesis of better speech recognition in the Unisys-LL system will not suffice, because the BBN system has better speech recognition but the false alarm rate is higher than the Unisys-LL rate. In addition, the Unisys system's performance on the NL test tells us how the system would do given perfect speech recognition, and the false alarm rate there is around 8~. One possible hypothesis is that the bigram language model used in the Lincoln Labs system is in some sense more conservative than the language models used in the BBN and MIT systems and consequently prevents some of the inputs which might have led to an F in the natural language system from being recognized well enough for the natural language system to generate an F.</Paragraph>
      <Paragraph position="9"> In this system, based on one input per utterance, we found that 59% of the inputs failed to receive a syntactic analysis (including compiled semantics, as discussed above) and 2% failed to receive a semantic analysis. No database call could be generated for 13% of the inputs and a call was made for the remaining 25~ of the inputs.</Paragraph>
      <Paragraph position="10"> Evaluation of the Natural Language System In \[1\] we reported on a technique for evaluation of the natural language component of our spoken language system, based on the question of how often did the natural language system do the right thing. If the reference answer for an utterance is found in the N-best, the right thing for the natural language system is to find the reference answer (or a semantic equivalent) in the N-best and give the right answer. The operational definition of doing the right thing, then, is for the system to receive a &amp;quot;T&amp;quot; on such inputs. On the other hand if the reference answer is not in the N-best the right thing for the system to do is to either find a semantic equivalent to the reference answer or to reject all inputs. Thus, doing the right thing in the case of no reference answer can be operationally defined as &amp;quot;T&amp;quot; + &amp;quot;NA'.  queries), depending on whether or not reference query occurred in N-best (N=16) from BBN SPREC.</Paragraph>
      <Paragraph position="11"> Several interesting comparisons can be made based on tables 2 and 3. To begin with, it seems clear that the BBN N-best is better than the MIT N-best based on three quite distinct measures - first of all the speech recognition score is better (16.1% word error rate for BBN vs. 43.6~ word error rate for MIT), secondly, the spoken language score (with the natural language system held constant) for Unisys-BBN is better than Unlsys-MIT (39.3% for Unisys-BBN vs. 9.7% for Unlsys-MIT) and thirdly, the reference answer occurred in MIT's top 16 candidates only 15~ of the time vs. 65% of the time for the BBN N-best. Thus this experiment allows us  to ask the question of what effect does better speech recognition have on the interaction between speech recognition and natural language? In the case where the reference answer is in the N-best, PUNDIT does much better with the BBN N-best. Since less search in the N-best is required with BBN data the reference answer or equivalent is likely to be found sooner, and consequently there will be fewer chances for PUNDIT to find a syntactically and semantically acceptable sentence in the N-best which differs crucially from what was uttered. On the other hand, PUNDIT actually does better with the poorer speech recogniser output from MIT when the reference answer is not in the N-best. We suspect that the poorer speech recognizer output is in some sense easier to reject; that is, it is more likely to seriously violate the syntactic and semantic constraints of English. If this is so then it is possible that a relatively accepting natural language system might work wen with worse speech recognition outputs (because even a relatively accepting natural language system can reject very had inputs), but with better speech recognizer output one might get good performance with a stricter natural language system. We plan to test this hypothesis in future research.</Paragraph>
      <Paragraph position="12"> It is natural to ask why we should care about what to do with poorer speech recognizer output; one would tlllnlc that we should use the best recognizer output possible. The answer is that many potential applications have requirements such as large vocabulary size which are somewhat at odds with high accuracy, consequently the best recognizer output available may nevertheless be relatively inaccurate. Thus it is important to have speech/natu.~al language integration strategies which allow us to fine tune the interaction to compensate for less accurate speech recognition.</Paragraph>
      <Paragraph position="13"> Optional Class A We used both the Unisys-MIT system and the UUIsys-BBN system for this test. For both speech recognizers in this test of eleven utterances with verbal deletions we received two T's and sero F's for a weighted score of 18.2%. There is too little test data in this condition to draw reliable conclusions from the results.</Paragraph>
      <Paragraph position="14"> Comparison of Spoken Language Systems We believe coupling of a single natural language system with multiple speech recognition systems has the potential for being a very useful technique for comparing speech recognizers in a spoken language context. Of course speech recognizers can he compared on the basis of word and sentence accuracy, but we do not know how direct the mapping is between these measures of performance and spoken language performance. The most direct comparision for spoken language evaluation, then, is to define an experimental condition in which the systems to be compared differ only in the speech recognition component.</Paragraph>
      <Paragraph position="15"> Not only is this strategy useful for comparing system level measurements of performance of speech recognizers, hut it is also useful for more fine grained analyses of the interaction between the speech recognition component and the natural language system.</Paragraph>
      <Paragraph position="16"> Figure 3 shows the distribution of T's, F's and NA's for specific queries across the three systems.</Paragraph>
      <Paragraph position="17"> Note that for 52 queries, or 36~ of the total, the systems received the same score, although in no case did all three systents receive an &amp;quot;F'. The largest difference among the three systems was in the number of cases where Unisys-BBN received a &amp;quot;T&amp;quot; but the other two systems received an &amp;quot;NA'. This occurred for 31 queries.</Paragraph>
      <Paragraph position="18"> Another interesting comparison is to look at the cases where Unisys-MIT and Unisys-BBN issued a call based on the ffi~st candidate in the N-best, since this corresponds to the one-best interface used in Unisys-LL. In Unisys-MIT twenty-seven calls were issued based on the first candidate, out of a total of 45 cans. Of the calls issued on the first candidate, 15 received a score of T and 12 received a score of F, for a weighted score of 2%. In Unisys-BBN the first candidate was selected from the N-best 70% of the time. 26 of these candidates resulted in scores of &amp;quot;F&amp;quot; and 42 resulted in a &amp;quot;T&amp;quot; for a weighted score of 11%.</Paragraph>
      <Paragraph position="19"> Overall, the number of calls made was quite similar for the Unisys-LL and Unisys-MIT systems (25% of utterances for Unisys-LL and 30% for Unisys-MIT), but it was much higher for Unisys-BBN (67%). In all three systems most of the inputs were rejected by the syntax component (59% of all inputs for Unisys-LL, 74% of all inputs for Unisys-BBN and 85% of all inputs for Unisys-MIT). We can compare this to a base-line syntactic falluxe of 14% of inputs on the Unisys natural language test. (Note that since multiple inputs per utterance are possible with the N-best systems, the N-best vs. one-best systems are not strictly comparable.)</Paragraph>
    </Section>
    <Section position="2" start_page="130" end_page="130" type="sub_section">
      <SectionTitle>
Speech Recognition Evaluatlons
</SectionTitle>
      <Paragraph position="0"> Using speech recognition data from MIT, we submitted resuits for the Class A, Class D1, Class AO and Class D10 speech recognition tests, shown in tables 4, 5, 6, and 7.</Paragraph>
      <Paragraph position="1"> As expected, we observed a higher error rate for the optional tests, which contained verbal deletions, and we also observed a wide range of performance across speakers. The comparison of D1 pales and Class A speech recognition showed poorer word recognition in the D1 pairs than in the Class A test. An average 45.8% word error rate was observed for the Class A utterances compared to a 54.6% error rate for the D1 utterances. As tables 4 and 6 show, this was fairly consistent across speakers, except for speaker CJ. There are at least two hypotheses which may explain this higher error in context dependent spontaneous utterances. One hypothesis suggests that the higher error rate may be due in part to the presence of prosodic phenomena common in dialog such as destressing of &amp;quot;old&amp;quot; information. Because the specific dialog context affects the pronunciation of words corresponding to old and new information, the training data used so far may not provide a complete sample of how words are pronounced in a wide range of dialog contexts, consequently leading to poorer word recognition. Another hypothesis is based on the fact that the context-dependent sentences contain many references to flight numbers. Flight numbers may be difficult to recognize hecause there is very little opportunity for syntactic or semantic information to constrain which number was uttered.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML