File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2022_metho.xml
Size: 11,210 bytes
Last Modified: 2025-10-06 14:12:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2022"> <Title>PRELIMINARY EVALUATION OF THE VOYAGER SPOKEN LANGUAGE SYSTEM*</Title> <Section position="4" start_page="0" end_page="160" type="metho"> <SectionTitle> EVALUATION ISSUES </SectionTitle> <Paragraph position="0"> We believe that spoken language systems should be evaluated along several dimensions. First, the accuracy of the system and its various modules should be documented. Thus, for example, one can measure a given system's phonetic, word, and sentence accuracy, as well as linguistic and task completion accuracy.</Paragraph> <Paragraph position="1"> Second, one must measure the coverage and habitability of the system. This can be applied to the lexicon, the language model, and the application back-end. Third, the system's flexibility must be established. For *This research was supported by DARPA under Contract N00014-89-J-1332, monitored through the Office of Naval Research. example, how easy is it to add new knowledge to the system? How difficult is it to port the system to a different application? Finally, the e~iciency of the system should be evaluated. One such measure may be the task completion time.</Paragraph> <Paragraph position="2"> Whether we want to evaluate the accuracy of a spoken language system in part or as a whole, we must first establish what the reference should be. For example, determining word accuracy for speech recdgnizers requires that the reference string of words first be transcribed. Similarly, assessing the appropriateness of a syntactic parse presupposes that we know what the correct parse is. In some cases, establishing the reference is relatively straightforward and can be done almost objectively. In other cases, such as specifying the correct system response, the process can be highly subjective. For example, should the correct answer to the query, &quot; Do you know of any Chinese restaurants?&quot; be simply, &quot;Yes,&quot; or a list of the restaurants that the system knows? It is important to point out that at no time is a human totally out of the evaluation loop. Even for something as innocent as word accuracy, we rely on the judgement of the transcriber for ambiguous events such as &quot;where is,&quot; versus &quot;where's,&quot; or &quot;I am&quot; versus &quot;I'm.&quot; Therefore, the issue is not whether the reference is obtained objectively, but the degree to which the reference is tainted by subjectivity.</Paragraph> <Paragraph position="3"> The outputs of the system modules naturally become more general at the higher levels of the system since these outputs represent more abstract information. Unfortunately, this makes an automatic comparison with a reference output more difficult, both because the correct response may become more ambiguous and because the output representation must become more flexible. The added flexibility that is necessary to express more general concepts also allows a given concept to be expressed in many ways, making the comparison with a reference more difficult.</Paragraph> <Paragraph position="4"> To evaluate these higher levels of the system, we will either have to restrict the representation and answers to be ones that are unambiguous enough to evaluate automatically, or adopt less objective evaluation criteria. We feel it is important not to restrict the representations and capabilities of the system on account of an inflexible evaluation process. Therefore, we have begun to explore the use of subjective evaluations of the system where we feel they are appropriate. For these evaluations, rather than automatically comparing the system response to a reference output, we present the input and output to human subjects and give them a set of categories for evaluating the response. At some levels of the system (for example evaluating the appropriateness of the response of the overall system) we have used subjects who were not previously familiar with the system, since we are interested in a user's evaluation of the system. For other components of the system, such as the translation from parse to action, we are interested in whether they performed as expected by their developers, so we have evaluated the output of these parts using people familiar with their function.</Paragraph> <Paragraph position="5"> In the following section, we present the results of applying various evaluation procedures to the VOYAGER system. We don't profess to know the answers regarding how performance evaluation should be achieved.</Paragraph> <Paragraph position="6"> By simply plunging in, we hope to learn something from this exercise.</Paragraph> </Section> <Section position="5" start_page="160" end_page="161" type="metho"> <SectionTitle> PERFORMANCE EVALUATION </SectionTitle> <Paragraph position="0"> Our evaluation of the VOYAGER system is divided into four parts. The SUMMIT speech recognition system is independently evaluated for its word and sentence accuracy. The TINA natural language system is evaluated in terms of its coverage and perplexity. The accuracy of the commands generated by the back end is determined. Finally, the appropriateness of the overall system response is assessed by a panel of naive subjects. Unless otherwise specified, all evaluations were done on the designated test set \[3\], consisting of 485 and 501 spontaneous and read sentences, respectively, spoken by 5 male and 5 female subjects. The average number of words per sentence is 7.7 and 7.6 for the spontaneous and read speech test sets, respectively.</Paragraph> </Section> <Section position="6" start_page="161" end_page="161" type="metho"> <SectionTitle> SPEECII RECOGNITION PERFORMANCE </SectionTitle> <Paragraph position="0"> The SUMMIT speech recognition system that we evaluated is essentially the same as the one we described during the last workshop \[4\], with the exception of a new training procedure as described elsewhere \[2\].</Paragraph> <Paragraph position="1"> Since the speech recognition and natural language components are not as yet fully integrated, we currently use a word-pair grammar to constrain the search space. The vocabulary size is 570 words, and the test set perplexity and coverage are 22 and 65% respectively3 Figure 1 displays the word and sentence accuracy for SUMMIT on both the spontaneous and read speech test sets. For word accuracy, substitutions, insertions and deletions are all included. For sentence accuracy, we count as correct sentences where all the words were recognized correctly. We have included only those sentences that pass the word-pair grammar, following the practice of past Resource Management evaluations. However, overall system results are reported on all the sentences. For spontaneous speech, we broke down the results into three categories: sentences that contain partial words, sentences that contain filled pauses, and uncontaminated sentences. These results are shown in Figure 2. Since we do not explicitly model these spontaneous speech events, we expected the performance of the system to degrade. However, we were somewhat surprised at the fact that the read speech results were very similar to the spontaneous speech ones (Figure 1). One possible reason is that the speaking rate for the read speech test set is very high, about 295 words/min compared to 180 words/rain for the spontaneous speech and 210 words/rain for the Resource Management February-89 test set. The read speech sentences were collected during the last five minutes of the recording session. Apparently, the subjects were anxious to complete the task, and we did not explicitly ask them to slow down.</Paragraph> </Section> <Section position="7" start_page="161" end_page="163" type="metho"> <SectionTitle> NATURAL LANGUAGE PERFORMANCE </SectionTitle> <Paragraph position="0"> Following data collection, TINA's arc probabilities were trained using the 3,312 sentences from the designated training set \[5\]. The resulting coverage and perplexity for the designated development set are shown whether the sentences contain false starts or filled pauses.</Paragraph> <Paragraph position="1"> in the top row of Table 1. The left column gives the perplexity when all words that could follow a given word are considered equally likely. The middle column takes into account the probabilities on arcs as established from the training sentences. The right column gives overall coverage in terms of percentage of sentences that parsed.</Paragraph> <Paragraph position="2"> Examination of the training sentences led to some expansions of the grammar and the vocabulary to include some of the more commonly occurring patterns/words that had originally been left out due to oversight. These additions led to an improvement in coverage from 69% to 76%, as shown in Table 1, but with a corresponding increase in perplexity. This table also shows the performance of the expanded system on the training set. The fact that there is little difference between this result and the result on the development set suggests that the training process is capturing appropriate generalities. The final row gives perplexity and coverage for the test set. The coverage for this set was somewhat lower, but the perplexities were comparable.</Paragraph> <Paragraph position="3"> Note also that perplexity as computed here is an upper bound measurement on the actual constraint provided. In a parser many long-distance constraints are not detected until long after the word has been incorporated into the perplexity count. For instance, the sentence &quot;What does the nearest restaurant serve?&quot; would license the existence of &quot;does&quot; as a competitor for &quot;is&quot; following the word &quot;what.&quot; However, if &quot;does&quot; is actually substituted for &quot;is&quot; incorrectly in the sentence &quot;What is the nearest restaurant?&quot; the parse would fail at the end due to the absence of a predicate. It is difficult to devise a scheme that could accurately measure the gain realized in a parser due to long-distance memory that is not present in a word-pair grammar. The above results were all obtained directly from the log file, as typed in by the experimenter. We also have available the orthographic transcriptions for the utterances, which included false starts explicitly. We ran a separate experiment on the test set in which we used the orthographic transcription, after stripping away all partial words and non-words. We found a 2.5% reduction in coverage in this case, presumably due to back ups after false starts.</Paragraph> <Paragraph position="4"> Of course, we have not yet taken advantage of the constraint provided by TINA, except in an accept/reject mode for recognizer output. We expect TINA'S low perplexity to become an important factor for search space reduction and performance improvement once the system is fully integrated.</Paragraph> <Section position="1" start_page="163" end_page="163" type="sub_section"> <SectionTitle> 8.3 76% 8.1 78% 8.2 72.5% </SectionTitle> <Paragraph position="0"/> </Section> </Section> class="xml-element"></Paper>