File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/89/h89-2022_evalu.xml
Size: 5,819 bytes
Last Modified: 2025-10-06 14:00:01
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2022"> <Title>PRELIMINARY EVALUATION OF THE VOYAGER SPOKEN LANGUAGE SYSTEM*</Title> <Section position="8" start_page="163" end_page="164" type="evalu"> <SectionTitle> SYSTEM PERFORMANCE </SectionTitle> <Paragraph position="0"> VOYAGER'S overall performance was evaluated in several ways. In some cases, we used automatic means to measure performance. In others, we used the expert opinion of system developers to judge the correctness of intermediate representations. Finally, we used a panel of naive users to judge the appropriateness of the responses of the system as well as the queries made by the subjects.</Paragraph> <Section position="1" start_page="163" end_page="163" type="sub_section"> <SectionTitle> Automated Evaluation </SectionTitle> <Paragraph position="0"> VOYAGER'S responses to sentences can be divided into three categories. For some sentences, no parse is produced, either due to recognizer errors, unknown words, or unseen linguistic structures. For others, no action is generated due to inadequacies of the back end. Some action is generated for the remainder of the sentences. Figure 3 show the results on the spontaneous speech test set. The system failed to generate a parse for one reason or another on two-thirds of the sentences. Of those, 26% were found to contain unknown words. VOYAGER almost never failed to provide a response once a parse had been generated. This is a direct result of our conscious decision to constrain TINA according to the capabilities of the back end.</Paragraph> <Paragraph position="1"> For diagnostic purposes, we also examined VOYAGER's responses when orthography, rather than speech, was presented to the system, after partial words and non-words had been removed. The results are also shown in Figure 3. Comparing the two sets of numbers, we can conclude that 30% of the sentences would have failed to parse even if recognized correctly, and an additional 36% of the sentences failed to generate an action due to recognition errors or the system's inability to deal with spontaneous speech phenomena.</Paragraph> <Paragraph position="2"> Even if a response was generated, it may not have been the correct response. It is difficult to know how to diagnose the quality of the responses, but we felt it was possible to break up the analysis into two parts, one measuring the performance of the portion of the system that translates the sentence into functions and arguments and the other assessing the capabilities of the back end. For the first part, we had two experts who were well informed on the functionalities in the back end assess whether the function calls generated by the interface were complete and appropriate. The experts worked as a committee and examined all the sentences in the test set for which an action had been generated. They agreed that 97% of the functions generated were correct. Most of the failures were actually due to inadequacies in the back end. For example, the back end had no mechanism for handling the quantifier &quot;other&quot; as in &quot;any other restaurants,&quot; and therefore this word was ignored by the function generator, resulting in an incomplete command specification.</Paragraph> </Section> <Section position="2" start_page="163" end_page="164" type="sub_section"> <SectionTitle> Human Evaluation </SectionTitle> <Paragraph position="0"> For the other half of the back end evaluation, we decided to solicit judgments from naive subjects who had had no previous experience with VOYAGER. We decided to have the subjects categorize both system responses and user queries as to their appropriateness. System responses came in two forms, a direct response to the question if the system thought it understood, or an admission of failure and an attempt to explain what went wrong. Subjects were asked to judge answers as either &quot;appropriate,&quot; &quot;verbose,&quot; or &quot;incorrect,&quot; and to judge error messages as either &quot;appropriate&quot; or &quot;ambiguous.&quot; In addition, they were asked to judge queries as &quot;reasonable,&quot; &quot;ambiguous,&quot; &quot;ill-formed,&quot; or &quot;out-of-domain.&quot; Statistics were collected separately for the two conditions, &quot;speech input&quot; and &quot;orthographic input.&quot; In both cases, we threw out sentences that had out-of-vocabulary words or no parse. We had three subjects judge each sentence, in order to assess inter-subject agreement.</Paragraph> <Paragraph position="1"> Table 2 shows a breakdown (in percentage) of the results, averaged across three subjects. The columns represent the judgement categories for the system's responses, whereas the rows represent judgement categories for the user queries. A comparison of the last row of the two conditions reveals that the results are quite consistent, presumably because the majority of the incorrectly recognized sentences are rejected by the parser. About 80% of the sentences were judged to have an appropriate response, with an additional 5% being verbose but otherwise correct. Only about 4% of the sentences produced error messages, for which the system was judged to give an appropriate response about two thirds of the time. The response was judged incorrect about 10% of the time. The table also shows that the subjects judged about 87% of the user queries to be reasonable.</Paragraph> <Paragraph position="2"> In order to assess the reliability of the results, we examined the agreement in the judgements provided by the subjects. For this limited experiment, at least two out of three subjects agreed in their judgements about 95% of the time.</Paragraph> </Section> </Section> class="xml-element"></Paper>