File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/w01-1602_evalu.xml
Size: 6,118 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1602"> <Title>VariantTransduction: A Method for Rapid Developmentof InteractiveSpoken Interfaces</Title> <Section position="8" start_page="0" end_page="0" type="evalu"> <SectionTitle> 7 Experiments </SectionTitle> <Paragraph position="0"> Animportantquestion relating toourmethod is the eect of the number of examples on system interpretation accuracy. To measure this eect, wechose the operator services call routing task described by Gorin et al. (1997).</Paragraph> <Paragraph position="1"> Wechose this task because a reasonably large data set was available in the form of actual recordings of thousandsof real customers calling AT&T's operators, together with transcriptions and manual labeling of the desired call destination. More specically,we measure the call routing accuracy for unconstrained caller responses to the initial context prompt AT&T. How may I help you?. Another advantage of this task was that benchmark call routing accuracy gures were available for systems built with the full data set (Gorin et al., 1997;; Schapire and Singer, 2000). Wehavenotyet measured interpretation accuracy for the structurally more complex e-mail access application.</Paragraph> <Paragraph position="2"> In this experiment, the responses to How may I help you? are \routed&quot; to fteen destinations, where routing means handing o the call to another system or human operator, or moving to another example-action context thatwill interactfurtherwith theuser toelicit further information so that a subtask (suchas making a collect call) can be completed. Thus the actions in the initial context are simply the destinations, i.e. a = a , and the matcher is only used to compute e .</Paragraph> <Paragraph position="3"> The fteen destinations include a destination \other&quot; which is treated specially in that it is also taken to be the destination when the system rejects the user's input, for example because the condence in the output of the speech recognizer is too low. Following previous work on this task, cited above, we present the results for each experimental condition as an ROC curve plotting the routing accuracy (on non-rejected utterances) as a function of the false rejection rate (the percentage of the samples incorrectly rejected);; a classication by the system of \other&quot; is considered equivalent to rejection.</Paragraph> <Paragraph position="4"> The dataset consists of 8,844 utterances of which 1000were held out for testing. We refer to the remaining 7,884 utterances as the \full training dataset&quot;.</Paragraph> <Paragraph position="5"> In the experiments, wevary two conditions: Input uncertainty The input string to the interpretation component is either a human transcription of the spoken utterance or the output of a speech recognizer. The acoustic model used for automatic speech recognition was a general telephone speech HHM model in all cases. (For the full dataset, better results can be achieved by an applicationspecic acoustic model, as presented by Gorin et al. (1997) and conrmed by our results below.) Size of example set We select progressively larger subsets of examples from the full training set, as well as showing results for the full training set itself. We wish to approximate the situation where an application developer uses typical examples for the initial context without knowing the distribution of call types.</Paragraph> <Paragraph position="6"> We therefore select k utterances for each destination, with k set to 3, 5, and 10, respectively. This selection is random, except for the provision that utterances appearing more than once are preferred, to approximate the notion of a typical utterance. The selected examples are expanded by the addition of variants, as described earlier. For eachvalue of k, the results shown are for the median of three runs.</Paragraph> <Paragraph position="7"> Figure 3 shows the routing accuracy ROC curves for transcribed input for k =3;;5;;10 and for the full training dataset. These results for transcribed input were obtained with BoosTexter(Schapire andSinger, 2000)asthe classier module in our system because we have observed that BoosTexter generally outperformsour Phi classier (mentioned earlier) for text input.</Paragraph> <Paragraph position="8"> Figure 4showsthe corresponding fourROC curves for recognition output, and an additional fth graph (the top one) showing the improvement that is obtained with a domain specic acoustic model coupled with a trigram language model. These results for recognition output were obtained with the Phi classier module rather than BoosTexter;; the Phi classier performance is generally the same as, or slightly better than, BoosTexter when applied to recognition output. The language models used in the experiments for Figure 4 are derived from the example sets for k =3;;5;;10 (lower three graphs) and for the full training set (upper two graphs), respectively. As described earlier, the language model for restricted numbers of examples is an unweighted one that recognizes sequences of substrings ofthe examples. Forthe full training set, statistical N-gram language models are used (N=3 for the top graph and N=2 for the second to top) since there is sufcient data in the full training set for such language models to be eective.</Paragraph> <Paragraph position="9"> Comparing the two gures, it can be seen that the performance shortfall from using small numbers of examples compared to the full training set is greater when speech recog- null nition output nition errors are included. This suggests that it might be advantageous to use the examples to adapt a general statistical language model. There also seem to be diminishing returns as k is increased from 3 to 5 to 10. A likely explanation is that expansion of examples by variants is progressively less eectiveasthe size of the unexpanded set is increased. This is to be expected since additional real examples presumably are more faithful to the task than articially generated variants.</Paragraph> </Section> class="xml-element"></Paper>