File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1158_evalu.xml

Size: 6,323 bytes

Last Modified: 2025-10-06 13:59:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1158">
  <Title>Efficient Confirmation Strategy for Large-scale Text Retrieval Systems with Spoken Dialogue Interface</Title>
  <Section position="5" start_page="3" end_page="5" type="evalu">
    <SectionTitle>
4 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> We implemented and evaluated our method as a front-end of Dialog Navigator. The front-end works on a Web browser, Internet Explorer 6.0.</Paragraph>
    <Paragraph position="1"> Julius (Lee et al., 2001) for SAPI  was used as a speech recognizer on PCs. The system presents a confirmation to users on the display. He or she replies to the confirmation by selecting choices with the mouse.</Paragraph>
    <Paragraph position="2">  Confirmation will be generated practically if one of the significance scores between the first candidate and others exceeds the threshold.</Paragraph>
    <Paragraph position="3">  1. Input the present date and time in Word 2. WORD: Add a space between Japanese and alphanumeric characters 3. WORD: Check the form of inputted characters null 4. WORD: Input a handwritten signature 5. WORD: Put watermark characters into the background of a character 6. ...</Paragraph>
    <Paragraph position="4"> [#2 candidate of ASR] &amp;quot;WORD2002 de suushiki wo nyuryoku suru houhou wo oshiete kudasai.&amp;quot;(Pleasetel me the way to input numerical expressions in WORD 2002.) Retrieval results (# of the results was 15.) 1. Insert numerical expressions in Word 2. Input the present date and time in Word 3. Input numerical expressions in Spreadsheet 4. Input numerical expressions in PowerPoint 5. Input numerical expressions in Excel 6. ...</Paragraph>
    <Paragraph position="5">  We collected the test data by 30 subjects who had not used our system. Each subject was requested to retrieve support information for 14 tasks, which consisted of 11 prepared scenarios (query sentences are not given) and 3 spontaneous queries. Subjects were allowed to utter the sentence again up to 3 times per task if a relevant retrieval result was not obtained. We obtained 651 utterances for 420 tasks in total. The average word accuracy of the ASR was 76.8%.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Evaluation of Success Rate of
Retrieval
</SectionTitle>
      <Paragraph position="0"> We calculated the success rates of retrieval for the collected speech data. We regarded the retrieval as having succeeded when the retrieval results contained an answer for the user's initial question. We set three experimental conditions:  1. Transcription: A correct transcription of user utterances, which was made manually, was used as an input to the Dialog Navigator. This condition corresponds to a case of 100% ASR accuracy, indicating an utmost performance obtained by improvements in the ASR and our dialogue strategy.</Paragraph>
      <Paragraph position="1"> 2. ASR results: The first candidate of the ASR was used as an input (baseline).</Paragraph>
      <Paragraph position="2"> 3. Our method: The N-best candidates of the  ASR were used as an input, and confirmation was generated based on our method using both the relevance and significance scores. It was assumed that the users responded appropriately to the generated confirmation.</Paragraph>
      <Paragraph position="3"> Table 2 lists the success rate. The rate when the transcription was used as the input was 79.9%. The remaining errors included those caused by irrelevant user utterances and those in the text retrieval system. Our method attained a better success rate than the condition where the first candidate of the ASR was used. Improvement of 36 cases (5.5%) was obtained by our method, including 30 by the confirmations and 14 by weighting during the matching using the relevance score, though the retrieval failed eight times as side effects of the weighting. We further investigated the results shown in Table 2. Table 3 lists the relations between the success rate of the retrieval and the accuracy of the ASR per utterance. The improvement rate out of the number of utterances was rather high between 40% and 60%. This means that our method was effective not only for utterances with high ASR accuracy but also for those with around 50% accuracy. That is, an appropriate confirmation was generated even for utterances whose ASR accuracy was not very high.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
4.2 Evaluation of Confirmation
Efficiency
</SectionTitle>
      <Paragraph position="0"> We also evaluated our method from the number of generated confirmations. Our method generated 221 confirmations. This means that confirmations were generated once every three utterances on the average. The 221 confirmations consisted of 66 prior to the retrieval using the relevance score and 155 posterior to the retrieval using the significance score.</Paragraph>
      <Paragraph position="1"> We compared our method with a conventional one, which used a confidence measure (CM) based on N-best candidates of the ASR (Komatani and Kawahara, 2000)  . In this method, the system generated confirmation only for content words with a confidence measure lower  ) were set to 0.4, 0.6, and 0.8. If a content word that was confirmed was rejected by the user, the retrieval was executed after removing a phrase that included it.</Paragraph>
      <Paragraph position="2"> The number of confirmations and retrieval successes are shown in Table 4. Our method achieved a higher success rate with a less number of confirmations (less than half) compared with the case of th  =0.8 in the conventional method. Thus, the generated confirmations based on the two scores were more efficient. The confidence measure used in the conventional method only reflects the acoustic and linguistic likelihood of the ASR results. Our method, however, reflects the domain knowledge because the two scores are derived by either a language model trained with the target knowledge base or by retrieval results for the N-best candidates. The domain knowledge can be introduced without any manual deliberation. The experimental results show that the scores are appropriate to determine whether a confirmation should be generated or not.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML