XML Viewer - a00-2029

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2029_intro.xml
Size: 3,979 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2029">
  <Title>Predicting Automatic Speech Recognition Performance Using Prosodic Cues</Title>
  <Section position="2" start_page="0" end_page="218" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> One of the central tasks of the dialogue manager in most current spoken dialogue systems (SDSs) is error handling. The automatic speech recognition (ASR) component of such systems is prone to error, especially when the system has to operate in noisy conditions or when the domain of the system is large.</Paragraph>
    <Paragraph position="1"> Given that it is impossible to fully prevent ASR errors, it is important for a system to know how likely a speech recognition hypothesis is to be correct, so it can take appropriate action, since users have considerable difficulty correcting incorrect information that is presented by the system as true (Krahmer et al., 1999). Such action may include verifying the user's input, reprompting for fresh input, or, in cases where many errors have occurred, changing the interaction strategy or switching the caller to a human attendant (Smith, 1998; Litman et al., 1999; Langkilde et al., 1999). Traditionally, the decision to reject a recognition hypothesis is based on acoustic confidence score thresholds, which provide a reliability measure on the hypothesis and are set in the application (Zeljkovic, 1996). However, this process often fails, as there is no simple one-to-one mapping between low confidence scores and incorrect recognitions, and the setting of a rejection threshold is a matter of trial and error (Bouwman et al., 1999).</Paragraph>
    <Paragraph position="2"> Also, some incorrect recognitions do not necessarily lead to misunderstandings at a conceptual level (e.g.</Paragraph>
    <Paragraph position="3"> &amp;quot;a.m.&amp;quot; recognized as &amp;quot;in the morning&amp;quot;). The current paper looks at prosody as one possible predictor of ASR performance. ASR performance is known to vary based upon speaking style (Weintraub et al., 1996), speaker gender and age, native versus non-native speaker status, and, in general, the deviation of new speech from the training data. Some of this variation is linked to prosody, as prosodic differences have been found to characterize differences in speaking style (Blaauw, 1992) and idiosyncratic differences (Kraayeveld, 1997). Several other studies (Wade et al., 1992; Oviatt et al., 1996; Swerts and Ostendorf, 1997; Levow, 1998; Bell and Gustafson, 1999) report that hyperarticulated speech, characterized by careful enunciation, slowed speaking rate, and increase in pitch and loudness, often occurs when users in human-machine interactions try to correct system errors. Still others have shown that such speech also decreases recognition performance (Soltau and Waibel, 1998). Prosodic features have also been shown to be effective in ranking recognition hypotheses, as a post-processing filter to score ASR hypotheses (Hirschberg, 1991; Veilleux, 1994; Hirose, 1997).</Paragraph>
    <Paragraph position="4"> In this paper we present results of empirical studies testing the hypothesis that prosodic features provide an important clue to ASR performance. We first present results comparing prosodic analyses of correctly and incorrectly recognized speaker turns in TOOT, an experimental SDS for obtaining train information over the phone. We then describe machine learning experiments based on these results that explore the predictive power of prosodic features alone and in combination with other automatically available information, including ASR confidence scores and recognized string. Our results indicate that there are significant prosodic differences between correctly and incorrectly recognized utterances. These differences can in fact be used to pre- null dict whether an utterance has been misrecognized, with a high degree of accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML