File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/a00-2029_concl.xml

Size: 3,161 bytes

Last Modified: 2025-10-06 13:52:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2029">
  <Title>Predicting Automatic Speech Recognition Performance Using Prosodic Cues</Title>
  <Section position="6" start_page="222" end_page="223" type="concl">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> A statistical comparison of recognized versus mis-recognized utterances indicates that F0 excursion, loudness, longer prior pause, and longer duration are significant prosodic characteristics of both WER and CA-defined failed recognition attempts. Results from a set of machine learning experiments show that prosodic differences can in fact be used to improve the prediction of misrecognitions with a high degree of accuracy (12.76% error) for WER-based misrecognitions- and an even higher degree (6.53% error) when combined with information currently available from ASR systems. The use of ASR confidence scores alone had a predicted WER of 22.23%, so the improvement over traditional methods is quite considerable. For CA-defined misrecognitions, the improvement provided by prosodic features is considerably less. One of our future research directions will be to understand this difference.</Paragraph>
    <Paragraph position="1"> Another future direction will be to address the issue of just why prosodic features provide such useful indicators of recognition failure. Do the features themselves make recognition difficult, or are they instead indirect correlates of other phenomena not captured in our study? While the negative influence of speaking rate variation on ASR has been reported before (e.g. (Ostendorf et al., 1996), it is traditionally assumed that ASR is impervious to differences in F0 and RMS; yet, it is known that F0 and RMS variations co-vary to some extent with spectral characteristics (e.g. (Swerts and Veldhuis, 1997; Fant et al., 1995)), so that it is not unlikely that utterances with extreme values for these may differ critically from the training data. Other prosodic features may be more indirect indicators of errors. Longer utterances may simply provide more chance for error than shorter ones, while speakers who pause longer before utterances and take more time making them may also produce more disfluencies than others.</Paragraph>
    <Paragraph position="2"> We are currently replicating our experiment on a new domain with a new speech recognizer. We are examining the W99 corpus, which was collected in a  This system employed the AT&amp;T WATSON speech recognition technology (Sharp et al., 1997). Preliminary results indicate that our TOOT results do in fact hold up across recognizers. We also are extending our TOOT corpus analysis to include prosodic analyses of turns in which users become aware of misrecognitions and correct them. In addition, we are exploring whether prosodic differences can help explain the &amp;quot;goat&amp;quot; phenomenon -- the fact that some voices are recognized much more poorly than others (Doddington et al., 1998; Hirschberg et al., 1999). Our ultimate goal is to provide prosodically-based mechanisms for identifying and reacting to ASR failures in SDS systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML