File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0714_intro.xml

Size: 3,154 bytes

Last Modified: 2025-10-06 14:01:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0714">
  <Title>IMPROVEMENTS IN NON-VERBAL CUE IDENTIFICATION USING MULTILINGUAL PHONE STRINGS</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Today's state-of-the-art front-ends for multilingual speech-to-speech translation systems apply monolingual speech recognizers trained for a single language and/or accent.</Paragraph>
    <Paragraph position="1"> The monolingual speech engine is usually adaptable to an unknown speaker over time using unsupervised training methods; however, if the speaker was seen during training, their specialized acoustic model will be applied, since it achieves better performance. In order to make full use of specialized acoustic models in this proposed scenario, it is necessary to automatically identify the speaker with high accuracy. Furthermore, monolingual speech recognizers currently rely on the fact that language and/or accent will be selected beforehand by the user. This requires the user's cooperation and an interface which easily allows for such selection. Both requirements are awkward and error-prone, especially when translation services are provided for many languages using small devices like PDAs or telephones. For these scenarios, front-ends are desired which automatically identify the spoken language or accent. We believe that the automatic identification of an utterance's non-verbal cues, such as language, accent and speaker, are necessary to the successful deployment of speech-to-speech translation systems.</Paragraph>
    <Paragraph position="2"> Currently, approaches based on Gaussian Mixture Models (GMMs) [1] are the most widely and successfully used methods for speaker identification. Although GMMs have been applied successfully to close-speaking microphone scenarios under matched training and testing conditions, their performance degrades dramatically under mismatched conditions. For language and accent identification, phone recognition together with phone N-gram modeling has been the most successful approach in the past [2]. More recently, Kohler introduced an approach for speaker recognition where a phonotactic N-gram model is used [3].</Paragraph>
    <Paragraph position="3"> In [4], we extended Kohler's approach to accent and language identification as well as to speaker identification under mismatched conditions. The term &amp;quot;mismatched condition&amp;quot; describes a situation in which the testing conditions, e.g. microphone distance, are quite different from what had been seen during training. In that work, we explored a common framework for the identification of language, accent and speaker using multilingual phone strings produced by phone recognizers trained on data from different languages.</Paragraph>
    <Paragraph position="4"> In this paper, we propose and evaluate some improvements, comparing classification accuracy as well as realtime performance in our framework. Furthermore, we investigate the benefits that are to be drawn from additional phone recognizers. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML