File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/h90-1060_intro.xml

Size: 3,140 bytes

Last Modified: 2025-10-06 14:04:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1060">
  <Title>A New Paradigm for Speaker-Independent Training and Speaker Adaptation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> One important scenario for the use of spoken language systems (SLS) by new speakers is to start with a SI corpus or model and have the system adapt as the new users interact with the system. Once the interaction has begun, the system has the opportunity to collect speaker-dependent data of known orthographic transcription from the target speaker.</Paragraph>
    <Paragraph position="1"> After a small sample of speech has been collected, the system should be able to adapt so as to significantly increase performance compared to the original SI model. The success of this scenario depends on the adaptation being powerful enough to generalize from a small sample of speaker-specific speech in which most of the phonetic contexts of the language are not observed. Furthermore, it depends on having a SI speech corpus which is amenable to speaker adaptation.</Paragraph>
    <Paragraph position="2"> It is a widely held belief that speech used for training SI models must be collected from many speakers. It is also commonly accepted that collecting only a small sample of speech from each training speaker is a reasonable compromise to make in the effort to collect as many speakers as possible. While this compromise may be reasonable for SI recognition, several efforts to use such a corpus as a basis for speaker adaptation have failed to make significant improvements.</Paragraph>
    <Paragraph position="3"> Recently, we have discovered that adequate SI performance can be achieved with far less speaker coverage than conventionally thought necessary, but with much better sampiing of each training speaker's speech. Specifically, we show that it is possible to achieve near state-of-the-art SI performance on a 1000-word continuous speech recognition task using only 12 training speakers. Furthermore, we will show that it is possible and advantageous to create the SI model from a set of independently trained speaker-dependent (SD) models, without retraining on the entire pooled dataset at one time. Most importantly, we show that such a SI corpus is an effective basis for speaker adaptation. By combining the adapted models of 11 reference speakers, we were able to reduce the error rate by 45% compared to the SI performance. This method succeeds because we are able to apply a robust probabilistic speaker-transformation to well-trained andhighly discriminating SD training models.</Paragraph>
    <Paragraph position="4"> In section 2, we describe the new SI training paradigm and present comparative results for SI recognition using only 12 training speakers. In section 3, we describe three previous attempts to adapt from a corpus of many training speakers.</Paragraph>
    <Paragraph position="5"> Then we describe our approach for adapting to new speakers from the 12 speaker SI corpus and discuss experimental results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML