File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/h01-1059_intro.xml

Size: 6,788 bytes

Last Modified: 2025-10-06 14:01:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1059">
  <Title>Portability Issues for Speech Recognition Technologies</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The last decade has seen impressive advances in the capability and performance of speech recognizers. Todays state-of-the-art systems are able to transcribe unrestricted continuous speech from broadcast data with acceptable performance. The advances arise from the increased accuracy and complexity of the models, which are closely related to the availability of large spoken and text corpora for training, and the wide availability of faster and cheaper computational means which have enabled the development and implementation of better training and decoding algorithms. Despite the extent of progress over the recent years, recognition accuracy is still extremely sensitive to the environmental conditions and speaking style: channel quality, speaker characteristics, and background This work was partially financed by the European Commission under the IST-1999 Human Language Technologies project 11876 Coretex.</Paragraph>
    <Paragraph position="1"> .</Paragraph>
    <Paragraph position="2"> noise have an important impact on the acoustic component of the speech recognizer, whereas the speaking style and the discourse domain have a large impact on the linguistic component.</Paragraph>
    <Paragraph position="3"> In the context of the EC IST-1999 11876 project CORETEX we are investigating methods for fast system development, as well as development of systems with high genericity and adaptability. By fast system development we refer to: language support, i.e., the capability of porting technology to different languages at a reasonable cost; and task portability, i.e. the capability to easily adapt a technology to a new task by exploiting limited amounts of domain-specific knowledge. Genericity and adaptability refer to the capacity of the technology to work properly on a wide range of tasks and to dynamically keep models up to date using contemporary data.</Paragraph>
    <Paragraph position="4"> The more robust the initial generic system is, the less there is a need for adaptation. Concerning the acoustic modeling component, genericity implies that it is robust to the type and bandwidth of the channel, the acoustic environment, the speaker type and the speaking style. Unsupervised normalization and adaptation techniques evidently should be used to enhance performance further when the system is exposed to data of a particular type.</Paragraph>
    <Paragraph position="5"> With today's technology, the adaptation of a recognition system to a new task or new language requires the availability of sufficient amount of transcribed training data. When changing to new domains, usually no exact transcriptions of acoustic data are available, and the generation of such transcribed data is an expensive process in terms of manpower and time. On the other hand, there often exist incomplete information such as approximate transcriptions, summaries or at least key words, which can be used to provide supervision in what can be referred to as &amp;quot;informed speech recognition&amp;quot;. Depending on the level of completeness, this information can be used to develop confidence measures with adapted or trigger language models or by approximate alignments to automatic transcriptions. Another approach is to use existing recognizer components (developed for other tasks or languages) to automatically transcribe task-specific training data. Although in the beginning the error rate on new data is likely to be rather high, this speech data can be used to re-train a recognition system. If carried out in an iterative manner, the speech data base for the new domain can be cumulatively extended over time without direct manual transcription. null The overall objective of the work presented here is to reduce the speech recognition development cost. One aspect is to develop &amp;quot;generic&amp;quot; core speech recognition technology, where by &amp;quot;generic&amp;quot; we mean a transcription engine that will work reasonably well on a wide range of speech transcription tasks, ranging from digit recognition to large vocabulary conversational telephony speech, without the need for costly task-specific training data. To start with we assess the genericity of wide domain models under cross-task con- null ditions, i.e., by recognizing task-specific data with a recognizer developed for a different task. We chose to evaluate the performance of broadcast news acoustic and language models, on three commonly used tasks: small vocabulary recognition (TI-digits), read and spontaneous text dictation (WSJ), and goal-oriented spoken dialog (ATIS). The broadcast news task is quite general, covering a wide variety of linguistic and acoustic events in the language, ensuring reasonable coverage of the target task. In addition, there are sufficient acoustic and linguistic training data available for this task that accurate models covering a wide range of speaker and language characteristics can be estimated.</Paragraph>
    <Paragraph position="6"> Another research area is the investigation of lightly supervised techniques for acoustic model training. The strategy taken is to use a speech recognizer to transcribe unannotated data, which are then used to estimate more accurate acoustic models. The light supervision is applied to the broadcast news task, where unlimited amounts of acoustic training data are potentially available. Finally we apply the lightly supervised training idea as a transparent method for adapting the generic models to a specific task, thus achieving a higher degree of genericity. In this work we focus on reducing training costs and task portability, and do not address language transfer.</Paragraph>
    <Paragraph position="7"> We selected the LIMSI broadcast news (BN) transcription system as the generic reference system. The BN task covers a large number of different acoustic and linguistic situations: planned to spontaneous speech; native and non-native speakers with different accents; close-talking microphones and telephone channels; quiet studio, on-site reports in noisy places to musical background; and a variety of topics. In addition, a lot of training resources are available including a large corpus of annotated audio data and a huge amount of raw audio data for the acoustic modeling; and large collections of closed-captions, commercial transcripts, newspapers and newswires texts for linguistic modeling. The next section provides an overview of the LIMSI broadcast news transcription system used as our generic system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML