File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/h01-1059_concl.xml

Size: 3,109 bytes

Last Modified: 2025-10-06 13:53:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1059">
  <Title>Portability Issues for Speech Recognition Technologies</Title>
  <Section position="8" start_page="2" end_page="2" type="concl">
    <SectionTitle>
7. CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> This paper has explored methods to reduce the cost of developing models for speech recognizers. Two main axes have been explored: developing generic acoustic models and the use of low cost data for acoustic model training.</Paragraph>
    <Paragraph position="1"> We have explored the genericity of state-of-the-art speech recognition systems, by testing a relatively wide-domain system on data from three tasks ranging in complexity. The generic models were taken from the broadcast news task which covers a wide range of acoustic and linguistic conditions. These acoustic models are relatively task-independent as there is only a small increase in word error relative to the word error obtained with task-dependent acoustic models, when a task-dependent language model is used. There remains a large difference in performance on the digit recognition task which can be attributed to the limited phonetic coverage of this task. On a spontaneous WSJ dictation task, the broadcast news acoustic and language are more robust to deviations in speaking style than the read-speech WSJ models. We also have shown that unsupervised acoustic model adaptation can reduce the performance gap between task-independent and task-dependent acoustic models, and that supervised adaptation of generic models can lead to better performance than that achieved with task-specific models.</Paragraph>
    <Paragraph position="2"> Both supervised and unsupervised adaptation are less effective for the digits task indicating that these may be a special case.</Paragraph>
    <Paragraph position="3"> We have investigated the use of low cost data to train acoustic models for broadcast news transcription, with supervision provided the language models. Recognition results obtained with acoustic models trained on large quantities of automatically annotated data are comparable (under a 10% relative increase in word error) to results obtained with acoustic models trained on large quantities of manually annotated data. Given the significantly higher cost of detailed manual transcription (substantially more time consuming than producing commercial transcripts, and more expensive since closed captions and commercial transcripts are produced for other purposes), such approaches are very promising as they require substantial computation time, but little manual effort. Another advantage offered by this approach is that there is no need to extend the pronunciation lexicon to cover all words and word fragments occurring in the training data. By eliminating the need for manual transcription, automated training can be applied to essentially unlimited quantities of task-specific training data. While the focus of our work has been on reducing training costs and task portability, we have been exploring these in a multi-lingual context.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML