File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1002_intro.xml
Size: 3,328 bytes
Last Modified: 2025-10-06 14:01:12
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1002"> <Title>Processing Broadcast Audio for Information Access</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Progress in LVCSR </SectionTitle> <Paragraph position="0"> Substantial advances in speech recognition technology have been achieved during the last decade.</Paragraph> <Paragraph position="1"> Only a few years ago speech recognition was primarily associated with small vocabulary isolated word recognition and with speaker-dependent (often also domain-specific) dictation systems. The same core technology serves as the basis for a range of applications such as voice-interactive database access or limited-domain dictation, as well as more demanding tasks such as the transcription of broadcast data. With the exception of the inherent variability of telephone channels, for most applications it is reasonable to assume that the speech is produced in relatively stable environmental and in some cases is spoken with the purpose of being recognized by the machine.</Paragraph> <Paragraph position="2"> The ability of systems to deal with non-homogeneous data as is found in broadcast audio (changing speakers, languages, backgrounds, topics) has been enabled by advances in a variety of areas including techniques for robust signal processing and normalization; improved training techniques which can take advantage of very large audio and textual corpora; algorithms for audio segmentation; unsupervised acoustic model adaptation; efficient decoding with long span language models; ability to use much larger vocabularies than in the past - 64 k words or more is common to reduce errors due to out-of-vocabulary words.</Paragraph> <Paragraph position="3"> With the rapid expansion of different media sources for information dissemination including via the internet, there is a pressing need for automatic processing of the audio data stream. The vast majority of audio and video documents that are produced and broadcast do not have associated annotations for indexation and retrieval purposes, and since most of today's annotation methods require substantial manual intervention, and the cost is too large to treat the ever increasing volume of documents. Broadcast audio is challenging to process as it contains segments of various acoustic and linguistic natures, which require appropriate modeling. Transcribing such data requires significantly higher processing power than what is needed to transcribe read speech data in a controlled environment, such as for speaker adapted dictation. Although it is usually assumed that processing time is not a major issue since computer power has been increasing continuously, it is also known that the amount of data appearing on information channels is increasing at a close rate. Therefore processing time is an important factor in making a speech transcription system viable for audio data mining and other related applications. Transcription word error rates of about 20% have been reported for unrestricted broadcast news data in several languages.</Paragraph> <Paragraph position="4"> As shown in Figure 1 the LIMSI broadcast news transcription system for automatic indexation consists of an audio partitioner and a speech recognizer.</Paragraph> </Section> class="xml-element"></Paper>