XML Viewer - p01-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1002_metho.xml
Size: 18,542 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1002">
  <Title>Processing Broadcast Audio for Information Access</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Audio partitioning
</SectionTitle>
    <Paragraph position="0"> The goal of audio partitioning is to divide the acoustic signal into homogeneous segments, labeling and structuring the acoustic content of the data, and identifying and removing non-speech segments. The LIMSI BN audio partitioner relies on an audio stream mixture model (Gauvain et al., 1998). While it is possible to transcribe the continuous stream of audio data without any prior segmentation, partitioning offers several advantages over this straight-forward solution. First, in addition to the transcription of what was said, other interesting information can be extracted such as the division into speaker turns and the speaker identities, and background acoustic conditions. This information can be used both directly and indirectly for indexation and retrieval purposes. Second, by clustering segments from the same speaker, acoustic model adaptation can be carried out on a per cluster basis, as opposed to on a single segment basis, thus providing more adaptation data. Third, prior segmentation can avoid problems caused by linguistic discontinuity at speaker changes. Fourth, by using acoustic models trained on particular acoustic conditions (such as wide-band or telephone band), overall performance can be significantly improved. Finally, eliminating non-speech segments substantially reduces the computation time. The result of the partitioning process is a set of speech segments usually corresponding to speaker turns with speaker, gender and telephone/wide-band labels (see Figure 2).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Transcription of Broadcast News
</SectionTitle>
    <Paragraph position="0"> For each speech segment, the word recognizer determines the sequence of words in the segment, associating start and end times and an optional confidence measure with each word. The LIMSI system, in common with most of today's state-of-the-art systems, makes use of statistical models of speech generation. From this point of view, message generation is represented by a language model which provides an estimate of the probability of any given word string, and the encoding of the message in the acoustic signal is represented by a probability density function. The speaker-independent 65k word, continuous speech recognizer makes use of 4-gram statistics for language modeling and of continuous density hidden Markov models (HMMs) with Gaussian mixtures for acoustic modeling. Each word is represented by one or more sequences of context-dependent phone models as determined by its pronunciation.</Paragraph>
    <Paragraph position="1"> The acoustic and language models are trained on large, representative corpora for each task and language.</Paragraph>
    <Paragraph position="2"> Processing time is an important factor in making a speech transcription system viable for automatic indexation of radio and television broadcasts. For many applications there are limitations on the response time and the available computational resources, which in turn can significantly affect the design of the acoustic and language models. Word recognition is carried out in one or more decoding passes with more accurate acoustic and language models used in successive passes. A 4-gram single pass dynamic network decoder has been developed (Gauvain and Lamel, 2000) which can achieve faster than real-time decoding with a word error under 30%, running in less than 100 Mb of memory on widely available platforms such Pentium III or Alpha machines.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Multilinguality
</SectionTitle>
    <Paragraph position="0"> A characteristic of the broadcast news domain is that, at least for what concerns major news events, similar topics are simultaneously covered in different emissions and in different countries and languages. Automatic processing carried out on contemporaneous data sources in different languages can serve for multi-lingual indexation and retrieval. Multilinguality is thus of particular interest for media watch applications, where news may first break in another country or language.</Paragraph>
    <Paragraph position="1"> At LIMSI broadcast news transcription systems have been developed for the American English, French, German, Mandarin and Portuguese languages. The Mandarin language was chosen because it is quite different from the other languages (tone and syllable-based), and Mandarin resources are available via the LDC as well as reference performance results.</Paragraph>
    <Paragraph position="2"> Our system and other state-of-the-art systems can transcribe unrestricted American English broadcast news data with word error rates under 20%. Our transcription systems for French and German have comparable error rates for news broadcasts (Adda-Decker et al., 2000). The character error rate for Mandarin is also about 20% (Chen et al., 2000). Based on our experience, it appears that with appropriately trained models, recognizer performance is more dependent upon the type and source of data, than on the language. For example, documentaries are particularly challenging to transcribe, as the audio quality is often not very high, and there is a large proportion of voice over.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Spoken Document Retrieval
</SectionTitle>
    <Paragraph position="0"> The automatically generated partition and word transcription can be used for indexation and information retrieval purposes. Techniques commonly applied to automatic text indexation can be applied to the automatic transcriptions of the broadcast news radio and TV documents. These techniques are based on document term frequencies, where the terms are obtained after standard text processing, such as text normalization, tokenization, stopping and stemming. Most of these preprocessing steps are the same as those used to prepare the texts for training the speech recognizer language models. While this offers advantages for speech recognition, it can lead to IR errors. For better IR results, some words sequences corresponding to acronymns, multiword named-entities (e.g. Los Angeles), and words preceded by some particular prefixes (anti, co, bi, counter) are rewritten as a single word. Stemming is used to reduce the number of lexical items for a given word sense. The stemming lexicon contains about 32000 entries and was constructed using Porter's algorithm (Porter80, 1980) on the most frequent words in the collection, and then manually corrected. null The information retrieval system relies on a un- null homogeneous acoustic segments, removing non-speech portions. The word recognizer identifies the words in each speech segment, associating time-markers with each word.  it is a day of final farewells in alabama the first funerals for victims of this week's tornadoes are being held today along with causing massive property damage the twisters killed thirty three people in alabama five in georgia and one each in mississippi and north carolina the national weather service says the tornado that hit jefferson county in alabama had winds of more than two hundred sixty miles per hour authorities speculated was the most powerful tornado ever to hit the southeast twisters destroyed two churches to fire stations and a school parishioners were in one church when the tornado  broadcasted on April 11, 1998 at 4pm. The output includes the partitioning and transcription results. To improve readability, word time stamps are given only for the first 6 words. Non speech segments have been removed and the following information is provided for each speech segment: signal bandwidth (telephone or wideband), speaker gender, and speaker identity (within the show).  mean average precision using using a 1-gram document model. The document collection contains 557 hours of broadcast news from the period of February through June 1998. (21750 stories, 50 queries with the associated relevance judgments.) igram model per story. The score of a story is obtained by summing the query term weights which are simply the log probabilities of the terms given the story model once interpolated with a general English model. This term weighting has been shown to perform as well as the popular TFa2 IDF weighting scheme (Hiemstra and Wessel, 1998; Miller et al., 1998; Ng, 1999; Sp&amp;quot;ark Jones et al., 1998).</Paragraph>
    <Paragraph position="1"> The text of the query may or may not include the index terms associated with relevant documents. One way to cope with this problem is to use query expansion (Blind Relevance Feedback, BRF (Walker and de Vere, 1990)) based on terms present in retrieved contemporary texts.</Paragraph>
    <Paragraph position="2"> The system was evaluated in the TREC SDR track, with known story boundaries. The SDR data collection contains 557 hours of broadcast news from the period of February through June 1998. This data includes 21750 stories and a set of 50 queries with the associated relevance judgments (Garofolo et al., 2000).</Paragraph>
    <Paragraph position="3"> In order to assess the effect of the recognition time on the information retrieval results we transcribed the 557 hours of broadcast news data using two decoder configurations: a single pass 1.4xRT system and a three pass 10xRT system.</Paragraph>
    <Paragraph position="4"> The word error rates are measured on a 10h test subset (Garofolo et al., 2000). The information retrieval results are given in terms of mean average precision (MAP), as is done for the TREC benchmarks in Table 1 with and without query expansion. For comparison, results are also given for manually produced closed captions. With query expansion comparable IR results are obtained using the closed captions and the 10xRT  radio and TV sources (NPR, ABC, CNN, CSPAN) from May-June 1996.</Paragraph>
    <Paragraph position="5"> transcriptions, and a moderate degradation (4% absolute) is observed using the 1.4xRT transcriptions. null 7 Locating Story Boundaries  The broadcast news transcription system also provides non-lexical information along with the word transcription. This information is available in the partition of the audio track, which identifies speaker turns. It is interesting to see whether or not such information can be used to help locate story boundaries, since in the general case these are not known. Statistics were made on 100 hours of radio and television broadcast news with manual transcriptions including the speaker identities. Of the 2096 sections manually marked as reports (considered stories), 40% start without a manually annotated speaker change. This means that using only speaker change information for detecting document boundaries would miss 40% of the boundaries. With automatically detected speaker changes, the number of missed boundaries would certainly increase. At the same time, 11,160 of the 12,439 speaker turns occur in the middle of a document, resulting in a false alarm rate of almost 90%. A more detailed analysis shows that about 50% of the sections involve a single speaker, but that the distribution of the number of speaker turns per section falls off very gradually (see Figure 3). False alarms are not as harmful as missed detections, since it may be possible to merge adjacent turns into a single document in subsequent processing. These results show that even perfect  100 hours of data from May-June 1996 (top) and for 557 hours from February-June 1998 (bottom).</Paragraph>
    <Paragraph position="6"> speaker turn boundaries cannot be used as the primary cue for locating document boundaries. They can, however, be used to refine the placement of a document boundary located near a speaker change.</Paragraph>
    <Paragraph position="7"> We also investigated using simple statistics on the durations of the documents. A histogram of the 2096 sections is shown in Figure 4. One third of the sections are shorter than 30 seconds.</Paragraph>
    <Paragraph position="8"> The histogram has a bimodal distribution with a sharp peak around 20 seconds, and a smaller, flat peak around 2 minutes. Very short documents are typical of headlines which are uttered by single speaker, whereas longer documents are more likely to contain data from multiple talkers. This distribution led us to consider using a multi-scale segmentation of the audio stream into documents.</Paragraph>
    <Paragraph position="9"> Similar statistics were measured on the larger corpus (Figure 4 bottom).</Paragraph>
    <Paragraph position="10"> As proposed in (Abberley et al., 1999; Johnson et al., 1999), we segment the audio stream into overlapping documents of a fixed duration.</Paragraph>
    <Paragraph position="11"> As a result of optimization, we chose a 30 second window duration with a 15 second overlap.</Paragraph>
    <Paragraph position="12"> Since there are many stories significantly shorter than 30s in broadcast shows (see Figure 4) we conjunctured that it may be of interest to use a double windowing system in order to better target short stories (Gauvain et al., 2000). The window size of the smaller window was selected to be 10 seconds. So for each query, we independently retrieved two sets of documents, one set for each window size. Then for each document set, document recombination is done by merging overlapping documents until no further merges are possible. The score of a combined document is set to maximum score of any one of the components. For each document derived from the 30s windows, we produce a time stamp located at the center point of the document. However, if any smaller documents are embedded in this document, we take the center of the best scoring document. This way we try to take advantage of both window sizes. The MAP using a single 30s window and the double windowing strategy are shown in Table 2. For comparison, the IR results using the manual story segmentation and the speaker turns located by the audio partitioner are also given. All conditions use the same word hypotheses obtained with a speech recognizer which had no knowledge about the story boundaries.</Paragraph>
    <Paragraph position="13"> manual segmentation (NIST) 59.6% audio partitioner 33.3%  automatically determined story boundaries. The document collection contains 557 hours of broadcast news from the period of February through June 1998. (21750 stories, 50 queries with the associated relevance judgments.) From these results we can clearly see the interest of using a search engine specifically designed to retrieve stories in the audio stream. Using an a priori acoustic segmentation, the mean average precision is significantly reduced compared to a &amp;quot;perfect&amp;quot; manual segmentation, whereas the window-based search engine results are much closer. Note that in the manual segmentation all non-story segments such as advertising have been removed. This reduces the risk of having out-oftopic hits and explains part of the difference between this condition and the other conditions.</Paragraph>
    <Paragraph position="14"> The problem of locating story boundaries is being further pursued in the context of the ALERT project, where one of the goals is to identify &amp;quot;documents&amp;quot; given topic profiles. This project is investigating the combined use of audio and video segmentation to more accurately locate document boundaries in the continuous data stream.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Recent Research Projects
</SectionTitle>
    <Paragraph position="0"> The work presented in this paper has benefited from a variety of research projects both at the European and National levels. These collaborative efforts have enabled access to real-world data allowing us to develop algorithms and models well-suited for near-term applications.</Paragraph>
    <Paragraph position="1"> The European project LE-4 OLIVE: A Multilingual Indexing Tool for Broadcast Material Based on Speech Recognition (http://twentyone.tpd.tno.nl/ olive/) addressed methods to automate the disclosure of the information content of broadcast data thus allowing content-based indexation. Speech recognition was used to produce a time-linked transcript of the audio channel of a broadcast, which was then used to produce a concept index for retrieval.</Paragraph>
    <Paragraph position="2"> Broadcast news transcription systems for French and German were developed. The French data come from a variety of television news shows and radio stations. The German data consist of TV news and documentaries from ARTE. OLIVE also developed tools for users to query the database, as well as cross-lingual access based on off-line machine translation of the archived documents, and online query translation.</Paragraph>
    <Paragraph position="3"> The European project IST ALERT: Alert system for selective dissemination (http://www.fb9ti.uni-duisburg.de/alert) aims to associate state-of-the-art speech recognition with audio and video segmentation and automatic topic indexing to develop an automatic media monitoring demonstrator and evaluate it in the context of real world applications. The targeted languages are French, German and Portuguese. Major mediamonitoring companies in Europe are participating in this project.</Paragraph>
    <Paragraph position="4"> Two other related FP5 IST projects are: CORE-TEX: Improving Core Speech Recognition Technology and ECHO: European CHronicles Online. CORETEX (http://coretex.itc.it/), aims at improving core speech recognition technologies, which are central to most applications involving voice technology. In particular the project addresses the development of generic speech recognition technology and methods to rapidly port technology to new domains and languages with limited supervision, and to produce enriched symbolic speech transcriptions. The ECHO project (http://pc-erato2.iei.pi.cnr.it/echo) aims to develop an infrastructure for access to historical films belonging to large national audiovisual archives. The project will integrate state-of-the-art language technologies for indexing, searching and retrieval, cross-language retrieval capabilities and automatic film summary creation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML