File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1119_evalu.xml
Size: 7,398 bytes
Last Modified: 2025-10-06 13:59:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1119"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 947-954, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics SEARCHING THE AUDIO NOTEBOOK: KEYWORD SEARCH IN RECORDED CONVERSATIONS</Title> <Section position="9" start_page="951" end_page="952" type="evalu"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="951" end_page="952" type="sub_section"> <SectionTitle> 6.1 Setup </SectionTitle> <Paragraph position="0"> We have evaluated our system on five different corpora of recorded conversations: * one meeting corpus (NIST &quot;RT04S&quot; development data set, ICSI portion, (NIST, 2000-2004)) * two eval sets from the switchboard (SWBD) data collection (&quot;eval 2000&quot; and &quot;RT03S&quot;, (NIST, 20002004)) null * two in-house sets of interview recordings of about one hour each, one recorded over the telephone, and one using a single microphone mounted in the interviewee's lapel.</Paragraph> <Paragraph position="1"> For each data set, a keyword list was selected by an automatic procedure (Seide and Yu, 2004). Words and multi-word phrases were selected from the reference transcriptions if they occurred in at most two segments. Example keywords are overseas, olympics, and &quot;automated accounting system&quot;. For the purpose of evaluation, those data sets are cut into segments of about 15 seconds each. The size of the corpora, their number of segments, and the size of the selected keyword set are given in Table 1. The acoustic model we used is trained on 309h of the Switchboard corpus (SWBD-1). The LVCSR language model was trained on the transcriptions of the Switchboard training set, the ICSI-meeting training set, and the LDC Broadcast News 96 and 97 training sets. No dedicated training data was available for the in-house interview recordings. The recognition dictionary has 51388 words. The phonetic language model was trained on the phonetic version of the transcriptions of SWBD-1 and Broadcast News 96 plus about 87000 background dictionary entries, a total of 11.8 million phoneme tokens. To measure the search accuracy, we use the &quot;Figure Of Merit&quot; (FOM) metric defined by NIST for word-spotting evaluations. In its original form, it is the average of detection/false-alarm curve taken over the range [0..10] false alarms per hour per keyword. Because manual word-level alignments of our test sets were not available, we modified the FOM such that a correct hit is a 15-second segment that contains the key phrase.</Paragraph> <Paragraph position="2"> Besides FOM, we use a second metric - &quot;Top Hit Precision&quot; (THP), defined as the correct rate of the best ranked hit. If no hit is returned for an existing query term, it is counted as an error. Both of these metrics are relevant measures in our known-item search.</Paragraph> <Paragraph position="3"> as well as precision (P), recall (R), FOM and THP for searching the transcript.</Paragraph> <Paragraph position="4"> test set WER P R FOM THP</Paragraph> </Section> <Section position="2" start_page="952" end_page="952" type="sub_section"> <SectionTitle> 6.2 Word/Phoneme Hybrid Search </SectionTitle> <Paragraph position="0"> Table 2 gives the LVSCR transcription word-error rates for each set. Almost all sets have a word-error rates above 40%. Searching those speech recognition transcriptions results in FOM and THP values below 40%.</Paragraph> <Paragraph position="1"> Table 3 gives results of searching in word, phoneme, and hybrid lattices. First, for all test sets, word-lattice search is drastically better than transcription-only search. Second, comparing word-lattice and phoneme-lattice search, phoneme lattices outperforms word lattices on all tests in terms of FOM. This is because phoneme lattice has better recall rate. For THP, word lattice search is slightly better except on the interview sets for which the language model is not well matched. Hybrid search leads to a substantial improvement over each (27.6% average FOM improvement and 16.2% average THP improvement over word lattice search). This demonstrates the complementary nature of word and phoneme search.</Paragraph> <Paragraph position="2"> We also show results separately for known words (invocabulary, INV) and out-of-vocabulary words (OOV).</Paragraph> <Paragraph position="3"> Interestingly, even for known words, hybrid search leads to a significant improvement (get 22.0% for FOM and 16.7% for THP) compared to using word lattices only.</Paragraph> </Section> <Section position="3" start_page="952" end_page="952" type="sub_section"> <SectionTitle> 6.3 Effect of Node Posterior </SectionTitle> <Paragraph position="0"> In Section 4.2, we have shown that phrase posteriors can be computed from posterior lattices if they include both arc and node posteriors (Eq. 4). However, posterior representations of lattices found in literature only include word (arc) posteriors, and some posterior-based systems simply ignore the node-posterior term, e.g. (Chelba and Acero, 2005). In Table 4, we evaluate the impact on accuracy when this term is ignored. (In this experiment, we bypassed the index-lookup step, thus the numbers are slightly different from Table 3.) We found that for word-level search, the effect of node posterior compensation is indeed neglectable. However, for phonetic search it is not: We observe a 4% relative FOM loss.</Paragraph> </Section> <Section position="4" start_page="952" end_page="952" type="sub_section"> <SectionTitle> 6.4 Index Lookup and Linear Search </SectionTitle> <Paragraph position="0"> Section 5 introduced a two-stage search approach using an M-gram based indexing scheme. How much accuracy is lost from incorrectly eliminating correct hits in the first (index-based) stage? Table 5 compares three setups. The first column shows results for linear search only: no index lookup used at all, a complete linear search is performed on all lattices. This search is optimal but does not scale up to large database. The second column shows index lookup only. Segments are ranked by the approximate M-gram based ETF score obtained from the index. The third column shows the two-stage results.</Paragraph> <Paragraph position="1"> The index-based two-stage search is indeed very close to a full linear search (average FOM loss of 1.2% and THP loss of 0.2% points). A two-stage search takes under two seconds and is mostly independent of the database size. In other work, we have applied this technique successfully to search a database of nearly 200 hours.</Paragraph> </Section> <Section position="5" start_page="952" end_page="952" type="sub_section"> <SectionTitle> 6.5 The System </SectionTitle> <Paragraph position="0"> Fig. 2 shows a screenshot of a research prototype for a search-enabled audio notebook. In addition to a note-taking area (bottom) and recording controls, it includes a rich audio browser showing speaker segmentation and automatically identified speaker labels (both not scope of this paper). Results of keyword searches are shown as color highlights, which are clickable to start playback at that position.</Paragraph> </Section> </Section> class="xml-element"></Paper>