File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4022_metho.xml
Size: 10,916 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4022"> <Title>Context-based Speech Recognition Error Detection and Correction</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Our Approach </SectionTitle> <Paragraph position="0"> There are several key components to our approach to detecting and correcting in-vocabulary speech recognition errors. First, we calculate co-occurrence statistics for all words in a large corpus of ASR output data; this is an offline processing step that we describe in Section 2.1. This co-occurrence information is used in an online error detection process based on word context analysis. The error detection process first requires the input of a query word that is to be sought in the test corpus of ASR output from the same engine; the goal of this step is to detect places in the corpus where the query word was spoken but misrecognized. We describe the contextual analysis in Section 2.2. From the set of candidate error regions, ASR errors are detected using phonetic comparison between the query word and words in the window; this phonetic analysis is described in Section 2.3.</Paragraph> <Paragraph position="1"> Our approach to ASR error detection and correction builds on recent work in statistical lexical and contextual modeling using co-occurrence analysis, such as (Roark and Charniak, 1998). We apply the contextual modeling to a speech retrieval task, as in (Kupiec et al., 1994). In the earlier work, general mathematical models were developed to measure lexical similarity between words in context. We seek to develop a simple contextual model based on word co-occurrences in order to facilitate the retrieval of spoken documents containing critical word errors. null Our approach has a similar goal to that of Logan (2002); however, their work focuses primarily on out-of-vocabulary words while we focus on in-vocabulary words. Our work also builds on recent directions in language modeling for speech recognition, in which a broader context beyond n-grams is considered. For example, the dimensionality reduction modeling of Bellegarda (1998) seeks to model long-range contextual similarity among words in a training corpus. Rosenfeld (2000) has developed another language modeling approach that can model word occurrences beyond the common trigram approaches. While language modeling techniques seek to improve the ASR engine itself, we present an ASR post-processing correction model, in which we process and improve the output of an ASR system.</Paragraph> <Paragraph position="2"> The data used for our experiments consisted of a large corpus of English broadcast news transcripts produced by the broadcast news speech system described in (Makhoul et al., 2000). This real-time ASR system has a vocabulary size of about 64k words and a reported performance that normally ranges from WER=20% to 30% for English news broadcasts. Our training corpus consisted of 360 half-hour broadcast transcripts containing roughly 1.6 million words. The broadcasts were from three different English sources (CNN Headline News, BBC America, and News World International) from July 2003. We divided the data into a training set, from which all model parameters were trained, and a separate test set consisting of files that were randomly selected from the corpus. The evaluation corpus consisted of 39 half-hour broadcast transcripts containing about 180,000 words.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Word Co-occurrence Analysis </SectionTitle> <Paragraph position="0"> The goal of the first step in our approach, co-occurrence analysis, is to determine, for any given word in the ASR vocabulary, the other words that are very likely to occur near the given word and are not likely to occur elsewhere.</Paragraph> <Paragraph position="1"> We compile co-occurrence frequencies for a target word by counting all other words that co-occur in a document with the target word within a certain window size w (w/2 words to the left and w/2 words to the right). In our case, we investigated windows sizes ranging from w=2 to w=40; as with all our system parameters, optimal value for a given source empirically through training.</Paragraph> <Paragraph position="2"> We calculate several maximum likelihood prior probabilities for use in the co-occurrence analysis. For a target word x and each context word y, we calculate p(x)= c(x)/n and p(y)=c(y)/n, where c(x) and c(y) are the total corpus counts for x and y and where n is the total number of words in the training corpus (1,638,224). We also calculate the joint probability p(x,y)=c(x,y)/n, the probability of co-occurrence for x and y in the training data. In addition, we calculate the pointwise mutual information I(x,y) for two words x and y, I(x,y)=</Paragraph> <Paragraph position="4"> . The value I(x,y) is highest for target words x and context words y that occur frequently together within a window w but rarely outside the window. The context words are ranked by mutual information, and this ranked list of co-occurring context words for each target word is used in the context analysis step described in Section 2.2.</Paragraph> <Paragraph position="5"> The resulting ranked context lists demonstrate the different contexts in which words like &quot;Iraq&quot; and &quot;rock&quot; appear in the data. The top 5 context words for a window size of 20 for &quot;Iraq&quot; are inge, chirac's, refusal, reconstruction, and waging. The top 5 context words for &quot;rock&quot; are uplifting, kt, folk, lejeune, and assertion. Most of the top words in the first list are, for the most part, relevant to the word &quot;Iraq,&quot; and the words in the second list are clearly not relevant to &quot;Iraq.&quot; The corresponding top 5 list for &quot;Abbas&quot; is mahmoud, ariel, prime, minister, and committed; the list for &quot;bus&quot; is michelle, blew, moscow, jerusalem, and responsible.</Paragraph> <Paragraph position="6"> These lists also demonstrate the value of modeling the patterns in the ASR output directly, rather than compiling co-occurrence frequencies from a clean data source without word errors: the output word inge occurs exclusively in the data as an ASR error for in going in the context &quot;in going to war with Iraq.&quot; Similarly, kt occurs frequently in the data as part of the call letters for a television station in Little Rock, Arkansas. Systematic and recurring errors such as this provide a great deal of information in the co-occurrence statistics. However, the use of ASR output without &quot;clean&quot; transcripts in training also introduces the possibility of modeling false positives in the output, such as &quot;Iraq&quot; being output as an error when &quot;a rock&quot; was spoken; this type of error can adversely affect the co-occurrence statistics we calculate.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Context Analysis </SectionTitle> <Paragraph position="0"> The context analysis component seeks to identify contextual regions in the test data that are likely to contain a given query word, and thus also likely to contain a mis-recognition of the query word. This analysis uses the probabilities and mutual information output from the co-occurrence analysis described in Section 2.1.</Paragraph> <Paragraph position="1"> We slide a window of w words across a document in the test data, where w is the same window size used to train the word co-occurrence statistics. We also define a minimum number of context words c that must be contained with the window in order to mark the center word of the window as a possible ASR error. As an example of this context matching, consider the word sequence &quot;... the reconstruction of a rock proceeds despite Chirac's refusal...&quot; The word &quot;rock&quot; is at the center of an 8-word context window (4 on either side) containing 3 of the top-ranked context words for &quot;Iraq&quot; from the previous section. This instance of the word &quot;rock&quot; would thus be a candidate misrecognition of &quot;Iraq&quot; for w [?] 8 and c [?] 3. Table 1 shows the number of candidate words detected for &quot;Iraq&quot; in the evaluation data for different window sizes w and minimum context words c. As might be expected, the number of candidate words increases as the window size increases and decreases as the minimum number of context words increases.</Paragraph> <Paragraph position="2"> range of window sizes w for minimum numbers of context words c.</Paragraph> <Paragraph position="3"> Most combinations result in a large number of candidates, so we also apply candidate pruning based on probabilistic metrics. Given a candidate error and c context words contained in a window, we then compare the probability of observing both the query word and the actual word in the data. This comparison is carried out using the Kullback-Leibler divergence for observation distributions containing the c context words, D(p bardbl q)=</Paragraph> <Paragraph position="5"> , where p(x) is the conditional probability of the co-occurrence of the query word with a context word x in the set of c context words X and q(x) is the probability of the candidate error with the context word.</Paragraph> <Paragraph position="6"> A larger Kullback-Leibler divergence value indicates a higher probability that the candidate word is actually a misrecognition of the query word.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Phonetic Comparison </SectionTitle> <Paragraph position="0"> Given the set of candidate errors for a query word, as detected using the context matching technique described in Section 2.2, we next apply a phonetic distance criterion to determine the similarity between each candidate error and the query word being sought, based on the pronunciations in the ASR system lexicon. We used the common minimum-distance weighted phonetic alignment technique described in detail in (Kondrak, 2003); in our experiments we used phonetic weights available through the alignment package altdistsm originally described in (Fisher and Fiscus, 1993).</Paragraph> <Paragraph position="1"> The final decision whether to correct the ASR error is made based on the phonetic distance between the candidate word and the query word. Since the candidate word is already known to have occurred in a lexical context that is likely to contain the query, a strong phonetic similarity between the words provides very strong evidence that the candidate word is, in fact, a misrecognition of the query word.</Paragraph> </Section> </Section> class="xml-element"></Paper>