File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1023_metho.xml
Size: 25,075 bytes
Last Modified: 2025-10-06 14:13:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1023"> <Title>Topic and Speaker Identification via Large Vocabulary Continuous Speech Recognition</Title> <Section position="3" start_page="119" end_page="120" type="metho"> <SectionTitle> 2. THEORETICAL FRAMEWORK </SectionTitle> <Paragraph position="0"> Our approach to the topic and speaker identification tasks is based on modelling speech as a stochastic process. For each of the two problems, we assume that a given stream of speech is generated by one of several possible stochastic sources, one corresponding to each of the possible topics or to each of the possible speakers in question. We are required to judge from the acoustic data which topic (or speaker) is the most probable source.</Paragraph> <Paragraph position="1"> Standard statistical theory provides us with the optimal solution to such a classification problem. We denote the string of acoustic observations by A and introduce the random variable T to designate which stochastic model has produced the speech, where T may take on the values from 1 to n for the n possible sources. If we let Pi denote the prior probability of stochastic source i and assume that all classification errors have the same cost, then we should choose the source T = i for which</Paragraph> <Paragraph position="3"> We assume, for the purposes of this work, that all prior probabilities are equal, so that the classification problem reduces simply to choosing the source i for which the conditional probability of the acoustics given the source is maximized.</Paragraph> <Paragraph position="4"> In principle, to compute each of the probabilities</Paragraph> <Paragraph position="6"> In practice, such a collection of computations is unwieldy and so we make several simplifying approximations to limit the computational burden. First, we estimate the above sum only by its single largest term, i.e. we approximate the probability P(A I T = i) by the joint probabiltiy of A and the single most probable word sequence W = W~a x. Of course, generating such an optimal word sequence is exactly what speech recognition is designed to do. Thus, for the problem of topic identification, we could imagine running n different speech recognizers, each modelling a different topic, and then compare the resulting probabilities P(A, W~a x I T = i) corresponding to each of the n optimal transcriptions W~ x. Similarly, for speaker identification, we would run n different speaker-dependent recognizers, each trained on one of the possible speakers, and compare the resulting scores.</Paragraph> <Paragraph position="7"> This approach, though simpler, still requires us to make many complete recognition passes across the speech sample. We further reduce the computational burden by instead producing only a single transcription of the speech to be classified, by using a recognizer whose models are both topic-independent and speaker-independent. Once this single transcription W = Wm~ is obtained, we need only compute the probabilities P(A, Wmax \[ T = i) corresponding to each of the stochastic sources T = i.</Paragraph> <Paragraph position="8"> Rewriting P(A, Wmax I T = i) as P(A I Wmax, T = i) * P(Wmax I T = i), we see that the problem of computing the desired probability factors into two components. The first, P(A \[ W, T), we can think of as the contribution of the acoustic model, which assigns probabilities to acoustic observations generated from a given string of words. The second factor, P(W \[ T), encodes the contribution of the language model, which assigns probabilities to word strings without reference to the acoustics.</Paragraph> <Paragraph position="9"> Now for the problem of topic identification, we wish to determine which of several possible topics is most likely the subject of a given sample of speech. Nothing is known about the speaker. We therefore assume that the same speaker-independent acoustic model holds for all topics; i.e. for the topic identification task, we assume that P(A I W, T) does not depend on T. But we need n different language models P(W I T = i), i = 1,..., n.</Paragraph> <Paragraph position="10"> From the above factorization, it is then clear that in comparing scores from the different sources, only this latter term matters.</Paragraph> <Paragraph position="11"> Symmetrically, for the speaker identification problem, we must choose which of several possible speakers is most likely to have produced a given sample of speech. While in practice, different speakers may well talk about different subjects and in different styles, we assume for the speaker identification task that the language model P(W \[ T) is independent of T. But n different acoustic models P(A \[ W, T = i) are required. Thus only the first factor matters for speaker identification.</Paragraph> <Paragraph position="12"> As a result, once the speaker-independent topic-independent recognizer has generated a transcript of the speech message, the task of the topic classifier is simply to score the transcription using each of n different language models. Similarly, for speaker identification the task reduces to computing the likelihood of the acoustic data given the transcription, using each of n different acoustic models.</Paragraph> </Section> <Section position="4" start_page="120" end_page="121" type="metho"> <SectionTitle> 3. THE MESSAGE IDENTIFICATION SYSTEM </SectionTitle> <Paragraph position="0"> We now examine how this theory is implemented in each of the major components of Dragon's message identification system: the continuous speech recognizer, the speaker classifier, and the topic classifier.</Paragraph> <Section position="1" start_page="120" end_page="120" type="sub_section"> <SectionTitle> 3.1. The Speech Recognizer </SectionTitle> <Paragraph position="0"> In order to carry out topic and speaker identification as described above, it is necessary to have a large vocabulary continuous speech recognizer that can operate in either speaker-independent or speaker-dependent mode.</Paragraph> <Paragraph position="1"> Dragon's speech recognizer has been described extensively elsewhere (\[5\], \[6\]). Briefly, the recognizer is a time-synchronous hidden Markov model (HMM) based system. It makes use of a set of 32 signal-processing parameters: 1 overall amplitude term, 7 spectral parameters, 12 mel-cepstral parameters, and 12 mel-cepstral differences. Each word pronunciation is represented as a sequence of phoneme models called PICs (phonemes-incontext) designed to capture coarticulatory effects due to the preceding and succeeding phonemes. Because it is impractical to model all the triphones that could in principle arise, we model only the most common ones and back off to more generic forms when a recognition hypothesis calls for a PIG which has not been built. The PIGs themselves are modelled as linear ttMMs with one or more nodes, each node being specified by an output distribution and a double exponential duration distribution. We are currently modelling the output distributions of the states as tied mixtures of double exponential distributions. The recognizer employs a rapid match module which returns a short list of words that might begin in a given frame whenever the recognizer hypothesizes that a word might be ending. During recognition, a digram language model with unigram backoff is used.</Paragraph> <Paragraph position="2"> We have recently begun transforming our basic set of 32 signal-processing parameters using the IMELDA transform \[7\], a transformation constructed via linear discriminant analysis to select directions in parameter space that are most useful in distinguishing between designated classes while reducing variation within classes. For the speaker-independent recognizer, we sought directions which maximize average variation between phonemes while being relatively insensitive to differences within the phoneme class, such as might arise from different speakers, telephone channels, etc. Since the IMELDA transform generates a new set of parameters ordered with respect to their value in discriminating classes, directions with little discriminating power between phonemes can be dropped. We use only the top 16 IMELDA parameters for speaker-independent recognition. A different IMELDA transform, in many ways dual to this one, was employed by the speaker classifier, as described below.</Paragraph> <Paragraph position="3"> For speaker-independent recognition, we also normalize the average speech spectra across conversations via blind deconvolution prior to performing the IM~LDA transform, in order to further reduce channel differences. A fixed number of frames are removed from the beginning and end of each speech segment before computing the average to minimize the effect of silence on the long-term speech spectrum.</Paragraph> <Paragraph position="4"> Finally, we are now building separate male and female acoustic models and using the result of whichever model scores better. While in principle, one would have to perform a complete recognition pass with both sets of models and choose the better scoring, we have found that one can fairly reliably determine the model which better fits the data after recognizing only a few utterances. The remainder of the speech can then be recognized using only the better model.</Paragraph> </Section> <Section position="2" start_page="120" end_page="121" type="sub_section"> <SectionTitle> 3.2. The Speaker Classifier </SectionTitle> <Paragraph position="0"> Given the transcript generated by the speaker-independent recognizer, the job of the speaker classifier is to score the speech data using speaker-specific sets of acoustic models, assuming that the transcript provides the correct text; i.e. it must calculate the probabilities P(A \[ W, T = i) discussed above. Dragon's continuous speech recognizer is capable of running in such a &quot;scoring&quot; mode. This step is much faster than performing a full recognition, since the recognizer only has to hypothesize different ways of mapping the speech data to the required text - a frame-by-frame phonetic labelling we refer to as a &quot;segmentation&quot; of the script - and need not entertain hypotheses on alternate word sequences.</Paragraph> <Paragraph position="1"> In principle, the value of P(A \[ W, 7&quot;) should be computed as the sum over all possible segmentations of the acoustic data, but, as usual, we approximate this probability using only the largest term in the sum, corresponding to the maximum likelihood segmentation. While one could imagine letting each of the speaker-dependent models choose the segmentation that is best for them, in our current version of the speaker classifier we have chosen to compute this &quot;best&quot; segmentation once and for all using the same speaker-independent recognizer responsible for generating the initial transcription. This ensures that the comparison of different speakers is relative to the same alignment of the speech and may yield an actual advantage in performance, given the imprecision of our probability models.</Paragraph> <Paragraph position="2"> Thus, the job of the speaker classifier reduces to scoring the speech data given both a fixed transcription and a specified mapping of individual speech frames to PICs. 'Ib perform this scoring, we use a &quot;matched set&quot; of tied mixture acoustic models - a collection of speaker-dependent models each trained on speech from one of the target speakers but constructed with exactly the same collection of PICs to keep the scoring directly comparable. Running in &quot;scoring&quot; mode, we then produce a set of scores corresponding to the negative log likelihood of generating the acoustics given the segmentation for each of the speaker-dependent acoustic models. The speech sample is assigned to the lowest scoring model.</Paragraph> <Paragraph position="3"> In constructing speaker scoring models, we derived a new &quot;speaker sensitive&quot; IMELDA transformation, designed to enhance differences between speakers. The transform was computed using only voiced speech segments of the test speakers (and, correspondingly, only voiced speech was used in the scoring). As is common in using the IMELDA strategy, we dropped parameters with the least discriminating power, reducing our original 32 signal-processing parameters to a new set of 24 IMELDA parameters. These were the parameters used to build the speaker scoring models. It is worth remarking that, because these parameters were constructed to emphasize differences between speakers rather than between phonemes, it was particularly important that the phoneme-level segmentation used in the scoring be set by the original recognition models.</Paragraph> </Section> <Section position="3" start_page="121" end_page="121" type="sub_section"> <SectionTitle> 3.3. The Topic Classifier </SectionTitle> <Paragraph position="0"> Once the speaker-independent recognizer has generated a transcription of the speech, the topic classifier need only score the transcript using language models trained on each of the possible topics. The current topic scoring algorithm uses a simple (unigram) multinomial probability model based on a collection of topic-dependent &quot;keywords&quot;. Thus digrams are not used for topic scoring although they are used during recognition. For each topic, the probability of occurrence of each keyword is estimated from training material on that topic. Non-keyword members of the vocabulary are assigned to a catch-all &quot;other&quot; category whose probability is also estimated. Transcripts are then scored by adding in a negative log probability for every recognized word, and running totals are kept for each of the topics. The speech sample is assigned to the topic with the lowest cumulative score.</Paragraph> <Paragraph position="1"> We have experimented with two different methods of keyword selection. The first method is based on computing the chi-squared statistic for homogeneity based on the number of times a given word occurs in the training data for each of the target topics. This method assumes that the number of occurrences of the word within a topic follows a binomial distribution, i.e. that there is a &quot;natural frequency&quot; for each word within each topic class. The words of the vocabulary can then be ranked according to the P-value resulting from this chi-squared test. Presumably, the smaller the P-value, the more useful the word should be for topic identification. Key-word lists of different lengths are obtained by selecting all words whose P-value falls below a given threshold.</Paragraph> <Paragraph position="2"> Unfortunately, this method does not do a good job of excluding function words and other high frequency words, such as &quot;uh&quot; or &quot;oh&quot;, which are of limited use for topic classification. Consequently, this method requires the use of a human-generated &quot;stop list&quot; to filter out these unwanted entries. The problem lies chiefly in the falsity of the binomial assumption: one expects a great deal of variability in the frequency of words, even among messages on the same topic, and natural variations in the occurrence rates of these very high frequency words can result in exceptionally small P-values.</Paragraph> <Paragraph position="3"> The second method is designed to address this problem by explicitly modelling the variability in word frequency among conversations in the same topic instead of only variations between topics. It also uses a chi-squared test to sort the words in the vocabulary by P-value. But now for each word we construct a two-way table sorting training messages from each topic into classes based on whether the word in question occurs at a low, a moderate, or a high rate. (If the word occurs in only a small minority of messages, it becomes necessary to collapse the three categories to two.) Then we compute the P-value relative to the null hypothesis that the distribution of occurrence rates is the same for each of the topic classes.</Paragraph> <Paragraph position="4"> Hence this method explicitly models the variability in occurrence rates among documents in a nonparametric way. This method does seem successful at automatically excluding most function words when stringent P-value thresholds are set, and as the threshold is relaxed and the keyword lists allowed to grow, function words are slowly introduced at levels more appropriate to their utility in topic identification. Hence, this method eliminates the need for human editing of the keyword lists.</Paragraph> </Section> </Section> <Section position="5" start_page="121" end_page="123" type="metho"> <SectionTitle> 4. TESTING ON SWITCHBOARD DATA </SectionTitle> <Paragraph position="0"> To gauge the performance of our message classification system, we turned to the Switchboard corpus of recorded telephone conversations. The recognition task is particularly challenging for Switchboard messages, since they involve spontaneous conversational speech across noisy phone lines. This made the Switchboard corpus a particularly goodplatform for testing the message identification systems, allowing us to assess the ability of the continuous speech recognizer to extract information useful to the message classifiers even when the recognition itself was bound to be highly errorful.</Paragraph> <Paragraph position="1"> To create our &quot;Switchboard&quot; recognizer, male and female speaker-independent acoustic models were trained using a total of about 9 hours of Switchboard messages (approximately 140 message halves) from 8 male and 8 female speakers not involved in the test sets. We found that it was necessary to hand edit the training messages in order to remove such extraneous noises as cross-talk, bursts of static, and laughter. We also corrected bad transcriptions and broke up long utterances into shorter, more manageable pieces.</Paragraph> <Paragraph position="2"> Models for about 4800 PICs were constructed. We chose to construct only one-node models for the Switchboard task, both to reduce the number of parameters to be estimated given the limited training data and to minimize the penalty for reducing or skipping phonemes in the often rapid speech of many Switchboard speakers. A vocabulary of 8431 words (all words occurring at least 4 times in the training data) and a digram language model were derived from a set of 935 transcribed Switchboard messages involving roughly 1.4 million words of text and covering nearly 60 different topics. Roughly a third of the language model training messages were on one of the 10 topics used for the topic identification task.</Paragraph> <Paragraph position="3"> For the speaker identification trials, we used a set of 24 test speakers, 12 male and 12 female. Speaker-dependent scoring models were constructed for each of the 24 speakers using the same PIC set as for the speaker-independent recognizer. PIC models were trained using 5 to 10 hand-edited message halves (about 16 minutes of speech) from each speaker.</Paragraph> <Paragraph position="4"> The speaker identification test material involved 97 message halves and included from 1 to 6 messages for each test speaker. We tested on speech segments from these messages that contained 10, 30, and 60 seconds of speech.</Paragraph> <Paragraph position="5"> The results of the speaker identification tests were surprisingly constant across the three duration lengths.</Paragraph> <Paragraph position="6"> Even for segments containing as little as 10 seconds of speech, 86 of the 97 message halves, or 88.7%, were correctly classified. When averaged equally across speakers, this gave 90.3% accuracy. The results from the three trial runs are summarized in Table 1. It is worth remarking that even the few errors that were made tended to be concentrated in a few difficult speakers; for 17 of the 24 speakers, the performance was always perfect, and for only 2 speakers was more than one message ever misclassified. null Given the insensitivity of these results to speech duration, we decided to further limit the amount of speech available to the speaker classifier. The test segments used in the speaker test were actually concatenations of smaller speech intervals, ranging in length from as little as 1.5 to as much as 50.2 seconds. We rescored using these individual fragments as the test pieces. 1 Results remained excellent. For example, when testing only the pieces of length under 3 seconds, 42 of the 46 pieces, or 91.3%, were correctly classified (90.9% when speakers were equally weighted). These pieces represented only 19 of the 24 speakers, but did include our most problematic speakers. For segments of length less than 5 seconds, 177 of the 201 pieces (88.1%, or 89.4% when the 24 speakers were equally weighted) were correctly classified.</Paragraph> <Paragraph position="7"> Switchboard test.</Paragraph> <Paragraph position="8"> For the topic identification task, we used a test set of 120 messages, 12 conversations on each of 10 different topics. Topics included such subjects as &quot;air pollution&quot;, &quot;pets&quot;, and &quot;public education&quot;, and involved several topics (for example, &quot;gun control&quot; and &quot;crime&quot;) with significant common ground. For topic identification, we planned to use the entire speech message, but for uniformity all messages were truncated after 5 minutes and the first 30 seconds of each was removed because of concern that this initial segment might be artificially rich in keywords.</Paragraph> <Paragraph position="9"> Keywords were selected from the same training messages used for constructing the recognizer's language model.</Paragraph> <Paragraph position="10"> This collection yielded just over 30 messages on each of the ten topics, for a total of about 50,000 words of training text per topic. Because this is relatively little for estimating reliable word frequencies, word counts for each topic were heavily smoothed using counts from all other topics. We found that it was best to use a 5-to-1 smoothing ratio; i.e. data specific to the topic were counted five times as heavily as data from the other nine topics.</Paragraph> <Paragraph position="11"> Keyword lists of lengths ranging from about 200 words to nearly 5000 were generated using the second method of keyword selection. We also tried using the entire 84311The initial speaker-independent recognition and segmentation were not, however, re-run so that such decisions as gender determinatlon were inherited from the larger test.</Paragraph> <Paragraph position="12"> word recognition vocabulary as the &quot;keyword&quot; list. The results of the initial runs, given in the second column of Switchboard test.</Paragraph> <Paragraph position="13"> It is worth noting that, as it was designed to, the new keyword selection routine succeeded in automatically excluding virtually all function words from the 203-word list. For comparison, we also ran some keyword lists selected using our original method and filtered through a human-generated &quot;stop list&quot;. The performance was similar: for example, a list of 211 keywords resulted in an accuracy of 67.5%.</Paragraph> <Paragraph position="14"> The problem for the topic classifier was that scores for messages from different topics were not generally comparable due to differences in the acoustic confusability of the keywords. When tested on the true transcripts of the speech messages, the topic classifier did extremely well, missing only 2 or 3 messages out of the 120 with any of the keyword lists. Unfortunately, when run on the recognized transcriptions, some topics (most notably &quot;pets&quot;, with its preponderance of monosyllabic keywords) never received competitive scores.</Paragraph> <Paragraph position="15"> In principle, this problem could be corrected by estimating keyword frequencies not from true transcriptions of training data but from their recognized counterparts.</Paragraph> <Paragraph position="16"> Unfortunately, this is a fairly expensive approach, requiring that the full training corpus be run through the recognizer. Instead, we took a more expedient course. In the process of evaluating our Switchboard recognizer, we had run recognition on over a hundred messages on topics other than the ten used in the topic identification test. For each of these off-topic messages, we computed scores based on each of the test topic language models to estimate the (per word) handicap that each test topic should receive. When the 120 test messages were rescored using this adjustment, the results improved dramatically for all but the smallest list (where the keywords were too sparse for scores to be adequately estimated). The improved results are given in the last column of Table 2.</Paragraph> </Section> class="xml-element"></Paper>