XML Viewer - h90-1035

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1035_metho.xml
Size: 29,150 bytes
Last Modified: 2025-10-06 14:12:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1035">
  <Title>Phoneme-in-Context Modeling for Dragon's Continuous Speech Recognizer</Title>
  <Section position="3" start_page="0" end_page="165" type="metho">
    <SectionTitle>
2. Phonemes in Context
</SectionTitle>
    <Paragraph position="0"> A speaker of English, given a phonemic spelling of an unfamiliar word from a dictionary, can pronounce the word recognizably or recognize the word when it is spoken. On the other hand, it is impossible to put together an &amp;quot;alphabet&amp;quot; of recorded phonemes which, when concatenated, will sound like natural English words. Speakers of English apply a host of duration and coarticulation rules when combining phonemes into words and sentences, and they employ the same rules in recognizing spoken language. It comes as a surprise to most speakers, for example, to discover that the vowels in &amp;quot;will&amp;quot; and &amp;quot;kick&amp;quot;, which are identical according to dictionary pronunciations, are as different in their spectral characteristics as the vowels in &amp;quot;not&amp;quot; and &amp;quot;nut&amp;quot;, or that the vowel in &amp;quot;size&amp;quot; has more than twice the duration of the same vowel in &amp;quot;seismograph&amp;quot;.</Paragraph>
    <Paragraph position="2"/>
    <Paragraph position="4"/>
    <Paragraph position="6"> In the Dragon Systems family of speech recognizers, the fundamental unit of speech to be trained is the &amp;quot;phoneme in context&amp;quot; (PIC)\[3\]. Ultimately the defining property of a PIC is that by concatenating a sequence of PICs for an utterance one can construct an accurate simulated spectrum for the utterance. In the present implementation, a PIC is taken as completely specified by a phoneme accompanied by a preceding phoneme (or silence), a succeeding phoneme (or silence), and a duration code that indicates the degree of prepausal lengthening. To restrict the proliferation of PICs, syllable boundaries, even word boundaries, are currently ignored The set of phonemes is taken from The Random House(r) Unabridged Dictionary. The stress of each syllable is regarded as a property of the vowel or syllabic consonant in that syllable* Excluding pronunciations which are explicitly marked as foreign, there are 17 vowels, each with three possible stress levels, plus 26 consonants and syllabic consonants.</Paragraph>
    <Paragraph position="7"> A duration code of 3 indicates absence of prepausal lengthening. This will always be the case except in the last two syllables of an utterance.</Paragraph>
    <Paragraph position="8"> A duration code of 6 indicates prepausal lengthening to approximately twice the normal duration. This occurs for the vowel in the final syllable of an utterance and for any consonant that follows that vowel, unless the vowel is followed by one of the unvoiced consonants k, p, t, th or ch. For example, in the word &amp;quot;harmed&amp;quot; every PIC except the one for the initial 'h' will have a duration code of 6.</Paragraph>
    <Paragraph position="9"> A duration code of 4 indicates prepausal lengthening by a factor of approximately 4/3. This occurs in two cases: * In the final syllable when the vowel is followed by k, p, t, ch, or th: for example, in both PICS of &amp;quot;at&amp;quot; and in the last three PICS of &amp;quot;bench&amp;quot;.</Paragraph>
    <Paragraph position="10"> * For consonants that precede the vowel in the final syllable: for example, the 's' in &amp;quot;beside&amp;quot;.</Paragraph>
    <Paragraph position="11"> PICs contain almost enough information to predict the acoustic realization of a phoneme. For example, the PIC for 't' is different in the word &amp;quot;mighty&amp;quot; (where the 't' is usually realized as a flap) and in the phrase &amp;quot;my tea&amp;quot; (where the 't' is clearly aspirated). This distinction is made, even though syllable and word boundaries, are ignored, because the stress of the following vowel is part of the context* Similarly, PICs capture the information that the final 't' in &amp;quot;create&amp;quot; (preceded by a stressed vowel) is more strongly released that in &amp;quot;probate&amp;quot; (preceded by an unstressed vowel), that the 's' in &amp;quot;horseshoe&amp;quot; is realized as an &amp;quot;sh&amp;quot;, that the 'n' in &amp;quot;San Francisco&amp;quot; or &amp;quot;NPR&amp;quot; is realized almost like an 'm', and that the 'n' in &amp;quot;month&amp;quot; or &amp;quot;in the&amp;quot; is the dental allophone of 'n'. 3. Selection of PICs for Training For isolated-word recognition, one could in principle enumerate all PICs by processing phonetic spellings for all the words in an unabridged dictionary. For the 25,000 words in the DragonDictate recognizer, there are approximately 30,000 PICs. A subset of 8,000 words can be chosen that includes all but about 1,000 of these PICs, most of them occurring in only a single word. Increasing the vocabulary size to 64,000 words would increase the number of PICS only slightly, to about 32,000.</Paragraph>
    <Paragraph position="12"> For connected speech the goal of including all possible PICs is unachievable because of the wide variety of PICs that can arise through coarticulation across word boundaries.</Paragraph>
    <Paragraph position="13"> For example, the sentence &amp;quot;Act proud when you're dubbed Gareth&amp;quot; contains the PICs &amp;quot;ktp&amp;quot; and &amp;quot;bdg', neither of which occurs in any common English word. A further complication is that each PIC in a final syllable can occur in a sentence either with or without prepausal lengthening.</Paragraph>
    <Paragraph position="14"> For the sort of connected-speech task which can be carried out in close to real time on today's microcomputers, the majority of PICs already arise only as a result of coarticulation across word boundaries. The 1023 pronunciations for the 842 words in the mammography vocabulary that is used for research at Dragon Systems include 2681 PICs. A set of 3000 sentences using this vocabulary includes only 1929 of these PICs, plus another 4610 that are not present in the isolated words* A different set of 3000 sentences, reserved for testing, includes yet another 1326 new PICs. Among the 121 PICs, not present  in isolated words, that occur 100 or more times in the sentences are the vowel in the suffix &amp;quot;ation&amp;quot; without prepausal lengthening, the dental &amp;quot;n&amp;quot; of &amp;quot;in the&amp;quot; and &amp;quot;on the&amp;quot;, and the &amp;quot;zs&amp;quot; combination of&amp;quot;is seen&amp;quot;. The Dragon Systems training set currently includes about 8000 isolated words and about 8000 short phrases, each limited in duration to about 2.4 seconds. Although the total number of words in the training set is no greater than in the 6000 mammography sentences, the training set includes 37,423 distinct PICs. It is still far from complete.</Paragraph>
    <Paragraph position="15"> For example, a a set of 800 phrases drawn from a Hemingway short story and a newspaper article on parallel processing includes slightly more than 1000 PICs that were not in the training set (most, however, occurred only once).</Paragraph>
    <Paragraph position="16"> The problem of finding the smallest training vocabulary that includes a given set of PICs is probably NPcomplete. Still, it is easy to find a reasonably good approximation to the solution of the problem. In 6000 isolated words one can include about 22,000 different PICs.</Paragraph>
    <Paragraph position="17"> Beyond this point it becomes difficult to find words that include more than one or two new PICs, but short phrases of diverse text which contain three or more new PICs are still easy to find. By using such phrases to enlarge the training vocabulary, we hope to acquire training data for 50,000 PICs within the next year.</Paragraph>
    <Paragraph position="18"> 4. Modeling PICs by Phonemic Segments A &amp;quot;vocabulary&amp;quot; of 50,000 independent PICs would be no more manageable than a vocabulary of 50,000 independent isolated words, but PICs are not independent.</Paragraph>
    <Paragraph position="19"> Most of the PICs for a stop consonant, for example, involve an identical segment of silence, for example, while all PICs for the sibilant &amp;quot;s&amp;quot; are characterized by the absence of low-frequency energy. One can hope, therefore, to represent the thousand or so PICs that represent the same phoneme in various contexts in terms of a much smaller number of &amp;quot;phonemic segments&amp;quot;. For phonemes that exhibit a great deal of allophonic variation, such as &amp;quot;t&amp;quot;, &amp;quot;k&amp;quot;, and schwa, as many as 64 different segment models may be required, while for phonemes like &amp;quot;s&amp;quot; and &amp;quot;sh&amp;quot; that are little influenced by context, as few as ten may suffice. For the complete set of 77 phonemes used in English, slightly more than 2000 segment models suffice. In \[4\], an approach to modeling allphonic models using a small number of distributions was described. Similarly, in \[5\], an alternate way of performing parameter tying across distinct triphones using a triphone clustering procedure was described.</Paragraph>
    <Paragraph position="20"> A phonemic segment can be characterized in two alternative ways. At the simpler level, it can be regarded as a fragment of the sort of acoustic data that would be generated by the &amp;quot;front end&amp;quot; of a speech-recognition system. In the case of the current Dragon recognizer, this is nothing more than a simulated spectrum based on an amplitude parameter and several spectral parameters. At a more sophisticated level, a phonemic segment includes enough information to generate a probability distribution for use in hidden Markov modeling. For the current Dragon recognizer, this requires calculation of the absolute deviation from the mean, as well as the mean for each acoustic parameter. The same distinction between what will be called a &amp;quot;spectral model&amp;quot; and what will be called a &amp;quot;Markov model&amp;quot; applies also to continuous parameters that have no direct spectral interpretation (cepstral parameters, for example), or to discrete parameters. In the following discussion, the term &amp;quot;spectrum&amp;quot; should be interpreted to mean any sequence of parameters that results from processing a speech waveform, while &amp;quot;Markov model&amp;quot; should be interpreted as a random process capable of generating such sequences.</Paragraph>
    <Paragraph position="21"> One may think of a PIC as a probabilistic model for a portion of a speech spectrogram corresponding to a single phoneme. The problem of representing this PIC as a sequence of phonemic segments is solved by hidden Markov modeling. The sequence may be from one to six segments in length, and the same segment may occur in more than one position in the sequence. There is no constraint on the order of segments within the sequence.Thus the model for a phoneme with n segments is represented by the diagram below.</Paragraph>
    <Paragraph position="22"> sta n d  The arcs labeled 1, 2 .... n correspond to one or more frames of acoustic data corresponding to the single segment 1, 2 .... n. The arcs labeled x permit a given phoneme to have a sequence of fewer than six phonemes associated with it. These null arcs are assigned slightly higher transition probabilities than the arcs associated with phonemic segments.</Paragraph>
    <Paragraph position="23">  Thus a PIC may be represented very compactly as a sequence of one to six pairs, each pair consisting of a phonemic segment and a duration.This sequence may be regarded as the best piecewise-constant approximation to the spectrogram.</Paragraph>
    <Paragraph position="24"> For speaker adaptation, the phonemic segment is the basic unit. It is assumed that the representation of a PIC in terms of segments is valid for all speakers, so that adapting the small number of segments for a phoneme will have the effect of adapting the much larger number of PICs. Segment durations within a PIC can also be adapted, but only by acoustic data involving that particular PIC.</Paragraph>
  </Section>
  <Section position="4" start_page="165" end_page="167" type="metho">
    <SectionTitle>
5. Labeling Training Data
</SectionTitle>
    <Paragraph position="0"> To build a spectral model for a PIC, one must find one or more spectrograms that involve that PIC, then extract from these spectrograms the data for the phoneme in the desired PIC. Thus phonemically labeled training data are required.</Paragraph>
    <Paragraph position="1"> Given a complete set of hidden Markov models representing PICs, the labeling problem could easily be solved by dynamic programming and traceback. This approach is the correct one to use for implementing adaptation, but it is inappropriate for training, since the labeled training data would be required in order to produce the PIC models in the first place. To do semiautomatic labeling with an incomplete set of phonemic segments and with no PIC models, a simpler scheme must be used, one which deals gracefully with the situation where PIC models have not yet been created and where some portions of spectrograms cannot yet be labeled.</Paragraph>
    <Paragraph position="2"> The full Markov model for a word is a sequence of models for the phonemes of the word, starting and ending with silence. Silence is modeled, like any other phoneme, by a set of segments. Between the phoneme models are &amp;quot;transition nodes&amp;quot; with fixed transition probabilities that are chosen to be slightly lower than the typical probability for the best phoneme segment. Thus the model for &amp;quot;at&amp;quot; might be represented as follows: I--- silence transition transition transition \/,... ' ~ / ':.':::::.~~~i ~kk .............................../~~::::~...&amp;quot;il~ / ='&amp;quot;: ............................. , v t ~,:-:~B.:~3 = = = =&amp;quot; ~&amp;quot; -- ='lii::::i~:::~::i-::,:~i~::~ii~i~ir ---r - n ~ ~.::-~:::::.~. :::~:': |v :::::::::::::::::::::::::::::::::::::::::::: v silence --O Figure 6.A Markov Model for &amp;quot;at&amp;quot;.</Paragraph>
    <Paragraph position="3"> Each box represents a phoneme model of one to six states, as described above.</Paragraph>
    <Paragraph position="4"> Once the best path has been found by dynamic programming, traceback at the phoneme level assigns a start time and end time to each phoneme. If a complete set of phonemic segments has been constructed, the start time for each phoneme coincides with the end time for its predecessor phoneme. To the extent that there are acoustic segments that are not yet well modeled by any phonemic segment, the data that correspond to this segment will be assigned to an interphoneme transition.</Paragraph>
    <Paragraph position="5"> The phoneme-level traceback is recorded within each training token. This makes it possible, without repeating the dynamic programming, to identify the portion of a given training token that correspond to a specified phoneme--an important step in locating training data for a specific PIC. Traceback can also be performed at a lower level in order to determine the sequence of phonemic segments that corresponds to an individual PIC. The data thus assigned to a segment may then be used as ~aining data for that segment to improve the estimates of the means and variances for the acoustic parameters of that segment.</Paragraph>
    <Paragraph position="6"> The net effect of dynamic programming followed by traceback at the word level and at the phoneme level is to assign to each &amp;quot;frame&amp;quot; of acoustic data of the word a phoneme segment label, subject to the following  constraints: * Phonemes appear in the order specified by the pronunciation of the word.</Paragraph>
    <Paragraph position="7"> * For each phoneme, there are no more than five transitions from one segment to another.</Paragraph>
    <Paragraph position="8"> * Transition frames with no segment assignment may  occur only between phonemes.</Paragraph>
    <Paragraph position="9"> The process of labeling the training data is not completely automatic, but it becomes more and more nearly so as the set of phonemic segments increases in size. In practice, phonemic segments are initialized &amp;quot;by hand&amp;quot;. On a spectral display of a training token, a sequence of frames is selected. The means and variances for the acoustic parameters of those frames provide the initial estimates for the segment parameters. Even in the absence of any previously labeled segments, it is a straightforward matter to initialize a set of segments that will provide a correct phonemic labeling of a single token, and these segments in turn prove useful in labeling other tokens. As more and more tokens are labeled in this manner, a set of segments develops that suffices to label a greater and greater fraction of new tokens, until eventually any new token can be labeled without the need for interphoneme transitions.</Paragraph>
    <Paragraph position="10"> As new segments are created during the labeling process, occasionally the limit of 64 segments for a phoneme is reached. Whenever this occurs, the two segments that are most similar are automatically combined into a single segment.</Paragraph>
    <Paragraph position="11"> Once a thousand or so training tokens have been labeled, transition segments that are more than about thirty milliseconds long become difficult to find. At this point the  best strategy is to label all the training tokens automatically, then to search for the longest transition segments and to use them to create new phonemic segments. This process can be iterated until no transition segments remain.</Paragraph>
    <Paragraph position="12"> To make use of duration constraints in labeling, an alternative version of the dynamic programming is used which closely resembles the one used by Dragon's smallvocabulary recognition and training algorithm. To each phoneme in the word, an expected duration in milliseconds is assigned. To the extent that the actual duration of the speech assigned to that phoneme is less than or greater than the expected duration, a duration penalty is added to the dynamic programming score. The traceback is then determined both by acoustic match and by duration constraints. While a clear-cut phoneme boundary such as one before or after an 's' will be little affected by duration constraints, a boundary that is associated with almost no acoustic feature (between two stops, for example) will be assigned primarily on the basis of durations.</Paragraph>
    <Paragraph position="13"> In order to estimate durations, the hypothesis is made that changing the left or right context of a phoneme has little effect on the duration of that phoneme except in the case where the context is silence. As stated above, the duration of the final T in &amp;quot;all&amp;quot; ought to be the same as the duration of the final T in &amp;quot;wheel&amp;quot;, &amp;quot;bell&amp;quot;, or other words where there is a clear formant transition into the 'T'. As another example, the 'p' and 't' in &amp;quot;opted&amp;quot; should each have a duration close to that of a single intervocalic stop.</Paragraph>
    <Paragraph position="14"> For each PIC, an expected duration is determined by averaging together four quantities: * the duration of the phoneme in the precise context specified by the PIC (which may occur only once in the training vocabulary).</Paragraph>
    <Paragraph position="15"> * the duration of the phoneme with the specified left context and an arbitrary right context.</Paragraph>
    <Paragraph position="16"> * the duration of the phoneme with the specified right context and an arbitrary left context.</Paragraph>
    <Paragraph position="17"> * the duration of the phoneme with both left and right context arbitrary.</Paragraph>
    <Paragraph position="18"> In no case, however, is a silence context substituted for a non-silence context or vice versa.</Paragraph>
    <Paragraph position="19"> The semiautomatic labeling process described above has been under development for more than a year, with results that appear more and more satisfactory as the new phonemic segments are identified and duration estimates are improved. By using a set of about 2000 segments and imposing duration constraints on the dynamic programming, it is possible to achieve automatic phonemic labeling that agrees with hand labeling in almost every case and that is probably more consistent than hand labeling with regard to such difficult, arbitrary decisions as placing boundaries between adjacent front vowels or between glides and vowels. Most labels that a human labeler might question can be located by looking just at the small fraction of words for which the actual and expected duration of a phoneme differ significantly.</Paragraph>
    <Paragraph position="20"> By exploring situations in which the expected durations of phonemes in correctly labeled words are systematically in error, it is possible to discover new duration rules which can be incorporated into more refined characterization of PICs. Each such rule, though, leads to an increase in the total number of PICs that must be trained. 6. Building Models for PICs Given a sufficiently large quantity of training data, one can create an excellent model for a PIC by averaging together all examples of that PIC in the training vocabulary. For example, a model can be built for the phoneme &amp;quot;sh&amp;quot; in the context &amp;quot;ation&amp;quot; by averaging together the data labeled as &amp;quot;sh&amp;quot; in words such as &amp;quot;nation&amp;quot;, &amp;quot;creation&amp;quot;, and &amp;quot;situation&amp;quot;. Unfortunately, the assumption of a large quantity of training data for each PIC is unrealistic. There are, for example, about 1500 contexts in the DragonDictate 25,000 word vocabulary, and many contexts in connected speech, for which even the current training set of 16,000 items provides no examples. For thousands of other PICs there is only a single example in the training set. Thus, in modeling a PIC, it is important to employ training data from closely related PICs.</Paragraph>
    <Paragraph position="21"> In most cases the left context of a phoneme influences primarily the first half of the phoneme, while the right context influences primarily the second half. Furthermore, there are groups of phonemes which give rise to almost identical coarticulation effects: different stress levels of the same vowel, for example.</Paragraph>
    <Paragraph position="22"> The general strategy for building a model for a phoneme in a given context is to compute a weighted average of all the data in the training vocabulary for the given phoneme in the desired context or any similar context. The weight assigned to a context depends upon how well it matches the desired context.</Paragraph>
    <Paragraph position="23"> Weights are assigned separately for the left context and the right context, and two models are constructed. The first of these, where a high weight implies that the left context is very close to the desired left context (although the right context may be wrong) is used for the first half of the model. The second model, where a high weight implies that the right context is correct, is used for the second half of the model.</Paragraph>
    <Paragraph position="24"> Each phoneme is assigned both to a &amp;quot;left context group&amp;quot; and to a &amp;quot;fight context group&amp;quot;. The phonemes in left context group should all produce similar coarticulation effects at the start of a phoneme, while those in the same right context group should produce similar effects at the end of a phoneme.</Paragraph>
    <Paragraph position="25"> To build a model for a PIC, all examples of contexts similar to the desired PIC are extracted from the training vocabulary. Each context is assigned a &amp;quot;left weight&amp;quot; and a &amp;quot;right weight&amp;quot; according to the degree of match between the desired context in the PIC and the actual context in the training item.</Paragraph>
    <Paragraph position="26"> From the data a weighted average of the durations is now computed. Tokens for which the duration is close to the average are doubled in weight, while those that are far from the average duration are halved in weight.</Paragraph>
    <Paragraph position="27"> Finally all the examples of the desired phoneme are averaged together using a linear alignment algorithm which normalizes all examples so that they have the same length, then averages together acoustic parameters at intervals of 10 milliseconds. This procedure is carried out twice, once with left weights, once with right weights. The first half of the  &amp;quot;left model&amp;quot; and the second half of the &amp;quot;right model&amp;quot; are concatenated to form the final spectral model for the PIC. Models for initial and final silence in each context are created by averaging the initial silence from training words that begin with the desired phoneme and by averaging the final silence from words that end with the desired phoneme. Consider, for example, the comparatively unusual PIC &amp;quot;lak&amp;quot; (secondary stress on vowel, no prepausal lengthening). No word in the training set contains this PIC, although &amp;quot;Cadillacs&amp;quot; has the same PIC with prepausal lengthening. The &amp;quot;left&amp;quot; model, built from &amp;quot;implants&amp;quot;, &amp;quot;overlap shadows&amp;quot;, &amp;quot;eggplant&amp;quot;, &amp;quot;Cadillacs&amp;quot;, and &amp;quot;mainland gale&amp;quot;, captures well the second formant transition between the 'T' and the vowel. The &amp;quot;fight&amp;quot; model captures the spectrum of the vowel before &amp;quot;k&amp;quot;. The concatenated model has both features well modeled.</Paragraph>
    <Paragraph position="28"> These spectral models for PICs are not yet hidden Markov models, since they include only the means of acoustic parameters, but not the variances. They also have no direct connection with phonemic segments. The final step in the training process is to convert them to adaptable Markov models that are based on phonemic segments.</Paragraph>
    <Paragraph position="29"> Converting a spectral model for a PIC to a Markov model for that PIC employs the same algorithm that is used for labeling training data. Dynamic programming is used to determine the sequence of phonemic segments that has the greatest likelihood of generating the spectral model for the PIC. These phonemic segments become the nodes of the Markov model for the PIC. Concatenating the parameter means for the nodes, with each node given the duration determined by the dynamic programming, produces the optimal piecewise-constant approximation to the spectral model for the PIC.</Paragraph>
    <Paragraph position="30"> The variances in the parameters for each phonemic segment correctly reflect the fact that each segment appears in many different PICs. Because training tokens are already averages of three utterances, the variances underestimate the variation in parameters from one utterance to another. To compensate for this, the variances in the phonemic segment models that are used for recognition are made somewhat larger than the estimates that arise from training.</Paragraph>
    <Paragraph position="31"> Because the large number of PIC models are all constructed from about 2000 phonemic segments, they adapt quickly to a new speaker. The strategy for adaptation is simply to treat each utterance as if it were new training data. By dynamic programming the utterance is segmented into PICs, which are in turn subdivided in phonemic segments.</Paragraph>
    <Paragraph position="32"> The acoustic data assigned to each segment are used to reesfimate the means and variance for that segment. For the mammography task, a set of 500 sentences to be used for adaptation has been developed that includes more than 90% of the PICs used by the recognizer. Since most phonemic segments occur in many different PICs, these 500 sentences provide diverse training data for almost all segments, sufficient to provide good estimates of their parameter means and variances for a new speaker. Estimates of segment durations for each PIC are also improved as a result of adaptation, although for this purpose the 500 sentences provide much less data.</Paragraph>
    <Paragraph position="33"> To achieve real-time recognition of connected speech, a rapid-match algofithm is used to reduce the number of words for which full dynamic programming is carried out\[l\]. This algorithm requires models which incorporate accurate duration information and which capture coarticulation effects averaged over all possible contexts for a word. The training for the rapid-match model for a word makes use of a concatenation of spectral models for the PICs of the word, with a &amp;quot;generic speech&amp;quot; left context used for the first phoneme and a &amp;quot;genetic speech&amp;quot; fight context used for the last phoneme of the word.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML