File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1064_metho.xml

Size: 15,294 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1064">
  <Title>The LIMSI Continuous Speech Dictation Systemt</Title>
  <Section position="5" start_page="319" end_page="319" type="metho">
    <SectionTitle>
LANGUAGE MODELING
</SectionTitle>
    <Paragraph position="0"> Language modeling entails incorporating constraints on the allowable sequences of words which form a sentence.</Paragraph>
    <Paragraph position="1"> Statistical n-gram models attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words. In this work bigram and trigram language models are estimated on the training text material for each corpus. This data consists of 37M words of the WSJ 1 and 38M words of Le Monde. A backoff mechanism\[10\] is used to smooth the estimates of the probabilities of rare n-grams by relying on a lower order n-gram when there is insufficient training data, and to provide a means of modeling unobserved n-grams. Another advantage of the backoff mechanism is that LM size can be arbitrarily reduced by relying more on the backoff, by increasing the minimum number of required n-gram observations needed to include the n-gram. This property can be used in the first bigram decod1While we have built n-gram-backoff LMs directly from the 37M-word standardized WSJ training text material, in these experiments all results are reported using the 5k or 20k, bigram and tfigram backoff LMs provided by Lincoln Labs\[ 19\] as required by ARPA so as to be compatible with the other sites participating in the tests.</Paragraph>
    <Paragraph position="2"> ing pass to reduce computational requirements. The trigram langage model is used in the second pass of the decoding process:.</Paragraph>
    <Paragraph position="3"> In order to be able to constnact LMs for BREF, it was necessary to normalize the text material of Le Monde newpaper, which entailed a pre-treatment rather different from that used to normalize the WSJ texts\[19\]. The main differences are in the treatment of compound words, abbreviations, and case. In BREF the distinction between the cases is kept if it designates a distinctive graphemic feature, but not when the upper case is simply due to the fact that the word occurs at the beginning of the sentence. Thus, the first word of each sentence was semi-automatically verified to determine if a transformation to lower case was needed. Special treatment is also needed for the symbols hyphen (-), quote ('), and period (.) which can lead to ambiguous separations.</Paragraph>
    <Paragraph position="4"> For example, the hyphen in compound words like beaux-arts and au-dessus is considered word-internal. Alternatively the hyphen may be associated with the first word as in ex-, or anti-, or with the second word as in -Id or -nL Finally, it may appear in the text even though it is not associated with any word. The quote can have two different separations: it can be word internal (aujourd' hui, o'Donnel, hors-d'oeuvre), or may be part of the first word (l'aml). Similarly the period may be part of a word, for instance, L.A., sec. (secondes), p.</Paragraph>
    <Paragraph position="5"> (page), or simply an end-of-sentence mark.</Paragraph>
    <Paragraph position="6"> Table 1 compares some characteristics of the WSJ and Le Monde text corpora. In the same size training texts, there are almost 60% more distinct words for Le Monde than for WSJ without taking case into account. 2 As a consequence, the lexical coverage for a given size lexicon is smaller for Le Monde than for WSJ. For example, the 20k WSJ lexicon accounts for 97.5% of word occurrences, but the 20k BREF lexicon only covers 94.9% of word occurrences in the training texts. For lexicons in the range of 5k to 40k words, the number of words must be doubled for Le Monde in order to obtain the same word coverage as for WSJ.</Paragraph>
    <Paragraph position="7"> The lexical ambiguity is also higher for French than for English. The homophone rate (the number of words which have a homophone divided by the total number of words) in the 20k BREF lexicon is 57% compared to 9% in 20k-open WSJ lexicon. This effect is even greater if the word frequencies are taken into account. Given a perfect phonemic transcription, 23% of words in the WSJ training texts is ambiguous, whereas 75% of the words in the Le Monde training texts have an ambiguous phonemic transcription. Not only does one phonemic form correspond to different orthographic forms, there can also be a relatively large number of possible pronunciations for a given word. In French, the alternate pronunciations arise mainly from optional word-final phones, due to liaison and optional word-final consonant cluster re- null ruction (see Figure 1). There are also a larger number of frequent, monophone words for Le Monde than for WSJ, accounting for about 17% and 3% of all word occurrences in the respective training texts.</Paragraph>
  </Section>
  <Section position="6" start_page="319" end_page="320" type="metho">
    <SectionTitle>
ACOUSTIC-PHONETIC MODELING
</SectionTitle>
    <Paragraph position="0"> The recognizer makes use of continuous density HMM (CDHMM) with Gaussian mixture for acoustic modeling.</Paragraph>
    <Paragraph position="1"> The main advantage continuous density modeling offers over discrete or semi-continuous (or tied-mixture) observation density modeling is that the number of parameters used to model an HMM observation distribution can easily be adapted to the amount of available training data associated to this state. As a consequence, high precision modeling can be achieved for highly frequented states without the explicit need of smoothing techniques for the densities of less frequented states. Discrete and semi-continuous modeling use a fixed number of parameters to represent a given observation density and therefore cannot achieve high precision without the use of smoothing techniques. This problem can be alleviated by tying some states of the Markov models in order to have more training data to estimate each state distribution.</Paragraph>
    <Paragraph position="2"> However, since this kind of tying requires careful design and some a priori assumptions, these techniques are primarily of interest when the training data is limited and cannot easily be increased. In the experimental section we demonstrate the improvement in performance obtained on the same test data by simply using additional training material.</Paragraph>
    <Paragraph position="3"> A 48-component feature vector is computed every 10 ms.</Paragraph>
    <Paragraph position="4"> This feature vector consists of 16 Bark-frequency scale cepstrum coefficients computed on the 8kHz bandwidth and their first and second order derivatives. For each frame (30 ms window), a 15 channel Bark power spectrum is obtained by applying triangular windows to the DFT output. The cepstrum coefficients are then computed using a cosinus transform \[2\].</Paragraph>
    <Paragraph position="5"> The acoustic models are sets of context-dependent(CD), position independent phone models, which include both intra-word and cross-word contexts. The contexts are automatically selected based on their frequencies in the training data. The models include tfiphone models, fight- and  left-context phone models, and context-independent phone models. Each phone model is a left-to-right CDHMM with Gaussian mixture observation densities (typically 32 components). The covariance matrices of all the Gaussians are diagonal. Duration is modeled with a gamma distribution per phone model. The HMM and duration parameters are estimated separately and combined in the recognition process for the Viterbi search. Maximum a postedori estimators are used for the HMM parameters\[8\] and moment estimators for the gamma distributions. Separate male and female models are used to more accurately model the speech data.</Paragraph>
    <Paragraph position="6"> Dunng system development phone recognition has been used to evaluate different acoustic model sets. It has been shown that improvements in phone accuracy are directly indicative of improvements in word accuracy when the same phone models are used for recognition\[12\]. Phone recognition provides the added benefit that the recognized phone string can be used to understand word recognition errors and problems in the lexical representation.</Paragraph>
  </Section>
  <Section position="7" start_page="320" end_page="320" type="metho">
    <SectionTitle>
LEXICAL REPRESENTATION
</SectionTitle>
    <Paragraph position="0"> Lexicons containing 5k, 20k, and 64k words have been used in these experiments. The lexicons are represented phonemically, using language-specific sets of phonemes.</Paragraph>
    <Paragraph position="1"> Each lexicon has alternate pronunciations for some of the words, and allows some of the phones to be optional) A pronunciation graph is generated for each word from the baseform transcription to which word internal phonological rules are optionally applied during training and recognition to account for some of the phonological variations observed in fluent speech. The WSJ lexicons are represented using a set of 46 phonemes, including 21 vowels, 24 consonants, and silence. Training and test lexicons were created at LIMSI and include some input from modified versions of the TIMIT, Pocket and Moby lexicons. Missing forms were generated by rule when possible, or added by hand. Some pronounciations for proper names were kindly provided by Murray Spiegel at Bellcore from the Orator system. The BREF lexicons, corresponding to the 5k and 20k most common words in the Le Monde texts are represented with 35 phonemes including 14 vowels, 20 consonants, and silence\[3\]. The base pronunciations, obtained using text-to-phoneme rules\[20\], were extended to annotate potential liaisons and pronunciation variants. Some example lexical entries are given in  Word boundary phonological rules are applied in building the phone graph used by the recognizer so as to allow for some of the phonological variations observed in fluent speech\[11\]. The principle behind the phonological rules is to modify the phone network to take into account such vari- null { } are optional, phones in \[ \] are alternates. 0 specify a context constraint and V stands for vowel, C for consonant and the period represents silence.</Paragraph>
    <Paragraph position="2"> ations. These rules are optionally applied during training and recognition. Using optional phonological rules during training results in better acoustic models, as they are less &amp;quot;polluted&amp;quot; by wrong transcriptions. Their use during recognition reduces the number of mismatches. For English, only * well known phonological rules, such as glide insertion, stop deletion, homorganic stop insertion, palatalization, and voicing assimilation have been incorporated in the system. The same mechanism has been used to handle liaisons, mute-e, and final consonant cluster reduction for French.</Paragraph>
  </Section>
  <Section position="8" start_page="320" end_page="321" type="metho">
    <SectionTitle>
SEARCH STRATEGY
</SectionTitle>
    <Paragraph position="0"> One of the most important problems in implementing a large vocabulary speech recognizer is the design of an efficient search algorithm to deal with the huge search space, especially when using language models with a longer span than two successive words, such as trigrams. The most commonly used approach for small and medium vocabulary sizes is the one-pass frame-synchronous beam search \[16\] which uses a dynamic programming procedure. This basic strategy has been recently extended by adding other features such as &amp;quot;fast match&amp;quot;\[9, 1\], N-best rescoring\[21\], and progressive search\[15\]. The two-pass approach used in our system is based on the idea of progressive search where the information between levels is transmitted via word graphs. Prior to word recognition, sex identification is performed for each sentence using phone-based ergodic HMMs\[13\]. The word recognizer is then run with a bigram LM using the acoustic model set corresponding to the identified sex.</Paragraph>
    <Paragraph position="1"> The first pass uses a bigram-backoff LM with a tree organization of the lexicon for the backoff component. This one-pass frame-synchronous beam search, which includes intra- and inter-word CD phone models, intra- and inter-word phonological rules, phone duration models, and gender-dependent models, generates a list of word hypotheses resulting in a word lattice. Two problems need to be considered  at this level. The first is whether or not the dynamic programming procedure used in the first pass, which guarantees the optimality of the search for the bigram, generates an &amp;quot;optimal&amp;quot; lattice to be used with a trigram LM. For example, any giwen word in the lattice will have many possible ending points, but only a few starting points. This problem was in fact less severe than expected since the time information is not critical to generate an &amp;quot;optimal&amp;quot; word graph from the lattice, i.e. the multiple word endings provide enough flexibility to compensate for single word beginnings. The second consideration is that the lattice generated in this way cannot be too large or there is no interest in a two pass approach. To solve this second problem, two pruning thresholds are used during r, he first pass, a beam search pruning threshold which is kept to a level insuring almost no search errors (from the bigram point of view) and a word lattice pruning threshold used to control the lattice size.</Paragraph>
    <Paragraph position="2"> A description of the exact procedure used to generate the word graph from the word lattice is beyond the scope of this paper. The following steps give the key elements behind the procedure. 4 First, a word graph is generated from the lattice by merging three consecutive frames (i.e. the minimum duration for a word in our system). Then, &amp;quot;similar&amp;quot; graph nodes are merged with the goal of reducing the overall graph size and generalizing the word lattice. This step is reiterated until no further reductions are possible. Finally, based on the trigram backoff language model a trigram word graph is then generated by duplicating the nodes having multiple language model contexts. Bigram backoff nodes are created when possible to limit the graph expansion.</Paragraph>
    <Paragraph position="3"> To fix these ideas, let us consider some numbers for the WSJ 5k-closed vocabulary. With the pruning threshold set at a level such that there are only a negligible number of search errors, the first pass generates a word lattice containing on average 10,000 word hypotheses per sentence. The generated word graph before trigram expansion contains on average 1400 arcs. After expansion with the trigram backoff LM, there are on average 3900 word instanciations including silences which are treated the same way as words.</Paragraph>
    <Paragraph position="4"> It should be noted that this decoding strategy based on two forward passes can in fact be implemented in a single forward pass using one or two processors. We are using a two pass solution because it is conceptually simpler, and also due to memory constraints.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML