XML Viewer - h89-1027

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1027_metho.xml
Size: 20,418 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1027">
  <Title>THE MIT SUMMIT SPEECH RECOGNITION SYSTEM: A PROGRESS REPORT*</Title>
  <Section position="3" start_page="179" end_page="179" type="metho">
    <SectionTitle>
SYSTEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> There are three major components in the SUMMIT system, as illustrated in Figure 1. The first component transforms the speech signal into an acoustic-phonetic description. The second expands a set of baseform pronunciations into a lexical network. The final component provides natural language constraints. Our preliminary efforts in natural langauge are described in a companion paper \[2\]. The acoustic-phonetic and lexical components will be discussed in more detail in the following sections.</Paragraph>
  </Section>
  <Section position="4" start_page="179" end_page="182" type="metho">
    <SectionTitle>
ACOUSTIC-PHONETIC REPRESENTATION
</SectionTitle>
    <Paragraph position="0"> The phonetic recognition subsystem of SUMMIT takes as input the speech signal and produces as output a network of phonetic labels with scores indicating the system's confidence in the segments and in the accuracy of the labels. The subsystem contains three parts: signal representation, acoustic segmentation, and phonetic classification. In this section, we describe each of these three parts in some detail.</Paragraph>
    <Section position="1" start_page="179" end_page="180" type="sub_section">
      <SectionTitle>
Signal Representation
</SectionTitle>
      <Paragraph position="0"> The phonetic recognition process starts by first transforming the speech signal into a representation based on Seneff's auditory model \[3\]. The model has three stages. The first stage is a bank of linear filters, equally spaced on a critical-band scale. This is followed by a nonlinear stage that models the transduction process of the hair cells and the nerve synapses. The output of the second stage bifurcates, one branch corresponding to the mean firing rate of an auditory nerve fiber, and the other measuring the synchrony of the signal to the fiber's characteristic frequency.</Paragraph>
      <Paragraph position="1">  The outputs from various stages of this model are appropriate for different operations in our subsystem. The nonlinearities of the second stage produce sharper onsets and offsets than are achieved through simple linear filtering. In addition, irrelevant acoustic information is often masked or suppressed. These properties make such a representation well-suited for the detection of acoustic landmarks. The synchrony response, on the other hand, provides enhanced spectral peaks. Since these peaks often correspond to formant frequencies in vowel and sonorant consonant regions, we surmise that the synchrony representation may be particularly useful for performing fine phonetic distinctions. Advantages of using an auditory model for speech recognition have been demonstrated in many contexts, and can be found readily in the literature \[4,5,6\].</Paragraph>
    </Section>
    <Section position="2" start_page="180" end_page="180" type="sub_section">
      <SectionTitle>
Acoustic Segmentation
</SectionTitle>
      <Paragraph position="0"> Outputs of the auditory model are used to perform acoustic segmentation. The objective of the segmentation procedure is to establish explicit acoustic landmarks that will facilitate subsequent feature extraction and phonetic classification. Since there exists no single level of segmental representation that can adequately describe all the acoustic events of interest, we adopted a multi-level representation that enables us to capture both gradual and abrupt changes in one uniform structure. Once such a structure has been determined, acoustic-phonetic analysis can then be formulated as a path-finding problem in a highly constrained search space.</Paragraph>
      <Paragraph position="1"> The construction of the multi-level representation has been described elsewhere \[7,8\]. Briefly, the algorithm delineates the speech signal into regions that are acoustically homogeneous by associating a given frame to one of its immediate neighbors. Acoustic boundaries are marked whenever the association direction switches from past to future. The procedure is then repeated by comparing a given acoustic region with its neighboring regions. When two adjacent regions associate with each other, they are merged together to form a single region. The process repeats until the entire utterance is described by a single acoustic event.</Paragraph>
      <Paragraph position="2"> By keeping track of the distance at which two regions merge into one, the multi-level description can be displayed in the form of a dendrogram, as is illustrated in Figure 2 for the utterance &amp;quot;Call an ambulance for medical assistance.&amp;quot; From the bottom towards the top of the dendrogram, the acoustic description varies from fine to coarse. The release of the/k/in &amp;quot;call,&amp;quot; for example, may be considered to be a single acoustic event or a combination of two events (release plus aspiration) depending on the level of detail desired. By comparing the dendrogram with the time-aligned phonetic transcription shown below, we see that, for this example, most of the acoustic events of interest have been captured.</Paragraph>
    </Section>
    <Section position="3" start_page="180" end_page="182" type="sub_section">
      <SectionTitle>
Phonetic Recognition
</SectionTitle>
      <Paragraph position="0"> The multi-level acoustic segmentation provides an acoustic description of the signal. Before lexical access can be performed, the acoustic regions must be converted into a form that reflects the way words are represented in the lexicon, which, in our case, is in terms of phonemes. Since some of the phonemes can have more than one stable acoustic region, the mapping between phonemes and acoustic region cannot be one-toone. Currently, we allow up to two acoustic regions to represent a single phoneme. This is implemented by creating an acoustic-phonetic (AP) network from the dendrogram that includes all single and paired regions.</Paragraph>
      <Paragraph position="1"> We have experimentally found this choice to be a reasonable compromise between a flexible representation and computational tractability. To account for the fact that certain paths through the AP network are more likely to occur than others, each segment is assigned a weight.</Paragraph>
      <Paragraph position="2"> Next, each of the segments in the AP network is described in terms of a set of attributes, which are then transformed into a set of phoneme hypotheses. Rather than defining specific algorithms to measure the acoustic attributes, we define generic property detectors based on our knowledge of acoustic phonetics.</Paragraph>
      <Paragraph position="3"> These detectors have free parameters that control the details of the measurement. Their optimal settings are established by a search procedure using a large body of training data \[11\].</Paragraph>
      <Paragraph position="4"> This process is illustrated in Figure 3. In this example, we explore the use of the spectral center of gravity as a generic property detector for distinguishing front from back vowels. It has two free parameters, the  SUMMIT ..... :::.... &amp;quot;%;o*'dego dego,,;~;... ; 8: 6F ,s, \[16.5, 15.2\] 8.88  SXIO3-B-MRTKO - Call an ambulance for medical assistance. J ' ! * .il~,(; i;i! JitW,. .:,~ .... ii i ,,, z o&amp;quot; I f m &amp;quot; ' deg~ &amp;quot; 9 ' ........... ~ # k = n m m I d&amp;quot; n ~ f m E ! ~ k = s l s e ~ ~1 s  panel on the right contains: a) spectrogram, b) a dendrogram, c) the time-aligned phonetic transcription, and d) an acoustic-phonetic network.</Paragraph>
      <Paragraph position="5"> lower and upper frequency edges. An example of this measurement for a vowel token is superimposed on the spectral slice below the spectrogram, with the horizontal line indicating the frequency range* To determine the optimal settings for the free parameters, we first compute the classification performance on a large set of training data for all combinations of the parameter settings. We then search for the maximum on the surface defined by the classification performance. The parameter settings that correspond to the maximum are chosen to be the optimal settings. For this example, the classification performance of this attribute, using the automatically selected parameter settings, is shown at the top right corner. Note that an attribute can also be used in conjunction with other attributes, or to derive other attributes. We believe that the procedure described above is an example of successful knowledge engineering in which the human provides the knowledge and intuition, and the machine provides the computational power. Frequently, the settings result in a parameter that agrees with our phonetic intuitions. In this example, the optimal settings for this property detector result in an attribute that closely follows the second formant, which is known to be important for the front/back distinction. Our experience with this procedure suggests that it is able to discover important acoustic parameters that signify phonetic contrasts, without resorting to the use of heuristic rules.</Paragraph>
      <Paragraph position="6"> Once the attributes have been determined, they are selected through another optimization process* Classification is achieved using conventional pattern classification algorithms \[9\]. In our current scheme, we use a double-layered approach, with the first layer distinguishing among a small set of classes, and the second layer defining a mapping from these classes to the phone labels used to represent the lexicon. This approach enables us to build a small number of simple classifiers that distinguish the speech sounds along  several phonetic dimensions. The aggregate of these dimensions describes the contextual variations, which can then be captured in the mapping between the classes and the lexicon. Our experience indicates that such an approach leads to rapid convergence in the models with only a small number of training tokens for each label.</Paragraph>
      <Paragraph position="7"> The current scheme for scoring the N classes begins with N(N- 1)/2 pairwise Gaussian classifiers, each of which uses a subset of the acoustic attributes selected to optimize the discrimination of the pair. The probability of a given class is obtained by summing the probabilities from all the relevant pairwise results. The classes are then mapped to an orthogonal space using principal component analysis. Finally, the score for each phoneme label is obtained from a Gaussian model of the distributions of the scores for the transformed classes.</Paragraph>
      <Paragraph position="8"> Following phone classification, each segment in the AP network is represented by a list of phone candidates, with associated probability, as illustrated in Figure 2. The network is shown just below the transcription. In this display, only the AP segments surrounding the most probable path are displayed. The network displays only the top-choice label, although additional information can easily be accessed. For this example, the /k/ in &amp;quot;call&amp;quot; is correctly identified, and its score, in terms of probability, is displayed in the left-hand panel along with several near-miss candidates. On the other hand, the same panel shows that the correct label for the first schwa in &amp;quot;assistance&amp;quot; is the third most likely candidate, behind/n/and/rj/.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="182" end_page="183" type="metho">
    <SectionTitle>
LEXICAL REPRESENTATION
</SectionTitle>
    <Paragraph position="0"> We are adopting the point of view that it is preferable to offer several alternative pronunciations for each word in the lexicon, and then to build phoneme models that can be made more specific as a consequence. If accurate pronunciation probabilities can be acquired for the alternate forms, then this is a viable approach for capturing inherent variability in the acceptable pronunciations of words. For example, the last syllable in a word such as 'cushion' could be realized as a single syllabic nasal consonant or as a sequence of a vowel and a nasal consonant. The vowel could be realized as a short schwa, or as a normal lax vowel. For the system to be able to accept all of these alternatives, they must be entered into the lexicon in the form of a network. Currently, lexical pronunciations are expanded by rule to incorporate both within-word and  across-word-boundary phonological effects. 1 These rules describes common low-level phonological processes such as flapping, palatalization, and gemination.</Paragraph>
    <Paragraph position="1"> We have developed an automatic procedure for establishing probability weighting on all of the arcs in the word pronunciation networks. Currently the weights are entered into the total log probability score and are centered around a score of zero representing no influence. These weights were generated automatically by determining both the recognition path as well as the forced recognition path (i.e., the path obtained when the system is given the correct answer) for a large number of utterances. From this information, we computed: 1) the number of times an arc was used correctly, R, 2) the number of times an arc was missed, M, and 3) the number of times an arc was used incorrectly, W. Once these numbers were tabulated we could assign a weight to each lexical arc. Currently, this weight corresponds to the log ratio of R+ M, which is the total number of times an arc was used in the forced recognition path, to R / W, which is the total number of times an arc was used in the normal recognition path. Thus, if an arc was missed more often than it was used incorrectly, a positive weight is added to the lexical score, which will make the system prefer to use this arc. When the arc is more often incorrect, a negative weight is added, penalizing that arc. When there are the same number of misses as incorrect uses of the arc, or when they form a small fraction of the total number of times an arc was used correctly, the weight has little influence.</Paragraph>
  </Section>
  <Section position="6" start_page="183" end_page="184" type="metho">
    <SectionTitle>
DECODER
</SectionTitle>
    <Paragraph position="0"> The lexical representation described above consists of pronunciation networks for the words in the vocabulary. These networks may be combined into a single network that represents all possible sentences by connecting word end nodes with word start nodes that satisfy the inter-word pronunciation constraints.</Paragraph>
    <Paragraph position="1"> Local grammatical constraints may also be expressed in terms of allowable connections between words.</Paragraph>
    <Paragraph position="2"> The task of lexical decoding can be expressed as a search for the best match between a path in this lexical network and a path in the AP network. Currently, we use the Viterbi algorithm to search for this best scoring match. Since we cannot expect the phonetic network to always contain the appropriate phonetic sequence, the search algorithm allows for the insertion and deletion of phonetic segments with penalties that are based on the performance of the AP network on training data. The search algorithm is illustrated in  The possible alignments of nodes in the lexical network to nodes in the phonetic network are represented by a matrix of node-pairs. A match between a path in the lexical network and a path in the phonetic network can be represented as a sequence of allowable links between these node-pairs. The allowable links fall into four categories: normal matches, insertions, deletions, and interword connections. Examples of each are shown in Figure 4. Link (a) is a normM match between an arc in the lexical network and an arc in the phonetic network. Link (b) is an example of an insertion of a phonetic segment (the path advances by a phonetic segment while staying at the same point in the lexical network). Link (c) is an example of an interword connection. Link (d) is an example of a deletion of a phonetic segment (the path contains a lexical arc without advancing in the phonetic network).</Paragraph>
    <Paragraph position="3"> The score for a match is the sum of the scores of the links in the match. This allows the search for the best path to proceed recursively since the best score to arrive at a given node-pair is the best of the score of each arriving link plus the best score to arrive at start of the link. Currently, the scores include a phonetic match component, an existence score based on the probability of the particular segmentation, a lexical weight associated with the likelihood of the pronunciation, and a duration score based on the phone duration statistics. The best match for the utterance is the best match that ends at terminal nodes of the lexical network and phonetic network.</Paragraph>
    <Paragraph position="5"/>
  </Section>
  <Section position="7" start_page="184" end_page="185" type="metho">
    <SectionTitle>
PERFORMANCE EVALUATION
PHONETIC RECOGNITION
</SectionTitle>
    <Paragraph position="0"> The effectiveness of the acoustic-phonetic component has been reported elsewhere \[7,13\]. The performance of the segmentation algorithm was measured by first finding a path through the dendrogram that corresponds best to a time-aligned phonetic transcription, as illustrated by the path highlighted in white in Figure 2, and then tabulating the differences between these two descriptions. On 500 TIMIT \[10\] sentences spoken by 100 speakers, the algorithm deleted about 3.5% of the boundaries along the aligned path, while inserting an extra 5%. Analysis of the time difference between the boundaries found and those provided by the transcription shows that more than 70% of the boundaries were within 10 ms of each other, and more than 90% were within 20 ms.</Paragraph>
    <Paragraph position="1"> The phonetic classification results are evaluated by comparing the labels provided by the classifier to those in a time-aligned transcription. We have performed the evaluation on two separate databases, as summarized in Table 1. Performance was measured oil a set of 38 context-independent phone labels. This particular set was selected because it has been used in other recent evaluations within the DARPA community. For a single speaker, the top-choice classification accuracy was 77%. The correct label is within the top three nearly.95% of the time. For multiple and unknown speakers, the top-choice accuracy is about 70%, and the correct choice is within the top three over 90% of the time. Figure 5 shows the rank order statistics for the speaker-independent case.</Paragraph>
    <Paragraph position="2">  context-independent phone labels: 14 vowels, 3 semivowels, 3 nasals, 8 fricatives, 2 affricates, 6 stops, 1 flap, and one for silence.</Paragraph>
  </Section>
  <Section position="8" start_page="185" end_page="186" type="metho">
    <SectionTitle>
WORD RECOGNITION
</SectionTitle>
    <Paragraph position="0"> The SUMMIT system was originally developed for the task of recognizing sentences from the TIMIT database. Over the past three, months, we have ported the system to the DARPA 1000-word Resource Management (RM) task, and evaluated its recognition performance. The phoneme models were seeded from 1500 TIMIT sentences, and re-tralned on the RM task using 40 sentences each from 72 designated training speakers \[1\]. The system was evaluated on two test sets, and for two conditions. The first test set, containing 10 sentences each from 15 speakers, is known as the '87 Test Set The second test set, called the '89 Test Set, was recently released to the DARPA community, and it contains 30 sentences each from 10 speakers.</Paragraph>
    <Paragraph position="1"> Each test set was evaluated under both the all-word condition (i.e. no language model) and the word-pair conditions, in which a designated language model with a perplexity of 60 is used.</Paragraph>
    <Paragraph position="2"> The results of our evaluation are summarized in Table 2. ~ Note that this result is obtained by using 75 phoneme models, 32 of which are used to denote 16 stressed/unstressed vowel pairs. At the moment, our system does not explicitly make use of context-dependent models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML