XML Viewer - h91-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1011_metho.xml
Size: 20,430 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1011">
  <Title>Modelling Context Dependency in Acoustic-Phonetic and Lexical Representations 1</Title>
  <Section position="3" start_page="0" end_page="71" type="metho">
    <SectionTitle>
SYSTEM OVERVIEW
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="71" type="sub_section">
      <SectionTitle>
Component Description
</SectionTitle>
      <Paragraph position="0"> A block diagram of the SUMMIT system is shown in Figure 1. The acoustic processing consists of a model of the human peripheral auditory system as a front-end, a hierarchical segmentation algorithm to produce a network of possible acoustic segments, an automatically defined set of segmental measurements for each hypothesized segment, and finally, a statistical classifier for providing a probability of each label given a segment. The result of this analysis branch of the system is a network of possible phonetic interpretations of the speech signal. Each arc in the network has a list of probabilities of the labels used to represent the lexicon \[13\].</Paragraph>
      <Paragraph position="1">  The lexicon is also represented as a network, which is derived by applying a set of transformation rules to a set of baseform pronunciations of the words in the lexicon. These transformation rules are defined by hand and are intended to account for some known phonological effects such as flapping and gemination. The pronunciation networks for the individual word~ are combined into a single network allowing all possible word strings. Inter-word pronunciation rules and local granmaatical constraints are taken into account when the words are combined into this network.</Paragraph>
      <Paragraph position="2"> Finding the highest scoring word sequence is accomplished by finding the best match between a path in the acoustic network and a path in the lexical network. The initial version of the system used Viterbi search to find the single best match.</Paragraph>
      <Paragraph position="3"> More recently we have been using the A*, N-best search described in \[15\] and \[11\] to find a list of top scoring sentence hypotheses.</Paragraph>
    </Section>
    <Section position="2" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
Scoring Strategy
</SectionTitle>
      <Paragraph position="0"> Since the overall score of a path consists of a number of components (acoustic model score, duration model score, segmentation score, and, in some cases, language model score), we must determine a way to combine them. If these were statistically independent probabilities of paths given the acoustics, we could simply combine them by multiplication. Unfortunately, it is unlikely that the component scores are statistically independent. Besides, they are likely to be poor estimates of probabilities both because of lack of training data and because the models used by these components also make mistaken assumptions about their probability distributions and about the statistical independence of the segments making up the path.</Paragraph>
      <Paragraph position="1"> In addition, we have the problem in a segment-based system that different paths contain different acoustic segments and therefore have different observation spaces \[6\]. We cannot simply compare probabilities of word sequences given acoustic observations since the probabilities are computed using different observations. Normalizing the probabilities by the length of the segments helps to some degree (since all paths have the same duration), but then longer duration segments have a greater influence on the path score than short segments.</Paragraph>
      <Paragraph position="2"> In the past, we have dealt with these problems by using a weighted linear combination of estimates of the log probability of component scores along with a segment-transition penalty and word-transition penalty as our overall path score.</Paragraph>
      <Paragraph position="3"> The component weights and transition penalties were obtained by optimizing performance on a portion of the training data.</Paragraph>
      <Paragraph position="4"> Recently we have begun to use the N-best search mentioned previously to obtain the top N scoring paths. With the availability of these paths, we can then use the individual component scores as input to a classifier which can be trained to discriminate between correct and incorrect paths.</Paragraph>
      <Paragraph position="5"> So far, we have been using a linear discriminate function as this classifier, but more complex classifiers can clearly be used. Treating this as a classification problem allows us to not make assumptions about the meaning of the component scores (other than the assumption that we would like them to help discriminate correct from incorrect paths).</Paragraph>
      <Paragraph position="6"> This new scoring strategy also permits us to apply, as a post-proccess, constraints that do not fit well into the initial search strategy. For example, we can make use of context dependent models that can consider the global utterance context in addition to the local context.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="71" end_page="74" type="metho">
    <SectionTitle>
RECO GNITION EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> All the experiments described in this paper are performed on the 1,000-word Resource Management (RM) task \[7\]. In all cases, we have used the perplexity 60 word-pair language model. Except for the baseline system, we have used the now standard 109-speaker training set. To facilitate a meaningful comparison, all the experiments were conducted using the February 1989 speaker-independent test set consisting of 300 utterances, 30 each from 10 different talkers. The experiments that we conducted are summarized in Figure 2, and will be described in this section.</Paragraph>
    <Section position="1" start_page="71" end_page="74" type="sub_section">
      <SectionTitle>
Lexical Models
</SectionTitle>
      <Paragraph position="0"> In the initial version of SUMMIT reported in \[13\], each label used in the pronunciation of words in the lexicon is represented by a single diagonal Gaussian model. This procedure is illustrated by path (a) in Figure 2. The input to these models is a transformation of a set of segmental acoustic measurements, which were determined automatically using an  DG denotes a diagonal Gaussian classifiers, whereas CD Tree and CN Tree denote context-dependent and context-normalized tree classifiers, respectively, as described in text.</Paragraph>
      <Paragraph position="1"> optimization procedure where the optimization criterion was a measure of phonetic discrimination performance \[8\]. These measurements are based on an entire segment and therefore can potentially take into account both the static and dynamic properties of the segment and its surroundings. The outputs of these measurements form a vector for each segment. This vector is transformed by a combination of linear discriminant functions and principle components analysis to allow for better modelling by the diagonal Gaussian models. The resulting vector has 52 dimensions. This context-independent system achieved a word error rate of 13.6% on the RM task, as shown in the first row of Table 1. This baseline system was trained on the then standard 72 speaker training set. By increasing the training data to 109 speakers and using an improved corrective training procedure described in \[14\] for training the pronunciation weights, We reduced the word error rate to 12.9%. This new context-independent result is shown in the second row, marked 109-TRAIN, of Table 1.</Paragraph>
      <Paragraph position="2"> The intention of using such simple models of the lexical labels was to serve both as a baseline for experiments with more complex models and to allow us to use a simple distortion measure as a criterion for selecting a set of context-dependent models. We have begun both sets of experiments and have been exploring the trade-offs between adding flexibility to our models (which generally require more training data per model) and making use of more specific context-dependent models (which generally allow us to use less training data per model).</Paragraph>
      <Paragraph position="3"> Our initial attempts at using more complex models have focused on the use of mixtures of diagonal Gaussians, since this is a natural extension of our baseline system, and mixtures of Gaussians have been shown to be effective in other continuous-density speech-recognition systems \[3\]. This is illustrated by path (b) in Figure 2. Our mixtures are seeded with a VQ codebook generated with standard hierarchical procedures. A threshold is used to prune away mixtures with too few members. When we replaced the single Gaussian model for each label in system 109-TRAIN (cf. Table 1) by a mixture Gaussian model with a maximum of 16 mixtures per class,-the error rate decreased from 12.9% to 10.3%. The detailed results can be seen in the row marked CI-MIXTURES (context-independent mixtures) of Table 1.</Paragraph>
      <Paragraph position="4"> Thus far we have kept the transformation of the original acoustic input dimensions intact when using these more flexible models. There are some indications that this transformation may not be necessary, and in fact its elimination may lead to better performance. In addition, we have been experimenting with the use of distinctive features as an intermediate representation \[5\]. The use of distinctive features may turn out to be a better representation in which to account for factors such as context, speaker, and dialect effects.  Many researchers have found that the use of context-dependent models can lead to an increase in word recognition performance \[10,4\]. We have been concerned not only with context-dependent modelling but also with the more general problem of lexical representation. The choice of lexical representation involves not only the choice of an inventory of units (such as context-independent or context-dependent models) but also the structure of the pronunciation networks. Many systems currently make use of a rather complex set of units, but then rely on only a single pronunciation path for each word in the lexicon. Although context-dependent models can account for some of the variability due to context, altering the structure of the pronunciation networks may be a more natural way to account for phonological effects such as flapping and gemination, as well as certain types of inter-speaker variability due to dialect differences. Since we are interested in this more general problem of lexical representation, it has been our goal to find a mechanism to automatically define both an inventory of lexical units and a set of pronunciation networks for a given lexicon. We have been treating this as an optimization problem where the goal is to find a set of transformation rules that, when applied to a set of baseform pronunciations, results in a \]exical network that optimizes some measure of recognizer performance.</Paragraph>
      <Paragraph position="5"> These transformation rules can alter both the labels on the arcs in the network (resulting in context-dependent units) and can also alter the structure of the networks (resulting in networks of alternate pronunciations). The rules are able to take into account a variety of contextual factors including local contexts (e.g., whether the left label is a stressed vowel or whether the right label is a/t/), as well as global contexts (e.g. whether the segment is in the last syllable of the sen- null tence). For the experiments reported in this paper, we have limited the optin~ation to use only rules that alter the labels on the arcs in order to cc.mpare to performance increases achieved by other researchers using only context-dependent modelling.</Paragraph>
      <Paragraph position="6"> When applying only label-alteration rules, the optimization procedure that we use is basically a top-down tree growing procedure similar to that used by other researchers \[1,2,9\]. We start with all samples of a given class in the top node of the tree and then in each iteration, try splitting each leaf node in the tree with each of the available contextual factors (such as whether the left label is a stressed vowel), keeping the split that maximizes the criterion over all leaf nodes of the tree. We only allow splits that result in nodes with at least some minimum number of samples in each node. The resulting leaf nodes define the set of context-dependent models. In our case, we would like to use a splitting criterion that is related to the overall recognition performance (since we are trying to obtain the set of context-dependent labels that maximizes recognizer performance). So far we have only experimented with the total squared distance from the mean for the resulting lexical models.</Paragraph>
      <Paragraph position="7"> Currently, we are using the following contextual functions for the splits in the context trees:</Paragraph>
      <Paragraph position="9"> where class refers to one of a number of categories that we have defined by hand. So far, we have defined 64 categories for the left and right labels. These categories include classes based on broad categories, stress, and distinctive features.</Paragraph>
      <Paragraph position="10"> Examples of categories include front-vowel, nasal, stressed vowel, etc. The LEFT-WB () and RIGHT-WB () functions return TRUE or FALSE depending on whether the segment in question is at a left or right word boundary.</Paragraph>
      <Paragraph position="11"> If we grow a tree using these contextual factors, using a minimum of 50 samples per leaf node as a stopping criterion, we are able to reduce the squared error in the resulting models by approximately 30%. Using single diagonal Gaussian models in each of the leaf nodes of the tree, we compute a context-dependent model score for each of the N-best paths obtained from the context-independent recognition system. This is illustrated by path (c) in Figure 2. Since we are currently only using local constraints in the context-dependent models, we could have incorporated the models into the initial search.</Paragraph>
      <Paragraph position="12"> Applying the context-dependent models to the N-best paths saves computation for the current experiments, but more importantly allows us to begin to incorporate more global constraints without changing the experimental paradigm. Using these models as another input to the discrimination classifier discussed above to reorder the N-best paths, we obtain a word error rate of 10.1%. The detailed results are shown in Table 1 in the row marked CD-TREE. In this experiment, we are using a total of 1,300 context-dependent models (this number is obtained by counting the number of leaf nodes in all of the contextual trees). The average number of leaf nodes per contextual tree is approximately 17.</Paragraph>
      <Paragraph position="13"> Context Normalized Inputs We have also experimented with accounting for contextual effects separately for each of the model's input dimensions. That is, rather than growing a single contextual tree for each label, we grow a separate tree for each input dimension. This allows for a more detailed accounting for contextual effects, since different input dimensions are likely to be affected differently by the context. In addition, it also alleviates the dimension scaling problem in the distance metric for the distortion criterion. When growing a single contextual tree for a label, our distortion measure must take into account the distortion in all the dimensions at once, so the scaling of the input dimensions will affect the results. This problem disappears if we consider the distortion one dimension at a time. On the other hand, if context somehow affects the relationship among the input dimensions, we could perhaps take that into account in the single contextual tree but not in the separate input dimension trees.</Paragraph>
      <Paragraph position="14"> Since diagonal Gaussian models treat each input dimension separately, we can compute statistics for each dimension based on the contextual tree for that dimension. This is illustrated by path (d) in Figure 2. Using these scores as an additional component into the reordering of the N-best paths gives us a word error rate of 8.5%. The detailed results are shown in the row marked CN-TREE (context-normalized tree) of Table 1. Since we have a different contextual tree for each dimension, we can no longer come up with a meaningful count of the number of context dependent models. However, if we count the leaf nodes of each contextual trees, we find we are using an average 6.8 contexts per input dimension for each class.</Paragraph>
      <Paragraph position="15"> Since we have found performance increases both by increasing the flexibility of the models (by using mixture Gaussian models) and by using more specific models (by having separate models depending on context), we wonder if even better results can be obtained by combining both of these procedures. Unfortunately, it turns not to be true due to conflicting requirements of the modelling procedures. More flexible models tend to require a larger number of training samples to obtain good performance, and using more specific models causes us to use a smaller portion of the training data for each model. For example, when we replaced single diagonal Gaussian models with mixture Gaussian models in the leaf nodes of the CD-TREE experiment discussed above, we found no increase in performance. Even if we vary the stopping criterion of the tree splitting procedure (thus controlling the number of training samples we allow for the mixture  Gaussian models) we were not able to obtain any significant increase in performance.</Paragraph>
      <Paragraph position="16"> Rather than using the contextual trees to define more specific models, we can use this contextual information to adjust the input dimensions for the effects of the context. This procedure permits us to once again train the models using all of the available training data. Specifically~ we grow separate contextual trees for each input dimension as discussed above.</Paragraph>
      <Paragraph position="17"> Then~ rather than using the means and variances to train a Gaussian model for each leaf node, we use only the difference between the mean of the leaf node and mean of the overall class as an adjustment to the vector to account for the contextual effects on samples falling into that leaf node. This of course assumes that we can treat the input dimensions separately when accounting for context (because we are using separate contextual trees for each dimension). It also assumes that contextual effects only cause a shift in the observed input dimensions (and no change in the shape of the distribution of the input dimension). Note that using single diagonal Gaussian models on the resulting context-normalized input vectors is equivalent to using single diagonal Gaussian models in the leaf nodes of the separate dimension contextual trees with the variances tied across all of the leaf nodes for a given input dimension for a given label.</Paragraph>
      <Paragraph position="18"> Using context-normalized input dimensions (rather than context-specific models) allows us to use all of the training data for the models for each class. When we replace the single diagonal Gaussian model with the mixture Gaussian models, illustrated by path (el in Figure 2, we obtained a word error rate of 7%. This represents the best that we have been able to achieve thus far, reducing the error rate of the baseline system by nearly one-half.. The detailed scores for this experiment can be seen in the row labeled CN-MIXTURES (context null experiments described in the paper. The columns indicate the percentage of words correct, the percentage of substitutions, deletion, and insertions, the percentage word error (Sub + Del + Ins), and the percentage of sentence error. The systems include: the baseline system, the baseline system trained on the 109 speaker training set, the context-independent mixture Gaussian system, the system using context-dependent trees, the system using context-normalization trees for each input dimension, and finally the system using context-normalization trees along with mixture Gaussian models.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML