XML Viewer - h93-1018

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1018_metho.xml
Size: 16,579 bytes
Last Modified: 2025-10-06 14:13:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1018">
  <Title>Search Algorithms for Software-Only Real-Time Recognition with Very Large Vocabularies</Title>
  <Section position="3" start_page="91" end_page="91" type="metho">
    <SectionTitle>
2. Previous Algorithms
</SectionTitle>
    <Paragraph position="0"> The two most commonly used algorithms for speech recognition search are the time-synchronous beam search \[4\] and the best-first stack search \[5\]. (We do not consider &amp;quot;islanddriven&amp;quot; searches here, since they have not been shown to be effective.)</Paragraph>
  </Section>
  <Section position="4" start_page="91" end_page="92" type="metho">
    <SectionTitle>
2.1. Time-Synchronous Search
</SectionTitle>
    <Paragraph position="0"> In the time-synchronous Viterbi beam search, all the states of the model are updated in lock step frame-by-frame as the speech is processed. The computation required for this simple method is proportional to the number of states in the model and the number of frames in the input. If we discard any state whose score is far below the highest score in that frame we can reduce the computation by a large factor.</Paragraph>
    <Paragraph position="1"> There are two important advantages of a time-synchronous search. First, it is necessary that the search be time-synchronous in order for the computation to be finished at the same time that the speech is finished. Second, since all of the hypotheses are of exactly the same length, it is possible to compare the scores of different hypotheses in order to discard most hypotheses. This technique is called the beam search. Even though the beam search is not theoretically admissible, it is very easy to make it arbitrarily close to optimal simply by increasing the size of the beam. The computational properties are fairly well-behaved with minor differences in speech quality.</Paragraph>
    <Paragraph position="2"> One minor disadvantage of the Viterbi search is that it finds the state sequence with the highest probability rather than the word sequence with the highest probability. This is only a minor disadvantage because the most likely state sequence has been empirically shown to be highly correlated to the most likely word sequence. (We have shown in \[6\] that a slight modification to the Viterbi computation removes this problem, albeit with a slight approximation. When two paths come to the same state at the same time, we add the probabilities instead of taking the maximum.) A much more serious problem with the time-synchronous search is that it must follow a very large number of theories in parallel even though only one of them will end up scoring best. This can be viewed as wasted computation.</Paragraph>
    <Paragraph position="3"> We get little benefit from using a fast match algorithm with the time-synchronous search because we consider starting all possible words at each frame. Thus, it would be necessary to run the fast match algorithm at each frame, which would be too expensive for all but the least expensive of fast match algorithms.</Paragraph>
    <Section position="1" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
2.2. Best-First Stack Search
</SectionTitle>
      <Paragraph position="0"> The true best-first search keeps a sorted stack of the highest scoring hypotheses. At each iteration, the hypothesis with the highest score is advanced by all possible next words, which results in more hypotheses on the stack. The best-first search has the advantage that it can theoretically minimize the number of hypotheses considered if there is a good function to predict which theory to follow next. In addition, it can take very good advantage of a fast match algorithm at the point where it advances the best hypothesis.</Paragraph>
      <Paragraph position="1"> The main disadvantage is that there is no guarantee as to when the algorithm will finish, since it may keep backing up to shorter theories when it hits a part of the speech that doesn't match well. In addition it is very hard to compare theories of different length.</Paragraph>
    </Section>
    <Section position="2" start_page="91" end_page="92" type="sub_section">
      <SectionTitle>
2.3. Pseudo Time-Synchronous Stack Search
</SectionTitle>
      <Paragraph position="0"> A compromise between the strict time-synchronous search and the best-first stack search can be called the Pseudo Time-Synchronous Stack Search. In this search, the shortest hypothesis (i.e. the one that ends earliest in the signal) is updated first. Thus, all of the active hypotheses are within a short time delay of the end of the speech signal. To keep the algorithm from requiring exponential time, a beam-type pruning is applied to all of the hypotheses that end at the same time. Since this method advances one hypothesis at a time, it can take advantage of a powerful fast match algorithm. In addition, it is possible to use a higher order language model without the computation growing with the number of states in the language model.</Paragraph>
    </Section>
    <Section position="3" start_page="92" end_page="92" type="sub_section">
      <SectionTitle>
2.4. N-best Paradigm
</SectionTitle>
      <Paragraph position="0"> The N-best Paradigm was introduced in 1989 as a way to integrate speech recognition with natural language processing. Since then, we have found it to be useful for applying the more expensive speech knowledge sources as well, such as cross-word models, tied-mixture densities, and trigrarn language models. We also use it for parameter and weight optimization. The N-best Paradigm is a type of fast match at the sentence level. This reduces the search space to a short list of likely whole-sentence hypotheses.</Paragraph>
      <Paragraph position="1"> The Exact N-best Algorithm \[1\] has the side benefit that it is also the only algorithm that guarantees finding the most likely sequence of words. Theoretically, the computation required for this algorithm cannot be proven to be less than exponential with the length of the utterance. However, this case only exists when all the models of all of the phonemes and words are identical (which would present a more serious problem than large computation). In practice, we find that the computation required can be made proportional to the number of hypotheses desired, by the use of techniques similar to the beam search.</Paragraph>
      <Paragraph position="2"> Since the development of the exact algorithm, there have been several approximations developed that are much faster, with varying degrees of accuracy \[2, 3, 7, 8\]. The most recent algorithm \[9\] empirically retains the accuracy of the exact algorithm, while requiring little more computation than that of a simple 1-best search.</Paragraph>
      <Paragraph position="3"> The N-best Paradigm has the potential problem that if a knowledge source is not used to find the N-best hypotheses, the answer that would ultimately have the highest score including this knowledge source may be missing from the top N hypotheses. This becomes more likely as the error rate becomes higher and the utterances become longer. We have found empirically that this problem does not occur for smaller vocabularies, but it does occur when we use vocabularies of 20,000 words and trigram language models in the rescoring pass.</Paragraph>
      <Paragraph position="4"> This problem can be avoided by keeping the lattice of all sentence hypotheses generated by the algorithm, rather than enumerating independent sentence hypotheses. Then the lattice is treated as a grammar and used to rescore all the hypotheses with the more powerful knowledge sources \[10\].</Paragraph>
      <Paragraph position="5"> 2.5. Forward-Backward Search Paradigm The Forward-Backward Search algorithm is a general paradigm in which we use some inexpensive approximate time-synchronous search in the forward direction to speed up a more complex search in the backwards direction. This algorithm generally results in tw o orders of magnitude speedup for the backward pass. Since it was the key mechanism that made it possible to perform recognition with a 20,000-word vocabulary in real time, we discuss it in more detail in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="92" end_page="93" type="metho">
    <SectionTitle>
3. The Forward-Backward Search Algorithm
</SectionTitle>
    <Paragraph position="0"> We developed the Forward-Backward Search (FBS) algorithm in 1986 as a way to greatly reduce the computation needed to search a large language model. While many sites have adopted this paradigm for computation of the N-best sentence hypotheses, we feel that its full use may not be fully understood. Therefore, we will discuss the use of the FBS at some length in this section.</Paragraph>
    <Paragraph position="1"> The basic idea in the FBS is to perform a search in the forward direction to compute the probability of each word ending at each frame. Then, a second more expensive search in the backward direction can use these word-ending scores to speed up the computation immensely. If we multiply the forward score for a path by the backward score of another path ending at the same frame, we have an estimate of the total score for the combined path, given the entire utterance. In a sense, the forward search provides the ideal fast match for the backward pass, in that it gives a good estimate of the score for each of the words that can follow in the backward direction, including the effect of all of the remaining speech. When we first introduced the FBS to speed up the N-best search algorithm, the model used in the forward and backward directions were identical. So the estimate of the backward scores provided by the forward pass were exact. This method has also been used in a best-first stack search \[8\], in which it is very effective, since the forward-backward score for any theory covers the whole utterance. The forward-backward score solves the primary problem with the besstfirst search, which is that different hypotheses don't span the same amount of speech.</Paragraph>
    <Paragraph position="2"> However, the true power of this algorithm is revealed when we use different models in the forward and backward directions. For example, in the forward direction we can use approximate acoustic models with a bigram language model.</Paragraph>
    <Paragraph position="3"> Then, in the backward pass we can use detailed HMM models with a trigram language model. In this case, the forward scores still provide an excellent (although not exact) estimate of the ranking of different word end scores. Because both searches are time-synchronous, it does not matter that the forward and backward passes do not get the same score.</Paragraph>
    <Paragraph position="4"> (This is in contrast to a backward best-first or A* search, which depends on the forward scores being an accurate prediction of the actual scores that will result in the backward pass.) In order to use these approximate scores, we need to rood- null ify the algorithm slightly. The forward scores are normalized relative to the highest forward score at that frame. (This happens automatically in the BYBLOS decoder, since we normalized the scores in each frame in order to prevent undertow.) We multiply the normalized forward score by the normalized backward score to produce a normalized forward-backward score. We can compare these normalized forward-I)ackward scores to the normalized backward scores using the usual beam-type threshold. This causes us to consider more than one path in the backwards direction. The best path (word sequence) associated with each word end may not turn out to be the highest, but this does not matter, because the backward search will rescore all the allowed paths anyway.</Paragraph>
    <Paragraph position="5"> We find that the backward pass can run about 1000 times faster than it would otherwise, with the same accuracy. For example, when using a vocabulary of 20,000 words a typical beam search that allows for only a small error rate due to pruning requires about 20 times real time. In contrast, we find that the backward pass runs at about 1/60 real time! This makes it fast enough so that it can be performed at the end of the utterance with a delay that is barely noticeable.</Paragraph>
    <Paragraph position="6"> But the FBS also speeds up the forward pass indirectly! Since we know there will be a detailed backward search, we need not worry about the accuracy of the forward pass to some extent. This allows us the freedom to use powerful approximate methods to speed up the forward pass, even though they may not be as accurate as we would like for a final score.</Paragraph>
  </Section>
  <Section position="6" start_page="93" end_page="94" type="metho">
    <SectionTitle>
4. Sublinear Computation
</SectionTitle>
    <Paragraph position="0"> Fast match methods require much less computation for each word than a detailed match. But to reduce the computation for speech recognition significantly for very large vocabulary problems, we must change the computation from one that is linear with the vocabulary to one that is essentially independent of the vocabulary size.</Paragraph>
    <Paragraph position="1"> 4.1. Memory vs Speed Tradeoffs One of the classical methods for saving computation is to trade increased memory for reduced computation. Now that memory is becoming large and inexpensive, there are several methods open to us. The most obvious is various forms of fast match. We propose one such memory-intensive fast match algorithm here. Many others could be developed.</Paragraph>
    <Paragraph position="2"> Given an unknown word, we can make several orthogonal measures on the word to represent the acoustic realization of that word as a single point in a multi-dimensional space. If we quantize each dimension independently, we determine a single (quantized) cell in this space. We can associate information with this cell that gives us a precomputed estimate of the HMM score of each word. The computation is performed only once, and is therefore very small and independent of the size of the vocabulary. (Of course the precompilation of the scores of each of the words given a cell in the space can be large.) The precision of the fast match score is limited only by the amount of memory that we have, and our ability to represent the scores efficiently.</Paragraph>
    <Paragraph position="3"> 4.2. Computation vs Vocabulary Size To learn how the computation of our real-time search algorithm grows with vocabulary size we measured the computation required at three different vocabulary sizes: 1,500 words, 5,000 words, and 20,000 words. The time required, as a fraction of real time, is shown plotted against the vocabulary size in Figure !.</Paragraph>
    <Paragraph position="4">  As can be seen, the computation increases very slowly with increased vocabulary. To understand the behavior better we plotted the same numbers on a log-log scale as shown above.  Here we can see that the three points fall neatly on a straight line, leading us to the conclusion that the computation grows as a power of the vocabulary size, V. Solving the equation gives us the formula time= 0.04 V 1/3 (1) This is very encouraging, since it means that if we can decrease the computation needed by a small factor it would be feasible to increase the vocabulary size by a much larger factor, making recognition with extremely large vocabularies possible.</Paragraph>
  </Section>
  <Section position="7" start_page="94" end_page="94" type="metho">
    <SectionTitle>
5. Summary
</SectionTitle>
    <Paragraph position="0"> We have discussed the search problem in speech recognition and concluded that, in our opinion, it is no longer worth considering parallel or special propose hardware for the speech problem, because we have been able to make faster progress by modifying the basic search algorithm in software. At present, the fastest recognition systems are based entirely on software implementations. We reviewed several search algorithms briefly, and discussed the advantage of time-synchronous search algorithms over other basic strategies.</Paragraph>
    <Paragraph position="1"> The Forward-Backward Search algorithm has turned out to be an algorithm of major importance in that it has made possible the first real-time recognition of 20,000-word vocabularies in continuous speech. Finally, we demonstrated that the computation required by this algorithm grows as the cube root of the vocabulary size, which means that real-time recognition with extremely large vocabularies is feasible.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML