File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/h90-1003_intro.xml

Size: 13,718 bytes

Last Modified: 2025-10-06 14:04:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1003">
  <Title>Efficient, High-Performance Algorithms for N-Best Search</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In a Spoken Language System (SLS) we must use all available knowledge sources (KSs) to decide on the spoken sentence. While there are many knowledge sources, they are often grouped together into speech models, statistical language model, and natural language understanding models.</Paragraph>
    <Paragraph position="1"> To optimize accuracy we must choose the sentence that has the highest score (probability) given all of the KSs. This potentially requires a very large search space. The N-Best paradigm for integrating several diverse KSs has been described previously \[2, 10\]. First, we use a subset of the KSs to choose a small number of likely sentences. Then these sentences are scored using the remainder of the KSs.</Paragraph>
    <Paragraph position="2"> In Chow et. al., we also presented an efficient speech recognition search algorithm that was capable of computing the N most likely sentence hypotheses for an utterance, given the speech models and statistical language models.</Paragraph>
    <Paragraph position="3"> However, this algorithm greatly increases the needed computation over that needed for finding the best single sentence. In this paper we introduce two techniques that dramatically decrease the computation needed for the N-Best search. These algorithms are being used in a real-time SLS \[1\]. In the remainder of the introduction we review the exact N-Best search briefly and describe its problems. In Section 2 we describe two approximations to the exact algorithm and compare their accuracy with that of the exact algorithm.</Paragraph>
    <Paragraph position="4"> The resulting algorithm is still not fast enough for real-time implementation. In Section 3 we present a new sentence-level fast match scheme for continuous speech recognition. The algorithm is motivated by the mathematics of the Baum-Welch Forward-Backward training algorithm.</Paragraph>
    <Paragraph position="5"> The N-Best Paradigm The basic notion of the n-best paradigm is that, while we must ultimately use all the available KSs to improve recognition accuracy, the sources vary greatly in terms of perplexity reduction and required complexity. For example, a first-order statistical language model can reduce perplexity by at least a factor of 10 with little computation, while applying complete natural language (NL) models of syntax and semantics to all partial hypotheses typically requires more computation for less perplexity reduction. (Murveit \[6\] has shown that the use of an efficiently implemented syntax within a recognition search actually slowed down the search unless it was used very sparingly.) Therefore it is advantageous to use a strategy in which we use the most powerful, efficient KSs first to produce a scored list of all the likely sentences. This list is then filtered and reordered using the remaining KSs to arrive at the best single sentence.</Paragraph>
    <Paragraph position="6"> Figure 1 contains a block diagram that illustrates this basic idea. In addition to reducing total computation the resulting systems would be more modular ff we could separate radically different KSs.</Paragraph>
    <Paragraph position="7"> The Exact Sentence-Dependent Algorithm We have previously presented an efficient time-synchronous algorithm for finding the N most likely sentence hypotheses.</Paragraph>
    <Paragraph position="8"> This algorithm was unique in that it computed the correct forward probability score for each hypothesis found. The way this is accomplished is that, at each state, we keep an independent score for each different preceding sequence of words. That is, the scores for two theories are added only if the preceding word sequences are identical. We preserve up to N different theories at each state, as long as they are above the pruning beamwidth. This algorithm guarantees finding the N best hypotheses within a threshold of the best hypothesis. The algorithm was optimized to avoid expensive sorting operations so that it required computation that was less than linear with the number of sentence hypotheses found. It is easy to show that the inaccuracy in the scores computed is bounded by the product of the sentence length  knowledge sources, KS1, are used to find the N Best sentences. Then the remaining knowledge sources, KS2 are used to reorder the sentences and pick the most likely one.</Paragraph>
    <Paragraph position="9"> and the pruning beamwidth. For example, if a sentence is 1000 frarms long and a relative pruning beamwidth of 10-15 is maintained throughout the sentence, then all scores are guaranteed to be accurate to within 10 -12 of the maximum score. The proof is not given here, since it is not the subject of this paper. In the remainder of the paper we will refer to this particular algorithm as the Exact algorithm or the Sentence-Dependent algorithm.</Paragraph>
    <Paragraph position="10"> There is a problem associated with the use of this exact algorithm. If we assume that the probability of a single word being misrecognized is roughly independent of the position within a sentence, then we would expect that alonger sentence will have more errors. Consequently the typical rank of the correct answer will be lower (further from the top) on longer sentences. Therefore if we wanted the algorithm to find the correct answer within the list of hypotheses some fixed percentage of the time, the value of N will have to increase significantly for longer sentences.</Paragraph>
    <Paragraph position="11"> When we examine the different answers found we notice that, many of the different answers are simple one-word variations of each other. This is likely to result in much duplicated computation. One might imagine that if the difference between two hypothesized word sequences were several words in the past then any difference in score due to that past word would remain constant. In the next section we present two algorithms that attempt to avoid these problems.</Paragraph>
    <Paragraph position="12"> 2. Two Approximate N-Best Algorithms While the exact N-Best algorithm is theoretically interesting, we can generate lists of sentences with much less computation if we are willing to allow for some approximations. As long as the correct sentence can be guaranteed to be within the list, the list can always be reordered by rescoring each hypothesis individually at the end. We present two such approximate algorithms with reduced computation.</Paragraph>
    <Paragraph position="13"> Lattice N-Best The first algorithm will derive an approximate list of the N Best sentences with no more computation than the usual 1-Best search. Figure 2 illustrates the algorithm. Within words we use the time-synchronous forward-pass search algorithm \[8\], with only one theory at each state. We add the probabilities of all paths that come to each state. At each grammar node (for each frame) we simply store all of the theories that arrive at that node along with their respective scores in a traceback list. This requires no extra computation above the 1-Best algorithm. The score for the best hypothesis at the grammar node is sent on as in the norrnal time-synchronous forward-pass search. A pointer to the saved list is also sent on. At the end of the sentence we simply search (recursively) through the saved Iraceback lists for all of the complete sentence hypotheses that are above some threshold below the best theory. This recursive Iraceback can be performed very quickly. (We typically extract the 100 best answers, which causes no noticeable delay.) We call this algorithm the Lattice N-Best algorithm since we essentially have a dense word lattice represented by the traceback information. Another advantage of this algorithm is that it naturally produces more answers for longer sentences. null  This algorithm is similar to the one suggested by Steinbiss \[9\], with a few differences. First, he uses the standard Viterbi algorithm rather than the time-synchronous algorithm within words. That is he takes the maximum of the path probabilities at a state rather than the sum. We have observed a 20% higher error raate when using the maximum rather than the sum. The second difference is that when several word hypotheses come together at a common grammar node at the same lime, he traces back each of the choices and keeps the N (typically 10) best sentence hypotheses up to that lime and node. This step unnecessarily limits the o,mher of sentence hypotheses that are produced to N. As above the score of the best hypothesis is sent on to all words following the grammar node. At the end of the sentence he then has an approximation to the 3r best sentences. He reports that one third of the errors made by the 1-Best search are corrected in this way. However, as with a word lattice, many of the words are constrained to end at the same time - which leads to the main problem with this algorithm.</Paragraph>
    <Paragraph position="14"> The Lattice N-Best algorithm, while very fast, underestimates or misses high scoring hypotheses. Figure 3 shows an example in which two different words (words 1 and 2) can each be followed by the same word (word 3). Since there is only one theory at each state within a word, there is only one best beginning time. This best beginning time is determined by the best boundary between the best previous word (word 2 in the example) and the current word. But, as shown in Figure 3, the second-best theory involving a different previous word (word 1 in the example), would naturally  path for words 2-3 overrides the best path for words 1-3.</Paragraph>
    <Paragraph position="15"> Word-Dependent N-Best As a compromise between the exact sentence-dependent algorithm and the lattice algorithm we devised a Word-Dependent N-Best algorithm_ We reason that while the best starting lime for a word does depend on the preceding word, it probably does not depend on any word before that. Therefore instead of separating theories based on the whole preceding sequence, we separate them only ff previous word is different. At each state within the word we preserve the total probability for each of n(&lt;&lt; N) different preceding words. At the end of each word we record the score for each hypothesis along with the name of the previous word.</Paragraph>
    <Paragraph position="16"> Then we proceed on with a single theory with the name of the word that just ended. At the end of the sentence we perform a recursive traceback to derive a large list of the most likely sentences. The resulting theory paths are illustrated schematically in Figure 4. Like the lattice algorithm the  Best path for words 1-3 is preserved along with path for words 2-3.</Paragraph>
    <Paragraph position="17"> word-dependent algorithm naturally produces more answers for longer sentences. However, since we keep multiple theories within the word, we correctly identify the second best path. While the computation needed is greater than for the lattice algorithm it is less than for the sentence-dependent algorithm, since the number of theories only needs to account for number of possible previous words - not all possible preceding sequences. Therefore the number n, of theories kept locally only needs to be 3 to 6 instead of 20 to 100.</Paragraph>
    <Paragraph position="18"> Comparison of N-Best Algorithms We performed experiments to compare the behavior of the three N-Best algorithms. In all three cases we used the Class Grammar \[3\], a first-order statistical grammar based on 100 word classes. All words within a class are assumed equally likely. The test set perplexity is approximately 100. The test set used was the June '88 speaker-dependent test set of 300 sentences. To enable direct comparison with previous results we did not use models of triphones across word boundaries, and the models were not smoothed. We expect all three algorithms to improve significantly when the latest modeling methods are used.</Paragraph>
    <Paragraph position="19">  Figure 5 shows the cumulative distribution of the rank of the correct answer for the three algorithms. As can be seen, all three algorithms get the sentence correct on the first choice about 62% of the time. All three cumulative distributions increase substantially with more choices. However, we observe that the Word-Dependent algorithm yields accuracies quite close to that of the Exact Sentence-Dependent algorithm, while the Lattice N-Best is substantially worse.</Paragraph>
    <Paragraph position="20"> In particular, the sentence error rate at rank 100 (8%) is double that of the Word-Dependent algorithm (4%). Therefore, ff we can afford the computation of the Word-Dependent algorithm it is clearly preferred.</Paragraph>
    <Paragraph position="21"> We also observe in Figure 5 that the Word-Dependent algorithm is actually better than the Sentence-Dependent algorithm for very high ranks. This is because the score of the correct word sequence fell outside the pruning beamwidth.</Paragraph>
    <Paragraph position="22"> However, in the Word-Dependent algorithm each hypothesis gets the benefit of the best theory two words back. Therefore the correct answer was preserved in the traceback. This is another advantage that both of the approximate algorithms have over the Sentence-Dependent algorithm.</Paragraph>
    <Paragraph position="23"> In the next section we describe a technique that can be used to speed up all of these time-synchronous search algorithms by a large factor.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML