File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1017_metho.xml
Size: 13,248 bytes
Last Modified: 2025-10-06 14:13:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1017"> <Title>Vassilios Digalakis</Title> <Section position="4" start_page="0" end_page="87" type="metho"> <SectionTitle> 2. PRIOR ART </SectionTitle> <Paragraph position="0"> There are three important categories of techniques that aim to solve problems similar to the ones the progressive search techniques target.</Paragraph> <Section position="1" start_page="0" end_page="87" type="sub_section"> <SectionTitle> 2.1. Fast-Match Techniques </SectionTitle> <Paragraph position="0"> Fast-match techniques\[l\] are similar to progressive search in that a coarse match is used to constrain a more advanced computationally burdensome algorithm. The fast match, however, simply uses the local speech signal to constrain the costly advanced technique. Since the advanced techniques may take advantage of non-local data, the accuracy of a fast-match is limited and will ultimately limit the overall technique's performance. Techniques such as progressive search can bnng more global knowledge to bear when generating constraints, and, thus, more effectively speed up the cosily techniques while retaining more of their accuracy.</Paragraph> <Paragraph position="1"> 2.2. N-Best Recognition Techniques N-best techniques\[2\] are also similar to progressive search in that a coarse match is used to constrain a more computationaUy costly technique. In this case. the coarse mateher is a complete (simple) speech recognition system. The output of the N-best system is a list of the top N most likely sentence hypotheses, which can then be evaluated with the slower but more accurate techniques.</Paragraph> <Paragraph position="2"> Progressive search is a generalization of N-best--the earlier-pass technique produces a graph, instead of a list of N-best sentences. This generalization is crucial because N-best is only eomputationally effective for N in the order of tens or hundreds. A progressive search word graph can effectively account for orders of magnitude more sentence hypotheses. By limiting the advanced techniques to just searching the few top N sentences, N-best is destined to limit the effectiveness of the advanced techniques and, consequently, the overall system's accuracy. Furthermore, it does not make much sense to use N-best in an iterative fashion as it does with progressive searches.</Paragraph> </Section> <Section position="2" start_page="87" end_page="87" type="sub_section"> <SectionTitle> 2.3. Word Lattices </SectionTitle> <Paragraph position="0"> This technique is the most similar to progressive search.</Paragraph> <Paragraph position="1"> In I~ath approaches, an initial-pass recognition system can generate a lattice of word hypotheses. Subsequent passes can searclh through the lattice to find the best recognition hypothesis. It should be noted that, although we refer to lattices as word lattices, they could be used at other linguistic level, such as the phoneme, syllable, e.t.c.</Paragraph> <Paragraph position="2"> In the traditional word-lattice approach, the word lattice is viewed as a scored graph of possible segmentations of the input speech. The lattice contains information such as the acoustic match between the input speech and the lattice word, as well as segmentation information.</Paragraph> <Paragraph position="3"> The progressive search lattice is not viewed as a scored graph of possible segmentations of the input speech. Rather, the lattice is simply viewed as a word-transition grammar which constrains subsequent recognition passes. Temporal and scoring information is intentionally left out of the progressive search lattice.</Paragraph> <Paragraph position="4"> This is a critical difference. In the traditional word-lattice approach, many segmentations of the input speech which could not be generated (or scored well) by the earlier-pass algorithms will be eliminated for consideration before the advanced algorithms are used. With progressive-search techniques, these segmentations are implicit in the grammar and can be recovered by the advanced techniques in subsequent recognition passes.</Paragraph> </Section> </Section> <Section position="5" start_page="87" end_page="88" type="metho"> <SectionTitle> 3. Building Progressive Search Lattices </SectionTitle> <Paragraph position="0"> The basic step of a progressive search system is using a speech recognition algorithm to make a lattice which will be used as a grammar for a more advanced speech recognition algorithm. This section discusses how these lattices may be generated. We focus on generating word lattices, though these same algorithms are easily extended to other levels.</Paragraph> <Section position="1" start_page="87" end_page="88" type="sub_section"> <SectionTitle> 3.1. The Word-Life Algorithm </SectionTitle> <Paragraph position="0"> We implemented the following algorithm to generate a word-lattice as a by-product of the beam search used in recognizing a sentence with the DECCIPHER TM system\[4-7\].</Paragraph> <Paragraph position="1"> 1. For each frame, insert into the table Active(W, t) all words W active for each time t. Similarly construct tables End(W, t) and Transitions(W~, W 2, t) for all words ending at time t, and for all word-to-word transition at time t.</Paragraph> <Paragraph position="2"> 2. Create a table containing the word-lives used in the sentence, WordLives(W, T~tan, Tend). A word-life for word W is defined as a maximum-length interval (frame Tstar t to Ten d) during which some phone in word W is active. That is, W E Active (W, t), Tstar t ~ t ~ Ten d 3. Remove word-lives from the table if the word never ended between T, tan and Te~, that is, remove WordLives(W, Tsta, ~, Tend) if there is time t between Tstar t and Te,ut where End(W, 0 is true.</Paragraph> <Paragraph position="3"> 4. Create a finite-state graph whose nodes correspond to word-lives, whose arcs correspond to word-life transitions stored in the Transitions table. This finite state graph, augmented by language model probabilities, can be used as a grammar for a subsequent recognition pass in the progressive search.</Paragraph> <Paragraph position="4"> This algorithm can be efficiently implemented, even for large vocabulary recognition systems. That is, the extra work required to build the &quot;word-life lattice&quot; is minimal compared to the work required to recognize the large vocabulary with a early-pass speech recognition algorithm.</Paragraph> <Paragraph position="5"> This algorithm develops a grammar which contains all whole-word hypotheses the early-pass speech recognition algorithm considered. If a word hypothesis was active and the word was processed by the recognition system until the word finished (was not pruned before transitioning to another word), then this word will be generated as a lattice node. Therefore, the size of the lattice is directly controlled by the recognition seareh's beam width.</Paragraph> <Paragraph position="6"> This algorithm, unfortunately, does not scale down well--it has the property that small lattices may not contain the best recognition hypotheses. This is because one must use small beam widths to generate small lattices. However, a small beam width will likely generate pruning errors.</Paragraph> <Paragraph position="7"> Because of this deficiency, we have developed the We wish to generate word lattices that scale down gracefully. That is, they should have the property that when a lattice is reduced in size, the most likely hypotheses remain and the less likely ones are removed. As was discussed, this is not the ease if lattices are sealed down by reducing the beam search width.</Paragraph> <Paragraph position="8"> The forward-backward word-life algorithm achieves this scaling property. In this new scheme, described below, the size of the lattice is controlled by the LatticeThresh parameter. 1. A standard beam search recognition pass is done using the early-pass speech recognition algorithm. (None of the lattice building steps from Section 3.1 are taken in this forward pass).</Paragraph> <Paragraph position="9"> 2. During this forward pass, whenever a transition leaving word W is within the beam-search, we record that probability in ForwardProbability(W, frame).</Paragraph> <Paragraph position="10"> 3. We store the probability of the best scoring hypothesis from the forward pass, Pbest, and compute a pruning value</Paragraph> <Paragraph position="12"> exception. During the backward pass, whenever there is a transition between words W/and Wj at time t, we compute the overall hypothesis probability Phyp as the product of ForwardProbability(Wj,t-1), the language model probability P(H~IWj), and the Backward pass probability that W i ended at time t (i.e. the probability of starting word W i at time t and finishing the sentence). If Phyp < Pprune, then the backward transition between Wi and Wj at time t is blocked.</Paragraph> <Paragraph position="13"> Step 5 above implements a backwards pass pruning algorithm. This both greatly reduces the time required by the backwards pass, and adjusts the size of the resultant lattice.</Paragraph> </Section> </Section> <Section position="6" start_page="88" end_page="89" type="metho"> <SectionTitle> 4. Progressive Search Lattices </SectionTitle> <Paragraph position="0"> We have experimented with generating word lattices where the early-pass recognition technique is a simple version of the DECIPHER TM speech recognition system, a 4-feature, discrete density HMM trained to recognize a 5,000 vocabulary taken from DARPA's WSJ speech corpus. The test set is a difficult 20-sentence subset of one of the development sets.</Paragraph> <Paragraph position="1"> We define the number of errors in a single path p in a lattice, Errors(p), to be the number of insertions, deletions, and substitutions found when comparing the words inp to a reference string. We define the number of errors in a word lattice to be the minimum of Errors(p) for all paths p in the word lattice.</Paragraph> <Paragraph position="2"> The following tables show the effect adjusting the beam width and LatticeThresh has on the lattice error rate and on the lattice size (the number of nodes and arcs in the word lattice).</Paragraph> <Paragraph position="3"> The grammar used by the has approximately 10,000 nodes and 1,000,000 arcs. The the simple recognition system had a 1-best word error-rate ranging from 27% (beam width le-52) to 30% (beam width le-30).</Paragraph> <Paragraph position="4"> The two order of magnitude reduction in lattice size has a significant impact on HMM decoding time. Table 2 shows the per-sentence computation time required for the above test set when cemputed using a Spare2 computer, for both the original grammar, and word lattice grammars generated using a LatticeThresh of le-23.</Paragraph> </Section> <Section position="7" start_page="89" end_page="89" type="metho"> <SectionTitle> 5. Applications of Progressive Search Schemes </SectionTitle> <Paragraph position="0"> Progressive search schemes can be used in the same way N-best schemes are currently used. The two primary applications we've had at SKI are: 5.1. Reducing the time required to perform speech recognition experiments At SRI, we've been experimenting with large-vocabulary tied-mixture speech recognition systems. Using a standard decoding approach, and average decoding times for recognizing speech with a 5,000-word bigram language model were 46 times real time. Using lattices generated with beam widths of le-38 and a LatticeThresh of le-18 we were able to decode in 5.6 times real time). Further, there was no difference in recognition accuracy between the original and the lattice-based system.</Paragraph> <Paragraph position="1"> 5.2. Implementing recognition schemes that cannot be implemented with a standard approach.</Paragraph> <Paragraph position="2"> We have implemented a trigram language model on our 5,000-word recognition system. This would not be feasible using standard decoding techniques. Typically, continuous-speech trigram language models are implemented either with fastmatch technology or, more recently, with N-best schemes. However, it has been observed at BBN that using an N-best scheme (N=100) to implement a trigram language model for a 20,000 word continuous speech recognition system may have significantly reduced the potential gain from the language model. That is, about half of the time, correct hypotheses that would have had better (trigram) recognition scores than the other top-100 sentences were not included in the top 100 sentences generated by a bigram-based recognition system\[8\].</Paragraph> <Paragraph position="3"> We have implemented trigram-based language models using word-lattices, expanding the finite-state network as appropriate to unambiguously represent contexts for all trigrams. We observed that the number of lattice nodes increased by a factor of 2-3 and the number of lattice arcs increased by a factor of approximately 4 (using lattices generated with beam widths of le-38 and a LatticeThresh of le-18). The resulting decoding times increased approximately by 50% when using trigram lattices instead of bigram lattices.</Paragraph> </Section> class="xml-element"></Paper>