File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2049_metho.xml
Size: 44,270 bytes
Last Modified: 2025-10-06 14:12:17
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2049"> <Title>SPEECH RECOGNITION IN PARALLEL</Title> <Section position="3" start_page="354" end_page="354" type="metho"> <SectionTitle> MULTIPLE SPEECH RECOGNIZERS COMBINED INTO ONE </SectionTitle> <Paragraph position="0"> One serious problem with most speech recognition systems is that the output from the acoustic preprocessor is a small set of numbers that carry too little information about the original speech signal. To address this problem, researchers have begun examining how additional extracted information improves the recognition process</Paragraph> </Section> <Section position="4" start_page="354" end_page="355" type="metho"> <SectionTitle> RELATED APPROACHES </SectionTitle> <Paragraph position="0"> For example, K.-F. Lee \[Lee 88\] has recently constructed at Carnegie-Mellon University the most accurate large-vocabulary continuous-speech recognizer so far. Starting with a baseline system based on hidden Markov models, he made a succession of improvements to its acoustic-modeling component that significantly increased its recognition accuracy. Simply adding another set of acoustic parameters provided the greatest improvement in recognition accuracy; using triphone models to model the coarticulatory effects of a sound's context on that sound provided the next greatest improvement. There is clearly room for further improvement in acoustic modeling.</Paragraph> <Paragraph position="1"> A number of inaccuracies in the assumptions underlying most Markov models of speech further distort and blur the information that remains. P. Brown \[Brown 87\] has recently conducted experiments that seem to indicate that continuous-parameter distributions can model speech more accurately than vector-quantized feature vectors.</Paragraph> <Paragraph position="2"> Continuous-parameter distributions require vast amounts of training data, however, and there is little known about the shape of the actual distribution of speech sounds.</Paragraph> <Paragraph position="3"> Even though the acoustic analysis of speech has received a great deal of attention (see \[Makhoul 85\]), there is little agreement about the best set of acoustic parameters to use. Many sets have been proposed and tested, and include * Energy at different frequencies in the speech signal; * Time-domain characterizations of the waveform, such as zero crossings, or zero crossings in various frequency bands; * Productive models, such as models of the vocal tract as a set of resonant tubes; * Perceptual models, based on physical models of the cochlea and psychoacoustic experiments.</Paragraph> <Paragraph position="4"> In fact, few hard data are available about the relative strengths and weaknesses of the different parameter sets. Little published data exists on systems that have attempted recognition utilizing combinations of different acoustic parameter sets.</Paragraph> <Paragraph position="5"> Several systems have tried to combine a single set of parameters with information about their time derivatives; some of these are described below.</Paragraph> <Paragraph position="6"> One system using more than one parameter set is described by \[Tnbolet 82\]. This isolated-word system based on dynamic time warping used long-term LPC coefficients in slowly-changing sections of speech and short-term LPC coefficients in unvoiced fast-changing segments; the categorization of sections of speech as slowly-changing or fast-changing depended on a comparison of the short-term and long-term LPC coefficients. Adding short-term coefficients to a baseline system based on long-term coefficients contributed very little to recognition accuracy. The authors suggest that short-term LPC coefficients were an inappropriate choice of parameters for the rapidlychanging unvoiced sections of speech, and that the large steady-state regions in vowels swamped the contributions to the distance computation of the few transient frames.</Paragraph> <Paragraph position="7"> More recently, K.-F. Lee \[Lee 88\] has used multiple sets of parameters (measurements of energy in 12 frequency bands, differenced measurements of energy, total power, and differenced power) quantized with three disjoint codebooks to increase recognition accuracy. In his system, each frame of parameters consisted of an entry from each codebook, which the system treated as uncorrelated. Thus the probability distribution on each transition consisted of the product of the distributions for each codebook. Using three separate codebooks led to much lower quantization error and higher recognition accuracy than concatenating the sets of parameters and using a single codebook. The three sets of acoustic parameters he combined were all computed in the same fixed-width time windows. He also attempted to combine parameters computed in variable-width windows with fixed-width parameters; this approach was ad hoc and led to little increase in recognition accuracy.</Paragraph> <Paragraph position="8"> P. Brown \[Brown 87\] has concatenated acoustic parameters from adjacent time frames in a recognition system using continuous distributions and full covariance matrices. He applied discriminant analysis to reduce the number of parameters to train. His approach led to an improvement in recognition accuracy on the E-set (the easily confused &quot;e&quot;-sounding letters of the alphabe0.</Paragraph> </Section> <Section position="5" start_page="355" end_page="357" type="metho"> <SectionTitle> COMBINING INDEPENDENT RECOGNIZERS </SectionTitle> <Paragraph position="0"> We are attacking the problems of low recognition accuracy and overwhelming computational demands by combining multiple independently-executing recognizers into one through trained, weighted voting schemes.</Paragraph> <Paragraph position="1"> Several sets of recognizers might be constructed from different sets of acoustic parameters, or from different procedures for constructing codebooks. In the ideal case, the recognizers so combined would have independent (uncorrelated) error patterns, so that a simple vote could reduce the overall error rate dramatically. In such a case, if p is the error rate of the worst of n recognizers, the &quot;combined&quot; recognizer would be no worse than:</Paragraph> <Paragraph position="3"> Since the errors made by the recognizers will undoubtedly be correlated, albeit imperecfly, a simple vote will not suffice. In essence, we seek statistically sound methods to build into our combined recognizer the independence of the component recognizers needed to approach the ideal case.</Paragraph> <Paragraph position="4"> Evidence of the soundness of this approach has been provided by \[Stephanou 88\]. In that paper, the authors studied the combination of two inherently incomplete knowledge sources that reach a consensus opinion using Dempster- null Shafer theory in a problem-solving setting. They rigorously prove that the entropy of the combined consensus system is less than the entropy of each individual system. Thus, theoretically, errors made by the combined system should appear less frequently than either constituent.</Paragraph> <Paragraph position="5"> Initially, we are studying the correlations among a number of different acoustic parameter sets and different codebooks, both in a purely statistical fashion and in terms of the errors made by recognizers built on the parameter sets. A recent paper by Gillick and Cox \[Gillick 89\] provides a pnncipled means of determining the statistical significance of the differences among alternative recognition systems. Then, using a reasonable set of maximally independent parameter sets and codebooks, we will construct combined recognizers that use the remaining correlations (inverted) as weights for voting.</Paragraph> <Paragraph position="6"> Voting among recognition systems based on hidden Markov models can be incorporated as part of the traceback procedure. In a single recognizer, the Viterbi dynamic-programming search algorithm constructs a matrix that for each state and input frame tells the most likely state at the previous input frame. The traceback procedure begins at the final state of the network and the final input frame and, proceeding backwards in time, constructs the most likely path through the network. In combining multiple recognizers, we propose to maintain for each recognizer several back pointers, together with a rating of each, at each state and input frame, and to trace back all the models together, pooling the information from all the recognizers.</Paragraph> <Paragraph position="7"> One major question about the procedure will be the granularity of the voting scheme. During the traceback, do the various recognizers vote on individual words, on phonemes within words, or on analysis frames within phonemes? The frequency of voting may affect recognition accuracy as well as the decomposability of the recognition procedure.</Paragraph> <Paragraph position="8"> This is a key technical problem to be studied in this research. At present we have some idea of general approaches that may have utility. For example, symbolic AI machine learning techniques may be exploited to learn associational rules that infer likely recognized phones or words from contextual information provided by multiple recognizers trained for different units of speech. The output of a single recognizer of some interval of speech is typically a list of rank ordered candidate phones, that is, a list of symbols representing the modelled phones believed to be present in the interval of speech. Two recognizers operating on the same interval of speech would produce two lists of symbols (presuming, of course, both recognizers can be time aligned with accuracy). These two lists of symbols may be viewed as two &quot;representations&quot; of the same speech source, or rather two sets of &quot;hypotheses&quot; of a particular speech utterance. The task is then to define a method of combining these two bodies of evidence to reach a consensus on what was uttered.</Paragraph> <Paragraph position="9"> Suppose we have two recognizers, one based on diphone models, the other on triphones. Rules may be learned during training that associate the appearance with some diphone, for example, appearing in one list with some triphone model appearing in the other list. One idea is to use the list of triphone candidates generated during recognition along with the associational rules to reorder the list of candidate diphones from the other recognizer. During recognition, each applicable associational rule can be applied by scanning a triphone candidate list generated by one recognizer and adding to a &quot;frequency count&quot; to each associated diphone appearing in the candidate list of the other recognizer. The end result would be a reordered candidate list from the diphone-based recognizer, ranked according to maximum frequency count.</Paragraph> <Paragraph position="10"> It is difficult to propose a cogent argument for the utility of this approach without first performing a number of preliminary experiments. For example, it might be more useful to reorder the candidate list by using the normalized frequency counts as weights applied to the distances of the initial candidate diphones. Issues of time alignment, integration of multiple models and possible exponential growth in numbers of rules that may be generated, must be studied first with specific examples. The point is, however, that there is a distinct possibility that we may reduce the &quot;voting problem&quot; to a symbolic form suitable for applying various AI techniques. Alternatively, neural network techniques may have utility in solving this problem.</Paragraph> </Section> <Section position="6" start_page="357" end_page="358" type="metho"> <SectionTitle> INCORPORATING HIGHER LEVEL CONSTRAINTS </SectionTitle> <Paragraph position="0"> Following initial experimentation with isolated word recognition, we will turn our attention to problems involved in continuous speech. It is that point that we plan to incorporate higher level constraints from syntax and semantics.</Paragraph> <Paragraph position="1"> These constraints can be extremely helpful in discriminating between several words when low level recognition techniques produce more than one highly ranked candidate. For example, sentential syntactic context can be used to discriminate two similar sounding words such as &quot;form&quot; and &quot;farm&quot; when appearing in a sentence such as &quot;The clouds form a pattern that continues for miles.&quot; Our approach will be to construct a separate natural language recognizer that will vote on candidate words based primarily on syntactic expectations. While many have recognized the value of including natural language approaches in speech recognition, few systems include more than a very simple grammar (see for example \[Lee 88\].) One of the problems in adapting natural language approaches for speech is that people rarely speak in complete, grammatical sentences. Instead, speech errors and colloquialisms abound. Previous approaches to dealing with ungrammatical input have focused either on systematically relaxing grammatical constraints \[Weischedel 83\], on developing grammars specifically for speech through analysis of transcribed speech \[Hindle 89\], or on the use of semantically based parsers \[Lebowitz 83, Lytinen 84\]. Of these, we favor relaxing constraints, primarily because large speech grammars are not currently available for general use and because we have found that syntax does provide very useful clues to interpretation.</Paragraph> <Paragraph position="2"> Relaxing grammatical constraints to find an acceptable parse is a combinatonally explosive computational task, however. We have selected a functional unification formalism, FUF, for the task for this reason. One of its primary advantages is its ability to deal with the complex interaction of multiple constraints. Furthermore, its functional aspects will allow us to encode more than purely syntactic constraints and these can be called into service in parallel with syntax. Finally, we plan to implement a parallel version of the unification algorithm and this should also address computational issues in relaxing constraints.</Paragraph> <Paragraph position="3"> We have already developed a large grammar in FUF that we are currently using for generation along with a well documented unifier that has been substantially tested, in part through class use \[Elhadad 89\]. To use FUF for speech recognition, our tasks for the early phase of our proposed research will include the conversion of our FUF unifier for interpretation of language. Note that this will result in a reversible syntactic processor so that we will be well situated to use components of our speech recognizer for synthesis as well. The second task will be the parallelizafion of the unification algorithm. Because the search through the grammar is a decomposable problem (each alternative in the set of grammar rules can be passed off to a different processor), this is clearly possible.</Paragraph> <Paragraph position="4"> The final step will be the incorporation of the natural language recognizer with the remaining lower level recognizers. This will involve the development of a suitable voting scheme that can adequately weigh advice from both components.</Paragraph> </Section> <Section position="7" start_page="358" end_page="359" type="metho"> <SectionTitle> PARALLELISM </SectionTitle> <Paragraph position="0"> Our study aims to balance genemfity with economy and performance. We expect to fall in the middle by uncovering minimal parallel processing requirements to meet the needs of the model evaluation portion of the speech recognition task; these requirements may be met by an architecture that is not fully general (implying low cost in hardware) but also not highly specialized (implying that it is not limited too severely in scope of use). We believe an architecture composed of a distributed set of processing elements (PE's), each containing local memory and high speed DSP processors, with a limited interconnecfion and communication capability may suit our needs. From our studies so far, we have found that a general high flux interconnect is simply not needed to speed up dynamic programming. Communication does not appear to dominate the computation involved in the various algorithms employed in speech recognition.</Paragraph> <Paragraph position="1"> In our earlier work, we invented and implemented on the DADO2 parallel computer one approach to parallelizing the match task of test patterns to references by parallelizing the dynamic programming algorithm. The approach is strictly a data parallel activity. This work has been previously reported in \[Stolfo 87\] where we carefully compare our approach to a similar yet different approach proposed by Bentley and Kung. In \[Bentley and Kung 79\], they propose a systolic parallel machine, based on a &quot;dual tree&quot; interconnection, to rapidly execute a stream of simple dictionary-type queries. (Their machine was never realized.) For the present paper, we briefly outline a class of searching problems that generalizes those investigated by Bentley and Kung and describe the method by which we parallelize them on the DADO parallel computer.</Paragraph> <Paragraph position="2"> A static searching problem is defined as follows: * Preprocess a set F of N objects into an internal data structure D.</Paragraph> <Paragraph position="3"> * Answer queries about the set F by analyzing the strncture D.</Paragraph> <Paragraph position="4"> Following Bentley \[Bentley 78\], we note that in many cases, such problems are solvable by serial solutions with linear storage and logarithmic search time complexity.</Paragraph> <Paragraph position="5"> The membership problem provides an illustration of this kind of problem. Preprocess N elements of a totally ordered set F such that queries of the form &quot;is x in F&quot; can be answered quickly. The common solution for serial computers is to store F in a sorted binary tree strncture D and perform binary search. (Of course, hashing is common for problems where hashing functions can be defined.) Thus, the membership problem can be computed on sequential computers with logarithmic time complexity.</Paragraph> <Paragraph position="6"> A decomposable searching problem is a searching problem in which a query asking the relationship of a new object x to a set of objects F can be written as:</Paragraph> <Paragraph position="8"> where B is the repeated application of a commutative, associative binary operator b that has an identity and q is a &quot;primitive query&quot; applied between the new object x and each element f of F. Hence, membership is a decomposable searching problem when cast as</Paragraph> <Paragraph position="10"> The Nearest Neighbor problem, which determines for an arbitrary point x in the plane its nearest neighbor in a set F of N points, can be cast as NN(x, F) = MIN distance(x, If).</Paragraph> <Paragraph position="11"> fin F Based on the work of Dobkin and Lipton, Bentley points out that Nearest Neighbor has log time serial complexity in searching a static data structure. The Nearest Neighbor problem is closest to problems in pattern recognition to be detailed shortly.</Paragraph> <Paragraph position="12"> Decomposable searching problems are well suited to direct parallel execution. The key idea about these kinds of problems is decomposability. To answer a query about F, we can combine the answers of the query applied to arbitrary subsets of F. This characteristic also guarantees quick execution in a parallel environment. The idea is simply to partition the set F into a number of subsets equal to N, the number of available PE's. (For pedagogical reasons, in what follows we assume a single set element f is stored at a PE.) Apply the query q in parallel at each PE between the unknown x that is communicated to all the PE's and the locally stored set element f. Finally, combine the answers in parallel by log N repetitions of b. This last step proceeds by applying N/2 b-computations simultaneously between &quot;adjacent&quot; pairs of PE's. The N/2 resultant values are processed again in the same fashion. Hence N/4 b-computations are applied in parallel, producing N/8 results. After log N steps the final single result is computed.</Paragraph> </Section> <Section position="8" start_page="359" end_page="360" type="metho"> <SectionTitle> ALMOST DECOMPOSABLE SEARCHING PROBLEMS </SectionTitle> <Paragraph position="0"> The approach we invented (and implemented on the DADO2 machine \[Stolfo 87\]) is quite different to Bentley and Kung's approach to solving decomposable searching problems as well as variations that we call almost decomposable searching problems. In a nutshell, queries are rapidly broadcast to all the PE's in a paraallel processor.</Paragraph> <Paragraph position="1"> Primitive queries are executed in parallel by all PE's, and in several important cases, the combined result of applying operator b is obtained very quickly with paraallel hardware support. (In the case of DADO2, this step takes one instruction cycle for up to 8,000 PE's.) We have called this mode of operation Broadcast/Match/Resolve/Report and it will be described with examples shortly.</Paragraph> <Paragraph position="2"> First, we note several variations of these kinds of problems to clarify the benefits of the approach that we have taken: 1. Consider searching problems where a static data structure, D, cannot be defined. In vector quantization of some random source, or template matching in isolated word recognition, finding the best match (closest centroid of the clustered space) requires the calculation of the distance of the new unknown sample to ~ members of the reference set. Binary searching, for example, of the set of centroids is not possible in general.</Paragraph> <Paragraph position="3"> 2. Consider problems where a single query in a series of queries cannot be computed without knowing the result of the previous query. In dynamic programming approaches to statistical pattern matching tasks, a single match of an unknown against the set of references cannot be computed without knowing the best match(es) of the previous unknown(s). Hence, for a series of unknown x i, i=l ..... M, Query (xi, F) = B q(xi, Query(xi.1, F), If).</Paragraph> <Paragraph position="4"> fin F In this case, a pipe flushing phenomenon appears forcing systolic approaches to suffer computational losses.</Paragraph> <Paragraph position="5"> 3. Consider problems where the &quot;combining&quot; operator, b, is not commutative, nor associative, but otherwise the searching problem remains quite the same. Thus, a parallel algorithm of some sort would be applied to the results of the primitive queries, q, applied to the reference set. 4. Lastly, consider searching problems where we wish to compute a number of different queries about the same unknown x over a set, or possibly different sets. Hence</Paragraph> <Paragraph position="7"> Our approach to combining multiple speech recognizers provides an illustration of this type of problem.</Paragraph> </Section> <Section position="9" start_page="360" end_page="360" type="metho"> <SectionTitle> PARALLEL MACHINE REQUIREMENTS </SectionTitle> <Paragraph position="0"> How do we achieve high execution speed of almost decomposable searching problems on parallel hardware? We seek the most economical way of solving almost decomposable searching problems in parallel. This entails ascertaining the properties of the parallel computation in question and the parallel hardware required to perform it.</Paragraph> </Section> <Section position="10" start_page="360" end_page="362" type="metho"> <SectionTitle> MIMD vs SIMD </SectionTitle> <Paragraph position="0"> Case 1 above is clearly handled by any parallel machine simply by executing matching functions, or primitive queries, in parallel. The data structure common to serial implementations is replaced by the parallel machine programmed as an &quot;associative memory processor&quot;. However, each execution of the primitive query, q, may require different instruction streams to operate at each PE. (In our following discussion, q may be DTW matching or HMM evaluation, or both operating concurrently in different PE's.) Clearly, an SIMD parallel computer will not be effective in cases where MIMD parallelism is called for. This feature demands that each PE have significant program memory available to it rather than depending on a single broadcast stream of instructions executed in lock-step fashion. Distributed memory machines, as well as shared memory (omega network-based or bus-based) machines certainly provide this capability.</Paragraph> <Paragraph position="1"> Communication Model For case 2, the communication model of a parallel machine should support the rapid broadcast of data to all PE's in the machine. That is, communicating a single quantity of data should take O(log N) electronic gate delays in a parallel machine not O(log N) instruction cycles via instruction pipeline communication. DMA speeds are clearly desirable. Conversely, the &quot;reporting&quot; of a single datum from one PE to another should also be performed in a small amount of time. This key architectural pnnciple allows for the rapid communication of data in and out of the machine as well as from one distinguished, but arbitrarily chosen, PE to all others. Hence, communicating a single data item from one PE to all others is achieved in time proportional to the size of the data, not O(log N) instruction time. Internally generated queries, as in case 2, can be reported and broadcast in a constant number of instruction cycles. Pipe flushing problems as in a systolic machine need not concern us. (Below we detail the timing of broadcast rates in our experimental hardware system, currently under development, specifically in the case of broadcasting speech sample data.) Another key capability of a parallel machine to execute almost decomposable searching problems efficiently is to provide direct hardware support for quickly computing a range of commutative and associative binary operators B. In our membership problem defined above, the binary operator OR is repeatedly applied to all of the results of the primitive query &quot;equal (x, f)&quot;. In a sequential environment, this operation may require linear time to compute. In a parallel environment it can be computed in log time. On a parallel machine with some hardware support, it may be computed in constant instruction cycle time. The I/O circuit of DADO2, for example, provides a high speed function that we call min-resolve. The min-resolve circuitry calculates in one instruction cycle the minimum value of a set of values distributed one to a PE. Not only is the minimum value reported in a single instruction cycle, but the PE with the minimum value is set to what is called the &quot;winner state,&quot; providing an indication to the entire ensemble of loser PEs, as well as identifying the single winner PE in the computation. The membership problem can thus be solved by applying min-resolve to zeros and ones (distributed throughout the machine after complementing the result of the equality operator) to compute OR. Nearest Neighbor can be computed by applying min-resolve to the distances (one eight bit word at a time).</Paragraph> <Paragraph position="2"> Min-resolve has proven to be a very useful primitive in our studies of parallel algonthms. In certain cases it is not enough. When the binary operator, b, is not commutative, nor associative, more general parallel processing is of course needed to combine the results of applying primitive queries to distributed data. In an example below, calculating the mode of a set of distributed data provides a convenient illustration of case 3.</Paragraph> <Section position="1" start_page="361" end_page="362" type="sub_section"> <SectionTitle> Partitioning </SectionTitle> <Paragraph position="0"> Problems characterized by our fourth example can be efficiently supported by a parallel machine that can be logically partitioned into disjoint parallel computers. In this case each partition may execute a distinct task. This is a critical requirement for our proposed multiple recognizer paradigm described in detail below.</Paragraph> <Paragraph position="1"> SPMD mode and Data Parallelism Let us now review how a parallel machine can quickly execute almost decomposable searching problems. Nearest-neighbor will provide the vehicle for illustration.</Paragraph> <Paragraph position="2"> 1. Preprocess N objects of set F by distributing each element in turn to one PE of the machine. Repeat the following: 2. Broadcast the unknown object x to all PEs in time proportional to the size of the object. 3. Apply the query &quot;distance (x, f)&quot; in parallel at each PE. In parallel, each PE sets the min-resolve value to its locally calculated distance.</Paragraph> <Paragraph position="3"> 4. Min-resolve on the distributed set of distance scores in parallel very quickly.</Paragraph> <Paragraph position="4"> Figure 1 depicts this approach in terms of a tree-structured parallel architecture. The overall computation time for processing a query is therefore O(Ixl)+ O(q) + O(1), the sum of steps 2, 3, and 4. In the sequential case, the computation would be O(Ixl) + O(loglFI)O(q), the time to read the unknown, plus the time to apply the primitive query to the elements of the reference set appearing along a path through the loglFI deep dam structure D, and combine the results. In cases where the data structure D cannot be defined, the serial complexity rises to O(Ixl) + O(IFI)O(q), as in the case of pattem recognition tasks. Note, therefore, that the parallel running time is constant in the size of the reference set F. Scaling a recognizer to incorporate a larger reference set F implies that the parallel architecture should scale linearly to maintain constant time performance; that is, doubling the size of the reference set implies doubling the number of PE's at twice the cost in hardware.</Paragraph> <Paragraph position="5"> This mode of parallel operation clearly captures the familiar data parallel operations popularized by the Connection Machine and others. A pure SIMD-based machine, however, is not very well suited to executing the more general class of almost decomposable searching problems especially if the execution of the primitive query q requires alternative code sequences to be executed in each PE. In our earlier work we identified this more general parallel mode of operation as Single Program Multiple Data stream, or SPMD mode. SPMD mode is required for speech recognition tasks.</Paragraph> </Section> </Section> <Section position="11" start_page="362" end_page="364" type="metho"> <SectionTitle> PE Architecture </SectionTitle> <Paragraph position="0"> The running time of the primitive query distance, O(q), is clearly dependent on the particular distance metric computation chosen (and certainly on the size of the unknown). Generally, q can be the computation of the &quot;best&quot; matching word model, selected from a set of distributed (HMM) word models. The complexity is then dependent on the form of the models (the number of states in the longest left-right HMM, for example). In our membership example above, the primitive query &quot;equal&quot; takes constant time. Similarly, in Nearest-Neighbor, distance is assumed also to take constant time. In comparing analysis frames to acoustic models in speech recognition, however, the primitive query calculates a distance between two acoustic feature vectors in the case of codebook computations, or the best word model accounting for a sequence of observation vectors, requiring many floating point operations.</Paragraph> <Paragraph position="1"> Thus, our parallel machine's performance depends greatly on how fast a PE can calculate this function, and hence we explicitly show O(q) in our running time complexity.</Paragraph> <Paragraph position="2"> Simple one-bit PE's would struggle to execute these functions in realtime. This is an important consideration. A particular processor implementation may take 10 times as long to execute q as another. A parallel processor consisting of processors of the former type would need ten times as many PE's to compete with another parallel processor consisting of PE's of the latter. The choice of PE processor is driven by the frame broadcast rate. A frame may be broadcast every centisecond. Hence, a PE must calculate the distance function, q, communicate results and update paths in the search space within this time. This argues for very fast floating point hardware at each PE. In our prototype hardware described below, we describe the use of fast DSP chips as the main computational engine of a PE and the resultant computing cycles available for speech processing.</Paragraph> <Paragraph position="3"> The PE with value r is set (represented by a double circle).</Paragraph> </Section> <Section position="12" start_page="364" end_page="367" type="metho"> <SectionTitle> SPEECH RECOGNITION AS AN ALMOST DECOMPOSABLE SEARCH PROBLEM </SectionTitle> <Paragraph position="0"> Now we can clearly state the advantages to speech recognition; dynamic time warp serves as our example. During the course of executing the dynamic time warp algorithm a set of quanfized vector codes, (F in our example above) distributed to each PE is matched in parallel against the current frame of speech (x i in our examples above). Prior to the broadcast of the next query (frame Xi/l) some number of the best results (shortest distance vector codes) are reported and broadcast to all PEs which then update their current set of best paths, as maintained by the dynamic time warp algorithm. The next frame, xi+ 1, is then broadcast and the cycle repeats itself until a final posited word is recognized, or the end of the utterance is reached.</Paragraph> <Paragraph position="1"> Note that the sequence of broadcast speech frames is continually punctuated by synchronous global communication in this approach. The time available between broadcast frames is available for parallel processing and inter-PE communication. The unidirectional systolic pipelining approach is not appropriate in this context, as noted above, since this task requffes bidirectional global communication that would otherwise flush the systolic pipe. Since a single &quot;winning&quot; PE must send its best matching phone to all other PE's, asynchronous models of inter-PE communication (message passing, for example) are not appropriate as well: if a processor completes its task early, and then communicates its result, it must sit idly waiting for the next frame of speech. There are no other speech processing tasks for it to perform. Thus, it is best that all PE's complete their tasks roughly at the same time.</Paragraph> <Paragraph position="2"> Indeed, we therefore require simpler communication for these tasks since we do not need message protocol processing.</Paragraph> <Section position="1" start_page="364" end_page="366" type="sub_section"> <SectionTitle> Voting </SectionTitle> <Paragraph position="0"> To execute a number of concurrent (and independent) tasks requires partitioning whole tasks both in software and hardware. Partitioning a parallel machine is generally straightforward, unless the computational model of the machine imposes a single locus of control for all PE's (say, for example, a single host processor broadcasting inslructions to all subservient PE's). Each partition exercising its own control can thus execute its own recognizer according to the scheme outlined above. The set of reference acoustic models in each partition can be quantized vector codes of various phones (diphones, triphones, etc.), or word-based templates, or hidden Markov models of words or word sequences. The particular algonthm executed, whether it is dynamic time warp or Viterbi search, although similar in structure since both rely on dynamic programming, would operate completely independently within a single partition. In the simplest case, each partition may be a single PE executing a complete recognizer (serially, of course, but in parallel with the others). All partitions, however, would ultimately synchronize to vote on the utterance for final consensus recognition.</Paragraph> <Paragraph position="1"> The voting of a number of independent recognizers is a straightforward hierarchically parallel computation. Indeed, we may cast our voting based recognizer composed of a number of dynamic time warp recognizers as an almost decomposable query as follows:</Paragraph> <Paragraph position="3"> M is the number of component recognizers, x i is a particular analysis frame extracted by the i th recognizer from the speech data, F i is the i th set of distributed acoustic models and distance i is the particular distance function used for the i th component recognizer.</Paragraph> <Paragraph position="4"> Figure 3 depicts the organization of the multiple recognition paradigm, while figure 4 illustrates this organization as an Almost Decomposable Search Problem. Figure 4a depicts the case where each recognizer is executed wholly within one PE, while figure 4b depicts the case where a single recognizer may utilize a number of adjacent PE's.</Paragraph> <Paragraph position="5"> Here we have chosen to use &quot;mode&quot; as our means of voting for pedagogical reasons. Choosing the most frequently posited recognized phone is only one method. This computation, unlike min-resolve, requires counting members of a distributed bag (or multiset) in parallel. High-speed communication within the machine, as well as min-resolve to choose the highest frequency datum, provides a fast way to calculate mode. For example, enumerating and broadcasting of each recognized phone followed by a frequency counting step results in a set of distributed frequencies of occurrences of individual phones. Min-resolve can then select the most frequently occurring phone very quickly. This computation takes O(M) time in this scheme, where M is the number of component recognizers.</Paragraph> <Paragraph position="6"> We may choose to use a sorting network-based parallel machine to reduce this computation time. Notice, however, that the calculation of mode from a relatively small number of component recognizers is supported by the fast global communication and min-resolve operations of our idealized paraallel machine and is dominated by the time to calculate distances of vector codes. It nat be overkill to require a general interconnection topology for a potentially small part of the overall computation.</Paragraph> <Paragraph position="7"> Clearly, we may choose a number of other voting schemes including majority voting, or 2 of N voting, for example. The precise voting scheme, as noted earlier, is one of the issues we are studying experimentally.</Paragraph> </Section> <Section position="2" start_page="366" end_page="367" type="sub_section"> <SectionTitle> Load Balancing </SectionTitle> <Paragraph position="0"> Another critical problem to study is load balancing of the parallel activities and allocating PE's to the component recognizers. Although a partitionable parallel machine may successfully implement our multiple recognition paradigm, care must be taken to ensure that the total utilization of all PEs in the system is as high as possible. Each recognizer clearly will require different amounts of computing time and is dependent on the particular distance calculation and number of models stored at each PE. Note, in our earlier discussion we presumed a signle reference (template or model) is stored at a PE. Varying the number of models at a PE varies the amount of computing time required at a PE. The single synchronization point of voting undoubtedly must be carefully orchestrated so that no single recognizer, or partition of PE's, sits waiting idly by until all others catch up for the final voting and subsequent broadcast of the next speech frame. Thus, it is important to study and automate load balancing techniques to match the computation time of each recognizer as closely as possible.</Paragraph> <Paragraph position="1"> This requires automated systems to allocate perhaps different numbers of PE's in each partition. PE's in different partitions undoubtedly must store different numbers of acoustic models to balance the total computation time of each partition. Figure 4b depicts the case where each recognizer requires different numbers of PE's. No data can be provided at this time until we actually build a number of recognizers and perform detailed timing analyses to begin solving this problem.</Paragraph> </Section> </Section> <Section position="13" start_page="367" end_page="368" type="metho"> <SectionTitle> IMPROVING DYNAMIC PROGRAMMING </SectionTitle> <Paragraph position="0"> There is another possible approach to the problem of different recognizers taking differing amounts of time, which can be applied orthogonally to the above-mentioned technique of non-uniformly partitioning the PEs. That is, developing new algorithmic techniques, to speed up the slower recognition algorithms. This also has obvious benefits to sequential as well as parallel computation of speech recognition tasks.</Paragraph> <Paragraph position="1"> In particular, the time warp dynamic programming problem greatly resembles some problems of sequence alignment with convex gap cost functions \[Sankoff 83\], for which Zvi Galil and his students have found speed ups of as much as an order of magnitude \[Galil 89, Eppstein 89a, Eppstein 89b, Eppstein \].</Paragraph> <Paragraph position="2"> More specifically, the sequence alignment problems, as with the time warp problems, can be represented as filling out entries in a dynamic programming matrix of quadratic size. However, for the harder sequence alignment problems, each entry of the matrix depends on all entries in the column above it or in the row to the left of it; thus a straightforward computation of each entry would lead to a cubic time algorithm. But by taking advantage of the convexity or concavity inherent in typical gap cost functions, and by using simple data structures such as stacks and queues, the computation of all'entries in a given row or column can be performed in linear or close to linear time; thus the time for the entire algorithm is quadratic or close to quadratic.</Paragraph> <Paragraph position="3"> We can achieve even further speed-ups if we can determine that many entries of the dynamic programming matrix can not contribute to an optimal solution to the problem. With the help of some further algorithmic techniques, including divide and conquer as well as the recently developed monotone matrix searching technique \[Aggarwal 87\], the time can be reduced to almost linear in the number of remaining sparse matrix entries \[Eppstein 89a, Eppstein 89b\]. The details of the computation become more complicated, but this is made up for by the reduction in the size of the problem.</Paragraph> <Paragraph position="4"> It seems likely that similar methods will provide practical speed ups to the time warp algorithm, bringing its complexity closer to that of the other recognition algorithms. The sparse sequence alignment technique mentioned above is especially intriguing, because the complexity introduced in dealing with sparsity resembles in certain respects that of the time warp problem; further, the fact that the sparse alignment problem can be solved efficiently gives rise to hope that the same techniques can be used to solve time warping. It is also pertinent to ask whether the time warp problem can have similar sparseness properties, and to take advantage of these in its solution.</Paragraph> </Section> class="xml-element"></Paper>