File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1082_metho.xml
Size: 21,294 bytes
Last Modified: 2025-10-06 14:12:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-1082"> <Title>LINGUISTIC PROCESSING USING A DEPENDENCY STRUCTURE GRAMMAR FOR SPEECH RECOGNITION AND UNDERSTANDING</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> LINGUISTIC PROCESSING USING A DEPENDENCY STRUCTURE GRAMMAR FOR SPEECH RECOGNITION AND UNDERSTANDING Sho-lchl MATSUNAGA NTT Human Interface Laboratories </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper proposes an efficient linguistic processing strategy for speech recognition and understanding using a dependency structure grammar.</Paragraph> <Paragraph position="1"> The strategy includes parsing and phrase prediction algorithms. After speech processing and phrase recognition based on phoneme recognition, the parser extracts the sentence with the best likelihood taking account of the phonetic likelihood of phrase candidates and the linguistic likelihood of the semantic inter-phrase dependency relationships. A fast parsing algorithm using breadth-first search is also proposed. The predictor pre-selects the p}~.ase candidates using transition rules combined with a dependency structure to reduce the amount of phonetic processing. The proposed linguistic processor has been tested through speech recognition experiments.</Paragraph> <Paragraph position="2"> The experimental results show that it greatly increases the accuracy of speech recognitions, and the breadth-first parsing algorithm and predictor increase processing speed.</Paragraph> </Section> <Section position="3" start_page="0" end_page="403" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> In conventional continuous speech recognition and understanding systems\[1~4\], linguistic rules for sentences composed of phrase sequences are usually expressed by a phrase structure grammar such as a transition network or context free grammar. In such methods, however, phoneme recognition errors and rejections result in incorrect transition states because of the strong syntactical constraints.</Paragraph> <Paragraph position="1"> These erroneous transitions cause the following candidates to be incorrectly chosen or the processing system to halt. Therefore, these errors and rejections can be fatal to speech understanding.</Paragraph> <Paragraph position="2"> Furthermore, a complete set of these grammatical rules for speech understanding is very difficult to provide.</Paragraph> <Paragraph position="3"> To address these problems, this paper proposes a new linguistic processor based on a dependency structure grammar, which integrates a bottom~up sentence parser and a top-down phrase predictor. ~lis grammar is more semantic and less syntactic than phrase structure grammar, and, therefore, syntactic positional constraint in a sentence rarely occurs with this parser. This effectively prevents extreme degradation in speech recognition from errors and rejections in phoneme recognition and greatly increases the accuracy of speech processing. This grammar only has two syntactic rules, so this parser is free of many cumbersome grammatical rules that are indispensable to other grammars. This grammar particularly suits phrase-order-free languages such as Japanese.</Paragraph> <Paragraph position="4"> 4a~ rule I rule 2 For the parser of this grammar, a depth-first parsing algorithm with backtracking which guarantees the optimal solution was devised\[5\]. However, parsing long sentences composed of many phrases with this algorithm can be time-consuming because of combinatorial explosion, since the amount of computation is exponential order with respect to the number of phrases. Therefore, a fast parsing algorithm using breadth-first search and beam search was developed. This algorithm is based on flmdamental algorithms\[6,7\] which only take account of the dependency relationships of the modifier and modificant phrases, and it handles higher linguistic or semantic processing such as case structure. The processing ability of this breadth-first algorithm is equivalent to that of the depth-first algorithm.</Paragraph> <Paragraph position="5"> To effectively recognize speech, the amount of phonetic processing must be reduced through top-down prediction of hypotheses. However, top-down centre\] using the principal dependency structure is impossible. To solve this problem, this novel phrase predictor was devised. This predictor pre-selects hypotheses for phoneme recognition using prediction rules, and then it reduces the amount of phonetic processing. Prediction rules are created by integrating connection rules and phrase dependency structures.</Paragraph> <Paragraph position="6"> The effeetivenessof this linguistic processing was ascertained through speech recognition experiments.</Paragraph> <Paragraph position="7"> 2. Linguistic processor 2. 1 Dependency ntJcueturegrammar This grammar is based on semantic dependency relationships between phrases. The syntactic rules satisfy the following two constraints. First, any phrase, except the last phrase of a sentence, can modify only one later phrase. Each modification, called a dependency relationship or dependenc 2 structure, can be represented by one arc. Second, modification arcs between phrases must not cross. These rules are illustrated in Fig. I. In two unacceptable sentences, one sentence is unacceptable because one phrase modifies the former phrase, and A,B,C and D are sentence phrases.</Paragraph> <Paragraph position="8"> the ether sentence is unacceptable because arcs cross in its dependency structures.</Paragraph> <Paragraph position="9"> 2. 2Parser After phonetic phrase recognition, recognition results are represented in a phonetic score matrix form as shown in Fig. 2. When analyzing dependency relationships, the parser extracts the most likely sentence in this matrix by taking into account the phonetic \]_ikelihood of phrase candidates and the linguistic likelihood of semantic inter-phrase dependency relationships. The parser also obtains the dependenqy structure that corresponds to the semantic structure of the extracted sentence.</Paragraph> <Paragraph position="10"> 2. 2. 1 Objective function This parsing is equivalent to solving the following objective function using the constraints of dependency structure grammar. For simplicity, the following linguistic formulation is described for speech uttered phrase by phrase. The process for sentence speech is described in section 4.</Paragraph> <Paragraph position="11"> N T -- max\[ 2;'c(xi ,p) + maxZdep(w I j_1,Xj,piY1,j,p)\] (I) p j::l &quot; Y j=l ' where I~j.<N, 1Kp~M, N is the number of input phrases, M is the maximum number of phonetic recognition candidates for each phrase, Xj,p is a candidate of the j~th input phrase with the p-th best phonetic likelihood, and c(xj,p) is its phonetic likelihood (positive value). Also, Xi,j, p is a phrase sequence with one phrase candidate for each i-th to j-th input phrase and whose last phrase is Xj,p. Yi,j,p is one of the dependency structures of Xi,j,p, wi,j_ I is the set of phrases that modify Xj,p in the sequence Xi,j, p. Here, dep(w,xlY) is the linguistic likelihood (negatiw! value) of dependency relationships between w and x taking Y into account. Namely, the first item of the teem on the right in Eq. (I) is the summation of phonetic likelihoods of the hypothesized sentence composed of its phrase sequence, and the second item is the summation of linguistic likelihood. Maximizing Eq. (1) gives the sentence and the dependency structure of it as speech recognition and t~dersta~ling results.</Paragraph> <Paragraph position="12"> Because dependency structure grammar is compatible with case grammar\[8\], the linguistic semantic likelihood(dep) of the dependency structure is easily provided using case structure. The following are example:~ of items for evaluating dependency relationships: the disagreement between the semantic primitives of the modifier and that requested by the modificant, the lack of the ob\].igatory case phrase requested by the modificant, and the existence of differenb phrases with the same case and modifying the same phrase. The likelihood values for these items are given heuristically.</Paragraph> <Paragraph position="13"> To so\]re equation (I), a fast parsing algorithm using breadth-first search and beam search was developed. This algorithm deals with higher linguistic or semantic processing such as the case structure.. Although this algorithm offers sub-optimal solutions, it is practical because it requires less processing than depth-first search.</Paragraph> <Paragraph position="14"> 2. 2. 2 l~ce~t~z-firstpe~csingalgorithm The breadth-first algorithm is formulated as</Paragraph> <Paragraph position="16"> follows.</Paragraph> <Paragraph position="17"> First, dep(w,xlY) can obviously be divided into two terms.</Paragraph> <Paragraph position="19"> where depl is the likelihood associated with dependency relationships of only the modifier and modificant phrases, and dep2 is the likelihood associated with Y1,j,p. An example of dependency relationships is shown in Fig. 3.</Paragraph> <Paragraph position="20"> Eqs. (I) and (2) give the objective function's wtlue S(1,Xj,p) of a phrase sequence including the top phrase to Xj,p in the sentence as:</Paragraph> <Paragraph position="22"> On the other hand, the value of a phrase sequence not including the top phrase (iC/I) is defined as:</Paragraph> <Paragraph position="24"> depl (F, G) depl (E, G depl (A, G ', '.-t dep2( YA,G dep(wA,p. G i YA,G =dep2( YA.G ' G) +dspl (F, G) +depl (E. G) +depl (A, G) dep2(Yi,j,p,Xj,p) is not evaluated in Eq. (4). Using notation S and D, the recurrence relation among the objective functions are derived. This is shown in Fig. 4. The recurrence relation are transforms into the following equations using beam search.</Paragraph> <Paragraph position="26"> where i~k~j-1, 1~q~M, and 1~r, rl, r2<_L. Here, r, rl and r2 indicate the rank of beam, L is the maximum number of beams, S(1,Xj,p,r) and D(i,xj,p,r) are the r-th value of the element whose phrase sequence is Xi,j, p and the dependency structure is Yi,j,p.</Paragraph> <Paragraph position="27"> ilere, rth-max\[ \] is a function for deriving the r-th best value. When Eq. (5') or (6') is calculated, Yi,j,p is stored for use in the later stage of evaluating dep2.</Paragraph> <Paragraph position="28"> Initial values are given as follows.</Paragraph> <Paragraph position="30"> After calculating the recurrence relation, the value of the objective functions is obtained,</Paragraph> <Paragraph position="32"> where 1~p~M. The best sentenc~ and its dependency structure are given through YI,N,p where p maximizes Eq. (9). The parsing table is shown in Fig. 5 and the parsing algorithm is shown in Table I. In Fig. 5, the first row corresponds to S, and others correspond to D. The phrase sequence for first to N-th phrase corresponds to the right-most top cell. Each cell is composed of ML sub-cells. Arrows show the sequence of calculating the recurrence relation. The processing amount order for this algorithm is O(N3M2L2).</Paragraph> <Paragraph position="33"> Comparing the theoretical amount of processing for these two parsing algorithms, the breadth-first parsing algorithm clearly requires much less processingthanthe depth-first parsing algorithm.</Paragraph> <Paragraph position="34"> The amount of processing for each parsing algorithm is shown in Fig. 6.</Paragraph> <Paragraph position="35"> 2. 3 Predictor To pre-select the phrase hypotheses for the speech recognition, the predictor is devised\[9\], using prediction rules created by integrating connection rules and dependency structures of phrases. These rules are described with rewriting rules: (Xi,j)->(Xi,k)(Xk+1,j) where Xi, j is the phrase sequence for the i-th to j-th phrase. (Xi, j) is the sequence with a Closeddependency-structure where the tail phrase xj has the dependency relationships with phrases out of Xi,j, and other phrases in Xi, j have dependency relationships with phrases within Xi, j. (Xi, j) is divided into two phrase sequences with the closeddependency-structure by modifying x k by xj, and following Xi, k by Xk+1, j. A single phrase x i is also</Paragraph> <Paragraph position="37"> regarded as a phrase sequence with a closeddependency-structure. These rules are described for the sequence, the i-th phrase to j-th phrase modified by the i-th phrase, as (Xi,j)->(xi)(Xi+1,j) The hypotheses are predicted as follows.</Paragraph> <Paragraph position="38"> <I> x i is detected as a reliable phrase recognition result. If there are no reliable phrase candidates in the i-th phrase recognition results, the following procedure is not carried out.</Paragraph> <Paragraph position="39"> <2> The rules whose left term is scanned are such as (Xi+l,j)->(Xi+1,k)(Xk+1,j) After the left-most derivation is repeated to detect hypotheses for i+1~thphrase speech recognition, xi+ I is detected in the following form.</Paragraph> <Paragraph position="40"> (Xi+1,j)->(xi+1)(Xi+2,h)----(Xk+l,j) Generally, there is more than one (Xi+1,j) , so xi+ I is a setof phrases.</Paragraph> <Paragraph position="41"> <3> The phrase recognition is carried out for the i+l-th ~\]case utterance whose hypotheses are elements of the set xi+ I.</Paragraph> <Paragraph position="42"> <4> If the reliable phrase recognition result was detected in operation <3>, the rules which derived elements of xi+ I are scanned again and hypotheses for the next utterance are derived using same procedure as <2>.</Paragraph> <Paragraph position="43"> <5> Thes~ operations, namely hypotheses derivation and its phonetic verification, are carried out until xj is detected.</Paragraph> <Paragraph position="44"> <6> The detected phrase sequence Xi~ j and its dependency structure Yi,j is passed to the parser. During bhese operations, if the phrase recognition results are unreliable in operation <3>, the detectioa process of Xi, j is halted and phrase recognitSon for all hypotheses is carried out. Althe~,gh Japanese is a phrase-order-free language, there are some relatively fixed please-order parts in a sentence. These rules are applied to these parts. The number of hypotheses and the amount of acoustic processing can thus be reduced, maintaining the above characteristics of the dependency structure grammar. By linking the predictor to the parser, parsing can be accomplished using the dependency structures detected in operation <6> of the prediction procedure. This linkage method greatly increases parsing speed.</Paragraph> </Section> <Section position="4" start_page="403" end_page="403" type="metho"> <SectionTitle> 3* Speech recognition exporimen%~ 3- 1St~echrecognibion system </SectionTitle> <Paragraph position="0"> The speech recognition and understanding system is shown in Fig. 7. The system is composed of acoustic and linguistic processing stages. The acoustic processing stage consists of a feature extraction part and a phoneme recognition part\[10,11\]. The linguistic processing stage consists of a phrase recognition patti11\], a parsing part (a dependency relationship analysis part), and a phrase prediction part. The linguistic processing stage uses a word dictionary, word connection rules for intra-phrase syntax, dependency relationships rules and phrase prediction rules. The word dictionary is composed of pronunciation expressions, parts of speech and case structures. Dependency relationship rules produce negative eva%~ation values that are set to the dependency relationships contrary to case structure discipline.</Paragraph> </Section> <Section position="5" start_page="403" end_page="406" type="metho"> <SectionTitle> 3. 2 Speschreeognitionprocess </SectionTitle> <Paragraph position="0"> For separately uttered phrases, acoustic feature parameters are extracted and bottom-up phoneme recognition is carried out. The phrase hypotheses for top-down phoneme recognition are pre-selected by the { 1 } Loop for the end phrase of the partial sequence DO {2} to {5} for j = 1,2,---,N {2} Loop for the candidate DO {3} to {5} for p = 1,2,---,M\] {3} Setting the initial value SET S(1,Xl,p,1) or D(j,Xj,p,1) (Eqs. (7),(8)) If J = 1, go back to {2}.</Paragraph> <Paragraph position="1"> {4} Loop for the beginning phrase of the partial sequence DO {5} for i = 1-1j-2,--,1 {5} Calculation of reccurence relation < Loop for tlle end phrase of the former sequence > {5-1} DO {5-2} to {5-4} for k = i,i/1,-~-,1-1 {5-2} DO {5-3} to {5-4} for q = 1,2,--~,M k < Loop for the beam width > {5-3} DO {5-4} for rl = 1,2,---,L {5-4} DO for r2 = 1,2,---,L * Evaluation of S(1,x 1 n,r) or D(j,x~ n,r) taking account of YtZ,p or (Eq,.</Paragraph> <Paragraph position="2"> * Store of Ytj,p {6} Acquisition of the parsing results * Detection of value p maximizing Eq. (9) * Acquisition of the phrase sequence and its dependency predictor. The pre-selection is also carried out using bottom-up phoneme recognition results\[12\]. Next, top-down phoneme verification is carried out and phrase recognition results are generated. Phrase recognition results are represented in the form of score matrix with phonetic recognition scores averaged for each hypothesized phrase. When the end of a sentence is detected, the parser extracts the phrases with t~e best sentence likelihood by scanning this matrix, and determines the dependency structure of the extracted phrases.</Paragraph> <Paragraph position="3"> 3- 3 Performance The effectiveness of the proposed linguistic processor was tested in speech recognition experiments. The experiments were carried out using 20 sentences containing 320 phrases uttered by 2 male speakers. These results are shown in Table 2. recognition and understanding results I wg\[dd~c~l._o+ary__~--~\] dependency relationships analysls~--~idependency , I case J ~ /\ I relatlonshlps Lstructure ~ hypotheses prediction ~----7 \[\[~2~ ......... J i~ ............ 1 i l linguistic I \[pronounciatlon\[ + M / I I processing , expression ~ hypotheses derivation |L_~predic(lon ..... ? .............. i+J- t, \[-- L~uJS~ ......... J ~tntra-phrase \[pre-selectto L~Y~ ........ J ~ phrase@ (word) recognition ...................... ............................................................................................... bottom-sp phoneme ~honeme\[ The proposed parser using the depth-first parsing algorithm increased phrase recognition rate by approximately 2OZ (from 57Z without the parser to 77~ with the parser). This result shows the effectiveness of a parser using a dependency structure grammar. The processing time with the breadth-first algorithm was reduced to approximately IZ of that with the depth-flrst algorithm for sentence parsing, while keeping the same level of speech recognition 77 il.</Paragraph> <Paragraph position="4"> phrase i parslng recognition i tlme rate \[~3 i 77 ~1. .............................. } ...................... 78 i O. 89 rate as with the depth-first algorithm. This result shows the great effectiveness of the breadth-first parsing algorithm. This result is shown in Fig. 8 for each speaker when M is 3 and L is 8.</Paragraph> <Paragraph position="5"> Next, using 26 rules, the prediction was carried out for 33~ of the total input phrases. It reduced acoustic processing time to 60Z at these parts in a sentence, and it increased speech recognition speed. Finally, linking the predictor to the parser reduced parsing time to less than 10% of the time for the depth-first parser, and to approximately 90% of the time for the breadth-first parser. This shows the usefulness of the linkage.</Paragraph> <Paragraph position="6"> 4- Breadth-firs% parsing algorithm for sentence speeehreco~tion The breadth-first parsing algorithm for the sentence speech or connected phrase speech is devised\[13\] by the same procedure as in section 2. 2. Based on basic expansion algorithms\[14,15\] from phrase-wise to sentence speech, the speech recognition and understanding accuracy using the proposed algorithm is greatly increased compared to the accuracies using the basic algorithms. In the sentence speech, phrase recognition results after phonetic processing are represented in a score lattice form withphonetic recognition scores averaged. The parser extracts the best sentence composed of a phrase sequence by scanning this lattice. The processing order is O(NSM2L2), which is practical amount of computation, where N is the number of detected phrase boundaries in the uttered sentence, M is the maximum number of phonetic recognitio~ candidates for each phrase segment f~em one boundsry to the next boundary, and L is the maximum number of beams.</Paragraph> <Paragraph position="7"> The effectiveness of this parser was tested through sentence speech recognition with one speaker uttering i0 sentences containing a total of 67 phrases. This parser increased phrase recognition performance in the sentences by approximately 49Z (from 27~ without the parser to 76~ with the parser).</Paragraph> </Section> class="xml-element"></Paper>