File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2120_metho.xml
Size: 25,686 bytes
Last Modified: 2025-10-06 14:12:59
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-2120"> <Title>CONSTRUCTION OF CORPUS-BASED SYNTACTIC RULES FOR ACCURATE SPEECH RECOGNITION JUNKO HOSAKA TOSHIYUKI TAKEZAWA ATR Interpreting Telephony Research Laboratories</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> CONSTRUCTION OF CORPUS-BASED SYNTACTIC RULES FOR ACCURATE SPEECH RECOGNITION JUNKO HOSAKA TOSHIYUKI TAKEZAWA ATR Interpreting Telephony Research Laboratories </SectionTitle> <Paragraph position="0"> hosaka@at r-la.at r.co.jp takezawaQat r-la.~t r.co.j p</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper describes the syntactic rules which are applied in the Japanese speech recognition module of a speech-to-speech translation system. Japanese is considered to be a free word/phrase order language.</Paragraph> <Paragraph position="1"> Since syntactic rules are applied as constraints to reduce the search space in speech recognition, applying rules which take into account all possible phrase orders can have almost the same effect as using no constraints. Instead, we take into consideration the recognition weaknesses of certain syntactic categories and treat them precisely, so that a miuimal number of rules can work most effectively. In this paper we first examine which syntactic categories are easily misrecognized. Second, we consult our dialogue corpus, in order to provide the rules with great generality. Based ou both stndies, we refine the rules. Finally, we verify the validity of the refinement through speech recognition experiments.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We are developing the Spoken Language Tl~ANSlation system (SL-TRANS)\[1\], in which both speech recognition processing and natural language processing arc integrated. Currently we are studying automatic speech translation from Japanese into English in the domain of dialogues with the re ception service of an international conference office. In this framework we are constructing syntactic rules for recognition of Japanese speech.</Paragraph> <Paragraph position="1"> In speech recognition, the most significant concern is raising the recognition accuracy. For that purpose, applying linguistic information turns out to be promising. Various approaches have been taken, such as using stochastic models\[2\], syntactic rules\[3\], semantic information\[4\] and discourse plans\[5\]. Among stochastic models, the bigram and trigram succeeded in achieving a high recognition accuracy in languages that have a strong tendency toward a standard word order, such as English. On the contrary, Japanese belongs to free word order languages\[6\]. For such a language, semantic information is more adequate a.s a constraint. However, building semantic constraints for a large vocabulary needs a tremendous amount of data. Currently, our data consist of dialogues between the conference registration office and prospective conference participants with approximately 199,000 words in telephone conversations and approximately 72,000 words in keyboard conversations. But our data are still not sufficient to build appropriate semantic constraints for sentences with 700 distinct words. Processing a discourse plan requires excessive calculation and the study of discourse itself must be further developed to be applicable to speech recognition. On the other hand, syntax has been studied in more detail and makes increasing the vocabulary easier.</Paragraph> <Paragraph position="2"> As we are working on spoken language, we try to reflect real language usage. For this purpose, a stochastic approach beyond trigrams, namely stochastic sentence parsing\[7\], seems most promising. Ideally, syntactic rules should be generated automatically from a large dialogue corpus and probabilities should also be automatically assigned to each node. But to do so, we need underlying rules. Moreover, coping with phoneme perplexity, which is crucial to speech recognition, with rules created frmn a dialogue corpus, requires additional research\[8\].</Paragraph> <Paragraph position="3"> In this paper we propose taking into account tile weaknesses of the speech recogniton system in the earliest stage, namely when we construct underlying syntactic rules. First, we examined the speech recognition results to determine which Syntactic categories tend to be recognized erroneously. Second, we utilized our dialogue corpus\[9\] to support the refinement of rules concerning those categories. As examples, we discuss formal nouns 1 and conjunctive postposi~ions 2.</Paragraph> <Paragraph position="4"> Finally, we carried out a speech recognition experiment with the refined rules to verify the validity of our approach.</Paragraph> <Paragraph position="5"> in the Japanese speech recognition module of our experimental system the combination of generalized I,R parsing and fIidden Markov Model (IIMM) is realized ~s IIMM-LR \[10\]. The system predicts phonetnes by using an LR parsing table and drives IIMM phoneme verifiers to detect/verify them without any intervening structure, such as a phoneme lattice.</Paragraph> <Paragraph position="6"> The speech recognition unit is a Japanese bunselsu, which roughly corresponds to a phrase and is the next largest unit after the word. The ending of the bunselsu (phrase) is usually marked by a breath point. This justities its treatment as a distinct unit. A Japanese phrase consists of one independent word (e.g. noun, adverb, verb) and zero, one or more than one dependent words (e.g. postposition, auxiliary verb). The nmnber of words in a phreLse ranges from 1 to 14, and the mean number is about 3, according to our dialogue corpus.</Paragraph> <Paragraph position="7"> We will clarify the weaknesses of HMM-Llt speech recognition both in phrases and in sentences.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Phrase Recognition Errors </SectionTitle> <Paragraph position="0"> We examined which syntactic categories tmtd to be erroneously recognized, when using IIMM-LR pltraae speech recognition. For this purl)ose , we applied syntactic rules containing no constraints on word</Paragraph> <Paragraph position="2"> each brmach, the local beam width, 10.</Paragraph> <Paragraph position="3"> In the examples, the symbols >, -, ng and N have special meaning: A correctly recognized plmme is nmrked with >. (r) A word boundary is marked with -.</Paragraph> <Paragraph position="4"> A nasalized /g/is transcribed ng.</Paragraph> <Paragraph position="5"> * A syllabic nasal is transcribed N.</Paragraph> <Paragraph position="6"> In (1), after recognizing the tirst word, the system selected subsequent words solely to produce a phoneme string similar to the original utterance. (2) is an example of phrase recognition which failed. In this example tou was erroneously recognized as to. Suhsequently, no fllrther correet words were selected. Examples (1) and (2) both show that IIMM-LR tends to select words consisting of extremely few phonemes when it fails in word recognition. To avoid this problem, precise rules should be written fin' sequences of words with small nnmbers of phonemes. In Japmmse, postpositions(e.g, ga, o, nit, wh-pronouiis(e.g, itsu, nani, claret\[Ill, numerals(e.g. ichi, hi, san) and certain nouns(e.g, kata, mono) particularly tit this description.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Sentence il.ecognition Errors </SectionTitle> <Paragraph position="0"> To exanfine the error tendency of sentence speech recognition we applied a two-step method\[12\]. First, we applied phra~e rules to the ItMM-LR speech recognition s. Second, we applied phrase-ba-sed sentence rules tt, the phrase candidates as a post-filter, in order to obtain sentence candidates, while filterins out unacceptable candidates. We experimented with the 353 phrases making up 1:/7 sentences. The recognition rate ff)r the top candidates wins 68.3 % by exact string tnatching, and for the top 5 candidates 95.5 %.</Paragraph> <Paragraph position="1"> Based on the top 5 phr~me candidates, we condncted a ;;entente experiment, ht this experiment we applied loosely constrained sentence rules. With these rules, altproxinnttely 80 % of all the possibh', combinations of phrase candidates were re-.</Paragraph> <Paragraph position="2"> cepted. Following are examples which did not exactly match the uttered sentences a . Notice that misrecognized words consist of a relatively small number of phoneluesj gig }ve have seen iil section 2.1.</Paragraph> <Paragraph position="3"> (3) lkaingi~ni moubhiko-mi-tai-no-desu-nga \[ (rl ~ould like go !egister for the conference. ) as: kaingi-ni moushJko~mi-tai-N-desu-nga 3b: kaingi.-ni moushiko-mi-gai-no-desu-ka (4) Ikochira-wa kaingizimukyoku-desul 5'fhe global beam width is set fin&quot; 100 and tile local beam width 10.</Paragraph> <Paragraph position="4"> ~Since the phr~e candidates *tlv obtaiued by the I1MM-LIt speech recognitiolt, word botmdatie~ m'e Mready marked by -. AcrEs DE COLINGo92. NANTES. 23-28 ^ot'n 1992 8 0 7 I'r~oc. OF COLINGO2, NANTES. AUG. 23-28. 1992 (This is the conference office.) Though the phoneme string in 3a is different from the uttered phoneme string, the difference between no and N in meaning is minor, and has no effect on translation with the current technique. While (3) is affirmative, 3b is interrogative, which is indicated by the sentence final postposition ka. This cannot be treated with sentence rules. To haudle this problem, we need dialogue management.</Paragraph> <Paragraph position="5"> The uttered phrase kochira-~a in (4), meaning &quot;this,&quot; was recognized erroneously as kat.a-wa in 4a, meaning &quot;person.&quot; The word kata belongs to the formal noun group, a kind of noun which should be modified by a verbal phrase \[13\]. Sentence 4a is acceptable, if modified by a verbal phrase, as in 4a': 4a': midori-no seihukn-o kiteiru kata-wa kaigizimukyoku-desu (The person who is wearing a green uniform is \[with\] the conference office.) This is also true of the phrase mono in 5c meaning &quot;thing,&quot; which was erroneously recognized instead of doumo meaning &quot;very much&quot;: 5c': kouka-na mono aringat-ou-gozaima-shi-ta (Thank you for the expensive thing.) In sentence candidates 5a and 5b, the numeral go, meaning &quot;five,&quot; is used. These sentences may seem strange at first glance, but in a situation such as playing cards, these sentences are quite natural. If someone plays a 5 when you need one, you would say: &quot;Thanks for the five.&quot; Similarly, when you need a 3 and a 5, and someone plays a 3 and after that someone else plays a 5, you would say: &quot;Thanks for the five, too.&quot; In the sentence candidate 6a, the conjunetiveposlposilion (conj-pp) shi is used sentence finally. In principle, a conj~pp combines two sentences, functioning like a conjunction, such as &quot;while&quot; and &quot;though,&quot; and is used in the middle of a sentence.</Paragraph> <Paragraph position="6"> Erroneous sentence recognition such as in the case of 3a-b cannot be treated by sentence rules. Therefore, we are trying to cope with erroneous recognition, as seen in sentence candidates 4a, 5a-c and 6a, with sentence rules.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="40" type="metho"> <SectionTitle> 3 Dealing with Speech Recog- </SectionTitle> <Paragraph position="0"> nition Errors We are going to deal with sentences containing tile following phrases: In order to decide how to cope with the above problems, we used our dialogue corpus. Currently we have 177 keyboard conversations consisting of approximately 72,000 words and 181 telephone conversations consisting of approxilnately 199,000 words 7. We regard keyboard conversations as representing written Japanese and telephone conversations as representing spoken Japanese. When retrieving the dialogue corpus, we always compare written and spoken Japanese, in order to clarify the features of the latter. We examined the actuM usage of formal nouns as well as that of eonj-pps.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Formal Nouns </SectionTitle> <Paragraph position="0"> We examined the behavior of formal nouns, such as koto and mono. Formal nouns are considered to be a kind of noun which lacks the content usually found in common nouns such as &quot;sky&quot; or &quot;apple.&quot; They function similarly to relative pronouns and therefore are used with a verbal modifier\[13\], as in examples 7 and 8: 7 : kinou ilia koto~wa torikeshitai.</Paragraph> <Paragraph position="1"> (I would like to take back what I said yesterday.) 8 : nedan-ga takai mono-ga shitsu-ga ii wakedewanai. null (It is not always true that an expensive thing has good quality.) In examples 7 and S, the formal nouns, kolo and mono, are modified by kinou ilia (yesterday said) and nedan-ga takai (price expensive), respectively. But it is also true that these nouns behave like common nouns and can be used without any verbal modifier, as in examples 9 and 10: Considering the examples 7-10, we coukl define two kinds of usage for formal nouns. This distinction is applicable to sentence analysis, but is meaningless from the standpoint of applying syntactic rules ms constraints.</Paragraph> <Paragraph position="2"> Ill our dialogue corpus, koto, mono, hou and kata are tile most frequently used formal nouns. Table 1 shows how often tile formal nouns are used with a verbal modifier. We have also rctrieved formal nouns used in the sentence initial position, w~ in example 10. written Japanese, when we allow only formal nouns preceded by a verbal modifier in the syntactic rules. llowever, the coverage remains at 40 %, which is less than half, in the spoken Japanese we are dealing with. We have further examined those sentences in which fortnal nouns are not modified by verbals. Most of them are modified by phrases consisting of a noun and postposition no, which approximately corresponds to &quot;of.&quot; Further, some are modified by phrases cousist~ ing of a verb tbllowed by postpositions to and no. Others are moditled by words which cars be used exclusively ,as nominal modifiers such as donna (what kind of) and sono (that). We found only one exampie in the keyboard conversation in which a fortnal noun is not modified at all: 11 : osorakn kyouju-ni koto-no shidm-o tsutaeru koto-ga ii-to omoim~su.</Paragraph> <Paragraph position="3"> (it might be good if you tell the professor how the tiring is going.) In our diMogue corpus we found 2,491 phrases con taining the formal nouns kolo, mono, hou and kala. Out of 2,491 examples, there is only one which is not modified at all. If we define formal lsouns ,~s those which are always modilied in some manner, i.e. even if we do not allow formal nouns to be used alone, the coverage still exceeds 99 %. Since the occurrence rate of formal nmms without ally moditier is very low, we can treat the usage of formal nouns (as in examples 9-11) as semi-frozen expressions.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Conjunctive Postpositions </SectionTitle> <Paragraph position="0"> Japanese pc,stpositions such m~ 9 a, o and hi, which function a.s case markers, are usually attached to nominals. Different from this kind of postposition, conj-pps such a~s ga, te and ba are used after verbMs.</Paragraph> <Paragraph position="1"> Conj-pps combine two clauses, fimctiouing similarly to conjunctions such as &quot;because&quot; and &quot;whilc,&quot; and are thus often used in the middle of a sentence, as in example 12. But they cars also be used in the sentence final position, ,as ill exmnple 13.</Paragraph> <Paragraph position="2"> There should follow some additional words to express the complete meaning. Sentences finishing with a eonj-pp leave the interpretation to tile hearer. And, in general, the hearer can correctly interpret the sentence from the context. Understanding conj-pps, therefore, plays an important role in treating spoken Japanese.</Paragraph> <Paragraph position="3"> In the dialogue corpus the following conj-pps are used: ga (beeanse, while), node and udc (because), te aud~ (and), k.r~ 0 ......... fret), k'~,'~.,l ...... k~,'edo, kedo and kedomo (though, but), shi (and, and then), ....... de (because), tara (if), to (if, when), ba (if) and nagara (while).</Paragraph> <Paragraph position="4"> Table 2 shows conj-pps used sentence finally.</Paragraph> <Paragraph position="5"> According to Table 2, the conj-pp ga is the one most used in keyboard conversations. While the usage of conj-pps in keyboard conversations is heavily concentrated on ga with all occurrence rate of 85%, it is more balanced m telephone conversations. In addition to ga (38%), kcredomo (30%) and conj-pps which carry a similar meaning such as kercdo, kedo and kedomo are frequently used. In telephone conversations, node (13%) is also frequently nsed. Treating only the six conj-pps in sentence final position, the coverage reaches 91% for Sl)oken Japanese. l)itt~ren~ tiatmg conj-pps which can Ire used in sentence final position i?om those which can be used only in the middle of a sentence is also supported by the speech recognition results\[14\]. The conj-pps shi and cha are especially subject to erroneous recognition.</Paragraph> </Section> <Section position="3" start_page="0" end_page="40" type="sub_section"> <SectionTitle> 3.3 Syntactic Rules for Speech Recog- </SectionTitle> <Paragraph position="0"> nition Based on the corpus retrieval we decided to deal with formal nouns and conj-pps as described below. AIM we decided to treat numerals only in a restricted en-vironment, because they are significant noise factors in speech rccognitionS: * Phrases with formal nouns nmst be modified. * Phrases with numerals can be used only ill certain environments. Numerals are allowed in addresses, telephone numbers, dates aim prices. Japanese nnlYlera\]s consist of all extremely small number of phonemes, e.g. ichi, hi, san (1, 2, 3) and are therefore especially easy to misrecognize 9. &quot;\['bus, they should be strongly constrained. The domain we have chosen is limited to dialogues between all international conferenee receptkmist and prospcctive participants and we are going to deal only with tile anticipated usage in the domain. Another condition, sue\]l as playing cards, will be treated when speech recognition is further improved.</Paragraph> <Paragraph position="1"> * We classify conj-pps into two groups: conj-pps which call be used in the sentence final position as well as in the milldlc of a sentence, and conj-pps which can be used only ill the nfiddle of a seutence.</Paragraph> <Paragraph position="2"> We refined the loosely constrained syntactic rules introduced ill section 2.2. ill the new version of the sentence rules, formal nouns, numerals and eonjq)ps are more precisely treated. Ill the following, we ex plain the rules for formal nouns and conj-llpS.</Paragraph> <Paragraph position="3"> SSee Figure 2.</Paragraph> <Paragraph position="4"> 9Nmnbers greater than ten e.re in principle the combination of basic numbers.</Paragraph> <Paragraph position="5"> 'File format for syutactic rules is as follows: (<CATI> <--> (<CAT2> <CAT3>)) Nonterufinals are surrounded by <>10. The above rule indicates that CATI consists of CAT2 and CAT3. To make tile distinction between phrase categories which are terminals ill phrase-based sentence rules and those which are not, we will write tile former all in lower-case.</Paragraph> <Paragraph position="6"> Ill the process of sentence construction, phrases containing a formal noun np-formal are treated ms '\]?\]le above rules say that noun phrases M-NN call, m principle, be modified by some modifier MOD-K In tile case of a common noun NN, tile phrase can be lnodified but need not be. But in the case of a formal noun IqN-FOKK file phrase must be modified.</Paragraph> <Paragraph position="7"> Phrases with a conj-pp which is exclusively used in tile middle of a sentence vaux-s, those with a eonj-pp which is used both ill the middle of a sentence and in tile sentence final vaux-s+~, and verb phrases without any eonj-pps vaux, are treated as follows: A sentence SS can consist of only one verb phrase VC, or call be preceded by adverbial pfir,~ses ADVPH. A sentence SS can end either with a verb phrase without a conj-pps vaux or with a verb phrase with a certain kind of conj-pps vaux-s+:f. An adverbial phrase ADVPH can consist of only adverbs ADVI and Call also consist of verbal phrases VADVS. The verbal pbrases ldegFor tenninMs we have a different notation. Terminals in phrase rules ta'e phoneme sU'ings, whose trm~scriptlon is defined by the HMM-LR phoneme model.</Paragraph> <Paragraph position="8"> 11 For the sake of explanation, the rifles m'e simplified. Acr~ DE COLING-92, NAhqES. 23-28 AOt~q' 1992 8 1 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 VADVS call contain ally conj-pps, which means both vaux-s and vanx-s+~.</Paragraph> <Paragraph position="9"> Compared with tile first version, which accepts all- g 901-/~/~A~~ proximately 80 % of the sentence candidates coilsist- ~&quot; ing of all the possible combinations of plmLse candidates, tile refined version only accepts approximately ~ 80 , . tile phrase rules and l)hrase-based sentence rules. ~ 70</Paragraph> </Section> </Section> <Section position="5" start_page="40" end_page="40" type="metho"> <SectionTitle> 4 Validity of l~ule tLefinements ~0 0 5 10 </SectionTitle> <Paragraph position="0"> We tested the improvement in two ways: speech recognition accuracy 'and the acceptance rate\[12\].</Paragraph> <Paragraph position="1"> rio estimate the latter we checked how many sentence candidates were fltered out by applying phrmse-based sentence rules as a post-filter. We verified the rule refinements through coral)arisen of results gained by five different rule sets: tile refined version of sentence rules which contain all three reline ments (Neu Grammar); the refined versiou without conj-pp treatment (No Sentence Final Conj-pp), without formal noun treatment (No Formal ~oun Treating), and without mnneral trcatmcnt (No Nume~'al Tz'eat+-ng); and rules which allow all combinations of phr~qe candidates (No (;rmmn~n:). For the frst four of these rule sets wc determined ranks based on the probabilities of phoneme strings predicted by syntactic rules. But in the No Grmamar case we determined tile rank solely based on phoneme probability. We exl)erimented with the same 353 phrases which make up 137 sentences as irl section 2.2. The phrase recognition rate for the top 5 candidates was again 95.5% by exact string matching.</Paragraph> <Section position="1" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 4.1 Speech Recognition Accuracy </SectionTitle> <Paragraph position="0"> We conducted speech recognition experiments. Figure 1 shows the constraint effectiveness of the phrmse-based sentence rules given the five conditions examined. These live conditions arc'. compared ill tile graph, based on their abilities to correctly recognize the spoken sentences among tile top ranked 20 can.</Paragraph> <Paragraph position="1"> didates.</Paragraph> <Paragraph position="2"> While the sentence recognition rate tbr the top candidates remains 37.2 % when probability is the only factor in determining tile candidates, the recognition rate rises to 70.1% when tile refined syntactic rules are applied as constraints. Differentiating eonj-pps is highly effective. Without this treatment, tile recognition rate renaains 48.2%. Formal noun and lnunera\] treatments are not as effective. Figure 1 indicates that tile elt~ct according to each syntactic constraint is especially distinct up to rank 5, and that the recognition rates saturate when we take into account Selltenee candidates up to rank 10.</Paragraph> </Section> <Section position="2" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 4.2 Acceptance Rate </SectionTitle> <Paragraph position="0"> We also verified the validity of sentence rules through tile acceptance rate. We examined how many sen tence candidat~es were filtered ont. Table 4 shows the frequencies of sentences consisting of different nun> bets of phrases in our test corpus: Figure 2 shows tile acceptance rates when applying four different syntactic rules. Wlmn applying rules which allow all combinations of phrase candidates, the accel)tance rate remains 100 %.</Paragraph> <Paragraph position="1"> 't'hc effect of constraints is especially clear lot sentences with a small number of I)hra~s. In sentences witil one phrase, the asceptance rate for the revised version is 41%, and for the wu'sion without conj-pp constraints 70%. In cOral)arisen with Figure 1, treating nmuerals contributes toward filtering out sentence candidates rather than raising speech recognition accuracy. Independent of the constraint strength, tile mort? phrases there are ill ~. sentence, tile ntore effete lively tile rules work. 'l)hc wdue for a sentence with</Paragraph> </Section> </Section> class="xml-element"></Paper>