File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-2001_metho.xml
Size: 16,108 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-2001"> <Title>Multilingual Speech Recognition for Information Retrieval in Indian context Udhyakumar.N, Swaminathan.R and Ramakrishnan.S.K</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Multilingual Recognition System </SectionTitle> <Paragraph position="0"> Multilingual phoneme set is obtained from monolingual models by combining acoustically similar phones. The model combination is based on the assumption that the articulatory representations of phones are so similar across languages that they can be considered as units that are independent from the underlying language.</Paragraph> <Paragraph position="1"> Such combination has the following benefits (Schultz.T et al.1998): Model sharing across languages makes the system compact by reducing the complexity of the system.</Paragraph> <Paragraph position="2"> Data sharing results in reliable estimation of model parameters especially for less frequent phonemes.</Paragraph> <Paragraph position="3"> Multilingual models, bootstrapped as seed models for an unseen target language improve the recognition accuracy considerably.</Paragraph> <Paragraph position="4"> Global phoneme pool allows accurate modeling of OOV (Out Of Vocabulary) words.</Paragraph> <Paragraph position="5"> International Phonetic Association has classified sounds based on the phonetic knowledge, which is independent of languages. Hence IPA mapping is used to form the global phoneme pool for multilingual recognizer. In this scheme, phones of Tamil and Hindi having the same IPA representation are combined and trained with data from both the languages (IPA 1999).</Paragraph> <Paragraph position="6"> The phonetic inventory of the multilingual recognizer MLY can be expressed as a group of language independent phones LIG unified with a set of language dependent phones LDG that are unique to Hindi or Tamil.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> LDHLDTLIML G&quot;G&quot;G=Y </SectionTitle> <Paragraph position="0"> where LDTG is the set of Tamil dependent models.</Paragraph> <Paragraph position="1"> LDHG is the set of Hindi dependent models The share factor SF is calculated as</Paragraph> <Paragraph position="3"> which implies a sharing rate of 75% between both the languages. The share factor is a measure of relation between the sum of language specific phones and the size of the global phoneme set. The high overlap of Hindi and Tamil phonetic space is evident from the value of SF. This property has been a motivating factor to develop a multilingual system for these languages.</Paragraph> <Paragraph position="4"> After merging the monophone models, context dependent triphones are created as stated earlier. Alternate data-driven techniques can also be used for acoustic model combination, but they are shown to be outperformed by IPA mapping (Schultz.T et al.1998).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Cross Language Adaptation </SectionTitle> <Paragraph position="0"> One major time and cost limitation in developing LVCSR systems in Indian languages is the need for large training data. Cross-lingual bootstrapping is addressed to overcome these drawbacks. The key idea in this approach is to initialize a recognizer in the target language by using already developed acoustic models from other language as seed models. After the initialization the resulting system is rebuilt using training data of the target language. The cross-language seed models perform better than flat starts or random models. Hence the phonetic space of Hindi and Tamil MLY is populated with English models EY in the following steps.</Paragraph> <Paragraph position="1"> The English phones are trained with Network speech database (NTIMIT), which is the telephone bandwidth version of widely used TIMIT database.</Paragraph> <Paragraph position="2"> To suit Indian telephony conditions, 16KHz NTIMIT speech data is down-sampled to 8KHz.</Paragraph> <Paragraph position="3"> A heuristic IPA mapping combined with data-driven approach is used to map English models with multilingual models. The mappings are shown in Table.3.</Paragraph> <Paragraph position="4"> If any phone in EY maps to two or more phones in MLY the vectors are randomly divided between the phones since duplication reduces classification rate.</Paragraph> <Paragraph position="5"> After bootstrapping, the models are trained with data from both the languages.</Paragraph> <Paragraph position="7"> Table.3: Mapping between Multilingual phoneme set and English phones for crosslingual bootstrapping.</Paragraph> <Paragraph position="8"> The improvement in accuracy due to crosslingual bootstrapping is evident from the results shown in Table.4. This is due to the implicit initial alignment caused by bootstrapped seed models. The results are calculated for context dependent triphones in each case. The degradation in accuracy of the multilingual system compared to monolingual counterparts is attributed to generalization and parameter reduction.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Grapheme to Phoneme conversion </SectionTitle> <Paragraph position="0"> Direct dictionary lookup to generate phonetic base forms is limited by time, effort and knowledge brought to bear on the construction process. The dictionary can never be exhaustive due to proper names and pronunciation variants. A detailed lexicon also occupies a large disk space. The solution is to derive the pronunciation of a word from its orthography. Automatic grapheme to phoneme conversion is essential in both speech synthesis and automatic speech recognition. It helps to solve the out-of-vocabulary word problem, unlike the case using a soft lookup dictionary. We have examined both rule-based and data-driven self-learning approaches for automatic letter-to-sound (LTS) conversion.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Rule-Based LTS </SectionTitle> <Paragraph position="0"> Inspired by the phonetic property of Indian languages grapheme to phoneme conversion is usually carried out by a set of handcrafted phonological rules. For example the set of rules that maps the alphabet to its corresponding phones is given below. The letter (/p/) in Tamil can be pronounced as 'p' as in ! or 'b' as in !&quot; or 'P' as in !.</Paragraph> <Paragraph position="1"> P_Rules: 1. {Anything, &quot;pp&quot;, Anything, &quot;p h&quot; }, 2. {Nothing, &quot;p&quot;, Anything, &quot;p*b&quot; }, 3. {Anything, &quot;p&quot;, CONSONANT, &quot;p&quot; }, 4. {NASAL, &quot;p&quot;, Anything, &quot;b&quot; }, 5. {Anything, &quot;p&quot;, Anything, &quot;P&quot; } Here * indicates a pronunciation variant. It may be noted that for any word, these context sensitive rules give a phonemic transcription. These rules are ordered as most specific first with special conventions about lookup, context and target (Xuedong Huang et al. 2001). For example, the rule 4 means that the alphabet /p/ when preceeded by a nasal and followed by anything is pronounced as 'b'. These rules could not comprehend all possibilities. The exceptions are stored in an exception list. The system first searches this lookup dictionary for any given word. If a match is found, it reads out the transcription from the list. Otherwise, it generates pronunciation using the rules. This approach helps to accommodate pronunciation variations specific to some words and thus avoiding the need to redraft the complete rules.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 CART Based LTS </SectionTitle> <Paragraph position="0"> Extensive linguistic knowledge is necessary to develop LTS rules. As with any expert system, it is difficult to anticipate all possible relevant cases and sometimes hard to check for rule interface and redundancy. In view of how tedious it is to develop phonological rules manually, machine-learning algorithms are used to automate the acquisition of LTS conversion rules. We have used statistical modeling based on CART (Classification And Regression Trees) to predict phones based on letters and their context.</Paragraph> <Paragraph position="1"> Indian languages usually have a one-to-one mapping between the alphabets and corresponding phones.</Paragraph> <Paragraph position="2"> This avoids the need for complex alignment methods.</Paragraph> <Paragraph position="3"> The basic CART component includes a set of Yes-No questions about the set membership of phones and letters that provide the orthographic context. The question that has the best entropy reduction is chosen at each node to grow the tree from the root. The performance can be improved by including composite questions, which are conjunctive and disjunctive combinations of primitive questions and their negations. The use of composite questions can achieve longer-range optimum, improves entropy reduction and avoids data fragmentation caused by greedy nature of the CART (Breiman.L et al.1984). The target class or leafs consist of individual phones. In case of alternate pronunciations the phone variants are combined into a single class.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Experiments </SectionTitle> <Paragraph position="0"> Both rule-based and CART based LTS systems are developed for Tamil and evaluated on a 2k word hand-crafted lexicon. Transliterated version of Tamil text is given as input to the rule-based system. The performance of decision trees is comparable to the phonological rules (Table.5). We observed some interesting results when the constructed tree is visualized after pruning.</Paragraph> <Paragraph position="1"> The composite questions generated sensible clusters of alphabets. Nasals, Rounded vowels, Consonants are grouped together. The other phenomenon is that CART has derived some intricate rules among the words, which were considered as exceptions by the linguists who prepared the phonological rules. Statistical methods to use phonemic trigrams to rescore n-best list generated by decision tree and use of Weighted Finite State The results show that automatic rule generation with CART performs better than manually coded rules.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Adaptation for Accented English </SectionTitle> <Paragraph position="0"> English words are more common in Indian conversation. In OGI multi language database, 32% of Hindi and 24% of Tamil sentences have English words. Therefore it is necessary to include English in a multilingual recognizer designed for Indian languages. English being a non-native language, Indian accents suffer from disfluency, low speaking rate and repetitions. Hence accuracy of a system trained with American English degrades significantly when used to recognize Indian accented English. Various techniques like lexical modeling, automatic pronunciation modeling using FST, speaker adaptation, retraining with pooled data and model interpolation are being explored to reduce the Word Error Rates for non-native speakers.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Speech Corpus </SectionTitle> <Paragraph position="0"> We have used Foreign Accented English corpus from Centre for Spoken Language Understanding (CSLU) and Native American accented Network TIMIT corpus for the experiments. Table.6 gives the details of speech databases used for our study:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Previous Work </SectionTitle> <Paragraph position="0"> Lexical adaptation techniques introduce pronunciation variants in the lexicon for decoding accented speech (Laura.M.Tomokiyo 2000). The problem with these methods is that the context dependent phones in the adapted lexicon seldom appear in the training data and hence they are not trained properly. Statistical methods for automatic acquisition of pronunciation variants had produced successful results (Chao Huang et al.2001).</Paragraph> <Paragraph position="1"> These algorithms are costlier in terms of memory space and execution time, which makes them difficult to handle real-time speech. In acoustic modeling techniques the native models are modified with available accented and speakers' native language data.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Experiments </SectionTitle> <Paragraph position="0"> The base line system is trained on N_ENG corpus and is used to recognize NN_TAE utterances. It is a well-known fact that accented speech is influenced by native language of the speaker. Hence we tried decoding NN_TAE data using Tamil recognizer. The lexicon is generated using grapheme to phoneme rules. The accuracy dropped below the baseline, which means that there is no direct relationship between N_TAM and NN_TAE speech. The result is already confirmed by (Laura.M.Tomokiyo 2000).</Paragraph> <Paragraph position="1"> Careful analysis of the accented speech shows that perception of the target phone is close to acoustically related phone in speaker's native language. As speaker gains proficiency, his pronunciation is tuned towards the target phone and hence the influence of interfering phone is less pronounced. This clearly suggests that any acoustic modeling technique should start with native language models and suitably modify them to handle accented English. Hence attempts to retrain or adapt N_ENG models using N_TAM data have degraded the accuracy. First set of experiments is carried out using N_ENG models. MLLR adaptation and re-training with NN_TAE data increased the accuracy. In the second set of experiments English models are bootstrapped using N_TAM models by heuristic IPA mapping. They are then trained by pooling N_ENG and NN_TAE data. This method showed better performance than other approaches. The comparative results are shown in figure.1.</Paragraph> <Paragraph position="2"> Figure.1: Comparison of Accuracies of acoustic modeling techniques on Tamil accented English.</Paragraph> </Section> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Language Identification </SectionTitle> <Paragraph position="0"> Automatic language identification (LID) has received increased interest with the development of multilingual spoken language systems. LID can be used in Telephone companies handling foreign calls to automatically route the call to an operator who is fluent in that language. In information retrieval systems, it can be used by speech synthesis module to respond in user's native language. It can also serve as a front-end in speech-to-speech translation. LVCSR based LID can be incorporated in both acoustic and language modeling.</Paragraph> <Paragraph position="1"> In language independent approach a multilingual recognizer using language independent models is used.</Paragraph> <Paragraph position="2"> Tamil and Hindi bigram models are used to rescore the recognized phone string. The language providing highest log probability is hypothesized. The bigrams for both the languages are evaluated on the transcribed training data. In language dependent approach each phone is given a language tag along with its label. The models are trained solely with data from its own language. Language is identified implicitly from the recognized phone labels. This approach has the advantage of context and text independent language identification.</Paragraph> <Paragraph position="3"> (Lamel.L et al.1994). The results for both the approaches are given in Table.7.</Paragraph> </Section> class="xml-element"></Paper>