File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-2001_intro.xml
Size: 4,162 bytes
Last Modified: 2025-10-06 14:02:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-2001"> <Title>Multilingual Speech Recognition for Information Retrieval in Indian context Udhyakumar.N, Swaminathan.R and Ramakrishnan.S.K</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Monolingual Baseline systems </SectionTitle> <Paragraph position="0"> Monolingual baseline systems are designed for Tamil and Hindi using HTK as the first step towards multilingual recognition. We have used OGI Multilanguage telephone speech corpus for our experiments (Yeshwant K. Muthusamy et al.1992). The database is initially cleaned up and transcribed both at word and phone level. The phoneme sets for Hindi and Tamil are obtained from (Rajaram.S 1990) and (Nitendra Rajput, et al. 2002). The spontaneous speech effects like filled pauses (ah, uh, hm), laughter, breathing, sighing etc. are modeled with explicit words. The background noises from radio, fan and crosstalk are pooled together and represented by a single model to ensure sufficient training. Front-end features consist of 39 dimensional Melscale cepstral coefficients. Vocal Tract Length Normalization (VTLN) is used to reduce inter and intraspeaker variability.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Train and Test sets </SectionTitle> <Paragraph position="0"> The OGI Multilanguage corpus consists up to nine separate responses from each caller, ranging from single words to short-topic specific descriptions to 60 seconds of unconstrained spontaneous speech. Tamil data totaled around 3 hours and Hindi data around 2.5 hours of continuous speech. The details of training and test data used for our experiments are shown in Table.1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Context Independent Training </SectionTitle> <Paragraph position="0"> The context independent monophones are modeled by individual HMMs. They are three state strict left-to-right models with a single Gaussian output probability density function for each state. Baum-Welch training is carried out to estimate the HMM parameters. The results of the monolingual baseline systems are shown in Table.2.</Paragraph> <Paragraph position="1"> The difference in accuracy cannot be attributed to the language difficulties because there are significant variations in database quality, vocabulary and quantity between both the languages.</Paragraph> <Paragraph position="2"> TAMIL: The recognition result for monophones shows that prominent errors are due to substitution between phones, which are acoustic variants of the same alphabet (eg.ch and s, th and dh, etc.). Hence the lexicon is updated with alternate pronunciations for these words. As a result the accuracy improved to 56%.</Paragraph> <Paragraph position="3"> HINDI: Consonant clusters are the main sources of errors in Hindi. They are replaced with a single consonant followed by a short spelled 'h' phone in the lexicon.</Paragraph> <Paragraph position="4"> This increased the accuracy to 52.9%.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Context Dependent Training </SectionTitle> <Paragraph position="0"> Each monophone encountered in the training data is cloned into a triphone with left and right contexts. All triphones that have the same central phone end up with different HMMs that have the same initial parameter values. HMMs that share the same central phone are clustered using decision trees and incrementally trained.</Paragraph> <Paragraph position="1"> The phonetic questions (nasals, sibilants, etc.) for tree based state tying require linguistic knowledge about the acoustic realization of the phones. Hence the decision tree built for American English is modified to model context-dependency in Hindi and Tamil. Further unsupervised adaptation using Maximum Likelihood Linear Regression (MLLR) is used to handle calls from non-native speakers. Environment adaptation is analyzed for handling background noise.</Paragraph> </Section> </Section> class="xml-element"></Paper>