File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3711_metho.xml
Size: 14,281 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3711"> <Title>IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-speech Translator *</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. SYSTEM OVERVIEW </SectionTitle> <Paragraph position="0"> The general framework of our speech translation system is illustrated in Figure 1. The general framework of our MASTOR system has components of ASR, MT and TTS. The cascaded approach allows us to deploy the power of the existing advanced speech and language processing techniques, while concentrating on the unique problems in speech-to-speech translation. Figure 2 illustrates the MASTOR GUI (Graphic User Interface) on laptop and PDA, respectively.</Paragraph> <Paragraph position="1"> Acoustic models for English and Mandarin baseline are developed for large-vocabulary continuous speech and trained on over 200 hours of speech collected from about 2000 speakers for each language. However, the Arabic dialect speech recognizer was only trained using about 50 hours of dialectal speech. The training data for Arabic consists of about 200K short utterances. Large efforts were invested in initial cleaning and normalization of the training data because of large number of irregular dialectal words and variations in spellings. We experimented with three approaches for pronunciation and acoustic modeling: i.e. grapheme, phonetic, and context-sensitive grapheme as will be described in section 3.A. We found that using context-sensitive pronunciation rules reduces the WER of the grapheme based acoustic model by about 3% (from 36.7% to 35.8%). Based on these results, we decided to use context-sensitive grapheme models in our system.</Paragraph> <Paragraph position="2"> The Arabic language model (LM) is an interpolated model consisting of a trigram LM, a class-based LM and a morphologically processed LM, all trained from a corpus of a few hundred thousand words. We also built a compact language model for the hand-held system, where singletons are eliminated and bigram and trigram counts are pruned with increased thresholds. The LM footprint size is 10MB.</Paragraph> <Paragraph position="3"> There are two approaches for translation. The concept based approach uses natural language understanding (NLU) and natural language generation models trained from an annotated corpus.</Paragraph> <Paragraph position="4"> Another approach is the phrase-based finite state transducer which is trained using an un-annotated parallel corpus.</Paragraph> <Paragraph position="5"> A trainable, phrase-splicing and variable substitution TTS system is adopted to synthesize speech from translated sentences, which has a special ability to generate speech of mixed languages seamlessly [9]. In addition, a small footprint TTS is developed for the handheld devices using embedded concatenative TTS technologies.[10] null Next, we will describe our approaches in automatic speech recognition and machine translation in greater detail.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. AUTOMATIC SPEECH RECOGNITION A. Acoustic Models </SectionTitle> <Paragraph position="0"> Acoustic models and the pronunciation dictionary greatly influence the ASR performance. In particular, creating an accurate pronunciation dictionary poses a major challenge when changing the language. Deriving pronunciations for resource rich languages like English or Mandarin is relatively straight forward using existing dictionaries or letter to sound models. In certain languages such as Arabic and Hebrew, the written form does not typically contain short vowels which a native speaker can infer from context. Deriving automatic phonetic transcription for speech corpora is thus difficult. This problem is even more apparent when considering colloquial Arabic, mainly due to the large number of irregular dialectal words.</Paragraph> <Paragraph position="1"> One approach to overcome the absence of short vowels is to use grapheme based acoustic models. This leads to straightforward construction of pronunciation lexicons and hence facilitates model training and decoding. However, the same grapheme may lead to different phonetic sounds depending on its context. This results in less accurate acoustic models. For this reason we experimented with two other different approaches. The first is a full phonetic approach which uses short vowels, and the second uses context-sensitive graphemes for the letter &quot;A&quot; (Alif) where two different phonemes are used for &quot;A&quot; depending on its position in the word.</Paragraph> <Paragraph position="2"> Using phoneme based pronunciations would require vowelization of every word. To perform vowelization, we used a mix of dictionary search and a statistical approach. The word is first searched in an existing vowelized dictionary, and if not found it is passed to the statistical vowelizer [11]. Due to the difficulties in accurately vowelizing dialectal words, our experiments have not shown any improvements using phoneme based ASR compared to grapheme based.</Paragraph> <Paragraph position="3"> Speech recognition for both the laptop and hand-held systems is based on the IBM ViaVoice engine. This highly robust and efficient framework uses rank based acoustic scores [12] which are derived from tree-clustered context dependent Gaussian models.</Paragraph> <Paragraph position="4"> These acoustic scores together with n-gram LM probabilities are incorporated into a stack based search algorithm to yield the most probable word sequence given the input speech.</Paragraph> <Paragraph position="5"> The English acoustic models use an alphabet of 52 phones. Each phone is modeled with a 3-state left-to-right hidden Markov model (HMM). The system has approximately 3,500 context-dependent states modeled using 42K Gaussian distributions and trained using 40 dimensional features. The context-dependent states are generated using a decision-tree classifier. The colloquial Arabic acoustic models use about 30 phones that essentially correspond to graphemes in the Arabic alphabet. The colloquial Arabic HMM structure is the same as that of the English model.</Paragraph> <Paragraph position="6"> The Arabic acoustic models are also built using 40 dimensional features. The compact model for the PDA has about 2K leaves and 28K Gaussian distributions. The laptop version has over 3K leaves and 60K Gaussians. All acoustic models are trained using discriminative training [13].</Paragraph> <Paragraph position="7"> ended coversational systems. Our approaches to build statistical tri-gram LMs fall into three categories: 1) obtaining additional training material automatically; 2) interpolating domain-specific LMs with other LMs; 3) improving distribution estimation robustness and accuracy with limited in-domain resources. Automatic data collection and expansion is the most straight-forward way to achieve efficient LM, especially when little in-domain data is available. For resource-rich languages such as English and Chinese, we retrieve additional data from the World Wide Web (WWW) to enhance our limited domain specific data, which shows significant improvement [6].</Paragraph> <Paragraph position="8"> In Arabic, words can take prefixes and suffixes to generate new words which are semantically related to the root form of the word (stem). As a result, the vocabulary size in Arabic can become very large even for specific domains. To alleviate this problem, we built a language model on morphologically tokenized data by applying morphological analysis and hence splitting some of the words into prefix+stem+suffix, prefix+stem or stem+suffix forms.</Paragraph> <Paragraph position="9"> We refer the reader to [14] to learn more about the morphological tokenization algorithm. Morphological analysis reduced the vocabulary size by about 30% without sacrificing the coverage.</Paragraph> <Paragraph position="10"> More specifically, in our MASTOR system, the English language model has two components that are linearly interpolated. The first one is built using in-domain data. The second component acts as a background model and is built using a very large generic text inventory that is domain independent. The language model counts are also pruned to control the size of this background model. The colloquial Arabic language model for our laptop system is composed of three components that are linearly interpolated. The first one is the basic word tri-gram model. The second one is a class based language model with 13 classes that covers names for English and Arabic, numbers, months, days, etc. The third one is the morphological language model described above.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. SPEECH TRANSLATION </SectionTitle> <Paragraph position="0"> A. NLU/NLG-based Speech Translation One of the translation algorithms we proposed and applied in MASTOR is the statistical translation method based on natural language understanding (NLU) and natural language generation (NLG). Statistical machine translation methods translate a sentence W in the source language into a sentence A in the target language by using a statistical model that estimates the probability of A given W, i.e. ( )WAp . Conventionally, ( )WAp is optimized on a set of pairs of sentences that are translations of one another. To alleviate this data sparseness problem and, hence, enhance both the accuracy and robustness of estimating ( )WAp , we proposed a statistical concept-based machine translation paradigm that predicts A with not only W but also the underlying concepts embedded in W and/or A. As a result, the optimal sentence A is picked by first understanding the meaning of the source sentence W.</Paragraph> <Paragraph position="1"> Let C denote the concepts in the source language and S denote the concepts in the target language, our proposed statistical concept-based algorithm should select a word sequence A^ as</Paragraph> <Paragraph position="3"> where the conditional probabilities ( )WCp , ( )WCSp , and ( )WCSAp ,, are estimated by the Natural Language Understanding (NLU), Natural Concept Generation (NCG) and Natural Word Generation (NWG) procedures, respectively. The probability distributions are estimated and optimized upon a pre-annotated bilingual corpus. In our MASTOR system, ( )WCp is estimated by a decision-tree based statistical semantic parser, and ( )WCSp , and ( )WCSAp ,, are estimated by maximizing the conditional entropy as depicted in [2] and [7], respectively.</Paragraph> <Paragraph position="4"> We are currently developing a new translation method that unifies statistical phrase-based translation models and the above NLU/NLG based approach. We will discuss this work in future publications.</Paragraph> <Paragraph position="5"> B. Fast and Memory Efficient Machine Translation Using SIPL Another translation method we proposed in MASTOR is based on the Weighted Finite-State Transducer (WFST). In particular, we developed a novel phrase-based translation framework using WFSTs that achieves both memory efficiency and fast speed, which is suitable for real time speech-to-speech translation on scalable computational platforms. In the proposed framework [15] which we refer to as Statistical Integrated Phrase Lattices (SIPLs), we statically construct a single optimized WFST encoding the entire translation model. In addition, we introduce a Viterbi decoder that can combine the translation model and language model FSTs with the input lattice efficiently, resulting in translation speeds of up to thousands of words per second on a PC and hundred words per second on a PDA device. This WFST-based approach is well-suited to devices with limited computation and memory. We achieve this efficiency by using methods that allow us to perform more composition and graph optimization offline (such as, the determinization of the phrase segmentation transducer P) than in previous work, and by utilizing a specialized decoder involving multilayer search.</Paragraph> <Paragraph position="6"> During the offline training, we separate the entire translation lattice H into two pieces: the language model L and the translation model M:</Paragraph> <Paragraph position="8"> where is the composition operator, Min denotes the minimization operation, and Det denotes the determinization operation; T is the phrase translation transducer, and W is the phrase-to-word transducer. Due to the determinizability of P, M can be computed offline using a moderate amount of memory.</Paragraph> <Paragraph position="9"> The translation problem can be framed as finding the best path in the full search lattice given an input sentence/automaton I. To address the problem of efficiently computing LMI , we have developed a multilayer search algorithm.</Paragraph> <Paragraph position="10"> Specifically, we have one layer for each of the input FSM's: I, L, and M. At each layer, the search process is performed via a state traversal procedure starting from the start state 0s , and consuming an input word in each step in a left-to-right manner.</Paragraph> <Paragraph position="11"> We represent each state s in the search space using the following 7-tuple: Is , Ms , Ls , Mc , Lc , ha0 , prevs , where Is , Ms , and Ls record the current state in each input FSM; Mc and Lc record the accumulated cost in L and M in the best path up to this point; h a0 records the target word sequence labeling the best path up to this point; and prevs records the best previous state.</Paragraph> <Paragraph position="12"> To reduce the search space, two active search states are merged whenever they have identical Is , Ms , and Ls values; the remaining state components are inherited from the state with lower cost. In addition, two pruning methods, histogram pruning and threshold or beam pruning, are used to achieve the desired balance between translation accuracy and speed.</Paragraph> <Paragraph position="13"> To provide the decoder for the PDA devices as well that lacks a floating-point processor, the search algorithm is implemented using fixed-point arithmetic.</Paragraph> </Section> class="xml-element"></Paper>