File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1018_metho.xml
Size: 5,315 bytes
Last Modified: 2025-10-06 14:07:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1018"> <Title>Domain Portability in Speech-to-Speech Translation</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 4. EXPERIMENT 2: PORTINGTOANEWDOMAIN USING A HYBRID RULE-BASED AND STATISTICAL ANALYSIS APPROACH </SectionTitle> <Paragraph position="0"> We are in the process of developing a new alternative analysis approach for our interlingua-based speech-translation systems that combines rule-based and statistical methods and we believe inherently supports faster porting into new domains. The main aspects of the approach are the following. Rather than developing complete semantic grammars for analyzing utterances into our interlingua (either completely manually, or using grammar induction techniques), we separate the task into two main levels. We continue to develop and maintain rule-based grammars for phrases that correspond to argument-level concepts of our interlingua representation (e.g., time expressions, locations, symptom-names, etc.). However, instead of developing grammar rules for assembling the argument-level phrases into appropriate domain actions, we apply machine learning and classification techniques [1] to learn these mappings from a corpus of interlingua tagged utterances. (Earlier work on this task is reported in [6].) We believe this approach should prove to be more suitable for fast porting into new domains for the following reasons. Many of the required argument-level phrase grammars for a new domain are likely to be covered by already existing grammar modules, as can be seen by examining the XDM (cross-domain) nodes in Figure 1.</Paragraph> <Paragraph position="1"> The remaining new phrase grammars are fairly fast and straightforward to develop. The central questions, however, are whether the statistical methods used for classifying strings of arguments into domain actions are accurate enough, and what amounts of tagged data are required to obtain reasonable levels of performance. To assess this last question, we tested the performance of the current speech-act and concept classifiers for the expanded travel-domain when trained with increasing amounts of training data. The results of these experiments are shown in Figure 3. We also report the performance of the domain-action classification derived from the combined speech-act and concepts. As can be seen, performance reaches a relative plateau at around 4000-5000 utterances. We see these results as indicative that this approach should indeed prove to be significantly easier to port to new domains. Creating a tagged database of this order of magnitude can be done in a few weeks, rather than the months required for complete manual grammar development time.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 5. EXPERIMENT 3: PORTING THE SPEECH RECOGNIZER TO NEW DOMAINS </SectionTitle> <Paragraph position="0"> When the speech recognition components (acoustic models, pronunciation dictionary, vocabulary, and language model) are ported across domains and languages mainly three types of mismatches occur: (1) mismatches in recording condition; (2) speaking style mismatches; as well as (3) vocabulary and language model mismatches. In the past these problems have mostly been solved by collecting large amounts of acoustic data for training the acoustic models and development of the pronunciation dictionary, as well as large text data for vocabulary coverage and language model calculation. However, especially for highly specialized domains and conversational speaking styles, large databases cannot always be provided. Therefore, our research has focused on the problem of how to build LVCSR systems for new tasks and languages [7, 9] using only a limited amount of data. In this third experiment we investigate the results of porting the speech recognition component of our MT system to different new domains. The experiments and improvements were conducted with the Janus Speech Recognition Toolkit JRTk [13].</Paragraph> <Paragraph position="1"> Table 2 shows the results of porting four baseline speech recognition systems to the doctor-patient domain, and to the meeting domain. The four baseline systems are trained on Broadcast News (BN), English SpontaneousScheduling Task (ESST), combined BN and ESST, and the travel planning domain of the C-STAR consortium (http://www.c-star.org). The given tasks illustrate a variety of domain size, speaking styles and recording conditions ranging from clean spontaneous speech in a very limited domain (ESST, C-STAR) to highly conversational multi-party speech in an extremely broad domain (Meeting). As a consequence the error rates on the meeting data are quite high but using MAP (Maximum A Posteriori) acoustic model adaptation and language model adaptation the error rate can be reduced by about 10.2% relative over the BN baseline system. With the doctor-patient data the drop in error rate was less severe which can be explained by the similar speaking style and recording conditions for C-STAR and doctor-patient data.</Paragraph> <Paragraph position="2"> Details about the applied recognition engine can be found in [10] for ESST and [11] for the BN system.</Paragraph> </Section> class="xml-element"></Paper>