File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-2029_intro.xml
Size: 2,770 bytes
Last Modified: 2025-10-06 14:03:30
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2029"> <Title>Exploiting Variant Corpora for Machine Translation</Title> <Section position="2" start_page="0" end_page="113" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Corpus-based approaches to machine translation (MT) have achieved much progress over the last decades. Despite a high performance on average, these approaches can often produce translations with severe errors. Input sentences featuring linguistic phenomena that are not sufficiently covered by the utilized models cannot be translated accurately.</Paragraph> <Paragraph position="1"> This paper proposes to use multiple variant corpora, i.e., parallel text corpora that are equal in meaning, but use different vocabulary and grammatical constructions in order to express the same content. Using training corpora of the same content with different sources result in translation models that focus on specific linguistic phenomena, thus reducing translation ambiguities compared to models trained on a larger corpus obtained by merging all variant corpora. The proposed method applies each variant model separately to an input sentence resulting in multiple translation hypotheses. The best translation is selected according to statistical models. We show that the combination of variant translation models is effective and outperforms not only all single variant models, but also is superior to translation models trained on the union of all variant corpora.</Paragraph> <Paragraph position="2"> In addition, we extend the proposed method to multi-engine MT. Combining multiple MT engines can boost the system performance further by exploiting the strengths of each MT engine. For each variant, all MT engines are trained on the same corpus and used in parallel to translate the input. We first select the best translation hypotheses created by all MT engines trained on the same variant and then verify the translation quality of the translation hypotheses selected for each variant.</Paragraph> <Paragraph position="3"> per we are using two variants of a parallel text corpus for Chinese (C) and English (E) from the travel domain (cf. Section 2). These variant corpora are used to acquire the translation knowledge for seven corpus-based MT engines. The method to select the best translation hypotheses of MT engines trained on the same variant is described in Section 3.1. Finally, the selected translations of different variants are combined according to a statistical significance test as described in Section 3.2. The effectiveness of the proposed method is verified in Section 4 for the Chinese-English translation task of last year's IWSLT1 evaluation campaign.</Paragraph> </Section> class="xml-element"></Paper>