File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2029_metho.xml

Size: 6,485 bytes

Last Modified: 2025-10-06 14:10:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2029">
  <Title>Exploiting Variant Corpora for Machine Translation</Title>
  <Section position="3" start_page="113" end_page="113" type="metho">
    <SectionTitle>
2 Variant Corpora
</SectionTitle>
    <Paragraph position="0"> The Basic Travel Expressions Corpus (BTEC) is a collection of sentences that bilingual travel experts consider useful for people going to or coming from another country and cover utterances in travel situations (Kikui et al., 2003). The original Japanese-English corpus consists of 500K of aligned sentence pairs whereby the Japanese sentences were also translated into Chinese.</Paragraph>
    <Paragraph position="1"> In addition, parts of the original English corpus were translated separately into Chinese resulting in a variant corpus comprising 162K CE sentence pairs.</Paragraph>
    <Paragraph position="2"> Details of both, the original (BTECO) and the variant (BTECV ) corpus, are given in Table 1, where word token refers to the number of words in the corpus and word type refers to the vocabulary size.</Paragraph>
    <Paragraph position="3">  Only 4.8% of the sentences occured in both corpora and only 68.1% of the BTECV vocabulary was covered in the BTECO corpus.</Paragraph>
    <Paragraph position="4"> The comparison of both corpora revealed further that each variant closely reflects the linguistic structure of the source language which was used to produce the Chinese translations of the respective data sets. The differences between the BTECO and  BTECV variants can be categorized into: (1) literalness: BTECO sentences are translated on the basis of their meaning and context resulting in freer translations compared to the BTECV sentences which are translated more literally; (2) syntax: The degree of literalness also has an impact on the syntactic structure like word order variations (CV sentences reflect closely the word order of the corresponding English sentences) or the sentence type (question vs. imperative); (3) lexical choice: Alternations in lexical choice 1http://penance.is.cs.cmu.edu/iwslt2005 also contribute largely to variations between the corpora. Moreover, most of the pronouns found in the English sentences are translated explicitly in the CV sentences, but are omitted in CO; (4) orthography: Orthographic differences especially for proper nouns (Kanji vs. transliteration) and numbers (numerals vs. spelling-out).</Paragraph>
  </Section>
  <Section position="4" start_page="113" end_page="114" type="metho">
    <SectionTitle>
3 Corpus-based Machine Translation
</SectionTitle>
    <Paragraph position="0"> The differences in variant corpora directly effect the translation quality of corpus-based MT approaches.</Paragraph>
    <Paragraph position="1"> Simply merging variant corpora for training increases the coverage of linguistic phenomena by the obtained translation model. However, due to an increase in translation ambiguities, more erroneous translations might be generated.</Paragraph>
    <Paragraph position="2"> In contrast, the proposed method trains separately MT engines on each variant focusing on linguistic phenomena covered in the respective corpus. If specific linguistic phenomena are not covered by a variant corpus, the translation quality of the respective output is expected to be significantly lower.</Paragraph>
    <Paragraph position="3"> Therefore, we first judge the translation quality of all translation hypotheses created by MT engines trained on the same variant corpus by testing statistical significant differences in the statistical scores (cf. Section 3.1). Next, we compare the outcomes of the statistical significance test between the translation hypotheses selected for each variant in order to identify the variant that fits best the given input sentence (cf. Section 3.2).</Paragraph>
    <Section position="1" start_page="113" end_page="114" type="sub_section">
      <SectionTitle>
3.1 Hypothesis Selection
</SectionTitle>
      <Paragraph position="0"> In order to select the best translation among outputs generated by multiple MT systems, we employ an SMT-based method that scores MT outputs by using multiple language (LM) and translation model (TM) pairs trained on different subsets of the training data. It uses a statistical test to check whether the obtained TMC/LM scores of one MT output are significantly higher than those of another MT output (Akiba et al., 2002). Given an input sentence, m translation hypotheses are produced by the element MT engines, whereby n different TMC/LM scores are assigned to each hypothesis. In order to check whether the highest scored hypothesis is significantly better then the other MT outputs, a multiple comparison test based on the Kruskal-Wallis test is used. If one of the MT outputs is significantly better, this output is selected.  Otherwise, the output of the MT engine that performs best on a develop set is selected.</Paragraph>
    </Section>
    <Section position="2" start_page="114" end_page="114" type="sub_section">
      <SectionTitle>
3.2 Variant Selection
</SectionTitle>
      <Paragraph position="0"> In order to judge which variant should be selected for the translation of a given input sentence, the outcomes of the statistical significance test carried out during the hypothesis selection are employed.</Paragraph>
      <Paragraph position="1"> The hypothesis selection method is applied for each variant separately, i.e., the BTECO corpus is used to train multiple statistical model pairs (SELO) and the best translation (MTOSEL) of the set of translation hypotheses created by the MT engines trained on the BTECO corpus is selected. Accordingly, the SELV models are trained on the BTECV corpus and applied to select the best translation (MTVSEL) of the MT outputs trained on the BTECV corpus. In addition, the SELO models were used in order to verify whether a significant difference can be found for the translation hypothesis MTVSEL, and, vice versa, the SELV models were applied to MTOSEL.</Paragraph>
      <Paragraph position="2"> The outcomes of the statistical significance tests are then compared. If a significant difference between the statistical scores based on one variant, but not for the other variant is obtained, the significantly better hypothesis is selected as the output. However, if a significant difference could be found for both or none of the variants, the translation hypothesis produced by the MT engine that performs best on a develop set is selected.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML