File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3121_metho.xml

Size: 7,050 bytes

Last Modified: 2025-10-06 14:11:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3121">
  <Title>Phramer - An Open Source Statistical Phrase-Based Translator</Title>
  <Section position="4" start_page="146" end_page="148" type="metho">
    <SectionTitle>
3 WMT06 Shared Task
</SectionTitle>
    <Paragraph position="0"> We have assembled a system for participation in the WMT 2006 shared task based on Phramer and other tools. We participated in 5 subtasks: DE-EN, FR-EN, ES-EN, EN-FR and EN-ES.</Paragraph>
    <Section position="1" start_page="146" end_page="146" type="sub_section">
      <SectionTitle>
3.1 Baseline system
3.1.1 Translation table generation
</SectionTitle>
      <Paragraph position="0"> To generate a translation table for each pair of languages starting from a sentence-aligned parallel corpus, we used a modified version of the Pharaoh training software 2. The software also required GIZA++ word alignment tool(Och and Ney, 2003).</Paragraph>
      <Paragraph position="1"> We generated for each phrase pair in the translation table 5 features: phrase translation probability (both directions), lexical weighting (Koehn et al., 2003) (both directions) and phrase penalty (constant value).</Paragraph>
      <Paragraph position="2">  The Phramer decoder was used to translate the devtest2006 and test2006 files. We accelerated the decoding process by using the distributed decoding tool.</Paragraph>
      <Paragraph position="3">  We determined the weights to combine the models using the MERT component in Phramer. Because of the time constrains for the shared task sub- null decoder. Before the optimizations (LM optimizations, fixing bugs that affected performance), Phramer was 5 to 15 times slower than Pharaoh.</Paragraph>
      <Paragraph position="4">  We removed from the source text the words that don't appear either in the source side of the training corpus (thus we know that the translation table will not be able to translate them) or in the language model for the target language (and we estimate that there is a low chance that the untranslated word might actually be part of the reference translation). The purpose of this procedure is to minimize the risk of inserting words into the automatic translation that are not in the reference translation. We applied this preprocessing step only when the target language was English.</Paragraph>
    </Section>
    <Section position="2" start_page="146" end_page="147" type="sub_section">
      <SectionTitle>
3.2 Enhancements to the baseline systems
</SectionTitle>
      <Paragraph position="0"> Our goal was to improve the translation quality by enhancing the the translation table.</Paragraph>
      <Paragraph position="1"> The following enhancements were implemented: * reduce the vocabulary size perceived by the GIZA++ and preset alignment for certain words * &amp;quot;normalize&amp;quot; distortion between pairs of languages by reordering noun-adjective constructions null The first enhancement identifies pairs of tokens in the parallel sentences that, with a very high probability, align together and they don't align with other tokens in the sentence. These tokens are replaced with a special identifier, chosen so that GIZA++ will learn the alignment between them easier than before replacement. The targeted token types are proper nouns (detected when the same upper-cased token were present in both the foreign sentence and the English sentence) and numbers, also taking into account the differences between number representation in different languages (i.e.: 399.99 vs. 399,99). Each distinct proper noun to be replaced in the sentence was replaced with a specific identifier, distinct from other replacement identifiers already used in the sentence. The same procedure was applied also for numbers. The specific identifiers were reused in other sentences. This has the effect of reducing the vocabulary, thus it provides a large number of instances for the special token forms. The change in</Paragraph>
    </Section>
    <Section position="3" start_page="147" end_page="147" type="sub_section">
      <SectionTitle>
ment
</SectionTitle>
      <Paragraph position="0"> the vocabulary size is shown in Table 1. To simplify the process, we limited the replacement of tokens to one-to-one (one real token to one special token), so that the word alignment file can be directly used together with the original parallel corpus to extract phrases required for the generation of the translation table. Table 2 shows an example of the output.</Paragraph>
      <Paragraph position="1"> The second enhancement tries to improve the quality of the translation by rearranging the words in the source sentence to better match the correct word order in the target language (Collins et al., 2005).</Paragraph>
      <Paragraph position="2"> We focused on a very specific pattern - based on the part-of-speech tags, changing the order of NN-ADJ phrases in the non-English sentences. This process was also applied to the input dev/test files, when the target language was English. Figure 1 shows the re-ordering process and its effect on the alignment.</Paragraph>
      <Paragraph position="3"> The expected benefits are: * Better word alignment due to an alignment closer to the expected alignment (monotone).</Paragraph>
      <Paragraph position="4"> * More phrases extracted from the word aligned corpus. Monotone alignment tends to generate more phrases than a random alignment.</Paragraph>
      <Paragraph position="5"> * Higher mixture weight for the monotone distortion model because of fewer reordering constraints during MERT, thus the value of the monotone distortion model increases, &amp;quot;tightening&amp;quot; the translation.</Paragraph>
    </Section>
    <Section position="4" start_page="147" end_page="148" type="sub_section">
      <SectionTitle>
3.3 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> We implemented the first enhancement on ES-EN subtask by part-of-speech tagging the Spanish text using TreeTagger5 followed by a NN-ADJ inversion heuristic.</Paragraph>
      <Paragraph position="1"> The language models provided for the task was used.</Paragraph>
      <Paragraph position="2"> We used the 1,000 out of the 2,000 sentences in each of the dev2006 datasets to determine weights for the 8 models used during decoding (one monotone distortion mode, one language model, five translation models, one sentence length model) through MERT. The weights were determined individually for each pair of source-target languages.  There are 145 settlements in the West Bank , 16 in Gaza , 9 in East Jerusalem ; 400,000 people live in them . Existen 145 asentamientos en Cisjordania , 16 en Gaza y 9 en Jerusaln Este ; en ellos viven 400.000 personas . There are [x1] settlements in the West Bank , [x2] in [y1] , [x3] in East Jerusalem ; [x4] people live in them . Existen [x1] asentamientos en Cisjordania , [x2] en [y1] y [x3] en Jerusaln Este ; en ellos viven [x4] personas .  Using these weights, we measured the BLEU score on the devtest2006 datasets. Based on the model chosen, we decoded the test2006 datasets using the same weights as for devtest2006.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML