File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1023_evalu.xml

Size: 9,220 bytes

Last Modified: 2025-10-06 13:59:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1023">
  <Title>Statistical Machine Translation with Wordand Sentence-Aligned Parallel Corpora</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Improved alignment quality
</SectionTitle>
      <Paragraph position="0"> As a staring point for comparison we trained GIZA++ using four different sized portions of the Verbmobil corpus. For each of those portions we output the most probable alignments of the testing data for Model 1, the HMM, Model 3, and Model 2Note that we stripped out probable alignments from our manually produced alignments. Probable alignments are large blocks of words which the annotator was uncertain of how to align. The many possible word-to-word translations implied by the manual alignments led to lower results than with the automatic alignments, which contained fewer word-to-word translation possibilities.</Paragraph>
      <Paragraph position="1">  Models trained with word-aligned data 4,3 and evaluated their AERs. Table 1 gives alignment error rates when training on 500, 2000, 8000, and 16000 sentence pairs from Verbmobil corpus without using any word-aligned training data.</Paragraph>
      <Paragraph position="2"> We obtained much better results when incorporating word-alignments with our mixed likelihood function. Table 2 shows the results for the different corpus sizes, when all of the sentence pairs have been word-aligned. The best performing model in the unmodified GIZA++ code was the HMM trained on 16,000 sentence pairs, which had an alignment error rate of 12.04%. In our modified code the best performing model was Model 4 trained on 16,000 sentence pairs (where all the sentence pairs are word-aligned) with an alignment error rate of 7.52%. The difference in the best performing models represents a 38% relative reduction in AER. Interestingly, we achieve a lower AER than the best performing unmodified models using a corpus that is one-eight the size of the sentence-aligned data.</Paragraph>
      <Paragraph position="3"> Figure 1 show an example of the improved alignments that are achieved when using the word aligned data. The example alignments were held out sentence pairs that were aligned after training on 500 sentence pairs. The alignments produced when the training on word-aligned data are dramatically better than when training on sentence-aligned data.</Paragraph>
      <Paragraph position="4"> We contrasted these improvements with the improvements that are to be had from incorporating a bilingual dictionary into the estimation process. For this experiment we allowed a bilingual dictionary to constrain which words can act as translations of each other during the initial estimates of translation probabilities (as described in Och and Ney (2003)).</Paragraph>
      <Paragraph position="5"> As can be seen in Table 3, using a dictionary reduces the AER when compared to using GIZA++ without a dictionary, but not as dramatically as integrating the word-alignments. We further tried combining a dictionary with our word-alignments but found that the dictionary results in only very minimal improve-</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Improved translation quality
</SectionTitle>
      <Paragraph position="0"> The fact that using word-aligned data in estimating the parameters for machine translation leads to better alignments is predictable. A more significant result is whether it leads to improved translation quality. In order to test that our improved parameter estimates lead to better translation quality, we used a state-of-the-art phrase-based decoder to translate a held out set of German sentences into English. The phrase-based decoder extracts phrases from the word alignments produced by GIZA++, and computes translation probabilities based on the frequency of one phrase being aligned with another (Koehn et al., 2003). We trained a language model  more heavily that its proportion in the training data (corpus size 16000 sentence pairs) using the 34,000 English sentences from the training set.</Paragraph>
      <Paragraph position="1"> Table 4 shows that using word-aligned data leads to better translation quality than using sentence-aligned data. Particularly, significantly less data is needed to achieve a high Bleu score when using word alignments. Training on a corpus of 8,000 sentence pairs with word alignments results in a higher Bleu score than when training on a corpus of 16,000 sentence pairs without word alignments.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Weighting the word-aligned data
</SectionTitle>
      <Paragraph position="0"> We have seen that using training data consisting of entirely word-aligned sentence pairs leads to better alignment accuracy and translation quality.</Paragraph>
      <Paragraph position="1"> However, because manually word-aligning sentence pairs costs more than just using sentence-aligned data, it is unlikely that we will ever want to label an entire corpus. Instead we will likely have a relatively small portion of the corpus word aligned.</Paragraph>
      <Paragraph position="2"> We want to be sure that this small amount of data labeled with word alignments does not get overwhelmed by a larger amount of unlabeled data.</Paragraph>
      <Paragraph position="3">  ing corpus of 16K sentence pairs with various proportions of word-alignments Thus we introduced the l weight into our mixed likelihood function.</Paragraph>
      <Paragraph position="4"> Table 5 compares the natural setting of l (where it is proportional to the amount of labeled data in the corpus) to a value that amplifies the contribution of the word-aligned data. Figure 2 shows a variety of values for l. It shows as l increases AER decreases. Placing nearly all the weight onto the word-aligned data seems to be most effective.4 Note this did not vary the training data size - only the relative contributions between sentence- and word-aligned training material.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Ratio of word- to sentence-aligned data
</SectionTitle>
      <Paragraph position="0"> We also varied the ratio of word-aligned to sentence-aligned data, and evaluated the AER and Bleu scores, and assigned high value to l (= 0.9).</Paragraph>
      <Paragraph position="1"> Figure 3 shows how AER improves as more word-aligned data is added. Each curve on the graph represents a corpus size and shows its reduction in error rate as more word-aligned data is added. For example, the bottom curve shows the performance of a corpus of 16,000 sentence pairs which starts with an AER of just over 12% with no word-aligned training data and decreases to an AER of 7.5% when all 16,000 sentence pairs are word-aligned. This curve essentially levels off after 30% of the data is word-aligned. This shows that a small amount of word-aligned data is very useful, and if we wanted to achieve a low AER, we would only have to label 4,800 examples with their word alignments rather than the entire corpus.</Paragraph>
      <Paragraph position="2"> Figure 4 shows how the Bleu score improves as more word-aligned data is added. This graph also 4At l = 1 (not shown in Figure 2) the data that is only sentence-aligned is ignored, and the AER is therefore higher.  reinforces the fact that a small amount of word-aligned data is useful. A corpus of 8,000 sentence pairs with only 800 of them labeled with word alignments achieves a higher Bleu score than a corpus of 16,000 sentence pairs with no word alignments.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Evaluation using a larger training corpus
</SectionTitle>
      <Paragraph position="0"> We additionally tested whether incorporating word-level alignments into the estimation improved results for a larger corpus. We repeated our experiments using the Canadian Hansards French-English parallel corpus. Figure 6 gives a summary of the improvements in AER and Bleu score for that corpus, when testing on a held out set of 484 hand aligned sentences.</Paragraph>
      <Paragraph position="1"> On the whole, alignment error rates are higher and Bleu scores are considerably lower for the Hansards corpus. This is probably due to the differences in the corpora. Whereas the Verbmobil corpus has a small vocabulary (&lt;10,000 per lan- null guage), the Hansards has ten times that many vocabulary items and has a much longer average sentence length. This made it more difficult for us to create a simulated set of hand alignments; we measured the AER of our simulated alignments at 11.3% (which compares to 6.5% for our simulated alignments for the Verbmobil corpus).</Paragraph>
      <Paragraph position="2"> Nevertheless, the trend of decreased AER and increased Bleu score still holds. For each size of training corpus we tested we found better results using the word-aligned data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML