File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0836_evalu.xml

Size: 2,729 bytes

Last Modified: 2025-10-06 13:59:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0836">
  <Title>Training and Evaluating Error Minimization Rules for Statistical Machine Translation</Title>
  <Section position="6" start_page="212" end_page="213" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> The results in Table 1 compare the BLEU score achieved by each training method on the development and test data for both Pharaoh and CMU-Pharaoh. Score-sampling training was run for 150 iterations to find l for each decision rule. The MAP-MER training was performed to evaluate the effect of the greedy search method on the generalization of the development set results. Each row represents an alternative training method described in this paper, while the test set columns indicate the criteria used to select the final translation output vectore. The bold face scores are the scores for matching training and testing methods. The underlined score is the highest test set score, achieved by MBR decoding using the CMU-Pharaoh system trained for the MBR decision rule with the score-sampling algorithm. When comparing MER training for MAPdecoding with score-sampling training for MAPdecoding, score-sampling surprisingly outperforms MER training for both Pharaoh and CMU-Pharaoh, although MER training is specifically tailored to the MAP metric. Note, however, that our score-sampling algorithm has a considerably longer running time (several hours) than the MER algorithm (several minutes). Interestingly, within MER train- null former; we believe the reason for this disparity between training and test methods is the impact of phrasal consistency as a valuable measure within the n-best list.</Paragraph>
    <Paragraph position="1"> The relative performance of MBR score-sampling w.r.t. MAP and 0/1-loss score sampling is quite different between Pharaoh and CMU-Pharaoh: While MBR score-sampling performs worse than MAP and 0/1-loss score sampling for Pharaoh, it yields the best test scores across the board for CMU-Pharaoh.</Paragraph>
    <Paragraph position="2"> A possible reason is that the n-best lists generated by Pharaoh have a large percentage of lexically identical translations, differing only in their segmentations. As a result, the 1000-best lists generated by Pharaoh contain only a small percentage of unique translations, a condition that reduces the potential of the Minimum Bayes Risk methods. The CMU decoder, contrariwise, prunes away alternatives below a certain score-threshold during decoding and does not recover them when generating the n-best list. The n-best lists of this system are therefore typically more diverse and in particular contain far more unique translations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML