XML Viewer - w05-0833

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0833_metho.xml
Size: 13,676 bytes
Last Modified: 2025-10-06 14:10:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0833">
  <Title>Hybrid Example-Based SMT: the Best of Both Worlds?</Title>
  <Section position="5" start_page="184" end_page="185" type="metho">
    <SectionTitle>
3 Comparing EBMT and Word-Based
SMT
</SectionTitle>
    <Paragraph position="0"> (Way and Gough, 2005) obtained a large translation memory from Sun Microsystems containing 207,468 English-French sentence pairs, of which 3,939 sentence pairs were randomly extracted as a test set, with the remaining 203,529 sentences used as training data. The average sentence length for the English test set was 13.1 words and 15.2 words for the corresponding French test set. The EBMT system used was their Marker-based system as described in section 2.1 above. In order to create the necessary SMT language and translation models, they used:  (Turian et al., 2003), and Word- and Sentence Error Rates. In order to see whether the amount of training data affected the (relative) performance of the EBMT and SMT systems, (Way and Gough, 2005) split the training data into three sets, of 50K (1.1M words), 100K (2.4M words) and 203K (4.8M words) sentence pairs (TS1-TS3 in what follows).</Paragraph>
    <Section position="1" start_page="185" end_page="185" type="sub_section">
      <SectionTitle>
3.1 English-French Results
</SectionTitle>
      <Paragraph position="0"> The results obtained by (Gough &amp; Way, 2004b) for English-French for their EBMT system and word-based SMT (WB-SMT) are given in Table 1.</Paragraph>
      <Paragraph position="1"> Essentially, all the automatic evaluation metrics bar one (Precision) suggest that EBMT can outperform SMT from English-French. Surprisingly, however, apart from SER, all evaluation scores are higher using 100K sentence pairs as training data rather than the full 203K sentences. It is generally assumed that increasing the size of the training data for corpus-based MT systems will improve the quality of the output translations. (Way and Gough, 2005) observe that while this dip in performance may be due to a degree of over-fitting, they intend to carry out some variance analysis on these results (e.g. performing bootstrap-resampling on the test set (Koehn, 2004)), or re-test with different sample test sets in order to investigate whether the same phenomenon is observed. null With respect to SER, however, for both SMT and EBMT, the figures improve as more training data is made available. However, the improvement is much more significant for EBMT (20.6%) than for SMT (0.1%). While the WER scores are much the same, indicating that both systems are identifying reasonable target vocabulary that should appear in the output translation, the vast differences in SER using TS3 indicate that a system containing essentially no information about target syntax has very little hope of arranging these target words in the right order.</Paragraph>
      <Paragraph position="2"> On the contrary, even a system containing some basic knowledge of how phrases fit together such as the Marker-based EBMT system of (Gough &amp; Way, 2004b) will generate translations of far higher quality. null</Paragraph>
    </Section>
    <Section position="2" start_page="185" end_page="185" type="sub_section">
      <SectionTitle>
3.2 French-English Results
</SectionTitle>
      <Paragraph position="0"> The results obtained by (Way and Gough, 2005) for French-English translations are presented in Table 2. Translating in this language direction is inherently 'easier' than for English-French as far fewer agreement errors and cases of boundary friction are likely. Accordingly, all WB-SMT results in Table 2 are better than for the reverse direction, while for EBMT, improved results are to be seen for BLEU, Recall and SER.</Paragraph>
      <Paragraph position="1"> While the majority of metrics obtained for English-French indicate that EBMT outperforms WB-SMT, the results for French-English are by no means as conclusive. Of the 15 tests, WB-SMT out-performs EBMT in nine.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="185" end_page="186" type="metho">
    <SectionTitle>
4 Comparing EBMT and Phrase-Based
SMT
</SectionTitle>
    <Paragraph position="0"> From the results in the previous sections for French-English and for English-French, (Way and Gough, 2005) observe that EBMT outperforms WB-SMT in the majority of tests. If we are to treat each of the metrics as being equally significant, it can be said that EBMT appears to outperform WB-SMT by a factor of two to one. In fact, the only metric for which EBMT seems to consistently underperform is precision for French-English which, when we examine WER, indicates that the EBMT system's knowledge of word correspondences is incomplete and not as comprehensive as that of the WB-SMT system.</Paragraph>
    <Paragraph position="1">  However, it has been apparent for some time now that phrase-based SMT outperforms previous systems using word-based models. The results obtained by (Way and Gough, 2005) for SER also indicate that if phrase-based SMT were used, then improvements in translation quality ought to be seen. Accordingly, in this section we describe a set of experiments which extends the work of (Way and Gough, 2005) by evaluating the Marker-based EBMT system of (Gough &amp; Way, 2004b) against a phrase-based SMT system built using the following components:  * Giza++, to extract the word-level correspondences; null * The Giza++ word alignments are then refined and used to extract phrasal alignments ((Och &amp; Ney, 2003); or (Koehn et al., 2003) for a more recent implementation); * Probabilities of the extracted phrases are calculated from relative frequencies; * The resulting phrase translation table is passed  to the Pharaoh phrase-based SMT decoder which along with SRI language modelling toolkit5 performs translation.</Paragraph>
    <Section position="1" start_page="186" end_page="186" type="sub_section">
      <SectionTitle>
4.1 English-French Results
</SectionTitle>
      <Paragraph position="0"> We seeded the phrase-based SMT system constructed from the publicly available resources listed above with the word- and phrase-alignments derived via both Giza++ and the Marker-Based EBMT system of (Gough &amp; Way, 2004b). Using the full 203K training set of (Gough &amp; Way, 2004b), and testing on their near 4K test set, the results are given in Table 3. It is clear to see that the Giza++ alignments obtain better scores than the EBMT sub-sentential data. Before one considers the full impact of these results, one should take into account that the size of  the EBMT data set (word- and phrase-alignments) is 403,317, while there are over four times as many SMT sub-sentential alignments (1,732,715).</Paragraph>
      <Paragraph position="1"> Comparing these results with those in Table 1, we can see that for the same training-test data, the phrase-based SMT system outperforms the WB-SMT system on most metrics, considerably so with respect to BLEU score (.3753 vs. .3223). WER, however, is somewhat worse (.585 vs. .535), and SER remains disappointingly high. Compared to the EBMT system of (Gough &amp; Way, 2004b), the phrase-based SMT system still falls well short with respect to BLEU score (.4409 for EBMT vs. .3573 for SMT), and again, notably for SER (.656 EBMT,</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="186" end_page="186" type="metho">
    <SectionTitle>
.868 SMT).
4.2 French-English Results
</SectionTitle>
    <Paragraph position="0"> Again, the phrase-based SMT system was seeded with the Giza++ and EBMT alignments, trained on the full 203K training set, and tested on the 4K test set. The results are given in Table 4. As for English-French, the Giza++ alignments obtain better scores than when the EBMT sub-sentential data is used.</Paragraph>
    <Paragraph position="1"> Comparing these results with those in Table 2, we see that the phrase-based SMT system actually does worse than WB-SMT, which is an unexpected result6. As expected, therefore, the results for phrase-based SMT here are worse still compared to EBMT.</Paragraph>
  </Section>
  <Section position="8" start_page="186" end_page="187" type="metho">
    <SectionTitle>
5 Towards Hybridity: Merging SMT and
EBMT Alignments
</SectionTitle>
    <Paragraph position="0"> We decided to experiment further by combining parts of the EBMT sub-sentential alignments with parts of the data induced by Giza++. In the following sections, for both English-French and French-English, we seed the Pharaoh phrase-based SMT  1. the EBMT phrase-alignments with the Giza++ word-alignments; 2. all the EBMT and Giza++ sub-sentential alignments (both words and phrases).</Paragraph>
    <Section position="1" start_page="187" end_page="187" type="sub_section">
      <SectionTitle>
5.1 Giza++ Words and EBMT Phrases
</SectionTitle>
      <Paragraph position="0"> Here we seeded Pharaoh with the word-alignments induced by Giza++ and the EBMT phrasal chunks only (i.e. no Giza++ phrases and no EBMT lexical alignments).</Paragraph>
      <Paragraph position="1">  Using the full 203K training set of (Gough &amp; Way, 2004b), and testing on their near 4K test set, the results are given in Table 5. Comparing these figures to those in Table 3, we can see that all automatic evaluation metrics improve with this hybrid system configuration. Note that the data set size is 430,336, compared to 1.73M for the phrase-based SMT system seeded solely with Giza++ alignments. With respect to the EBMT system per se in Table 1, these results remain slightly below those figures (except for precision).</Paragraph>
      <Paragraph position="2">  Running the same experimental set up for the reverse language direction gives the results in Table 6. While recall drops slightly, all the other metrics show a slight increase compared to the performance obtained when Pharaoh is seeded with Giza++ wordand phrase-alignments (cf. Table 4).</Paragraph>
    </Section>
    <Section position="2" start_page="187" end_page="187" type="sub_section">
      <SectionTitle>
5.2 Merging All Data
</SectionTitle>
      <Paragraph position="0"> The following two experiments were carried out by seeding Pharaoh with all the EBMT and Giza++ sub-sentential alignments, i.e. both words and phrases.</Paragraph>
      <Paragraph position="1">  ble 7. These are considerably better than the scores for the 'semi-hybrid' system described in section 5.1.1. This indicates that a phrase-based SMT system is likely to perform better when EBMT wordand phrase-alignments are used in the calculation of the translation and target language probability models. Note, however, that the size of the data set increases to over 2M items. Despite this, compared to the results for the EBMT system of (Gough &amp; Way, 2004b) shown in Table 1, these results for the 'fully hybrid' SMT system still fall somewhat short (except for Precision: .6727 vs. .7026).</Paragraph>
      <Paragraph position="2">  Carrying out a similar experiment for the reverse language direction gives the results in Table 8. This time this hybrid SMT system does outperform the EBMT system of (Gough &amp; Way, 2004b), with respect to BLEU score (.4888 vs .4611) and Precision (.6927 vs. 6782), but the EBMT system still wins out where Recall, WER and SER are concerned. Regarding this latter, it seems that the correlation between low SER and high BLEU score is not as important as is claimed in (Way and Gough, 2005).</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="187" end_page="188" type="metho">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> (Way and Gough, 2005) carried out a number of experiments designed to test their large-scale Marker-Based EBMT system described in (Gough &amp; Way, 2004b) against a WB-SMT system constructed from publicly available tools. While the results were a little mixed, the EBMT system won out overall.</Paragraph>
    <Paragraph position="1">  Nonetheless, WB-SMT has long been abandoned in favour of phrase-based models. We extended the work of (Way and Gough, 2005) by performing a range of experiments using the Pharaoh phrase-based decoder. Our main observations are as follows: null * Seeding Pharaoh with word- and phrase-alignments induced via Giza++ generates better results than if EBMT sub-sentential data is used.</Paragraph>
    <Paragraph position="2"> * Seeding Pharaoh with a 'hybrid' dataset of Giza++ word alignments and EBMT phrases improves over the baseline phrase-based SMT system primed solely with Giza++ data. This would appear to indicate that the quality of the EBMT phrases is better than the SMT phrases, and that SMT practitioners should use EBMT phrasal data in the calculating of their language and translation models, if available.</Paragraph>
    <Paragraph position="3"> * Seeding Pharaoh with all data induced by Giza++ and the EBMT system leads to the best-performing hybrid SMT system: for English-French, as well as EBMT phrasal data, EBMT  A number of avenues of further work remain open to us. We would like to extend our investigations into hybrid example-based statistical approaches to machine translation by experiment with seeding the Marker-Based system of (Gough &amp; Way, 2004b) with the SMT data, and combinations thereof with the EBMT sub-sentential alignments, to investigate the effect on translation quality. Given our findings here, we are optimistic that 'hybrid statistical EBMT' will outperform the baseline EBMT system, and that our findings will prompt EBMT practitioners to augment their data resources with SMT alignments, something which to our knowledge is currently not done. In addition, we intend to continue this line of research on different and larger data sets, and for other language pairs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML