File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1033_intro.xml

Size: 4,589 bytes

Last Modified: 2025-10-06 14:02:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1033">
  <Title>Improvements in Phrase-Based Statistical Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Phrase-Based Translation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Motivation
</SectionTitle>
      <Paragraph position="0"> One major disadvantage of single-word based approaches is that contextual information is not taken into account.</Paragraph>
      <Paragraph position="1"> The lexicon probabilities are based only on single words.</Paragraph>
      <Paragraph position="2"> For many words, the translation depends heavily on the surrounding words. In the single-word based translation approach, this disambiguation is addressed by the language model only, which is often not capable of doing this.</Paragraph>
      <Paragraph position="3"> One way to incorporate the context into the translation model is to learn translations for whole phrases instead of single words. Here, a phrase is simply a sequence of words. So, the basic idea of phrase-based translation is to segment the given source sentence into phrases, then translate each phrase and finally compose the target sentence from these phrase translations.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Phrase Extraction
</SectionTitle>
      <Paragraph position="0"> The system somehow has to learn which phrases are translations of each other. Therefore, we use the following approach: first, we train statistical alignment models using GIZA++ and compute the Viterbi word alignment of the training corpus. This is done for both translation directions. We take the union of both alignments to obtain a symmetrized word alignment matrix. This alignment matrix is the starting point for the phrase extraction. The following criterion defines the set of bilingual phrases BP of the sentence pair (fJ1 ;eI1) and the alignment matrix A J PSI that is used in the translation system.</Paragraph>
      <Paragraph position="2"> This criterion is identical to the alignment template criterion described in (Och et al., 1999). It means that two phrases are considered to be translations of each other, if the words are aligned only within the phrase pair and not to words outside. The phrases have to be contiguous.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Translation Model
</SectionTitle>
      <Paragraph position="0"> To use phrases in the translation model, we introduce the hidden variable S. This is a segmentation of the sentence pair (fJ1 ;eI1) into K phrases (~fK1 ;~eK1 ). We use a one-to-one phrase alignment, i.e. one source phrase is translated by exactly one target phrase. Thus, we obtain:</Paragraph>
      <Paragraph position="2"> In the preceding step, we used the maximum approximation for the sum over all segmentations. Next, we allow only translations that are monotone at the phrase level.</Paragraph>
      <Paragraph position="3"> So, the phrase ~f1 is produced by ~e1, the phrase ~f2 is produced by ~e2, and so on. Within the phrases, the re-ordering is learned during training. Therefore, there is no constraint on the reordering within the phrases.</Paragraph>
      <Paragraph position="5"> Here, we have assumed a zero-order model at the phrase level. Finally, we have to estimate the phrase translation probabilities p(~fj~e). This is done via relative frequencies:</Paragraph>
      <Paragraph position="7"> Here, N(~f;~e) denotes the count of the event that ~f has been seen as a translation of ~e. If one occurrence of ~e has N &gt; 1 possible translations, each of them contributes to N(~f;~e) with 1=N. These counts are calculated from the training corpus.</Paragraph>
      <Paragraph position="8"> Using a bigram language model and assuming Bayes decision rule, Equation (2), we obtain the following search criterion:</Paragraph>
      <Paragraph position="10"> For the preceding equation, we assumed the segmentation probability p(SjeI1) to be constant. The result is a simple translation model. If we interpret this model as a feature function in the direct approach, we obtain:</Paragraph>
      <Paragraph position="12"> We use the maximum approximation for the hidden variable S. Therefore, the feature functions are dependent on S. Although the number of phrases K is implicitly given by the segmentation S, we used both S and K to make this dependency more obvious.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML