File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0827_metho.xml

Size: 8,830 bytes

Last Modified: 2025-10-06 14:10:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0827">
  <Title>Improving Phrase-Based Statistical Translation by modifying phrase extraction and including several features</Title>
  <Section position="5" start_page="149" end_page="149" type="metho">
    <SectionTitle>
2 Baseline
</SectionTitle>
    <Paragraph position="0"> The baseline is based on the source-channel approach, and it is composed of the following models which later will be combined in the decoder.</Paragraph>
    <Paragraph position="1"> The Translation Model. It is based on bilingual phrases, where a bilingual phrase (BP) is simply two monolingual phrases (MP) in which each one is supposed to be the translation of each other. A monolingual phrase is a sequence of words.</Paragraph>
    <Paragraph position="2"> Therefore, the basic idea of phrase-based translation is to segment the given source sentence into phrases, then translate each phrase and finally compose the target sentence from these phrase translations [17].</Paragraph>
    <Paragraph position="3"> During training, the system has to learn a dictionary of phrases. We begin by aligning the training corpus using GIZA++ [6], which is done in both translation directions. We take the union of both alignments to obtain a symmetrized word alignment matrix. This alignment matrix is the starting point for the phrase based extraction.</Paragraph>
    <Paragraph position="4"> Next, we define the criterion to extract the set of BP of the sentence pair (fj2j1 ; ei2i1) and the alignment matrix A [?] J[?]I, which is identical to the alignment criterion described in [11].</Paragraph>
    <Paragraph position="6"> The set of BP is consistent with the alignment and consists of all BP pairs where all words within the foreign language phrase are only aligned to the words of the English language phrase and viceversa.</Paragraph>
    <Paragraph position="7"> At least one word in the foreign language phrase has to be aligned with at least one word of the English language. Finally, the algorithm takes into account possibly unaligned words at the boundaries of the foreign or English language phrases.</Paragraph>
    <Paragraph position="8"> The target language model. It is combined with the translation probability as showed in equation (2). It gives coherence to the target text obtained by the concatenated phrases.</Paragraph>
  </Section>
  <Section position="6" start_page="149" end_page="150" type="metho">
    <SectionTitle>
3 Phrase Extraction
</SectionTitle>
    <Paragraph position="0"> Motivation. The length of a MP is defined as its number of words. The length of a BP is the greatest of the lengths of its MP.</Paragraph>
    <Paragraph position="1"> As we are working with a huge amount of data (see corpus statistics), it is unfeasible to build a dictionary with all the phrases longer than length 4. Moreover, the huge increase in computational and storage cost of including longer phrases does not provide a significant improve in quality [8].</Paragraph>
    <Paragraph position="2"> X-length In our system we considered two length limits. We first extract all the phrases of length 3 or less. Then, we also add phrases up to length 5 if they cannot be generated by smaller phrases.</Paragraph>
    <Paragraph position="3"> Empirically, we chose 5, as the probability of reappearence of larger phrases decreases.</Paragraph>
    <Paragraph position="4"> Basically, we select additional phrases with source words that otherwise would be missed because of cross or long alignments. For example, from the following sentence, Cuando el Parlamento Europeo , que tan frecuentemente insiste en los derechos de los traba-</Paragraph>
    <Paragraph position="6"> where the number inside the clauses is the aligned word(s). And the phrase that we are looking for is the following one.</Paragraph>
    <Paragraph position="7"> los derechos de los trabajadores # workers ' rights which only could appear in the case the maximum length was 5.</Paragraph>
  </Section>
  <Section position="7" start_page="150" end_page="150" type="metho">
    <SectionTitle>
4 Phrase ranking
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="150" end_page="150" type="sub_section">
      <SectionTitle>
4.1 Conditional probability P(fje)
</SectionTitle>
      <Paragraph position="0"> Given the collected phrase pairs, we estimated the phrase translation probability distribution by relative frecuency.</Paragraph>
      <Paragraph position="2"> where N(f,e) means the number of times the phrase f is translated by e. If a phrase e has N &gt; 1 possible translations, then each one contributes as 1/N [17].</Paragraph>
      <Paragraph position="3"> Note that no smoothing is performed, which may cause an overestimation of the probability of rare phrases. This is specially harmful given a BP where the source part has a big frecuency of appearence but the target part appears rarely. For example, from our database we can extract the following BP: &amp;quot;you # la que no&amp;quot;, where the English is the source language and the Spanish, the target language. Clearly, &amp;quot;la que no&amp;quot; is not a good translation of &amp;quot;you&amp;quot;, so this phrase should have a low probability. However, from our aligned training database we obtain,</Paragraph>
      <Paragraph position="5"> This BP is clearly overestimated due to sparseness. On the other, note that &amp;quot;la que no&amp;quot; cannot be considered an unusual trigram in Spanish.</Paragraph>
      <Paragraph position="6"> Hence, the language model does not penalise this target sequence either. So, the total probability (P(f|e)P(e)) would be higher than desired.</Paragraph>
      <Paragraph position="7"> In order to somehow compensate these unreiliable probabilities we have studied the inclusion of the posterior [12] and lexical probabilities [1] [10] as additional features.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="150" end_page="151" type="metho">
    <SectionTitle>
4.2 Feature P(ejf)
</SectionTitle>
    <Paragraph position="0"> In order to estimate the posterior phrase probability, we compute again the relative frequency but replacing the count of the target phrase by the count of the source phrase.</Paragraph>
    <Paragraph position="2"> where N'(f,e) means the number of times the phrase e is translated by f. If a phrase f has N &gt; 1 possible translations, then each one contributes as 1/N.</Paragraph>
    <Paragraph position="3"> Adding this feature function we reduce the number of cases in which the overall probability is overestimated. This results in an important improvement in translation quality.</Paragraph>
    <Paragraph position="4">  We used IBM Model 1 to estimate the probability of a BP. As IBM Model 1 is a word translation and it gives the sum of all possible alignment probabilities, a lexical co-ocurrence effect is expected. This captures a sort of semantic coherence in translations. null Therefore, the probability of a sentence pair is given by the following equation.</Paragraph>
    <Paragraph position="6"> The p(fj|ei) are the source-target IBM Model 1 word probabilities trained by GIZA++. Because the phrases are formed from the union of source-to-target and target-to-source alignments, there can be words that are not in the P(fj|ei) table. In this case, the probability was taken to be 10[?]40.</Paragraph>
    <Paragraph position="7"> In addition, we have calculated the IBM[?]1 Model</Paragraph>
    <Paragraph position="9"/>
    <Section position="1" start_page="150" end_page="150" type="sub_section">
      <SectionTitle>
4.4 Language Model
</SectionTitle>
      <Paragraph position="0"> The English language model plays an important role in the source channel model, see equation (2), and also in its modification, see equation (3). The English language model should give an idea of the sentence quality that is generated.</Paragraph>
      <Paragraph position="1"> As default language model feature, we use a standard word-based trigram language model generated with smoothing Kneser-Ney and interpolation (by using SRILM [16]).</Paragraph>
    </Section>
    <Section position="2" start_page="150" end_page="151" type="sub_section">
      <SectionTitle>
4.5 Word and Phrase Penalty
</SectionTitle>
      <Paragraph position="0"> To compensate the preference of the target language model for shorter sentences, we added two  simple features which are widely used [17] [7]. The word penalty provides means to ensure that the translations do not get too long or too short. Negative values for the word penalty favor longer output, positive values favor shorter output [7]. The phrase penalty is a constant cost per produced phrase. Here, a negative weight, which means reducing the costs per phrase, results in a preference for adding phrases. Alternatively, by using a positive scaling factors, the system will favor less phrases.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML