XML Viewer - w06-3110

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3110_metho.xml
Size: 9,624 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3110">
  <Title>N-Gram Posterior Probabilities for Statistical Machine Translation</Title>
  <Section position="4" start_page="72" end_page="72" type="metho">
    <SectionTitle>
3 N-Gram Posterior Probabilities
</SectionTitle>
    <Paragraph position="0"> The idea is similar to the word posterior probabilities: we sum the sentence posterior probabilities for each occurrence of an n-gram.</Paragraph>
    <Paragraph position="1"> Let d(*,*) denote the Kronecker function. Then, we define the fractional count C(en1,fJ1 ) of an n-gram en1 for a source sentence fJ1 as:</Paragraph>
    <Paragraph position="3"> The sums over the target language sentences are limited to an N-best list, i.e. the N best translation candidates according to the baseline model. In this equation, the term d(e'i+n[?]1i ,en1) is one if and only if the n-gram en1 occurs in the target sentence e'I1 starting at position i.</Paragraph>
    <Paragraph position="4"> Then, the posterior probability of an n-gram is obtained as:</Paragraph>
    <Paragraph position="6"> Note that the widely used word posterior probability is obtained as a special case, namely if n is set to one.</Paragraph>
  </Section>
  <Section position="5" start_page="72" end_page="72" type="metho">
    <SectionTitle>
4 Sentence Length Posterior Probability
</SectionTitle>
    <Paragraph position="0"> The common phrase-based translation systems, such as (Och et al., 1999; Koehn, 2004), do not use an explicit sentence length model. Only the simple word penalty goes into that direction. It can be adjusted to prefer longer or shorter translations.</Paragraph>
    <Paragraph position="1"> Here, we will use the posterior probability of a specific target sentence length I as length model:</Paragraph>
    <Paragraph position="3"> Note that the sum is carried out only over target sentences eI1 with the a specific length I. Again, the candidate target language sentences are limited to an N-best list.</Paragraph>
  </Section>
  <Section position="6" start_page="72" end_page="73" type="metho">
    <SectionTitle>
5 Rescoring/Reranking
</SectionTitle>
    <Paragraph position="0"> A straightforward application of the posterior probabilities is to use them as additional features in a rescoring/reranking approach (Och et al., 2004).</Paragraph>
    <Paragraph position="1"> The use of N-best lists in machine translation has several advantages. It alleviates the effects of the huge search space which is represented in word  graphs by using a compact excerpt of the N best hypotheses generated by the system. N-best lists are suitable for easily applying several rescoring techniques since the hypotheses are already fully generated. In comparison, word graph rescoring techniques need specialized tools which can traverse the graph accordingly.</Paragraph>
    <Paragraph position="2"> The n-gram posterior probabilities can be used similar to an n-gram language model:</Paragraph>
    <Paragraph position="4"> Note that the models do not require smoothing as long as they are applied to the same N-best list they are trained on.</Paragraph>
    <Paragraph position="5"> If the models are used for unseen sentences, smoothing is important to avoid zero probabilities. We use a linear interpolation with weights an and the smoothed (n [?] 1)-gram model as generalized distribution.</Paragraph>
    <Paragraph position="7"> Note that absolute discounting techniques that are often used in language modeling cannot be applied in a straightforward way, because here we have fractional counts.</Paragraph>
    <Paragraph position="8"> The usage of the sentence length posterior probability for rescoring is even simpler. The resulting feature is:</Paragraph>
    <Paragraph position="10"> Again, the model does not require smoothing as long as it is applied to the same N-best list it is trained on. If it is applied to other sentences, smoothing becomes important. We propose to smooth the sentence length model with a Poisson distribution.</Paragraph>
    <Paragraph position="12"> We use a linear interpolation with weight b. The mean l of the Poisson distribution is chosen to be identical to the mean of the unsmoothed length model:</Paragraph>
    <Paragraph position="14"/>
  </Section>
  <Section position="7" start_page="73" end_page="74" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
6.1 Corpus Statistics
</SectionTitle>
      <Paragraph position="0"> The experiments were carried out on the large data track of the Chinese-English NIST task. The corpus statistics of the bilingual training corpus are shown in Table 1. The language model was trained on the English part of the bilingual training corpus and additional monolingual English data from the GigaWord corpus. The total amount of language model training data was about 600M running words. We use a fourgram language model with modified Kneser-Ney smoothing as implemented in the SRILM toolkit (Stolcke, 2002).</Paragraph>
      <Paragraph position="1"> To measure the translation quality, we use the BLEU score (Papineni et al., 2002) and the NIST score (Doddington, 2002). The BLEU score is the geometric mean of the n-gram precision in combination with a brevity penalty for too short sentences. The NIST score is the arithmetic mean of a weighted n-gram precision in combination with a brevity penalty for too short sentences. Both scores are computed case-sensitive with respect to four reference translations using the mteval-v11b tool1. As the BLEU and NIST scores measure accuracy higher scores are better.</Paragraph>
      <Paragraph position="2"> We use the BLEU score as primary criterion which is optimized on the development set using the Downhill Simplex algorithm (Press et al., 2002). As development set, we use the NIST 2002 evaluation set. Note that the baseline system is already welltuned and would have obtained a high rank in the last NIST evaluation (NIST, 2005).</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
6.2 Translation Results
</SectionTitle>
      <Paragraph position="0"> The translation results for the Chinese-English NIST task are presented in Table 2. We carried out experiments for evaluation sets of several years. For these rescoring experiments, we use the 10 000 best translation candidates, i.e. N-best lists of size N=10 000.</Paragraph>
      <Paragraph position="1">  conventional word posterior probabilities, there is only a very small improvement, or no improvement at all. This is consistent with the findings of the JHU workshop on confidence estimation for statistical machine translation 2003 (Blatz et al., 2003), where the word-level confidence measures also did not help to improve the BLEU or NIST scores.</Paragraph>
      <Paragraph position="2"> Successively adding higher order n-gram posterior probabilities, the translation quality improves consistently across all evaluation sets. We also performed experiments with n-gram orders beyond four, but these did not result in further improvements. null Adding the sentence length posterior probability feature is also helpful for all evaluation sets. For the development set, the overall improvement is 1.5% for the BLEU score. On the blind evaluation sets, the overall improvement of the translation quality ranges between 1.1% and 1.6% BLEU.</Paragraph>
      <Paragraph position="3"> Some translation examples are shown in Table 3.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="74" end_page="74" type="metho">
    <SectionTitle>
7 Future Applications
</SectionTitle>
    <Paragraph position="0"> We have shown that the n-gram posterior probabilities are very useful in a rescoring/reranking framework. In addition, there are several other potential applications. In this section, we will describe two of them.</Paragraph>
    <Section position="1" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
7.1 Iterative Search
</SectionTitle>
      <Paragraph position="0"> The n-gram posterior probability can be used for rescoring as described in Section 5. An alternative is to use them directly during the search. In this second search pass, we use the models from the first pass, i.e. the baseline system, and additionally the n-gram and sentence length posterior probabilities. As the n-gram posterior probabilities are basically a kind of sentence-specific language model, it is straight-forward to integrate them. This process can also be iterated. Thus, using the N-best list of the second pass to recompute the n-gram and sentence length posterior probabilities and do a third search pass, etc..</Paragraph>
    </Section>
    <Section position="2" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
7.2 Computer Assisted Translation
</SectionTitle>
      <Paragraph position="0"> In the computer assisted translation (CAT) framework, the goal is to improve the productivity of human translators. The machine translation system takes not only the current source language sentence but also the already typed partial translation into account. Based on this information, the system suggest completions of the sentence. Word-level posterior probabilities have been used to select the most appropriate completion of the system, for more details see e.g. (Gandrabur and Foster, 2003; Ueffing and Ney, 2005). The n-gram based posterior probabilities as described in this work, might be better suited for this task as they explicitly model the dependency on the previous words, i.e. the given prefix.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML