XML Viewer - p01-1027

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1027_metho.xml
Size: 11,698 bytes
Last Modified: 2025-10-06 14:07:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1027">
  <Title>Refined Lexicon Models for Statistical Machine Translation using a Maximum Entropy Approach</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Statistical Machine Translation
</SectionTitle>
    <Paragraph position="0"> The goal of the translation process in statistical machine translation can be formulated as follows: A source language string a4a6a5a7a9a8 a4 a7a6a10a11a10a11a10 a4 a5is to be translated into a target language string</Paragraph>
    <Paragraph position="2"> this paper, the source language is German and the target language is English. Every target string is considered as a possible translation for the input.</Paragraph>
    <Paragraph position="3"> If we assign a probability a15a17a16a19a18 a12 a13a7a21a20a4a6a5a7a23a22 to each pair of strings a18 a12a14a13a7a25a24 a4 a5a7 a22 , then according to Bayes' decision rule, we have to choose the target string that maximizes the product of the target language model a15a17a16a19a18 a12a14a13a7 a22 and the string translation model</Paragraph>
    <Paragraph position="5"> Many existing systems for statistical machine translation (Berger et al., 1994; Wang and Waibel, 1997; Tillmann et al., 1997; Niessen et al., 1998) make use of a special way of structuring the string translation model like proposed by (Brown et al., 1993): The correspondence between the words in the source and the target string is described by alignments that assign one target word position to each source word position. The lexicon probability a27a28a18a26a4 a20a12 a22 of a certain target word a12 to occur in the target string is assumed to depend basically only on the source word a4 aligned to it.</Paragraph>
    <Paragraph position="6"> These alignment models are similar to the concept of Hidden Markov models (HMM) in speech recognition. The alignment mapping is a29a31a30</Paragraph>
    <Paragraph position="8"> ments a33a37a35a38a8a40a39 with the 'empty' word a12a14a41 to account for source words that are not aligned to any target word. In (statistical) alignment models</Paragraph>
    <Paragraph position="10"> a hidden variable.</Paragraph>
    <Paragraph position="11"> Typically, the search is performed using the so-called maximum approximation:</Paragraph>
    <Paragraph position="13"> The search space consists of the set of all possible target language strings a12a25a13a7 and all possible align-</Paragraph>
    <Paragraph position="15"> The overall architecture of the statistical translation approach is depicted in Figure 1.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Maximum entropy modeling
</SectionTitle>
    <Paragraph position="0"> The translation probability a15a17a16a19a18a26a4 a5a7 a24 a33 a5a7 a20a12a25a13a7 a22 can be rewritten as follows:</Paragraph>
    <Paragraph position="2"> based on Bayes' decision rule.</Paragraph>
    <Paragraph position="3"> Typically, the probability a15a70a16a71a18a26a4 a35 a20a4 a35a11a72</Paragraph>
    <Paragraph position="5"> approximated by a lexicon model a27a28a18a26a4 a35 a20a12</Paragraph>
    <Paragraph position="7"> Obviously, this simplification is not true for a lot of natural language phenomena. The straightforward approach to include more dependencies in the lexicon model would be to add additional dependencies(e.g. a27a74a18a26a4 a35 a20a12</Paragraph>
    <Paragraph position="9"> would yield a significant data sparseness problem.</Paragraph>
    <Paragraph position="10"> Here, the role of maximum entropy (ME) is to build a stochastic model that efficiently takes a larger context into account. In the following, we will use a27a74a18a26a4 a20a79a80a22 to denote the probability that the ME model assigns to a4 in the context a79 in order to distinguish this model from the basic lexicon model a27a28a18a26a4 a20a12 a22 .</Paragraph>
    <Paragraph position="11"> In the maximum entropy approach we describe all properties that we feel are useful by so-called feature functions a81a28a18 a79 a24 a4 a22 . For example, if we want to model the existence or absence of a specific word a12a11a82 in the context of an English word a12 which has the translation a4 we can express this dependency using the following feature function:</Paragraph>
    <Paragraph position="13"> The ME principle suggests that the optimal parametric form of a model a27a28a18a26a4 a20a79a90a22 taking into account only the feature functions a81a92a91 a24a94a93 a8</Paragraph>
    <Paragraph position="15"> Here a96 a18 a79a80a22 is a normalization factor. The resulting model has an exponential form with free</Paragraph>
    <Paragraph position="17"> values which maximize the likelihood for a given training corpus can be computed with the so-called GIS algorithm (general iterative scaling) or its improved version IIS (Pietra et al., 1997; Berger et al., 1996).</Paragraph>
    <Paragraph position="18"> It is important to notice that we will have to obtain one ME model for each target word observed in the training data.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Contextual information and training
</SectionTitle>
    <Paragraph position="0"> events In order to train the ME modela27 a51 a18a26a4 a20a79a80a22 associated to a target word a12 , we need to construct a corresponding training sample from the whole bilingual corpus depending on the contextual information that we want to use. To construct this sample, we need to know the word-to-word alignment between each sentence pair within the corpus. That is obtained using the Viterbi alignment provided by a translation model as described in (Brown et al., 1993). Specifically, we use the Viterbi alignment that was produced by Model 5. We use the program GIZA++ (Och and Ney, 2000b; Och and Ney, 2000a), which is an extension of the training program available in EGYPT (Al-Onaizan et al., 1999).</Paragraph>
    <Paragraph position="1"> Berger et al. (1996) use the words that surround a specific word pair a18 a12 a24 a4 a22 as contextual information. The authors propose as context the 3 words to the left and the 3 words to the right of the target word. In this work we use the following contextual information: a3 Target context: As in (Berger et al., 1996) we consider a window of 3 words to the left and to the right of the target word considered.</Paragraph>
    <Paragraph position="2"> a3 Source context: In addition, we consider a window of 3 words to the left of the source word a4 which is connected to a12 according to the Viterbi alignment.</Paragraph>
    <Paragraph position="3"> a3 Word classes: Instead of using a dependency on the word identity we include also a dependency on word classes. By doing this, we improve the generalization of the models and include some semantic and syntactic information with. The word classes are computed automatically using another statistical training procedure (Och, 1999) which often produces word classes including words with the same semantic meaning in the same class.</Paragraph>
    <Paragraph position="4"> A training event, for a specific target word a12 , is composed by three items:  appears.</Paragraph>
    <Paragraph position="5"> a3 The number of occurrences of the event in the training corpus.</Paragraph>
    <Paragraph position="6"> Table 1 shows some examples of training events for the target word &amp;quot;which&amp;quot;.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Features
</SectionTitle>
    <Paragraph position="0"> Once we have a set of training events for each target word we need to describe our feature functions. We do this by first specifying a large pool  English word &amp;quot;which&amp;quot; in the English context. In the German part the placeholder (&amp;quot; &amp;quot;) corresponds to the word aligned to &amp;quot;which&amp;quot;, in the first example the German word &amp;quot;die&amp;quot;, the word &amp;quot;das&amp;quot; in the second and the word &amp;quot;was&amp;quot; in the third. The considered English and German contexts are separated by the double bar &amp;quot;a20a112a20&amp;quot;.The last number in the rightmost position is the number of occurrences of the event in the whole corpus.</Paragraph>
    <Paragraph position="1"> Alig. word (a4 ) Context (a79 ) # of occur.</Paragraph>
    <Paragraph position="2"> die bar there , I just already nette Bar , 2 das hotel best , is very centrally ein Hotel , 1 was now , one do we jetzt , 1  sents a specific source word.</Paragraph>
    <Paragraph position="3"> Category a81 a51a53a117 a18 a79 a24 a4 a35 a22 a8 a87 if and only if ...</Paragraph>
    <Paragraph position="5"> for the word pair a18 a12a11a110 a24 a4 a35 a22 , we use the following categories of features:</Paragraph>
    <Paragraph position="7"> Category 1 features depend only on the source word a4 a35 and the target word a12a25a110 . A ME model that uses only those, predicts each source translation</Paragraph>
    <Paragraph position="9"> empirical data. This is exactly the standard lexicon probability a27a28a18a26a4 a20a12 a22 employed in the translation model described in (Brown et al., 1993) and in Section 2.</Paragraph>
    <Paragraph position="10"> Category 2 describes features which depend in addition on the word a12 a82 one position to the left or to the right of a12a25a110 . The same explanation is valid for category 3 but in this case a12 a82 could appears in any position of the context a79 . Categories 4 and 5 are the analogous categories to 2 and 3 using word classes instead of words. In the categories 6, 7, 8 and 9 the source context is used instead of the target context. Table 2 gives an overview of the different feature categories.</Paragraph>
    <Paragraph position="11"> Examples of specific features and their respective category are shown in Table 3.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Feature selection
</SectionTitle>
      <Paragraph position="0"> The number of possible features that can be used according to the German and English vocabularies and word classes is huge. In order to reduce the number of features we perform a threshold based feature selection, that is every feature which occurs less than a138 times is not used. The aim of the feature selection is two-fold. Firstly, we obtain smaller models by using less features, and secondly, we hope to avoid overfitting on the training data.</Paragraph>
      <Paragraph position="1"> In order to obtain the threshold a138 we compare the test corpus perplexity for various thresholds.</Paragraph>
      <Paragraph position="2"> The different threshold used in the experiments range from 0 to 512. The threshold is used as a cut-off for the number of occurrences that a specific feature must appear. So a cut-off of 0 means that all features observed in the training data are used. A cut-off of 32 means those features that appear 32 times or more are considered to train the maximum entropy models.</Paragraph>
      <Paragraph position="3"> We select the English words that appear at least 150 times in the training sample which are in total 348 of the 4673 words contained in the English vocabulary. Table 4 shows the different number of features considered for the 348 English words selected using different thresholds.</Paragraph>
      <Paragraph position="4"> In choosing a reasonable threshold we have to balance the number of features and observed perplexity. null  different cut-off threshold. In the second column of the table are shown the number of features used when only the English context is considered. The third column correspond to English, German and</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML