File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1038_intro.xml
Size: 7,762 bytes
Last Modified: 2025-10-06 14:01:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1038"> <Title>Discriminative Training and Maximum Entropy Models for Statistical Machine Translation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We are given a source ('French') sentence fJ1 = f1;::: ;fj;::: ;fJ, which is to be translated into a target ('English') sentence eI1 = e1;::: ;ei;::: ;eI: Among all possible target sentences, we will choose the sentence with the highest probability:1</Paragraph> <Paragraph position="2"> The argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language.</Paragraph> <Paragraph position="3"> symbol Pr(C/) to denote general probability distributions with (nearly) no specific assumptions. In contrast, for model-based probability distributions, we use the generic symbol p(C/).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Source-Channel Model </SectionTitle> <Paragraph position="0"> According to Bayes' decision rule, we can equivalently to Eq. 1 perform the following maximization:</Paragraph> <Paragraph position="2"> This approach is referred to as source-channel approach to statistical MT. Sometimes, it is also referred to as the 'fundamental equation of statistical MT' (Brown et al., 1993). Here, Pr(eI1) is the language model of the target language, whereas Pr(fJ1 jeI1) is the translation model. Typically, Eq. 2 is favored over the direct translation model of Eq. 1 with the argument that it yields a modular approach.</Paragraph> <Paragraph position="3"> Instead of modeling one probability distribution, we obtain two different knowledge sources that are trained independently.</Paragraph> <Paragraph position="4"> The overall architecture of the source-channel approach is summarized in Figure 1. In general, as shown in this figure, there may be additional transformations to make the translation task simpler for the algorithm. Typically, training is performed by applying a maximum likelihood approach. If the language model Pr(eI1) = p (eI1) depends on parameters and the translation model Pr(fJ1 jeI1) = p (fJ1 jeI1) depends on parameters , then the optimal parameter values are obtained by maximizing the likelihood on a parallel training corpus fS1;eS1 (Brown et al., 1993):</Paragraph> <Paragraph position="6"> We obtain the following decision rule:</Paragraph> <Paragraph position="8"> State-of-the-art statistical MT systems are based on this approach. Yet, the use of this decision rule has various problems: 1. The combination of the language model p^ (eI1) and the translation model p^ (fJ1 jeI1) as shown in Eq. 5 can only be shown to be optimal if the true probability distributions p^ (eI1) = Pr(eI1) and p^ (fJ1 jeI1) = Pr(fJ1 jeI1) are used. Yet, we know that the used models and training methods provide only poor approximations of the true probability distributions. Therefore, a different combination of language model and translation model might yield better results.</Paragraph> <Paragraph position="9"> 2. There is no straightforward way to extend a baseline statistical MT model by including additional dependencies.</Paragraph> <Paragraph position="10"> 3. Often, we observe that comparable results are obtained by using the following decision rule instead of Eq. 5 (Och et al., 1999):</Paragraph> <Paragraph position="12"> Here, we replaced p^ (fJ1 jeI1) by p^ (eI1jfJ1 ).</Paragraph> <Paragraph position="13"> From a theoretical framework of the source-channel approach, this approach is hard to justify. Yet, if both decision rules yield the same translation quality, we can use that decision rule which is better suited for efficient search.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Direct Maximum Entropy Translation Model </SectionTitle> <Paragraph position="0"> As alternative to the source-channel approach, we directly model the posterior probability Pr(eI1jfJ1 ).</Paragraph> <Paragraph position="1"> An especially well-founded framework for doing this is maximum entropy (Berger et al., 1996). In this framework, we have a set of M feature functions hm(eI1;fJ1 );m = 1;::: ;M. For each feature function, there exists a model parameter ,m;m = 1;::: ;M. The direct translation probability is given</Paragraph> <Paragraph position="3"> This approach has been suggested by (Papineni et al., 1997; Papineni et al., 1998) for a natural language understanding task.</Paragraph> <Paragraph position="4"> We obtain the following decision rule:</Paragraph> <Paragraph position="6"> Hence, the time-consuming renormalization in Eq. 8 is not needed in search. The overall architecture of the direct maximum entropy models is summarized in Figure 2.</Paragraph> <Paragraph position="7"> Interestingly, this framework contains as special case the source channel approach (Eq. 5) if we use the following two feature functions:</Paragraph> <Paragraph position="9"> and set ,1 = ,2 = 1. Optimizing the corresponding parameters ,1 and ,2 of the model in Eq. 8 is equivalent to the optimization of model scaling factors, which is a standard approach in other areas such as speech recognition or pattern recognition.</Paragraph> <Paragraph position="10"> The use of an 'inverted' translation model in the unconventional decision rule of Eq. 6 results if we use the feature function logPr(eI1jfJ1 ) instead of logPr(fJ1 jeI1). In this framework, this feature can be as good as logPr(fJ1 jeI1). It has to be empirically verified, which of the two features yields better results. We even can use both features logPr(eI1jfJ1 ) and logPr(fJ1 jeI1), obtaining a more symmetric translation model.</Paragraph> <Paragraph position="11"> As training criterion, we use the maximum class posterior probability criterion:</Paragraph> <Paragraph position="13"> This corresponds to maximizing the equivocation or maximizing the likelihood of the direct translation model. This direct optimization of the posterior probability in Bayes decision rule is referred to as discriminative training (Ney, 1995) because we directly take into account the overlap in the probability distributions. The optimization problem has one global optimum and the optimization criterion is convex.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.3 Alignment Models and Maximum Approximation </SectionTitle> <Paragraph position="0"> Typically, the probability Pr(fJ1 jeI1) is decomposed via additional hidden variables. In statistical alignment models Pr(fJ1 ;aJ1jeI1), the alignment aJ1 is introduced as a hidden variable:</Paragraph> <Paragraph position="2"> The alignment mapping is j ! i = aj from source position j to target position i = aj.</Paragraph> <Paragraph position="3"> Search is performed using the so-called maximum approximation:</Paragraph> <Paragraph position="5"> Hence, the search space consists of the set of all possible target language sentences eI1 and all possible alignments aJ1.</Paragraph> <Paragraph position="6"> Generalizing this approach to direct translation models, we extend the feature functions to include the dependence on the additional hidden variable. Using M feature functions of the form hm(eI1;fJ1 ;aJ1);m = 1;::: ;M, we obtain the following model:</Paragraph> <Paragraph position="8"> Obviously, we can perform the same step for translation models with an even richer structure of hidden variables than only the alignment aJ1. To simplify the notation, we shall omit in the following the dependence on the hidden variables of the model.</Paragraph> </Section> </Section> class="xml-element"></Paper>