File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/j04-4002_abstr.xml

Size: 9,148 bytes

Last Modified: 2025-10-06 13:43:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-4002">
  <Title>c(c) 2004 Association for Computational Linguistics The Alignment Template Approach to Statistical Machine Translation</Title>
  <Section position="2" start_page="0" end_page="420" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Machine translation (MT) is a hard problem, because natural languages are highly complex, many words have various meanings and different possible translations, sentences might have various readings, and the relationships between linguistic entities are often vague. In addition, it is sometimes necessary to take world knowledge into account. The number of relevant dependencies is much too large and those dependencies are too complex to take them all into account in a machine translation system.</Paragraph>
    <Paragraph position="1"> Given these boundary conditions, a machine translation system has to make decisions (produce translations) given incomplete knowledge. In such a case, a principled approach to solving that problem is to use the concepts of statistical decision theory to try to make optimal decisions given incomplete knowledge. This is the goal of statistical machine translation.</Paragraph>
    <Paragraph position="2"> The use of statistical techniques in machine translation has led to dramatic improvements in the quality of research systems in recent years. For example, the statistical approaches of the Verbmobil evaluations (Wahlster 2000) or the U.S. National [?] 1600 Amphitheatre Parkway, Mountain View, CA 94043. E-mail: och@google.com.</Paragraph>
    <Paragraph position="3"> + Lehrstuhl f &amp;quot;ur Informatik VI, Computer Science Department, RWTH Aachen-University of Technology, Ahornstr. 55, 52056 Aachen, Germany. E-mail: ney@cs.rwth-aachen.de.</Paragraph>
    <Paragraph position="4"> Submission received: 19 November 2002; Revised submission received: 7 October 2003; Accepted for publication: 1 June 2004  obtain the best results. In addition, the field of statistical machine translation is rapidly progressing, and the quality of systems is getting better and better. An important factor in these improvements is definitely the availability of large amounts of data for training statistical models. Yet the modeling, training, and search methods have also improved since the field of statistical machine translation was pioneered by IBM in the late 1980s and early 1990s (Brown et al. 1990; Brown et al. 1993; Berger et al. 1994). This article focuses on an important improvement, namely, the use of (generalized) phrases instead of just single words as the core elements of the statistical translation model.</Paragraph>
    <Paragraph position="5"> We describe in Section 2 the basics of our statistical translation model. We suggest the use of a log-linear model to incorporate the various knowledge sources into an overall translation system and to perform discriminative training of the free model parameters. This approach can be seen as a generalization of the originally suggested source-channel modeling framework for statistical machine translation.</Paragraph>
    <Paragraph position="6"> In Section 3, we describe the statistical alignment models used to obtain a word alignment and techniques for learning phrase translations from word alignments. Here, the term phrase just refers to a consecutive sequence of words occurring in text and has to be distinguished from the use of the term in a linguistic sense. The learned bilingual phrases are not constrained by linguistic phrase boundaries. Compared to the word-based statistical translation models in Brown et al. (1993), this model is based on a (statistical) phrase lexicon instead of a single-word-based lexicon. Looking at the results of the recent machine translation evaluations, this approach seems currently to give the best results, and an increasing number of researchers are working on different methods for learning phrase translation lexica for machine translation purposes (Marcu and Wong 2002; Venugopal, Vogel, and Waibel 2003; Tillmann 2003; Koehn, Och, and Marcu 2003). Our approach to learning a phrase translation lexicon works in two stages: In the first stage, we compute an alignment between words, and in the second stage, we extract the aligned phrase pairs. In our machine translation system, we then use generalized versions of these phrases, called alignment templates, that also include the word alignment and use word classes instead of the words themselves.</Paragraph>
    <Paragraph position="7"> In Section 4, we describe the various components of the statistical translation model. The backbone of the translation model is the alignment template feature function, which requires that a translation of a new sentence be composed of a set of alignment templates that covers the source sentence and the produced translation. Other feature functions score the well-formedness of the produced target language sentence (i.e., language model feature functions), the number of produced words, or the order of the alignment templates. Note that all components of our statistical machine translation model are purely data-driven and that there is no need for linguistically annotated corpora. This is an important advantage compared to syntax-based translation models (Yamada and Knight 2001; Gildea 2003; Charniak, Knight, and Yamada 2003) that require a parser for source or target language.</Paragraph>
    <Paragraph position="8"> In Section 5, we describe in detail our search algorithm and discuss an efficient implementation. We use a dynamic-programming-based beam search algorithm that allows a trade-off between efficiency and quality. We also discuss the use of heuristic functions to reduce the number of search errors for a fixed beam size.</Paragraph>
    <Paragraph position="9"> In Section 6, we describe various results obtained on different tasks. For the German-English Verbmobil task, we analyze the effect of various system compo- null Och and Ney The Alignment Template Approach to Statistical Machine Translation Figure 1 Architecture of the translation approach based on a log-linear modeling approach.</Paragraph>
    <Paragraph position="10"> nents. On the French-English Canadian Hansards task, the alignment template system obtains significantly better results than a single-word-based translation model. In the Chinese-English 2002 NIST machine translation evaluation it yields results that are significantly better statistically than all competing research and commercial translation systems.</Paragraph>
    <Paragraph position="11"> 2. Log-Linear Models for Statistical Machine Translation We are given a source (French) sentence f = f</Paragraph>
    <Paragraph position="13"> The argmax operation denotes the search problem, that is, the generation of the output sentence in the target language.</Paragraph>
    <Paragraph position="14"> As an alternative to the often used source-channel approach (Brown et al. 1993), we directly model the posterior probability Pr(e  ), m = 1, ..., M. For each feature function, there exists a model 2 The notational convention employed in this article is as follows. We use the symbol Pr(*) to denote general probability distributions with (nearly) no specific assumptions. In contrast, for model-based probability distributions, we use the generic symbol p(*).</Paragraph>
    <Paragraph position="15">  This approach has been suggested by Papineni, Roukos, and Ward (1997, 1998) for a natural language understanding task.</Paragraph>
    <Paragraph position="16"> We obtain the following decision rule:  Hence, the time-consuming renormalization in equation (3) is not needed in search. The overall architecture of the log-linear modeling approach is summarized in Figure 1. A standard criterion on a parallel training corpus consisting of S sentence pairs</Paragraph>
    <Paragraph position="18"> ): s = 1, ..., S} for log-linear models is the maximum class posterior probability criterion, which can be derived from the maximum-entropy principle:  This corresponds to maximizing the equivocation or maximizing the likelihood of the direct-translation model. This direct optimization of the posterior probability in Bayes' decision rule is referred to as discriminative training (Ney 1995) because we directly take into account the overlap in the probability distributions. The optimization problem under this criterion has very nice properties: There is one unique global optimum, and there are algorithms (e.g. gradient descent) that are guaranteed to converge to the global optimum. Yet the ultimate goal is to obtain good translation quality on unseen test data. An alternative training criterion therefore directly optimizes translation quality as measured by an automatic evaluation criterion (Och 2003).</Paragraph>
    <Paragraph position="19"> Typically, the translation probability Pr(e  ) is decomposed via additional hidden variables. To include these dependencies in our log-linear model, we extend the feature functions to include the dependence on the additional hidden variable. Using for example the alignment a</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML