File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1021_intro.xml

Size: 3,532 bytes

Last Modified: 2025-10-06 14:01:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1021">
  <Title>Minimum Error Rate Training in Statistical Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Statistical Machine Translation with
Log-linear Models
</SectionTitle>
    <Paragraph position="0"> Let us assume that we are given a source ('French') sentence a0a2a1a4a3a6a5a7 a1a8a3 a7a10a9a10a11a10a11a10a11a12a9 a3a10a13 a9a10a11a10a11a10a11a14a9 a3 a5 , which is to be translated into a target ('English') sentence  target sentences, we will choose the sentence with the highest probability:1</Paragraph>
    <Paragraph position="2"> The argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language. The decision in Eq. 1 minimizes the number of decision errors. Hence, under a so-called zero-one loss function this decision rule is optimal (Duda and Hart, 1973). Note that using a different loss function--for example, one induced by the  symbol Pra44a39a45 a46 to denote general probability distributions with (nearly) no specific assumptions. In contrast, for model-based probability distributions, we use the generic symbol a47a48a44a39a45 a46 . As the true probability distribution Pra23a39a15a41a40 a0a26a25 is unknown, we have to develop a model a0 a23a39a15a24a40 a0a30a25 that approximates Pra23a39a15a41a40 a0a26a25 . We directly model the posterior probability Pra23a39a15a41a40 a0a26a25 by using a log-linear model. In this framework, we have a set of a1 feature functions  there exists a model parameter a9 a3 a9a6a5 a1a10a7 a9a10a11a10a11a10a11a14a9 a1 . The direct translation probability is given by:</Paragraph>
    <Paragraph position="4"> In this framework, the modeling problem amounts to developing suitable feature functions that capture the relevant properties of the translation task. The training problem amounts to obtaining suitable pa-</Paragraph>
    <Paragraph position="6"> linear models is the MMI (maximum mutual information) criterion, which can be derived from the maximum entropy principle:</Paragraph>
    <Paragraph position="8"> The optimization problem under this criterion has very nice properties: there is one unique global optimum, and there are algorithms (e.g. gradient descent) that are guaranteed to converge to the global optimum. Yet, the ultimate goal is to obtain good translation quality on unseen test data. Experience shows that good results can be obtained using this approach, yet there is no reason to assume that an optimization of the model parameters using Eq. 4 yields parameters that are optimal with respect to translation quality.</Paragraph>
    <Paragraph position="9"> The goal of this paper is to investigate alternative training criteria and corresponding training algorithms, which are directly related to translation quality measured with automatic evaluation criteria.</Paragraph>
    <Paragraph position="10"> In Section 3, we review various automatic evaluation criteria used in statistical machine translation. In Section 4, we present two different training criteria which try to directly optimize an error count. In Section 5, we sketch a new training algorithm which efficiently optimizes an unsmoothed error count. In Section 6, we describe the used feature functions and our approach to compute the candidate translations that are the basis for our training procedure. In Section 7, we evaluate the different training criteria in the context of several MT experiments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML