File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1091_intro.xml

Size: 4,431 bytes

Last Modified: 2025-10-06 14:03:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1091">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Discriminative Global Training Algorithm for Statistical MT</Title>
  <Section position="3" start_page="0" end_page="721" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper presents a view of phrase-based SMT as a sequential process that generates block orientation sequences. A block is a pair of phrases which are translations of each other. For example, Figure 1 shows an Arabic-English translation example that uses four blocks. During decoding, we view translation as a block segmentation process, where the input sentence is segmented from left to right and the target sentence is generated from bottom to top, one block at a time. A monotone block sequence is generated except for the possibility to handle some local phrase re-ordering. In this local re-ordering model (Tillmann and Zhang, 2005; Kumar and Byrne, 2005) a block a0 with orientation a1 is generated relative to its predecessor block a0a3a2 . During decoding, we maximize the score a4a6a5a8a7 a0a10a9a11a13a12 a1 a9a11a15a14 of a block orientation sequence  ample, where the Arabic words are romanized.</Paragraph>
    <Paragraph position="1"> The following orientation sequence is generated:</Paragraph>
    <Paragraph position="3"> where a0 a67 is a block, a0 a67a69a77 a11 is its predecessor block, and a1 a67a80a79a82a81 a60 a7 efta14a78a12 a65 a7 ighta14a78a12 a57 a7 eutrala14a78a83 is a three-valued orientation component linked to the block  is the number of blocks in the translation. We are interested in learning the weight vector  quencea86 is generated under the restriction that the concatenated source phrases of the blocks a0 a67 yield the input sentence. In modeling a block sequence, we emphasize adjacent block neighbors that have right or left orientation, since in the current experiments only local block swapping is handled (neutral orientation is used for 'detached' blocks as described in (Tillmann and Zhang, 2005)).</Paragraph>
    <Paragraph position="4"> This paper focuses on the discriminative training of the weight vector  used in Eq. 1. The decoding process is decomposed into local decision steps based on Eq. 1, but the model is trained in a global setting as shown below. The advantage of this approach is that it can easily handle tens of millions of features, e.g. up to a87a89a88 million features for the experiments in this paper. Moreover, under this view, SMT becomes quite similar to sequential natural language annotation problems such as part-of-speech tagging and shallow parsing, and the novel training algorithm presented in this paper is actually most similar to work on training algorithms presented for these task, e.g. the on-line training algorithm presented in (McDonald et al., 2005) and the perceptron training algorithm presented in (Collins, 2002). The current approach does not use specialized probability features as in (Och, 2003) in any stage during decoder parameter training. Such probability features include language model, translation or distortion probabilities, which are commonly used in current SMT approaches 1. We are able to achieve comparable performance to (Tillmann and Zhang, 2005). The novel algorithm differs computationally from earlier work in discriminative training algorithms for SMT (Och, 2003) as follows: a90 No computationally expensive a57 -best lists are generated during training: for each input sentence a single block sequence is generated on each iteration over the training data.</Paragraph>
    <Paragraph position="5"> a90 No additional development data set is necessary as the weight vector  is trained on bilingual training data only.</Paragraph>
    <Paragraph position="6"> The paper is structured as follows: Section 2 presents the baseline block sequence model and the feature representation. Section 3 presents the discriminative training algorithm that learns 1A translation and distortion model is used in generating the block set used in the experiments, but these translation probabilities are not used during decoding.</Paragraph>
    <Paragraph position="7"> a good global ranking function used during decoding. Section 4 presents results on a standard Arabic-English translation task. Finally, some discussion and future work is presented in Section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML