File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1031_intro.xml

Size: 3,570 bytes

Last Modified: 2025-10-06 14:03:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1031">
  <Title>Relabeling Syntax Trees to Improve Syntax-Based Machine Translation Quality</Title>
  <Section position="2" start_page="0" end_page="240" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recent work in statistical machine translation (MT) has sought to overcome the limitations of phrase-based models (Marcu and Wong, 2002; Koehn et al., 2003; Och and Ney, 2004) by making use of syntactic information. Syntax-based MT offers the potential advantages of enforcing syntaxmotivated constraints in translation and capturing long-distance/non-contiguous dependencies. Some approaches have used syntax at the core (Wu, 1997; Alshawi et al., 2000; Yamada and Knight, 2001; Gildea, 2003; Eisner, 2003; Hearne and Way, 2003; Melamed, 2004) while others have integrated syntax into existing phrase-based frameworks (Xia and McCord, 2004; Chiang, 2005; Collins et al., 2005; Quirk et al., 2005).</Paragraph>
    <Paragraph position="1"> In this work, we employ a syntax-based model that applies a series of tree/string (xRS) rules (Galley et al., 2004; Graehl and Knight, 2004) to a source language string to produce a target language phrase structure tree. Figure 1 exempli es the translation process, which is called a derivation, from Chinese into English. The source string to translate (a0a0a0a2a1a1a1</Paragraph>
    <Paragraph position="3"> replaces the Chinese word a4a4a4a2a6a6a6 (shaded) with the English NP-C police. Rule 2(c) then builds a VP over thea3a3a3 NP-Ca8a8a8a12a10a10a10 sequence. Next,a0a0a0a11a1a1a1 is translated as the NP-C the gunman by rule 3(c). Finally, rule 4(c) combines the sequence of NP-C VP . into an S, denoting a complete tree. The yield of this tree gives the target translation: the gunman was killed by police .</Paragraph>
    <Paragraph position="4"> The Penn English Treebank (PTB) (Marcus et al., 1993) is our source of syntactic information, largely due to the availability of reliable parsers. It is not clear, however, whether this resource is suitable, as is, for the task of MT. In this paper, we argue that the overly-general tagset of the PTB is problematic for MT because it fails to capture important grammatical distinctions that are critical in translation. As a solution, we propose methods of relabeling the syntax trees that effectively improve translation quality.</Paragraph>
    <Paragraph position="5"> Consider the derivation in Figure 2. The output translation has two salient errors: determiner/noun number disagreement (*this Turkish positions) and auxiliary/verb tense disagreement (*has demonstrate). The rst problem arises because the DT tag, which does not distinguish between singular and plural determiners, allows singular this to be used with plural NNS positions. In the second problem, the VP-C tag fails to communicate that it is headed by the base verb (VB) demonstrate, which should prevent it from being used with the auxiliary VBZ has.</Paragraph>
    <Paragraph position="6"> Information-poor tags like DT and VP-C can be relabeled to encourage more uent translations, which is the thrust of this paper.</Paragraph>
    <Paragraph position="7">  an English tree.</Paragraph>
    <Paragraph position="8"> Section 2 describes our data and experimental procedure. Section 3 explores different relabeling approaches and their impact on translation quality. Section 4 reports a substantial improvement in BLEU achieved by combining the most effective re-labeling methods. Section 5 concludes.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML