File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1608_metho.xml

Size: 12,662 bytes

Last Modified: 2025-10-06 14:10:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1608">
  <Title>The impact of parse quality on syntactically-informed statistical machine translation</Title>
  <Section position="5" start_page="63" end_page="67" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> Our goal in the current paper is to measure the impact of parse quality on syntactically-informed statistical machine translation. One method for producing parsers of varying quality might be to train a parser and then to transform its output, e.g.</Paragraph>
    <Paragraph position="1">  by replacing the parser's selection of the parent for certain tokens with different nodes.</Paragraph>
    <Paragraph position="2"> Rather than randomly adding noise to the parses, we decided to vary the quality in ways that more closely mimic the situation that confronts us as we develop machine translation systems. Annotating data for POS requires considerably less human time and expertise than annotating syntactic relations. We therefore used an automatic POS tagger (Toutanova et al., 2003) trained on the complete training section of the Penn Treebank (sections 02-21). Annotating syntactic dependencies is time consuming and requires considerable linguistic expertise.1 We can well imagine annotating syntactic dependencies in order to develop a machine translation system by annotating first a small quantity of data, training a parser, training a system that uses the parses produced by that parser and assessing the quality of the machine translation output. Having assessed the quality of the output, one might annotate additional data and train systems until it appears that the quality of the machine translation output is no longer improving.</Paragraph>
    <Paragraph position="3"> We therefore produced parsers of varying quality by training on the first n sentences of sections 02-21 of the Penn Treebank, where n ranged from 250 to 39,892 (the complete training section). At training time, the gold-standard POS tags were used.</Paragraph>
    <Paragraph position="4"> For parser evaluation and for the machine translation experiments reported here, we used an automatic POS tagger (Toutanova et al., 2003) trained on sections 02-21 of the Penn Treebank.</Paragraph>
    <Paragraph position="5"> We trained English-to-German and English-to-Japanese treelet translation systems on approximately 500,000 manually aligned sentence pairs drawn from technical computer documentation.</Paragraph>
    <Paragraph position="6"> The sentence pairs consisted of the English source sentence and a human-translation of that sentence.</Paragraph>
    <Paragraph position="7"> Table 1 summarizes the characteristics of this data.</Paragraph>
    <Paragraph position="8"> Note that German vocabulary and singleton counts are slightly more than double the corresponding English counts due to complex morphology and pervasive compounding (see section 2.3.1).</Paragraph>
    <Section position="1" start_page="64" end_page="65" type="sub_section">
      <SectionTitle>
3.1 Parser accuracy
</SectionTitle>
      <Paragraph position="0"> To evaluate the accuracy of the parsers trained on different samples of sentences we used the tradi1Various people have suggested to us that the linguistic expertise required to annotate syntactic dependencies is less than the expertise required to apply a formal theory of constituency like the one that informs the Penn Treebank. We tend to agree, but have not put this claim to the test.</Paragraph>
      <Paragraph position="1">  parsers trained on different numbers of sentences.</Paragraph>
      <Paragraph position="2"> The graph compares accuracy on the blind test section of the Penn Treebank to accuracy on a set of 250 sentences drawn from technical text. Punctuation tokens are excluded from the measurement of dependency accuracy.</Paragraph>
      <Paragraph position="3"> tional blind test section of the Penn Treebank (section 23). As is well-known in the parsing community, parse quality degrades when a parser trained on the Wall Street Journal text in the Penn Tree-bank is applied to a different genre or semantic domain. Since the technical materials that we were training the translation system on differ from the Wall Street Journal in lexicon and syntax, we annotated a set of 250 sentences of technical material to use in evaluating the parser. Each of the authors independently annotated the same set of 250 sentences. The annotation took less than six hours for each author to complete. Inter-annotator agreement excluding punctuation was 91.8%. Differences in annotation were resolved by discussion, and the resulting set of annotations was used to evaluate the parsers.</Paragraph>
      <Paragraph position="4"> Figure 2 shows the accuracy of parsers trained on samples of various sizes, excluding punctuation tokens from the evaluation, as is customary in evaluating dependency parsers. When measured against section 23 of the Penn Treebank, the section traditionally used for blind evaluation, the parsers range in accuracy from 77.8% when trained on 250 sentences to 90.8% when trained on all of sections 02-21. As expected, parse accuracy degrades when measured on text that differs greatly from the training text. A parser trained on  accuracy of 76.6% on the technical text. A parser trained on the complete Penn Treebank training section has a dependency accuracy of 84.3% on the technical text.</Paragraph>
      <Paragraph position="5"> Since the parsers make extensive use of lexical features, it is not surprising that the performance on the two corpora should be so similar with only 250 training sentences; there were not sufficient instances of each lexical item to train reliable weights or lexical features. As the amount of training data increases, the parsers are able to learn interesting facts about specific lexical items, leading to improved accuracy on the Penn Treebank. Many of the lexical items that occur in the Penn Treebank, however, occur infrequently or not at all in the technical materials so the lexical information is of little benefit. This reflects the mis-match of content. The Wall Street Journal articles in the Penn Treebank concern such topics as world affairs and the policies of the Reagan administration; these topics are absent in the technical materials. Conversely, the Wall Street Journal articles contain no discussion of such topics as the intricacies of SQL database queries.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="67" type="sub_section">
      <SectionTitle>
3.2 Translation quality
</SectionTitle>
      <Paragraph position="0"> Table 2 presents the impact of parse quality on a treelet translation system, measured using BLEU (Papineni et al., 2002). Since our main goal is to investigate the impact of parser accuracy on translation quality, we have varied the parser training data, but have held the MT training data, part-ofspeech-tagger, and all other factors constant. We observe an upward trend in BLEU score as more training data is made available to the parser; the trend is even clearer in Japanese.2 As a baseline, we include right-branching dependency trees, i.e., trees in which the parent of each word is its left  ants. Here sentences refer to the amount of parser training data, not MT training data.</Paragraph>
      <Paragraph position="1"> neighbor and the root of a sentence is the first word. With this analysis, treelets are simply sub-sequences of the sentence, and therefore are very similar to the phrases of Phrasal SMT. In Englishto-German, this result produces results very comparable to a phrasal SMT system (Koehn et al., 2003) trained on the same data. For English-to-Japanese, however, this baseline performs much worse than a phrasal SMT system. Although phrases and treelets should be nearly identical under this scenario, the decoding constraints are somewhat different: the treelet decoder assumes phrasal cohesion during translation. This constraint may account for the drop in quality.</Paragraph>
      <Paragraph position="2"> Since the confidence intervals for many pairs overlap, we ran pairwise tests for each system to determine which differences were significant at the p &lt; 0.05 level using the bootstrap method described in (Zhang and Vogel, 2004); Table 3 summarizes this comparison. Neither language pair achieves a statistically significant improvement from increasing the training data from 25,000 pairs to the full training set; this is not surprising since the increase in parse accuracy is quite small (90.2% to 90.8% on Wall Street Journal text).</Paragraph>
      <Paragraph position="3"> To further understand what differences in dependency analysis were affecting translation quality, we compared a treelet translation system that  used to train the dependency parser used a parser trained on 250 Penn Treebank sentences to a treelet translation system that used a parser trained on 39,892 Treebank sentences.</Paragraph>
      <Paragraph position="4"> From the test data, we selected 250 sentences where these two parsers produced different analyses. A native speaker of German categorized the differences in machine translation output as either improvements or regressions. We then examined and categorized the differences in the dependency analyses. Table 4 summarizes the results of this comparison. Note that this table simply identifies correlations between parse changes and translation changes; it does not attempt to identify a causal link. In the analysis, we borrow the term &amp;quot;NP [Noun Phrase] identification&amp;quot; from constituency analysis to describe the identification of dependency treelets spanning complete noun phrases.</Paragraph>
      <Paragraph position="5"> There were 141 sentences for which the machine translated output improved, 71 sentences for which the output regressed and 38 sentences for which the output was identical. Improvements in the attachment of prepositions, adverbs, gerunds and dependent verbs were common amongst improved translations, but rare amongst regressed translations. Correct identification of the dependent of a preposition3 was also much more common amongst improvements.</Paragraph>
      <Paragraph position="6"> Certain changes, such as improved root identification and final punctuation attachment, were very common across the corpus. Therefore their common occurrence amongst regressions is not very surprising. It was often the case that improvements in root identification or final punctuation attachment were offset by regressions elsewhere in the same sentence.</Paragraph>
      <Paragraph position="7"> Improvements in the parsers are cases where the syntactic analysis more closely resembles the analysis of dependency structure that results from applying Yamada and Matsumoto's head-finding rules to the Penn Treebank. Figure 4 shows different parses produced by parsers trained on dif3In terms of constituency analysis, a prepositional phrase should consist of a preposition governing a single noun phrase</Paragraph>
      <Paragraph position="9"> ferent numbers of sentences. The parser trained on 250 sentences incorrectly attaches the preposition &amp;quot;from&amp;quot; as a dependent of the noun &amp;quot;objects&amp;quot; whereas the parser trained on the complete Penn Treebank training section correctly attaches the preposition as a dependent of the verb &amp;quot;manipulate&amp;quot;. These two parsers also yield different analyses of the phrase &amp;quot;Microsoft Access objects&amp;quot;.</Paragraph>
      <Paragraph position="10"> In parse (a), &amp;quot;objects&amp;quot; governs &amp;quot;Office&amp;quot; and &amp;quot;Office&amp;quot; in turn governs &amp;quot;Microsoft&amp;quot;. This analysis is linguistically well-motivated, and makes a treelet spanning &amp;quot;Microsoft Office&amp;quot; available to the treelet translation system. In parse (b), the parser has analyzed this phrase so that &amp;quot;objects&amp;quot; directly governs &amp;quot;Microsoft&amp;quot; and &amp;quot;Office&amp;quot;. The analysis more closely reflects the flat branching structure of the Penn Treebank but obscures the affinity of &amp;quot;Microsoft&amp;quot; and &amp;quot;Office&amp;quot;.</Paragraph>
      <Paragraph position="11"> An additional measure of parse utility for MT is the amount of translation material that can be extracted from a parallel corpus. We increased the parser training data from 250 sentences to 39,986 sentences, but held the number of aligned sentence pairs used train other modules constant. The count of treelet translation pairs occurring at least twice in the English-German parallel corpus grew from 1,895,007 to 2,010,451.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML