XML Viewer - n04-1021

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/n04-1021_concl.xml
Size: 4,135 bytes
Last Modified: 2025-10-06 13:54:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1021">
  <Title>Anoop Sarkar</Title>
  <Section position="9" start_page="0" end_page="0" type="concl">
    <SectionTitle>
7 Conclusions
</SectionTitle>
    <Paragraph position="0"> The use of discriminative reranking of an n-best list produced with a state-of-the-art statistical MT system allowed us to rapidly evaluate the benefits of off-the-shelf parsers, chunkers, and POS taggers for improving syntactic well-formedness of the MT output. Results are summarized in Table 2; the best single new feature improved the %BLEU score from 31.6 to 32.5. The 95% confidence intervals computed with the bootstrap resampling method are about 0.8%. In addition to experiments with single features we also integrated multiple features using a greedy approach where we integrated at each step the feature that most improves the BLEU score. This feature integration produced a statistically significant improvement of absolute 1.3% to 32.9 %BLEU score.</Paragraph>
    <Paragraph position="1"> Our single best feature, and in fact the only single feature to produce a truly significant improvement, was the IBM Model 1 score. We attribute its success that it addresses the weakness of the baseline system to omit content words and that it improves word selection by employing a triggering effect. We hypothesize that this allows for better use of context in, for example, choosing among senses of the source language word.</Paragraph>
    <Paragraph position="2"> A major goal of this work was to find out if we can exploit annotated data such as treebanks for Chinese and English and make use of state-of-the-art deep or shallow parsers to improve MT quality. Unfortunately, none of the implemented syntactic features achieved a statistically significant improvement in the BLEU score. Potential reasons for this might be: * As described in Section 3.2, the use of off-the-shelf taggers and parsers has various problems due to various mismatches between the parser training data and our application domain. This might explain that the use of the parser probability as feature function was not successful. A potential improvement might be to adapt the parser by retraining it on the full training data that has been used by the baseline system.</Paragraph>
    <Paragraph position="3"> * The use of a 1000-best list limits the potential improvements. It is possible that more improvements could be obtained using a larger n-best list or a word graph representation of the candidates.</Paragraph>
    <Paragraph position="4"> * The BLEU score is possibly not sufficiently sensitive to the grammaticality of MT output. This could not only make it difficult to see an improvement in the system's output, but also potentially mislead the BLEU-based optimization of the feature weights. A significantly larger corpus for discriminative training and for evaluation would yield much smaller confidence intervals.</Paragraph>
    <Paragraph position="5"> * Our discriminative training technique, which directly optimizes the BLEU score on a development corpus, seems to have overfitting problems with large number of features. One could use a larger development corpus for discriminative training or investigate alternative discriminative training criteria.</Paragraph>
    <Paragraph position="6"> * The amount of annotated data that has been used to train the taggers and parsers is two orders of magnitude smaller than the parallel training data that has been used to train the baseline system (or the word-based features). Possibly, a comparable amount of annotated data (e.g. a treebank with 100 million words) is needed to obtain significant improvements. null This is the first large scale integration of syntactic analysis operating on many different levels with a state-of-the-art phrase-based MT system. The methodology of using a log-linear feature combination approach, discriminative reranking of n-best lists computed with a state-of-the-art baseline system allowed members of a large team to simultaneously experiment with hundreds of syntactic feature functions on a common platform.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML