File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4003_intro.xml
Size: 3,763 bytes
Last Modified: 2025-10-06 14:02:17
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4003"> <Title>Example-based Rescoring of Statistical Machine Translation Output</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The statistical machine translation framework (SMT) formulates the problem of translating a sentence from a source language S into a target language T as the maximization problem of the conditional probability:</Paragraph> <Paragraph position="2"> where p(SjT) is called a translation model (TM), representing the generation probability from T into S, p(T) is called a language model (LM) and represents the likelihood of the target language (Brown et al., 1993). The TM and LM probabilities are trained automatically from a parallel text corpus (parameter estimation). They represent the general translation knowledge used to map a sequence of words from the source language into the target language. During the translation process (decoding) a statistical score based on the probabilities of the translation and the language models is assigned to each translation candidate and the one with the highest TM LM score is selected as the translation output.</Paragraph> <Paragraph position="3"> However, the system might not be able to find a good translation due to parameter estimation problems of the statistical models (due to data sparseness during the estimation of the model probabilities) and search errors 1The research reported here was supported in part by a contract with the Telecommunications Advancement Organization of Japan entitled, &quot;A study of speech dialogue translation technology based on a large corpus&quot;.</Paragraph> <Paragraph position="4"> during the translation process. Moreover, conventional SMT approaches use words as the translation unit. Therefore, the optimization is carried out locally generating the translation word-by-word.</Paragraph> <Paragraph position="5"> In the framework of example-based machine translation (EBMT), however, a parallel text corpus is used directly to obtain the translation (Nagao, 1984). Given an input sentence, translation examples from the corpus that are best matched to the input are retrieved and adjusted to obtain the translation. Thus the translation unit used in EBMT approaches is a complete sentence, providing a larger context for the generation of an appropriate translation. However, this approach requires appropriate translation examples to achieve an accurate translation.</Paragraph> <Paragraph position="6"> A combination of statistical and example-based MT approaches shows some promising perspectives for overcoming the shortcomes of each approach. In this paper, we propose an example-based rescoring method (EBRS) for selecting translation candidates generated by a statistical decoder, as illustrated in Figure 1.</Paragraph> <Paragraph position="7"> It retrieves translation examples that are similar to the input from a parallel text corpus (cf. Section 2). The target parts of these examples (seed) paired with the input form the input of a statistical decoder (cf. Section 3). The statistical scores of each generated translation candidate are rescored using information about how much the seed sentence is modified during decoding. It measures the distance between the word sequences of the decoder output and its seed sentence based on the costs of edit distance operations (cf. Section 4). We combine the distance measure with the statistical scores of the SMT engine, resulting in a reliability measure to identify modeling problems in statistically optimized translation candidates and to reject inappropriate solutions (cf. Section 5).</Paragraph> </Section> class="xml-element"></Paper>