File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3115_metho.xml
Size: 8,532 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3115"> <Title>NTT System Description for the WMT2006 Shared Task</Title> <Section position="3" start_page="0" end_page="123" type="metho"> <SectionTitle> 2 Translation Models </SectionTitle> <Paragraph position="0"> We used a log-linear approach (Och and Ney, 2002) in which a foreign language sentence f J1 = f1, f2,... fJ is translated into another language, i.e. English, eI1 = e1, e2,..., eI by seeking a maximum likelihood solution of</Paragraph> <Paragraph position="2"> In this framework, the posterior probability Pr(eI1|f J1 ) is directly maximized using a log-linear combination of feature functions hm(eI1, f J1 ), such as a ngram language model or a translation model.</Paragraph> <Paragraph position="3"> When decoding, the denominator is dropped since it depends only on f J1 . Feature function scaling factors lm are optimized based on a maximum likelihood approach (Och and Ney, 2002) or on a direct error minimization approach (Och, 2003). This modeling allows the integration of various feature functions depending on the scenario of how a translation is constituted.</Paragraph> <Paragraph position="4"> In a phrase-based statistical translation (Koehn et al., 2003), a bilingual text is decomposed as K phrase translation pairs (-e1, -f-a1), (-e2, -f-a2 ),...: The input foreign sentence is segmented into phrases -f K1 , mapped into corresponding English -eK1 , then, re-ordered to form the output English sentence according to a phrase alignment index mapping -a.</Paragraph> <Paragraph position="5"> In a hierarchical phrase-based translation (Chiang, 2005), translation is modeled after a weighted synchronous-CFG consisting of production rules whose right-hand side is paired (Aho and Ullman, 1969):</Paragraph> <Paragraph position="7"> where X is a non-terminal, g and a are strings of terminals and non-terminals. [?] is a one-to-one correspondence for the non-terminals appeared in g and a. Starting from an initial non-terminal, each rule rewrites non-terminals in g and a that are associated with [?].</Paragraph> <Section position="1" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 2.1 Phrase/Rule Extraction </SectionTitle> <Paragraph position="0"> The phrase extraction algorithm is based on those presented by Koehn et al. (2003). First, many-to-many word alignments are induced by running a one-to-many word alignment model, such as GIZA++ (Och and Ney, 2003), in both directions and by combining the results based on a heuristic (Och and Ney, 2004). Second, phrase translation pairs are extracted from the word aligned corpus (Koehn et al., 2003). The method exhaustively extracts phrase pairs ( f j+mj , ei+ni ) from a sentence pair ( f J1 , eI1) that do not violate the word alignment constraints a.</Paragraph> <Paragraph position="1"> In the hierarchical phrase-based model, production rules are accumulated by computing &quot;holes&quot; for extracted contiguous phrases (Chiang, 2005): 1. A phrase pair ( -f, -e) constitutes a rule:</Paragraph> <Paragraph position="3"/> </Section> <Section position="2" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 2.2 Decoding </SectionTitle> <Paragraph position="0"> The decoder for the phrase-based model is a left-to-right generation decoder with a beam search strategy synchronized with the cardinality of already translated foreign words. The decoding process is very similar to those described in (Koehn et al., 2003): It starts from an initial empty hypothesis. From an existing hypothesis, new hypothesis is generated by consuming a phrase translation pair that covers untranslated foreign word positions. The score for the newly generated hypothesis is updated by combining the scores of feature functions described in Section 2.3. The English side of the phrase is simply concatenated to form a new prefix of English sentence. null In the hierarchical phrase-based model, decoding is realized as an Earley-style top-down parser on the foreign language side with a beam search strategy synchronized with the cardinality of already translated foreign words (Watanabe et al., 2006). The major difference to the phrase-based model's decoder is the handling of non-terminals, or holes, in each rule.</Paragraph> </Section> <Section position="3" start_page="122" end_page="123" type="sub_section"> <SectionTitle> 2.3 Feature Functions </SectionTitle> <Paragraph position="0"> Our phrase-based model uses a standard pharaoh feature functions listed as follows (Koehn et al., 2003): * Relative-count based phrase translation probabilities in both directions.</Paragraph> <Paragraph position="1"> * Lexically weighted feature functions in both directions. null * The supplied trigram language model. * Distortion model that counts the number of words skipped.</Paragraph> <Paragraph position="2"> * The number of words in English-side and the number of phrases that constitute translation. For details, please refer to Koehn et al. (2003). In addition, we added three feature functions to restrict reorderings and to represent globalized insertion/deletion of words: * Lexicalized reordering feature function scores whether a phrase translation pair is monotonically translated or not (Och et al., 2004):</Paragraph> <Paragraph position="4"> where dk = 1 iff -ak [?] -ak[?]1 = 1 otherwise dk = 0.</Paragraph> <Paragraph position="5"> * Deletion feature function penalizes words that do not constitute a translation according to a</Paragraph> <Paragraph position="7"> The deletion model simply counts the number of words whose lexicon model probability is lower than a threshold tdel. Likewise, we also added an insertion model hins(eI1, f J1 ) that penalizes the spuriously inserted English words using a lexicon model t(e|f ).</Paragraph> <Paragraph position="8"> For the hierarchical phrase-based model, we employed the same feature set except for the distortion model and the lexicalized reordering model.</Paragraph> </Section> </Section> <Section position="4" start_page="123" end_page="123" type="metho"> <SectionTitle> 3 Phrase Extraction from Different Word </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="123" end_page="123" type="sub_section"> <SectionTitle> Alignment </SectionTitle> <Paragraph position="0"> We prepared three kinds of corpora differentiated by tokenization methods. First, the simplest pre-processing is lower-casing (lower). Second, corpora were transformed by a Porter's algorithm based multilingual stemmer (stem) 1. Third, mixed-cased corpora were truncated to the prefix of four letters of each word (prefix4). For each differently tokenized corpus, we computed word alignments by a HMM translation model (Och and Ney, 2003) and by a word alignment refinement heuristic of &quot;grow-diagfinal&quot; (Koehn et al., 2003). Different preprocessing yields quite divergent alignment points as illustrated in Table 1. The table also shows the numbers for the intersection and union of three alignment annotations. null The (hierarchical) phrase translation pairs are extracted from three distinctly word aligned corpora.</Paragraph> <Paragraph position="1"> tartarus.org In this process, each word is recovered into its lower-cased form. The associated counts are aggregated to constitute relative count-based feature functions. Table 2 summarizes the size of phrase tables induced from the corpora. The number of rules extracted for the hierarchical phrase-based model was roughly twice as large as those for the phrase-based model. Fewer word alignments resulted in larger phrase translation table size as observed in the &quot;prefix4&quot; corpus. The size is further increased by our aggregation step (merged).</Paragraph> <Paragraph position="2"> Different induction/refinement algorithms or preprocessings of a corpus bias word alignment. We found that some word alignments were consistent even with different preprocessings, though we could not justify whether such alignments would match against human intuition. If we could trust such consistently aligned words, reliable (hierarchical) phrase translation pairs would be extracted, which, in turn, would result in better estimates for relative count-based feature functions. At the same time, differently biased word alignment annotations suggest alternative phrase translation pairs that is useful for increasing the coverage of translations.</Paragraph> </Section> </Section> class="xml-element"></Paper>