File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1033_metho.xml
Size: 11,365 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1033"> <Title>A Hierarchical Phrase-Based Model for Statistical Machine Translation</Title> <Section position="3" start_page="264" end_page="265" type="metho"> <SectionTitle> 2 The model </SectionTitle> <Paragraph position="0"> Our model is based on a weighted synchronous CFG (Aho and Ullman, 1969). In a synchronous CFG the elementary structures are rewrite rules with aligned pairs of right-hand sides:</Paragraph> <Paragraph position="2"> where X is a nonterminal, g and a are both strings of terminals and nonterminals, and [?] is a one-to-one correspondence between nonterminal occurrences in g and nonterminal occurrences in a. Rewriting begins with a pair of linked start symbols. At each step, two coindexed nonterminals are rewritten using the two components of a single rule, such that none of the newly introduced symbols is linked to any symbols already present.</Paragraph> <Paragraph position="3"> Thus the hierarchical phrase pairs from our above example could be formalized in a synchronous CFG as:</Paragraph> <Paragraph position="5"> where we have used boxed indices to indicate which occurrences of X are linked by [?].</Paragraph> <Paragraph position="6"> Note that we have used only a single nonterminal symbol X instead of assigning syntactic categories to phrases. In the grammar we extract from a bitext (described below), all of our rules use only X, except for two special &quot;glue&quot; rules, which combine a sequence of Xs to form an S:</Paragraph> <Paragraph position="8"> These give the model the option to build only partial translations using hierarchical phrases, and then combine them serially as in a standard phrase-based model. For a partial example of a synchronous CFG derivation, see Figure 1.</Paragraph> <Paragraph position="9"> Following Och and Ney (2002), we depart from the traditional noisy-channel approach and use a more general log-linear model. The weight of each rule is:</Paragraph> <Paragraph position="11"> where the phi are features defined on rules. For our experiments we used the following features, analogous to Pharaoh's default feature set: * P(g |a) and P(a |g), the latter of which is not found in the noisy-channel model, but has been previously found to be a helpful feature (Och and Ney, 2002); * the lexical weights Pw(g |a) and Pw(a |g) (Koehn et al., 2003), which estimate how well the words in a translate the words in g;2 * a phrase penalty exp(1), which allows the model to learn a preference for longer or shorter derivations, analogous to Koehn's phrase penalty (Koehn, 2003).</Paragraph> <Paragraph position="12"> The exceptions to the above are the two glue rules, (13), which has weight one, and (14), which has weight (16) w(S - <S 1 X 2 , S 1 X 2 > ) = exp([?]lg) the idea being that lg controls the model's preference for hierarchical phrases over serial combination of phrases.</Paragraph> <Paragraph position="13"> Let D be a derivation of the grammar, and let f (D) and e(D) be the French and English strings generated by D. Let us represent D as a set of triples <r, i, j> , each of which stands for an application of a grammar rule r to rewrite a nonterminal that spans f (D) ji on the French side.3 Then the weight of D 2This feature uses word alignment information, which is discarded in the final grammar. If a rule occurs in training with more than one possible word alignment, Koehn et al. take the maximum lexical weight; we take a weighted average. is the product of the weights of the rules used in the translation, multiplied by the following extra factors:</Paragraph> <Paragraph position="15"> where plm is the language model, and exp([?]lwp|e|), the word penalty, gives some control over the length of the English output.</Paragraph> <Paragraph position="16"> We have separated these factors out from the rule weights for notational convenience, but it is conceptually cleaner (and necessary for polynomial-time decoding) to integrate them into the rule weights, so that the whole model is a weighted synchronous CFG. The word penalty is easy; the language model is integrated by intersecting the English-side CFG with the language model, which is a weighted finite-state automaton.</Paragraph> </Section> <Section position="4" start_page="265" end_page="266" type="metho"> <SectionTitle> 3 Training </SectionTitle> <Paragraph position="0"> The training process begins with a word-aligned corpus: a set of triples <f, e,[?]> , where f is a French sentence, e is an English sentence, and [?] is a (manyto-many) binary relation between positions of f and positions of e. We obtain the word alignments using the method of Koehn et al. (2003), which is based on that of Och and Ney (2004). This involves running GIZA++ (Och and Ney, 2000) on the corpus in both directions, and applying refinement rules (the variant they designate &quot;final-and&quot;) to obtain a single many-to-many word alignment for each sentence.</Paragraph> <Paragraph position="1"> Then, following Och and others, we use heuristics to hypothesize a distribution of possible derivations of each training example, and then estimate the phrase translation parameters from the hypothesized distribution. To do this, we first identify initial phrase pairs using the same criterion as previous systems (Och and Ney, 2004; Koehn et al., 2003): Definition 1. Given a word-aligned sentence pair <f, e,[?]> , a rule <f ji , e jprimeiprime > is an initial phrase pair of <f, e,[?]> iff: 1. fk [?] ekprime for some k [?] [i, j] and kprime [?] [iprime, jprime]; 2. fk nsimilar ekprime for all k [?] [i, j] and kprime nelement [iprime, jprime]; 3. fk nsimilar ekprime for all k nelement [i, j] and kprime [?] [iprime, jprime]. Next, we form all possible differences of phrase pairs: Definition 2. The set of rules of <f, e,[?]> is the smallest set satisfying the following: 1. If <f ji , e jprimeiprime > is an initial phrase pair, then X - <f ji , e jprimeiprime > is a rule.</Paragraph> <Paragraph position="2"> 2. If r = X - <g,a> is a rule and <f ji , e jprimeiprime > is an initial phrase pair such that g = g1 f ji g2 and a = a1e jprimeiprime a2, then</Paragraph> <Paragraph position="4"> is a rule, where k is an index not used in r.</Paragraph> <Paragraph position="5"> The above scheme generates a very large number of rules, which is undesirable not only because it makes training and decoding very slow, but also because it creates spurious ambiguity--a situation where the decoder produces many derivations that are distinct yet have the same model feature vectors and give the same translation. This can result in n-best lists with very few different translations or feature vectors, which is problematic for the algorithm we use to tune the feature weights. Therefore we filter our grammar according to the following principles, chosen to balance grammar size and performance on our development set: 1. If there are multiple initial phrase pairs containing the same set of alignment points, we keep only the smallest.</Paragraph> <Paragraph position="6"> 2. Initial phrases are limited to a length of 10 on the French side, and rule to five (nonterminals plus terminals) on the French right-hand side.</Paragraph> <Paragraph position="7"> 3. In the subtraction step, f ji must have length greater than one. The rationale is that little would be gained by creating a new rule that is no shorter than the original.</Paragraph> <Paragraph position="8"> 4. Rules can have at most two nonterminals, which simplifies the decoder implementation.</Paragraph> <Paragraph position="9"> Moreover, we prohibit nonterminals that are adjacent on the French side, a major cause of spurious ambiguity.</Paragraph> <Paragraph position="10"> 5. A rule must have at least one pair of aligned words, making translation decisions always based on some lexical evidence.</Paragraph> <Paragraph position="11"> Now we must hypothesize weights for all the derivations. Och's method gives equal weight to all the extracted phrase occurences. However, our method may extract many rules from a single initial phrase pair; therefore we distribute weight equally among initial phrase pairs, but distribute that weight equally among the rules extracted from each. Treating this distribution as our observed data, we use relative-frequency estimation to obtain P(g |a) and P(a |g).</Paragraph> </Section> <Section position="5" start_page="266" end_page="267" type="metho"> <SectionTitle> 4 Decoding </SectionTitle> <Paragraph position="0"> Our decoder is a CKY parser with beam search together with a postprocessor for mapping French derivations to English derivations. Given a French sentence f , it finds the best derivation (or n best derivations, with little overhead) that generates <f, e> for some e. Note that we find the English yield of the and not necessarily the highest-probability e, which would require a more expensive summation over derivations.</Paragraph> <Paragraph position="1"> We prune the search space in several ways. First, an item that has a score worse than b times the best score in the same cell is discarded; second, an item that is worse than the bth best item in the same cell is discarded. Each cell contains all the items standing for X spanning f ji . We choose b and b to balance speed and performance on our development set. For our experiments, we set b = 40,b = 10[?]1 for X cells, and b = 15,b = 10[?]1 for S cells. We also prune rules that have the same French side (b = 100).</Paragraph> <Paragraph position="2"> The parser only operates on the French-side grammar; the English-side grammar affects parsing only by increasing the effective grammar size, because there may be multiple rules with the same French side but different English sides, and also because intersecting the language model with the English-side grammar introduces many states into the nonterminal alphabet, which are projected over to the French side. Thus, our decoder's search space is many times larger than a monolingual parser's would be. To reduce this effect, we apply the following heuristic when filling a cell: if an item falls outside the beam, then any item that would be generated using a lower-scoring rule or a lower-scoring antecedent item is also assumed to fall outside the beam. This heuristic greatly increases decoding speed, at the cost of some search errors.</Paragraph> <Paragraph position="3"> Finally, the decoder has a constraint that prohibits any X from spanning a substring longer than 10 on the French side, corresponding to the maximum length constraint on initial rules during training. This makes the decoding algorithm asymptotically linear-time.</Paragraph> <Paragraph position="4"> The decoder is implemented in Python, an interpreted language, with C++ code from the SRI Language Modeling Toolkit (Stolcke, 2002). Using the settings described above, on a 2.4 GHz Pentium IV, it takes about 20 seconds to translate each sentence (average length about 30). This is faster than our Python implementation of a standard phrase-based decoder, so we expect that a future optimized implementation of the hierarchical decoder will run at a speed competitive with other phrase-based systems.</Paragraph> </Section> class="xml-element"></Paper>