File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3105_intro.xml
Size: 9,292 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3105"> <Title>Why Generative Phrase Models Underperform Surface Heuristics</Title> <Section position="3" start_page="31" end_page="33" type="intro"> <SectionTitle> 2 Approach and Evaluation Methodology </SectionTitle> <Paragraph position="0"> The generative model defined below is evaluated based on the BLEU score it produces in an end-to-end machine translation system from English to French. The top-performing diag-and extraction heuristic(Zensetal., 2002)servesasthebaselinefor evaluation.1 Each approach - the generative model and heuristic baseline - produces an estimated conditional distribution of English phrases given French phrases. We will refer to the distribution derived from the baseline heuristic as phH. The distribution learned via the generative model, denoted phEM, is described in detail below.</Paragraph> <Section position="1" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 2.1 A Generative Phrase Model </SectionTitle> <Paragraph position="0"> While our model for computing phEM is novel, it is meant to exemplify a class of models that are not only clear extensions to generative word alignment models, but also compatible with the statistical framework assumed during phrase-based decoding.</Paragraph> <Paragraph position="1"> The generative process we modeled produces a phrase-aligned English sentence from a French sentence where the former is a translation of the latter. Note that this generative process is opposite to the translation direction of the larger system because of the standard noisy-channel decomposition. The learned parameters from this model will be used to translatesentencesfromEnglishtoFrench. Thegenerative process modeled has four steps:2 pair by computing a word-level alignment for the sentence and then enumerating all phrases compatible with that alignment. The word alignment is computed by first intersecting the directional alignments produced by a generative IBM model (e.g., model 4 with minor enhancements) in each translation direction, then adding certain alignments from the union of the directional alignments based on local growth rules.</Paragraph> <Paragraph position="2"> 2Our notation matches the literature for phrase-based translation: e is an English word, -e is an English phrase, and -eI1 is a sequence of I English phrases, and e is an English sentence. 2. Segment f into a sequence of I multi-word phrases that span the sentence, -fI1.</Paragraph> <Paragraph position="3"> 3. For each phrase -fi [?] -fI1, choose a corresponding position j in the English sentence and establish the alignment aj = i, then generate exactly one English phrase -ej from -fi.</Paragraph> <Paragraph position="4"> 4. The sequence -ej ordered by a describes an En- null glish sentence e.</Paragraph> <Paragraph position="5"> The corresponding probabilistic model for this generative process is:</Paragraph> <Paragraph position="7"> where P(e, -fI1, -eI1,a|f) factors into a segmentation model s, a translation model ph and a distortion model d. The parameters for each component of this model are estimated differently: * The segmentation model s( -fI1|f) is assumed to be uniform over all possible segmentations for a sentence.3 * The phrase translation model ph(-ej |-fi) is parameterized by a large table of phrase translation probabilities.</Paragraph> <Paragraph position="8"> * The distortion model d(aj = i|f) is a discounting function based on absolute sentence position akin to the one used in IBM model 3.</Paragraph> <Paragraph position="9"> While similar to the joint model in Marcu and Wong (2002), our model takes a conditional form compatible with the statistical assumptions used by the Pharaoh decoder. Thus, after training, the parameters of the phrase translation model phEM can be used directly for decoding.</Paragraph> </Section> <Section position="2" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 2.2 Training </SectionTitle> <Paragraph position="0"> Significant approximation and pruning is required to train a generative phrase model and table - such as phEM - with hidden segmentation and alignment variables using the expectation maximization algorithm (EM). Computing the likelihood of the data 3This segmentation model is deficient given a maximum phrase length: many segmentations are disallowed in practice. forasetofparameters(thee-step)involvessumming over exponentially many possible segmentations for each training sentence. Unlike previous attempts to train a similar model (Marcu and Wong, 2002), we allow information from a word-alignment model to inform our approximation. This approach allowed us to directly estimate translation probabilities even for rare phrase pairs, which were estimated heuristically in previous work.</Paragraph> <Paragraph position="1"> In each iteration of EM, we re-estimate each phrasetranslationprobabilitybysummingfractional phrase counts (soft counts) from the data given the current model parameters.</Paragraph> <Paragraph position="3"> This training loop necessitates approximation because summing over all possible segmentations and alignmentsforeachsentenceisintractable,requiring time exponential in the length of the sentences. Additionally, the set of possible phrase pairs grows too large to fit in memory. Using word alignments, we can address both problems.4 In particular, we can determine for any aligned segmentation ( -fI1, -eI1,a) whether it is compatible with the word-level alignment for the sentence pair. We define a phrase pair to be compatible with a word-alignment if no word in either phrase is aligned with a word outside the other phrase (Zens et al., 2002). Then, ( -fI1, -eI1,a) is compatible with the word-alignment if each of its aligned phrases is a compatible phrase pair.</Paragraph> <Paragraph position="4"> The training process is then constrained such that, when evaluating the above sum, only compatible aligned segmentations are considered. That is, we allow P(e, -fI1, -eI1,a|f) > 0 only for aligned segmentations ( -fI1, -eI1,a) such that a provides a one-to-one mapping from -fI1 to -eI1 where all phrase pairs ( -faj, -ej) are compatible with the word alignment.</Paragraph> <Paragraph position="5"> This constraint has two important effects. First, we force P(-ej |-fi) = 0 for all phrase pairs not compatible with the word-level alignment for some sentence pair. This restriction successfully reduced the baseline.</Paragraph> <Paragraph position="6"> total legal phrase pair types from approximately 250 million to 17 million for 100,000 training sentences.</Paragraph> <Paragraph position="7"> However, some desirable phrases were eliminated because of errors in the word alignments.</Paragraph> <Paragraph position="8"> Second, the time to compute the e-step is reduced.</Paragraph> <Paragraph position="9"> While in principle it is still intractable, in practice we can compute most sentence pairs' contributions in under a second each. However, some spurious word alignments can disallow all segmentations for a sentence pair, rendering it unusable for training.</Paragraph> <Paragraph position="10"> Several factors including errors in the word-level alignments, sparse word alignments and non-literal translations cause our constraint to rule out approximately 54% of the training set. Thus, the reduced size of the usable training set accounts for some of the degraded performance of phEM relative to phH.</Paragraph> <Paragraph position="11"> However, the results in figure 1 of the following section show that phEM trained on twice as much data as phH still underperforms the heuristic, indicating a larger issue than decreased training set size.</Paragraph> </Section> <Section position="3" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 2.3 Experimental Design </SectionTitle> <Paragraph position="0"> To test the relative performance of phEM and phH, we evaluated each using an end-to-end translation system from English to French. We chose this non-standard translation direction so that the examples in this paper would be more accessible to a primarily English-speaking audience. All training and test data were drawn from the French/English section of the Europarl sentence-aligned corpus. We tested on the first 1,000 unique sentences of length 5 to 15 in the corpus and trained on sentences of length 1 to 60 starting after the first 10,000.</Paragraph> <Paragraph position="1"> The system follows the structure proposed in the documentation for the Pharaoh decoder and uses many publicly available components (Koehn, 2003b). The language model was generated from the Europarl corpus using the SRI Language Modeling Toolkit (Stolcke, 2002). Pharaoh performed decoding using a set of default parameters for weighting the relative influence of the language, translation and distortion models (Koehn, 2003b). A maximum phrase length of three was used for all experiments.</Paragraph> <Paragraph position="2"> To properly compare phEM to phH, all aspects of thetranslationpipelinewereheldconstantexceptfor theparametersofthephrasetranslationtable. Inparticular, we did not tune the decoding hyperparameters for the different phrase tables.</Paragraph> </Section> </Section> class="xml-element"></Paper>