File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3106_metho.xml
Size: 17,779 bytes
Last Modified: 2025-10-06 14:10:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3106"> <Title>Phrase-Based SMT with Shallow Tree-Phrases</Title> <Section position="4" start_page="39" end_page="43" type="metho"> <SectionTitle> 3 The Translation Engine </SectionTitle> <Paragraph position="0"> We built a translation engine very similar to the statistical phrase-based engine PHARAOH described in (Koehn, 2004) that we extended to use tree-phrases.</Paragraph> <Paragraph position="1"> Not only does our decoder differ from PHARAOH by using TPs, it also uses direct translation models. We know from (Och and Ney, 2002) that not using the noisy-channel approach does not impact the quality of the translation produced.</Paragraph> <Section position="1" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 3.1 The maximization setting </SectionTitle> <Paragraph position="0"> For a source sentence f, our engine incrementally generates a set of translation hypothesesHby combiningtree-phrase(TP)unitsandphrase-phrase(PP) null units.2 We define a hypothesis in this set as h = {Ui [?] (Fi,Ei)}i[?][1,u], a set of u pairs of source (Fi) and target sequences (Ei) of ni and mi words respectively:</Paragraph> <Paragraph position="2"> under the constraints that for all i [?] [1,u], jin < jin+1 ,[?]n[?][1,ni[ for a source treelet (similar constraints apply on the target side), and jin+1 = jin + 1,[?]n [?] [1,ni[ for a source phrase. The way the hypotheses are built imposes additional constraints between units that will be described in Section 3.3.</Paragraph> <Paragraph position="3"> Notethat, atdecodingtime,|e|, thenumberofwords 2What we call here a phrase-phrase unit is simply a pair of source/target sequences of words.</Paragraph> <Paragraph position="4"> SYNTEX parse for the sentence pair of Figure 1.</Paragraph> <Paragraph position="5"> Non-contiguous structures are marked with a star.</Paragraph> <Paragraph position="6"> Each dependent node of a given governor token is displayed as a list surrounding the governor node, e.g. {governor{right-dependent}}. Along with the tokens of each node, we present their respective offset (the governor/root node has the offset 0 by definition). The format we use to represent the treelets is similar to the one proposed in (Quirk et al., 2005). ofthetranslationisunknown,butisboundedaccording to|f|(in our case,|e|max = 2x|f|+ 5). We define the source and target projection of a hypothesis h by the proj operator which collects in order the words of a hypothesis along one language:</Paragraph> <Paragraph position="8"> If we denote by Hf the set of hypotheses that have f as a source projection (that is, Hf = {h : projF(h) [?] f}), then our translation engine seeks</Paragraph> <Paragraph position="10"> The function we seek to maximize s(h) is a log-linear combination of 9 components, and might be better understood as the numerator of a maximum entropy model popular in several statistical MT systems(OchandNey, 2002; Bertoldietal., 2004; Zens and Ney, 2004; Simard et al., 2005; Quirk et al., 2005). The components are the so-called feature functions (described below) and the weighting coefficients (l) are the parameters of the model:</Paragraph> <Paragraph position="12"/> </Section> <Section position="2" start_page="40" end_page="41" type="sub_section"> <SectionTitle> 3.2 The components of the scoring function </SectionTitle> <Paragraph position="0"> We briefly enumerate the features used in this study.</Paragraph> <Paragraph position="1"> Translationmodels Evenifatree-phraseisageneralization of a standard phrase-phrase unit, for investigation purposes, we differentiate in our MT system between two kinds of models: a TP-based model ptp and a phrase-phrase model ppp. Both rely on conditional distributions whose parameters are learned over a corpus. Thus, each model is assigned its own weighting coefficient, allowing the tuning process to bias the engine toward a special kind of unit (TP or PP).</Paragraph> <Paragraph position="2"> We have, for k[?]{rf,ibm}:</Paragraph> <Paragraph position="4"> with p*rf standing for a model trained by relative frequency, whereas p*ibm designates a non-normalized score computed by an IBM model-1 translation model p, where f0 designates the so-called NULL word:</Paragraph> <Paragraph position="6"> Note that by setting ltprf and ltpibm to zero, we revert back to a standard phrase-based translation engine. This will serve as a reference system in the experiments reported (see Section 4).</Paragraph> <Paragraph position="7"> The language model Following a standard practice, we use a trigram target language model plm(projE(h)) to control the fluency of the translation produced. See Section 3.3 for technical subtleties related to their use in our engine.</Paragraph> <Paragraph position="8"> Distortion model d This feature is very similar to the one described in (Koehn, 2004) and only depends on the offsets of the source units. The only difference here arises when TPs are used to build a translation hypothesis:</Paragraph> <Paragraph position="10"> This score encourages the decoder to produce a monotonous translation, unless the language model strongly privileges the opposite.</Paragraph> <Paragraph position="11"> Global bias features Finally, three simple features help control the translation produced. Each TP (resp. PP) unit used to produce a hypothesis receives a fixed weight lt (resp. lp). This allows the introduction of an artificial bias favoring either PPs or TPs during decoding. Each target word produced is furthermore given a so-called word penalty lw which provides a weak way of controlling the preference of the decoder for long or short translations. null</Paragraph> </Section> <Section position="3" start_page="41" end_page="43" type="sub_section"> <SectionTitle> 3.3 The search procedure </SectionTitle> <Paragraph position="0"> The search procedure is described by the algorithm in Figure 3. The first stage of the search consists in collecting all the units (TPs or PPs) whose source part matches the source sentence f. We call U the set of those matching units.</Paragraph> <Paragraph position="1"> In this study, we apply a simple match policy that we call exact match policy. A TL t matches a source sentence f if its root matches f at a source position denoted r and if all the other words w of t satisfy: fow+r = w where ow designates the offset of w in t.</Paragraph> <Paragraph position="2"> Hypotheses are built synchronously along with the target side (by appending the target material to the right of the translation being produced) by progressively covering the positions of the source sentence f being translated.</Paragraph> <Paragraph position="3"> Require: a source sentence f</Paragraph> <Paragraph position="5"> for all hypotheses alive h[?]S[s] do for all u[?]U do if EXTENDS(u,h) then used in place of assignments, while-denotes unification (as in languages such as Prolog). The search space is organized into a set S of|f| stacks, where a stack S[s] (s[?][1,|f|]) contains all the hypotheses covering exactly s source words. A hypothesis h = (ps,t,r) is composed of its target material t, the source positions covered ps as well as its score r. The search space is initialized with an empty hypothesis: S[0] ={([?],epsilon1,0)}.</Paragraph> <Paragraph position="6"> The search procedure consists in extending each partial hypothesis h with every unit that can continue it. This process ends when all partial hypotheses have been expanded. The translation returned is the best one contained in S[|f|]:</Paragraph> <Paragraph position="8"> each stack S[s] is pruned before being expanded.</Paragraph> <Paragraph position="9"> Only the hypotheses whose scores are within a fraction (controlled by a meta-parameter b which typically is 0.0001 in our experiments) of the score of the best hypothesis in that stack are considered for expansion. We also limit the number of hypotheses maintained in a given stack to the top maxStack ones (maxStack is typically set to 500).</Paragraph> <Paragraph position="10"> Becausebeam-pruningtendstopromoteinastack partial hypotheses that translate easy parts (i.e. parts that are highly scored by the translation and language models), the score considered while pruning not only involves the cost of a partial hypothesis so far, but also an estimation of the future cost that will be incurred by fully expanding it.</Paragraph> <Paragraph position="11"> FUTURECOST -- We followed the heuristic described in (Koehn, 2004), which consists in computing for each source range [i,j] the minimum cost c(i,j) with which we can translate the source sequence fji . This is pre-computed efficiently at an early stage of the decoding (second line of the algorithminFigure3)byabottom-updynamicprogram- null ming scheme relying on the following recursion:</Paragraph> <Paragraph position="13"> where us stands for the projection of u on the target side (us [?] projE(u)), and score(u) is computed by considering the language model and the translation components ppp of the s(h) score. The future cost of h is then computed by summing the cost c(i,j) of all its empty source ranges [i,j].</Paragraph> <Paragraph position="14"> EXTENDS -- When we simply deal with standard (contiguous) phrases, extending a hypothesis h by a unit u basically requires that the source positions of u be empty in h. Then, the target material of u is appended to the current hypothesis h.</Paragraph> <Paragraph position="15"> Because we work with treelets here, things are a little more intricate. Conceptually, we are confronted with the construction of a (partial) source dependency tree while collecting the target material in order. Therefore, the decoder needs to check whetheragivenTL(thesourcepartofu)iscompatiblewiththeTLsbelongingtoh. Sincewedecidedin this study to use depth-one treelets, we consider that two TLs are compatible if either they do not share any source word, or, if they do, this shared word must be the governor of one TL and a dependent in the other TL.</Paragraph> <Paragraph position="16"> So, for instance, in the case of Figure 2, the two treelets are deemed compatible (they obviously should be since they both belong to the same original parse tree) because cr'edit is the governor in the right-hand treelet while being the dependent in the left-hand one. On the other hand, the two treelets in Figure 4 are not, since pr'esident is the governor of both treelets, even though mr.</Paragraph> <Paragraph position="17"> le pr'esident suppl'eant would be a valid source phrase. Note that it might be the case that the treelet {{mr.@-2} {le@-1} pr'esident {suppl'eant@1}} has been observed during training, in which case it will compete with the treelets in Figure 2.</Paragraph> <Paragraph position="18"> are their respective English translations.</Paragraph> <Paragraph position="19"> Therefore, extending a hypothesis containing a treelet with a new treelet consists in merging the two treelets (if they are compatible) and combining the target material accordingly. This operation is more complicatedthaninastandardphrase-baseddecoder sinceweallowgapsonthetargetsideaswell. Moreover, the target material of two compatible treelets may intersect. This is for instance the case for the two TPs in Figure 2 where the word funding is common to both phrases.</Paragraph> <Paragraph position="20"> UPDATE -- Whenever u extends h, we add a new hypothesis hprime in the corresponding stack S[|projF(hprime)|]. Its score is computed by adding to that of h the score of each component involved in s(h). For all but the one language model component, this is straightforward. However, care must be taken to update the language model score since the target material of u does not come necessarily right after that of h as would be the case if we only manipulated PP units.</Paragraph> <Paragraph position="21"> Figure 5 illustrates the kind of bookkeeping required. In practice, the target material of a hypothesis is encoded as a vector of triplets {<wi,logplm(wi|ci),li> }i[?][1,|e|max] where wi is the word at position i in the translation, logplm(wi|ci) is its score as given by the language model, ci denotes the largest conditioning context possible, and li indicates the length (in words) of ci (0 means a unigram probability, 1 a bigram probability and 2 a trigram probability). This vector is updated at each extension.</Paragraph> <Paragraph position="22"> tends an existing hypothesis (rectangles). The tag inside each occupied target position shows whether this word has been scored by a Unigram, a Bigram or a Trigram probability.</Paragraph> </Section> </Section> <Section position="5" start_page="43" end_page="43" type="metho"> <SectionTitle> 4 Experimental Setting </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 4.1 Corpora </SectionTitle> <Paragraph position="0"> We conducted our experiments on an in-house version of the Canadian Hansards focussing on the translation of French into English. The split of this material into train, development and test corpora is detailed in Table 1. The TEST corpus is subdivided in 16 (disjoints) slices of 500 sentences each that we translated separately. The vocabulary is atypically large since some tokens are being merged by SYNTEX, such as 'etaient#financ'ees (were financed in English).</Paragraph> <Paragraph position="1"> The training corpus has been aligned at the word level by two Viterbi word-alignments (French2English and English2French) that we combined in a heuristic way similar to the refined method described in (Och and Ney, 2003). The parameters of the word models (IBM model 2) were trained with the GIZA++ package (Och and Ney, 2000).</Paragraph> </Section> </Section> <Section position="6" start_page="43" end_page="44" type="metho"> <SectionTitle> TRAIN DEV TEST </SectionTitle> <Paragraph position="0"> this study. For each language l, l-toks is the number of tokens, l-toks/sent is the average number of tokens per sentence (+-the standard deviation), l-types is the number of different token forms and l-hapax is the number of tokens that appear only once in the corpus.</Paragraph> <Section position="1" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 4.2 Models </SectionTitle> <Paragraph position="0"> Tree-phrases Out of 1.7 million pairs of sentences, we collected more than 3 million different kinds of TLs from which we projected 6.5 million different kinds of EPs. Slightly less than half of the treelets are contiguous ones (i.e. involving a sequence of adjacent words); 40% of the EPs are contiguous. When the respective frequency of each TL or EP is factored in, we have approximately 11 million TLs and 10 million EPs. Roughly half of the treeletscollectedhaveexactlytwodependents(three word long treelets).</Paragraph> <Paragraph position="1"> Since the word alignment of non-contiguous phrases is likely to be less accurate than the alignment of adjacent word sequences, we further filter the repository of TPs by keeping the most likely EPs for each TL according to an estimate of p(EP|TL) that do not take into account the offsets of the EP or the TL.</Paragraph> <Paragraph position="2"> PP-model WecollectedthePPparametersbysimply reading the alignment matrices resulting from the word alignment, in a way similar to the one described in (Koehn et al., 2003). We use an in-house tool to collect pairs of phrases of up to 8 words. Freely available packages such as THOT (Ortiz-Mart'inez et al., 2005) could be used as well for that purpose.</Paragraph> <Paragraph position="3"> Language model We trained a Kneser-Ney trigramlanguagemodelusingthe SRILM toolkit(Stolcke, 2002).</Paragraph> </Section> <Section position="2" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 4.3 Protocol </SectionTitle> <Paragraph position="0"> We compared the performances of two versions of our engine: one which employs TPs ans PPs (TP-ENGINE hereafter), and one which only uses PPs (PP-ENGINE). We translated the 16 disjoint sub-corpora of the TEST corpus with and without TPs.</Paragraph> <Paragraph position="1"> We measure the quality of the translation produced with three automatic metrics. Two error rates: the sentence error rate (SER) and the word error rate (WER) that we seek to minimize, and BLEU (Papineni et al., 2002), that we seek to maximize. This last metric was computed with the multi-bleu.perl script available at www.</Paragraph> <Paragraph position="2"> statmt.org/wmt06/shared-task/.</Paragraph> <Paragraph position="3"> Weseparatelytunedbothsystemsonthe DEV corpus by applying a brute force strategy, i.e. by sampling uniformly the range of each parameter (l) and pickingtheconfigurationwhichledtothebest BLEU score. This strategy is inelegant, but in early experiments we conducted, we found better configurations this way than by applying the Simplex method with multiple starting points. The tuning roughly takes 24 hours of computation on a cluster of 16 computers clocked at 3 GHz, but, in practice, we found that one hour of computation is sufficient to get a configuration whose performances, while subobptimal, are close enough to the best one reachable by an exhaustive search.</Paragraph> <Paragraph position="4"> Both configurations were set up to avoid distortions exceeding 3 (maxDist = 3). Stacks were allowed to contain no more than 500 hypotheses (maxStack = 500) and we further restrained the number of hypotheses considered by keeping for each matching unit (treelet or phrase) the 5 best ranked target associations. This setting has been fixed experimentally on the DEV corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>