File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1023_metho.xml
Size: 12,143 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1023"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 177-184, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Inner-Outer Bracket Models for Word Alignment using Hidden Blocks</Title> <Section position="5" start_page="177" end_page="179" type="metho"> <SectionTitle> 3 Inner-Outer Bracket Models </SectionTitle> <Paragraph position="0"> We treat the constraining block as a hidden variable in a generative model shown in Eqn. 2.</Paragraph> <Paragraph position="2"> where d[] = (de,df) is the hidden block. In the generative process, the model first generates a bracket de for e with a monolingual bracketing model of P(de|e). It then uses the segmentation of the English (de,e) to generate the projected bracket df of f using a generative translation model P(f,df|de,e) = P(df/[?],df[?]|de/[?],de[?]) -- the key model to implement our proposed inner-outer constraints. With the hidden block d[] inferred, the model then generates word alignments within the inner and outer parts separately. We present two generating processes for the inner and outer parts induced by d[] and corresponding two models of P(f,df|de,e). These models are described in the following secions.</Paragraph> <Section position="1" start_page="177" end_page="178" type="sub_section"> <SectionTitle> 3.1 Inner-Outer Bracket Model-A </SectionTitle> <Paragraph position="0"> The first model assumes that the inner part and the outer part are generated independently. By the formal equivalence of (f,df) with (df[?],df/[?]), Eqn. 2 can be approximated as:</Paragraph> <Paragraph position="2"> where P(df[?]|de[?]) and P(df/[?]|de/[?]) are two independent generative models for inner and outer parts, respec- null tively and are futher decompsed into:</Paragraph> <Paragraph position="4"> where {aJ1} is the word alignment vector. Given the block segmentation and word alignment, the generative process first randomly selects a ei according to either P(ei|de[?]) or P(ei|de/[?]); and then generates fj indexed by word alignment aj with i = aj according to a word level lexicon P(fj|eaj). This generative process using the two models of P(df[?]|de[?]) and P(df/[?]|de/[?]) mustsatisfytheconstraintsofsegmentationsinduced by the hidden block d[] = (de,df). The English wordsde[?] insidetheblockcanonlygeneratethewords in df[?] and nothing else; likewise de/[?] only generates df/[?]. Overall, the combination of P(df[?]|de[?])P(df/[?]|de/[?]) in Eqn. 3 collaborates each other quite well in practice. For a particular observation df[?], if de[?] is too small (i.e., missing translations), P(df[?]|de[?]) will suffer; and if de[?] is too big (i.e., robbing useful words from de/[?]), P(df/[?]|de/[?]) will suffer. Therefore, our proposed model in Eqn. 3 combines the two costs and requires both inner and outer parts to be explained well at the same time.</Paragraph> <Paragraph position="5"> Because the model in Eqn. 3 is essentially a two-level (d[] and a) mixture model similar to IBM Models, the EM algorithm is quite straight forward as in IBM models. Shown in the following are several key E-step computations of the posteriors. The M-step (optimization) is simply the normalization of the fractional counts collected using the posteriors through the inference results from E-step: suming P(de|e) to be a uniform distribution, the posterior of selecting a hidden block given observations: P(d[] = (de,df)|e,f) is proportional to block level relative frequency Prel(df[?]|de[?]) updated in each iteration; and can be smoothed</Paragraph> <Paragraph position="7"> the inner and outer parts independently to reduce the risks of data sparseness in estimations.</Paragraph> <Paragraph position="8"> In principle, de can be a bracket of any length not exceeding the sentence length. If we restrict the bracket length to that of the sentence length, we recover IBM Model-1. Figure 2 summarizes the generation process for Inner-Outer Bracket Model-A.</Paragraph> <Paragraph position="10"/> </Section> <Section position="2" start_page="178" end_page="179" type="sub_section"> <SectionTitle> 3.2 Inner-Outer Bracket Model-B </SectionTitle> <Paragraph position="0"> A block d[] invokes both the inner and outer generations simultaneously in Bracket Model A (BM-A).</Paragraph> <Paragraph position="1"> However, the generative process is usually more effective in the inner part as d[] is generally small and accurate. We can build a model focusing on generating only the inner part with careful inferences to avoid errors from noisy blocks. To ensure that all fJ1 are generated, we need to propose enough blocks to cover each observation fj. This constraint can be met by treating the whole sentence pair as one block.</Paragraph> <Paragraph position="2"> The generative process is as follows: First the model generates an English bracket de as before. The model then generates a projection df in f to localize all aj's for the given de according to P(df|de,e). de and df forms a hidden block d[]. Given d[], the model then generates only the inner part fj [?] df[?] via</Paragraph> <Paragraph position="4"> similarequal P(df[?]|df,de,e)P([jl,jr]|de,e).</Paragraph> <Paragraph position="5"> P(df[?]|df,de,e) is a bracket level emission probabilistic model which generates a bag of contiguous words fj [?] df[?] under the constraints from the given hidden block d[] = (df,de). The model is simplified in Eqn. 7 with the assumption of bag-of-words' independence within the bracket df:</Paragraph> <Paragraph position="7"> The P([jl,jr]|de,e) in Eqn. 6 is a localization probabilisticmodel, whichhas resemblancesto an HMM's transition probability, P(aj|aj[?]1), implementing the assumption &quot;close-in-source&quot; is aligned to &quot;close-intarget&quot;. However, instead of using the simple position variable aj, P([jl,jr]|de,e) is more expressive with word identities to localize words {fj} aligned to de[?]. To model P([jl,jr]|de,e) reliably, df = [jl,jr] is equivalently defined as the center and width of the bracket df: (circledotdf,wdf). To simplify it further, we assume that wdf and circledotdf can be predicted independently. null The width model, P(wdf|de,e), depends on the length of the English bracket and the fertilities of English words in it. To simplify M-step computations, we can compute the expected width as in Eqn. 8.</Paragraph> <Paragraph position="8"> E{wdf|de,e}similarequal g *|ir [?]il +1|, (8) where g is the expected bracket length ratio and is approximated by the average sentence length ratio computed using the whole parallel corpus. For Chinese-English, g = 1/1.3 = 0.77. In practice, this estimation is quite reliable.</Paragraph> <Paragraph position="9"> The center model P(circledotdf|de,e) is harder. It is conditioned on the translational equivalence between the English bracket and its projection. We compute the expectedcircledotdf by averaging the weighted expected centers from all the English words in de as in Eqn. 9.</Paragraph> <Paragraph position="11"> The expectations of (circledotdf,wdf) from Eqn. 8 and Eqn. 9 give a reliable starting point for a local search for the optimal estimation of (^circledotdf, ^wdf) as in Eqn 10:</Paragraph> <Paragraph position="13"> where the score functions of P(df[?]|de[?])P(df/[?]|de/[?]) are in Eqn. 4 with the word alignment explicitly given from the previous iteration. For the very first iteration, full alignment isassumed; thismeansthatevery word pair is connected in the parallel sentences. During the local search in Eqn. 10, one can choose the top-1 (Viterbi) (^circledotdf, ^wdf) or top-N candidates and normalize over these candidates to obtain the posteriors. Except for the local search of (^circledotdf, ^wdf), the remainder EM steps are similar to Bracket Model-A, though with different interpretations.</Paragraph> <Paragraph position="14"> By performing local search in Eqn. 10, Model-B localizes hidden blocks more accurately than the scheme of the smoothed relative frequency in Model-A's EM iterations. The model is also more focused on the predictions in the inner part. Figure 3 summarizes the generative process of Model-B (BM-B).</Paragraph> <Paragraph position="16"/> </Section> <Section position="3" start_page="179" end_page="179" type="sub_section"> <SectionTitle> 3.3 A Null Word Model </SectionTitle> <Paragraph position="0"> The null word model allows words to be aligned to nothing. In the traditional IBM models, there is a universal null word which is attached to every sentence pair to compete with word generators. In our inner-outer bracket models, we use two context-specific null word models which use both the left and right context as competitors in the generative process for each observation fj: P(fj|fj[?]1,e) and P(fj|fj+1,e). This is similar to the approach in (Toutanova et al., 2002), in which the null word model is part of an extended HMM using left context only. With two null word models, we can associate fj with its left or right context (i.e., a null link) when the null word models are very strong, or when the word's alignment is too far from the expected center ^circledotdf in Eqn. 9.</Paragraph> </Section> </Section> <Section position="6" start_page="179" end_page="180" type="metho"> <SectionTitle> 4 A Max-Posterior for Word Alignment </SectionTitle> <Paragraph position="0"> In the HMM framework, (Ge, 2004) proposed a maximum-posterior method which worked much better than Viterbi for Arabic to English translations. The difference between maximum-posterior and Viterbi, in a nutshell, is that while Viterbi computes the best state sequence given the observation, the maximum-posterior computes the best state one at a time.</Paragraph> <Paragraph position="1"> In the terminology of HMM, let the states be the words in the foreign sentence fJ1 and observations be the words in the English sentence eT1 . We use the subscript t to note the fact that et is observed (or emitted) at time step t. The posterior probabilities P(fj|et) (state given observation) are obtained after the forward-backward training. The maximum-posterior word alignments are obtained by first com- null puting a pair (j,t)[?]:</Paragraph> <Paragraph position="3"> that is, the point at which the posterior is maximum.</Paragraph> <Paragraph position="4"> The pair (j,t) defines a word pair (fj,et) which is then aligned. The procedure continues to find the next maximum in the posterior matrix. Contrast this with Viterbi alignment where one computes</Paragraph> <Paragraph position="6"> We observe, in parallel corpora, that when one word translates into multiple words in another language, it usually translates into a contiguous sequence of words. Therefore, we impose a contiguity constraint on word alignments. When one word fj aligns to multiple English words, the English words must be contiguous in e and vice versa.</Paragraph> <Paragraph position="7"> The algorithm to find word alignments using maxposterior with contiguity constraint is illustrated in Algorithm 1.</Paragraph> <Paragraph position="8"> Algorithm 1 A maximum-posterior algorithm with contiguity constraint 1: while (j,t) = (j,t)[?] (as computed in Eqn. 11) do 2: if (fj,et) is not yet aligned then 3: align(fj, et); 4: else if (et is contiguous to what fj is aligned) or(fj iscontiguoustowhatet isaligned)then 5: align(fj, et); 6: end if 7: end while The algorithm terminates when there isn't any 'next' posterior maximum to be found. By definition, there are at most JxT 'next' maximums in the posterior matrix. And because of the contiguity constraint, not all (fj,et) pairs are valid alignments. The algorithm is sure to terminate. The algorithm is, in a sense, directionless, for one fj can align to multiple et's and vise versa as long as the multiple connections are contiguous. Viterbi, however, is directional in which one state can emit multiple observations but one observation can only come from one state.</Paragraph> </Section> class="xml-element"></Paper>