File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2124_metho.xml
Size: 14,377 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2124"> <Title>BiTAM: Bilingual Topic AdMixture Models for Word Alignment</Title> <Section position="5" start_page="969" end_page="971" type="metho"> <SectionTitle> 3 Bilingual Topic AdMixture Model </SectionTitle> <Paragraph position="0"> Now we describe the BiTAM formalism that captures the latent topical structure and generalizes word alignments and translations beyond sentence-level via topic sharing across sentencepairs: null</Paragraph> <Paragraph position="2"> where p(F|E) is a document-level translation model, generating the document F as one entity.</Paragraph> <Paragraph position="3"> In a BiTAM model, a document-pair (F,E) is treated as an admixture of topics, which is induced by random draws of a topic, from a pool of topics, for each sentence-pair. A unique normalized and real-valued vector th, referred to as a topic-weight vector, which captures contributions of different topics, are instantiated for each document-pair, so that the sentence-pairs with their alignments are generated from topics mixed according to these common proportions. Marginally, a sentence-pair is word-aligned according to a unique bilingual model governed by the hidden topical assignments. Therefore, the sentence-level translations are coupled, rather than being independent as assumed in the IBM models and their extensions.</Paragraph> <Paragraph position="4"> Because of this coupling of sentence-pairs (via topic sharing across sentence-pairs according to a common topic-weight vector), BiTAM is likely to improve the coherency of translations by treating the document as a whole entity, instead of uncorrelated segments that have to be independently aligned and then assembled. There are at least two levels at which the hidden topics can be sampled for a document-pair, namely: the sentence-pair and the word-pair levels. We propose three variants of the BiTAM model to capture the latent topics of bilingual documents at different levels.</Paragraph> <Section position="1" start_page="969" end_page="971" type="sub_section"> <SectionTitle> 3.1 BiTAM-1: The Frameworks </SectionTitle> <Paragraph position="0"> In the first BiTAM model, we assume that topics are sampled at the sentence-level. Each document-pair is represented as a random mixture of latent topics. Each topic, topic-k, is presented by a topic-specific word-translation table: Bk, which is a hexagon denotes a parameter. Un-shaded nodes are hidden variables. All the plates represent replicates. The outmost plate (M-plate) represents M bilingual document-pairs, while the inner N-plate represents the N repeated choice of topics for each sentence-pairs in the document; the inner J-plate represents J word-pairs within each sentence-pair. (a) BiTAM-1 samples one topic (denoted by z) per sentence-pair; (b) BiTAM-2 utilizes the sentence-level topics for both the translation model (i.e., p(f|e,z)) and the monolingual word distribution (i.e., p(e|z)); (c) BiTAM-3 samples one topic per word-pair. a translation lexicon: Bi,j,k=p(f=fj|e=ei,z=k), where z is an indicator variable to denote the choice of a topic. Given a specific topic-weight vector thd for a document-pair, each sentence-pair draws its conditionally independent topics from a mixture of topics. This generative process, for a document-pair (Fd,Ed), is summarized as below: 1. Sample sentence-number N from a Poisson(g).</Paragraph> <Paragraph position="1"> 2. Sample topic-weight vector thd from a Dirichlet(a).</Paragraph> <Paragraph position="2"> 3. For each sentence-pair (fn,en) in the dprimeth doc-pair , (a) Sample sentence-length Jn from Poisson(d); (b) Sample a topic zdn from a Multinomial(thd); (c) Sample ej from a monolingual model p(ej); (d) Sample each word alignment link aj from a uniform model p(aj) (or an HMM); (e) Sample each fj according to a topic-specific translation lexicon p(fj|e,aj,zn,B).</Paragraph> <Paragraph position="3"> We assume that, in our model, there are K possible topics that a document-pair can bear. For each document-pair, a K-dimensional Dirichlet random variable thd, referred to as the topic-weight vector of the document, can take values in the (K[?]1)-simplex following a probability density:</Paragraph> <Paragraph position="5"> where the hyperparameter a is a K-dimension vector with each component ak>0, and G(x) is the Gamma function. The alignment is represented by a J-dimension vector a = {a1,a2,*** ,aJ}; for each French word fj at the position j, an position variable aj maps it to an English word eaj at the position aj in English sentence. The word level translation lexicon probabilities are topic-specific, and they are parameterized by the matrix B = {Bk}.</Paragraph> <Paragraph position="6"> For simplicity, in our current models we omit the modelings of the sentence-number N and the sentence-length Jn, and focus only on the bilingual translation model. Figure 1 (a) shows the graphical model representation for the BiTAM generative scheme discussed so far. Note that, the sentence-pairs are now connected by the node thd.</Paragraph> <Paragraph position="7"> Therefore, marginally, the sentence-pairs are not independent of each other as in traditional SMT models, instead they are conditionally independent given the topic-weight vector thd. Specifically, BiTAM-1 assumes that each sentence-pair has one single topic. Thus, the word-pairs within this sentence-pair are conditionally independent of each other given the hidden topic index z of the sentence-pair.</Paragraph> <Paragraph position="8"> The last two sub-steps (3.d and 3.e) in the BiTam sampling scheme define a translation model, in which an alignment link aj is proposed and an observation of fj is generated according to the proposed distributions. We simplify alignment model of a, as in IBM-1, by assuming that aj is sampled uniformly at random. Given the parameters a, B, and the English part E, the joint conditional distribution of the topic-weight vector th, the topic indicators z, the alignment vectors A, and the document F can be written as:</Paragraph> <Paragraph position="10"> where N is the number of the sentence-pair.</Paragraph> <Paragraph position="11"> Marginalizing out th and z, we can obtain the marginal conditional probability of generating F from E for each document-pair:</Paragraph> <Paragraph position="13"> where p(fn,an|en,Bzn) is a topic-specific sentence-level translation model. For simplicity, we assume that the French words fj's are conditionally independent of each other; the alignment variables aj's are independent of other variables and are uniformly distributed a priori. Therefore, the distribution for each sentence-pair is:</Paragraph> <Paragraph position="15"> Thus, the conditional likelihood for the entire parallel corpus is given by taking the product of the marginal probabilities of each individual document-pair in Eqn. 5.</Paragraph> </Section> <Section position="2" start_page="971" end_page="971" type="sub_section"> <SectionTitle> 3.2 BiTAM-2: Monolingual Admixture </SectionTitle> <Paragraph position="0"> In general, the monolingual model for English can also be a rich topic-mixture. This is realized by using the same topic-weight vector thd and the same topic indicator zdn sampled according to thd, as described in SS3.1, to introduce not only topic-dependent translation lexicon, but also topic-dependent monolingual model of the source language, English in this case, for generating each sentence-pair (Figure 1 (b)). Now e is generated from a topic-based language model b, instead of a uniform distribution in BiTAM-1. We refer to this model as BiTAM-2.</Paragraph> <Paragraph position="1"> Unlike BiTAM-1, where the information observed in ei is indirectly passed to z via the node of fj and the hidden variable aj, in BiTAM-2, the topics of corresponding English and French sentences are also strictly aligned so that the information observed in ei can be directly passed to z, in the hope of finding more accurate topics. The topics are inferred more directly from the observed bilingual data, and as a result, improve alignment.</Paragraph> </Section> <Section position="3" start_page="971" end_page="971" type="sub_section"> <SectionTitle> 3.3 BiTAM-3: Word-level Admixture </SectionTitle> <Paragraph position="0"> It is straightforward to extend the sentence-level BiTAM-1 to a word-level admixture model, by sampling topic indicator zn,j for each word-pair (fj,eaj) in the nprimeth sentence-pair, rather than once for all (words) in the sentence (Figure 1 (c)).</Paragraph> <Paragraph position="1"> This gives rise to our BiTAM-3. The conditional likelihood functions can be obtained by extending the formulas in SS3.1 to move the variable zn,j inside the same loop over each of the fn,j.</Paragraph> </Section> <Section position="4" start_page="971" end_page="971" type="sub_section"> <SectionTitle> 3.4 Incorporation of Word &quot;Null&quot; </SectionTitle> <Paragraph position="0"> Similar to IBM models, &quot;Null&quot; word is used for the source words which have no translation counterparts in the target language. For example, Chinese words &quot;de&quot; ( ) , &quot;ba&quot; (r) and &quot;bei&quot; ( ) generally do not have translations in English.</Paragraph> <Paragraph position="1"> &quot;Null&quot; is attached to every target sentence to align the source words which miss their translations.</Paragraph> <Paragraph position="2"> Specifically, the latent Dirichlet allocation (LDA) in (Blei et al., 2003) can be viewed as a special case of the BiTAM-3, in which the target sentence contains only one word: &quot;Null&quot;, and the alignment link a is no longer a hidden variable.</Paragraph> </Section> </Section> <Section position="6" start_page="971" end_page="972" type="metho"> <SectionTitle> 4 Learning and Inference </SectionTitle> <Paragraph position="0"> Due to the hybrid nature of the BiTAM models, exact posterior inference of the hidden variables A,z and th is intractable. A variational inference is used to approximate the true posteriors of these hidden variables. The inference scheme is presented for BiTAM-1; the algorithms for BiTAM-2 and BiTAM-3 are straight forward extensions and are omitted.</Paragraph> <Section position="1" start_page="971" end_page="972" type="sub_section"> <SectionTitle> 4.1 Variational Approximation </SectionTitle> <Paragraph position="0"> To approximate: p(th,z,A|E,F,a,B), the joint posterior, we use the fully factorized distribution over the same set of hidden variables:</Paragraph> <Paragraph position="2"> where the Dirichlet parameter g, the multinomial parameters (ph1,*** ,phn), and the parameters (phn1,*** ,phnJn) are known as variational parameters, and can be optimized with respect to the Kullback-Leibler divergence from q(*) to the original p(*) via an iterative fixed-point algorithm. It can be shown that the fixed-point equations for the variational parameters in BiTAM-1 are as follows:</Paragraph> <Paragraph position="4"> where Ps(*) is a digamma function. Note that in the above formulas phdnk is the variational parameter underlying the topic indicator zdn of the n-th sentence-pair in document d, and it can be used to predict the topic distribution of that sentence-pair.</Paragraph> <Paragraph position="5"> Following a variational EM scheme (Beal and Ghahramani, 2002), we estimate the model parameters a and B in an unsupervised fashion. Essentially, Eqs. (8-10) above constitute the E-step, where the posterior estimations of the latent variables are obtained. In the M-step, we update a and B so that they improve a lower bound of the log-likelihood defined bellow:</Paragraph> <Paragraph position="7"> The close-form iterative updating formula B is:</Paragraph> <Paragraph position="9"> For a, close-form update is not available, and we resort to gradient accent as in (Sj&quot;olander et al., 1996) with re-starts to ensure each updated ak>0.</Paragraph> </Section> <Section position="2" start_page="972" end_page="972" type="sub_section"> <SectionTitle> 4.2 Data Sparseness and Smoothing </SectionTitle> <Paragraph position="0"> The translation lexicons Bf,e,k have a potential size of V 2K, assuming the vocabulary sizes for both languages are V . The data sparsity (i.e., lack of large volume of document-pairs) poses a more serious problem in estimating Bf,e,k than the monolingual case, for instance, in (Blei et al., 2003). To reduce the data sparsity problem, we introduce two remedies in our models. First: Laplace smoothing. In this approach, the matrix set B, whose columns correspond to parameters of conditional multinomial distributions, is treated as a collection of random vectors all under a symmetric Dirichlet prior; the posterior expectation of these multinomial parameter vectors can be estimated using Bayesian theory. Second: interpolation smoothing. Empirically, we can employ a linear interpolation with IBM-1 to avoid overfitting:</Paragraph> <Paragraph position="2"> As in Eqn. 1, p(f|e) is learned via IBM-1; l is estimated via EM on held out data.</Paragraph> </Section> <Section position="3" start_page="972" end_page="972" type="sub_section"> <SectionTitle> 4.3 Retrieving Word Alignments </SectionTitle> <Paragraph position="0"> Two word-alignment retrieval schemes are designed for BiTAMs: the uni-direction alignment (UDA) and the bi-direction alignment (BDA). Both use the posterior mean of the alignment indicators adnji, captured by what we call the posterior alignment matrix ph [?] {phdnji}. UDA uses a French word fdnj (at the jprimeth position of nprimeth sentence in the dprimeth document) to query ph to get the best aligned English word (by taking the maximum point in a row of ph):</Paragraph> <Paragraph position="2"> BDA selects iteratively, for each f, the best aligned e, such that the word-pair (f,e) is the maximum of both row and column, or its neighbors have more aligned pairs than the other combpeting candidates.</Paragraph> <Paragraph position="3"> A close check of {phdnji} in Eqn. 10 reveals that it is essentially an exponential model: weighted log probabilities from individual topic-specific translation lexicons; or it can be viewed as weighted geometric mean of the individual lexicon's strength.</Paragraph> </Section> </Section> class="xml-element"></Paper>