File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0804_metho.xml
Size: 13,021 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0804"> <Title>Bilingual Word Spectral Clustering for Statistical Machine Translation</Title> <Section position="3" start_page="25" end_page="26" type="metho"> <SectionTitle> 2 Statistical Machine Translation </SectionTitle> <Paragraph position="0"> The task of translation is to translate one sentence in some source language F into a target language E.</Paragraph> <Paragraph position="1"> For example, given a French sentence with J words denoted as fJ1 = f1f2...fJ, an SMT system automatically translates it into an English sentence with I words denoted by eI1 = e1e2...eI. The SMT system first proposes multiple English hypotheses in its model space. Among all the hypotheses, the system selects the one with the highest conditional probability according to Bayes's decision rule:</Paragraph> <Paragraph position="3"> where P(fJ1 |eI1) is called translation model, and P(eI1) is called language model. The translation model is the key component, which is the focus in this paper.</Paragraph> <Section position="1" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.1 HMM-based Translation Model </SectionTitle> <Paragraph position="0"> HMM is one of the effective translation models (Vogel et al., 1996), which is easily scalable to very large training corpus.</Paragraph> <Paragraph position="1"> To model word-to-word translation, we introduce the mapping j - aj, which assigns a French word fj in position j to a English word ei in position i = aj denoted as eaj. Each French word fj is an observation, and it is generated by a HMM state defined as [eaj,aj], where the alignment aj for position j is considered to have a dependency on the previous alignment aj[?]1. Thus the first-order HMM is defined as follows:</Paragraph> <Paragraph position="3"> where P(aj|aj[?]1) is the transition probability. This model captures the assumption that words close in the source sentence are aligned to words close in the target sentence. An additional pseudo word of &quot;NULL&quot; is used as the beginning of English sentence for HMM to start with. The (Och and Ney, 2003) model includes other refinements such as special treatment of a jump to a Null word, and a uniform smoothing prior. The HMM with these refinements is used as our baseline. Motivated by the work in both (Och and Ney, 2000) and (Toutanova et al., 2002), we propose the two following simplest versions of extended HMMs to utilize bilingual word clusters.</Paragraph> </Section> <Section position="2" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 2.2 Extensions to HMM with word clusters </SectionTitle> <Paragraph position="0"> Let F denote the cluster mapping fj - F(fj), which assigns French word fj to its cluster ID Fj = F(fj).</Paragraph> <Paragraph position="1"> Similarly E maps English word ei to its cluster ID of Ei = E(ei). In this paper, we assume each word belongs to one cluster only.</Paragraph> <Paragraph position="2"> With bilingual word clusters, we can extend the HMM model in Eqn. 1 in the following two ways:</Paragraph> <Paragraph position="4"> where E(eaj[?]1) and F(fj[?]1) are non overlapping word clusters (Eaj[?]1,Fj[?]1)for English and French respectively.</Paragraph> <Paragraph position="5"> Another explicit way of utilizing bilingual word clusters can be considered as a two-stream HMM as follows:</Paragraph> <Paragraph position="7"> This model introduces the translation of bilingual word clusters directly as an extra factor to Eqn. 2.</Paragraph> <Paragraph position="8"> Intuitively, the role of this factor is to boost the translation probabilities for words sharing the same concept. This is a more expressive model because it models both word and the cluster level translation equivalence. Also, compared with the model in Eqn.</Paragraph> <Paragraph position="9"> 3, this model is easier to train, as it uses a two-dimension table instead of a four-dimension table.</Paragraph> <Paragraph position="10"> However, we do not want this P(Fj|Eaj) to dominate the HMM transition structure, and the obser- null vation probability of P(fj|eaj) during the EM iterations. Thus a uniform prior P(Fj) = 1/|F |is introduced as a smoothing factor for P(Fj|Eaj):</Paragraph> <Paragraph position="12"> where |F |is the total number of word clusters in French (we use the same number of clusters for both languages). l can be chosen to get optimal performance on a development set. In our case, we fix it to be 0.5 in all our experiments.</Paragraph> </Section> </Section> <Section position="4" start_page="26" end_page="28" type="metho"> <SectionTitle> 3 Bilingual Word Clustering </SectionTitle> <Paragraph position="0"> In bilingual word clustering, the task is to build word clusters F and E to form partitions of the vocabularies of the two languages respectively. The two partitions for the vocabularies of F and E are aimed to be suitable for machine translation in the sense that the cluster/partition level translation equivalence is reliable and focused to handle data sparseness; the translation model using these clusters explains the parallel corpus {(fJ1 ,eI1)} better in terms of perplexity or joint likelihood.</Paragraph> <Section position="1" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 3.1 From Monolingual to Bilingual </SectionTitle> <Paragraph position="0"> To infer bilingual word clusters of (F,E), one can optimize the joint probability of the parallel corpus {(fJ1 ,eI1)} using the clusters as follows:</Paragraph> <Paragraph position="2"> Eqn. 6 separates the optimization process into two parts: the monolingual part for E, and the bilingual part for F given fixed E. The monolingual part is considered as a prior probability:P(eI1|E), and E can be inferred using corpus bigram statistics in the following equation:</Paragraph> <Paragraph position="4"> We need to fix the number of clusters beforehand, otherwise the optimum is reached when each word is a class of its own. There exists efficient leave-one-out style algorithm (Kneser and Ney, 1993), which can automatically determine the number of clusters.</Paragraph> <Paragraph position="5"> For the bilingual part P(fJ1 |eI1,F,E), we can slightly modify the same algorithm as in (Kneser and Ney, 1993). Given the word alignment {aJ1} between fJ1 and eI1 collected from the Viterbi path in HMM-based translation model, we can infer ^F as follows:</Paragraph> <Paragraph position="7"> Overall, this bilingual word clustering algorithm is essentially a two-step approach. In the first step, E is inferred by optimizing the monolingual likelihood of English data, and secondly F is inferred by optimizing the bilingual part without changing E. In this way, the algorithm is easy to implement without much change from the monolingual correspondent.</Paragraph> <Paragraph position="8"> This approach was shown to give the best results in (Och, 1999). We use it as our baseline to compare with.</Paragraph> </Section> <Section position="2" start_page="26" end_page="28" type="sub_section"> <SectionTitle> 3.2 Bilingual Word Spectral Clustering </SectionTitle> <Paragraph position="0"> Instead of using word alignment to bridge the parallel sentence pair, and optimize the likelihood in two separate steps, we develop an alignment-free algorithm using a variant of spectral clustering algorithm. The goal is to build high cluster-level translation quality suitable for translation modelling, and at the same time maintain high intra-cluster similarity , and low inter-cluster similarity for monolingual clusters.</Paragraph> <Paragraph position="1"> We define the vocabulary VF as the French vocabulary with a size of |VF|; VE as the English vocabulary with size of |VE|. A co-occurrence matrix C{F,E} is built with |VF |rows and |VE |columns; each element represents the co-occurrence counts of the corresponding French word fj and English word ei. In this way, each French word forms a row vector with a dimension of |VE|, and each dimensionality is a co-occurring English word. The elements in the vector are the co-occurrence counts. We can also view each column as a vector for English word, and we'll have similar interpretations as above.</Paragraph> <Paragraph position="2"> With C{F,E}, we can infer two affinity matrixes as follows:</Paragraph> <Paragraph position="4"> where AE is an |VE|x|VE |affinity matrix for English words, with rows and columns representing English words and each element the inner product between two English words column vectors. Correspondingly, AF is an affinity matrix of size |VF|x |VF |for French words with similar definitions. Both AE and AF are symmetric and non-negative. Now we can compute the eigenstructure for both AE and AF . In fact, the eigen vectors of the two are correspondingly the right and left sub-spaces of the original co-occurrence matrix of C{F,E} respectively.</Paragraph> <Paragraph position="5"> This can be computed using singular value decomposition (SVD): C{F,E} = USV T , AE = VS2V T , and AF = US2UT , where U is the left sub-space, and V the right sub-space of the co-occurrence matrix C{F,E}. S is a diagonal matrix, with the singular values ranked from large to small along the diagonal.</Paragraph> <Paragraph position="6"> Obviously, the left sub-space U is the eigenstructure for AF ; the right sub-space V is the eigenstructure for AE.</Paragraph> <Paragraph position="7"> By choosing the top K singular values (the square root of the eigen values for both AE and AF ), the sub-spaces will be reduced to: U|VF|xK and V|VE|xK respectively. Based on these subspaces, we can carry out K-means or other clustering algorithms to infer word clusters for both languages. Our algorithm goes as follows: * Initialize bilingual co-occurrence matrix C{F,E} with rows representing French words, and columns English words. Cji is the co-occurrence raw counts of French word fj and English word ei; * Form the affinity matrix AE = CT{F,E}C{F,E} and AF = CT{F,E}C{F,E}. Kernels can also be applied here such as AE = exp(C{F,E}C</Paragraph> <Paragraph position="9"> and normalize each row to be unit length; * Compute the eigen structure of the normalized matrix AE, and find the k largest eigen vectors: v1,v2,...,vk; Similarly, find the k largest eigen vectors of AF : u1,u2,...,uk; * Stack the k eigenvectors of v1,v2,...,vk in the columns of YE, and stack the eigenvectors u1,u2,...,uk in the columns for YF ; Normalize rows of both YE and YF to have unit length. YE is size of |VE|xk and YF is size of |VF|xk; * Treat each row of YE as a point in R|VE|xk, and cluster them into K English word clusters using K-means. Treat each row of YF as a point in R|VF|xk, and cluster them into K French word clusters.</Paragraph> <Paragraph position="10"> * Finally, assign original word ei to cluster Ek if row i of the matrix YE is clustered as Ek; similar assignments are for French words.</Paragraph> <Paragraph position="11"> Here AE and AF are affinity matrixes of pair-wise inner products between the monolingual words. The more similar the two words, the larger the value. In our implementations, we did not apply a kernel function like the algorithm in (Ng et al., 2001). But the kernel function such as the exponential function mentioned above can be applied here to control how rapidly the similarity falls, using some carefully chosen scaling parameter.</Paragraph> <Paragraph position="12"> The above algorithm is very close to the variants of a big family of the spectral clustering algorithms introduced in (Meila and Shi, 2000) and studied in (Ng et al., 2001). Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with high intra-cluster similarity and low inter-cluster similarity. It's shown to be computing the k-way normalized cut: K [?]trY TD[?]12AD[?]12Y for any matrix Y [?] RMxN. A is the affinity matrix, and Y in our algorithm corresponds to the subspaces of U and V .</Paragraph> <Paragraph position="13"> Experimentally, it has been observed that using more eigenvectors and directly computing a k-way partitioning usually gives better performance. In our implementations, we used the top 500 eigen vectors to construct the subspaces of U and V for K-means clustering.</Paragraph> <Paragraph position="14"> The K-means here can be considered as a post-processing step in our proposed bilingual word clustering. For initial centroids, we first compute the center of the whole data set. The farthest centroid from the center is then chosen to be the first initial centroid; and after that, the other K-1 centroids are chosen one by one to well separate all the previous chosen centroids.</Paragraph> <Paragraph position="15"> The stopping criterion is: if the maximal change of the clusters' centroids is less than the threshold of 1e-3 between two iterations, the clustering algorithm then stops.</Paragraph> </Section> </Section> class="xml-element"></Paper>