File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2221_metho.xml
Size: 13,016 bytes
Last Modified: 2025-10-06 14:15:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2221"> <Title>Modeling with Structures in Statistical Machine Translation</Title> <Section position="4" start_page="0" end_page="1358" type="metho"> <SectionTitle> 2 Word-based Alignment Model </SectionTitle> <Paragraph position="0"> In a word-based alignment translation model, the transformation from a sentence at the source end of a communication channel to a sentence at the target end can be described with the following random process: 1. Pick a length for the sentence at the target end.</Paragraph> <Paragraph position="1"> 2. For each word position in the target sentence, align it with a source word. 3. Produce a word at each target word po- null sition according to the source word with which the target word position has been aligned.</Paragraph> <Paragraph position="2"> IBM Alignment Model 2 is a typical example of word-based alignment. Assuming a sentence s = Sl,...,st at the source of a channel, the model picks a length m of the target sentence t according to the distribution P(m I s) = e, where e is a small, fixed number. Then for each position i (0 < i _< m) in t, it finds its corresponding position ai in s according to an alignment distribution P(ai l i, a~ -1, m, s) = a(ai l i, re, l). Finally, it generates a word ti at the position i of t from the source word s~, at the aligned position ai, according to a translation z 1 m distribution P(ti \] t~- , a 1 , s) -- t(ti I s~,). Alignment Model 2, the bottom one is the 'ideal' alignment. fiter der zweiten Terrain im Mai koennte ich den Mittwoch den fuenf und zwanzigsten anbieten 1 could offer ~ou Wednesday the twenty fifth for the second date in May fuer der zweiten Termin im Mai koennte ich den Mittwoch den fuenf und zwanzigsten anbieten I could offer you Wednesday the twenty fifth for the second date in May Therefore, P(t\]s) is the sum of the probabilities of generating t from s over all possible alignments A, in which the position i in t is aligned with the position ai in s:</Paragraph> <Paragraph position="4"> A word-based model may have severe problems when there are deletions in translation (this may be a result of erroneous sentence alignment) or the two languages have different word orders, like English and German. Figure 1 and Figure 2 show some problematic alignments between English/German sentences made by IBM Model 2, together with the 'ideal' alignments for the sentences. Here the alignment parameters penalize the alignment of English words with their German translation equivalents because the translation equivalents are far away from the words.</Paragraph> <Paragraph position="5"> An experiment reveals how often this kind of &quot;skewed&quot; alignment happens in our English/German scheduling conversation parallel corpus (Wang and Waibel, 1997). The experiment was based on the following observation: IBM translation Model 1 (where the alignment distribution is uniform) and Model 2 found similar Viterbi alignments when there were no movements or deletions, and they predicted very different Viterbi alignments when the skewness was severe ill a sentence pair, since the alignment parameters in Model 2 penalize the long distance alignment. Figure 3 shows the Viterbi alignment discovered by Model 1 for the same sentences in Figure 21 .</Paragraph> <Paragraph position="6"> We measured the distance of a Model 1 alignment a 1 and a Model 2 alignment a z ~--,Igl la ~ _ a2\]. To estimate the skew- aS A.-,i= 1 ness of the corpus, we collected the statistics about the percentage of sentence pairs (with at ~The better alignment on a given pair of sentences does not mean Model 1 is a better model. Non-uniform alignment distribution is desirable. Otherwise, language model would be the only factor that determines the source sentence word order in decoding.</Paragraph> <Paragraph position="7"> least five words in a sentence) with Model 1 and Model 2 alignment distance greater than 1/4,2/4,3/4,..., 10/4 of the target sentence length. By checking the Viterbi alignments made by both models, it is almost certain that whenever the distance is greater that 3/4 of the target sentence length, there is either a movement or a deletion in the sentence pair. Figure 4 plots this statistic -- around 30% of the sentence pairs in our training data have some degree of skewness in alignments.</Paragraph> </Section> <Section position="5" start_page="1358" end_page="1359" type="metho"> <SectionTitle> 3 Structure-based Alignment Model </SectionTitle> <Paragraph position="0"> To solve the problems with the word-based alignment models, we present a structure-based alignment model here. The idea is to directly model the phrase movement with a rough alignment, and then model the word alignment within phrases with a detailed alignment.</Paragraph> <Paragraph position="1"> Given an English sentence e = ele2...et, its German translation g = 9192&quot;&quot; &quot;gin can be generated by the following process: 1. Parse e into a sequence of phrases, so</Paragraph> <Paragraph position="3"> where E0 is a null phrase.</Paragraph> <Paragraph position="4"> 2. With the probability P(q \] e,E), determine q < n + 1, the number of phrases in g. Let Gi'&quot;Gq denote these q phrases.</Paragraph> <Paragraph position="5"> Each source phrase can be aligned with at most one target phrase. Unlike English phrases, words in a German phrase do not have to form a consecutive sequence. So g may be expressed with something like g = gllg12g21g13g22&quot;&quot;, where gij represents the j-th word in the i-th phrase. 3. For each German phrase Gi, 0 <_ i < q, with the probability P(rili, r~ -1, E, e), align it with an English phrase E~.</Paragraph> <Paragraph position="6"> 4. For each German phrase Gi, 0 <_ i < q, determine its beginning position bi in g with the distribution P(bi l &quot; 1.i-1 _q e, E). ~, u 0 ~ r0~ 5. Now it is time to generate the individual words in the German phrases through detailed alignment. It works like IBM Model 4. For each word eij in the phrase Ei, its fertility C/ij has the distribution P(C/ij I .. j-1C/i0-1 E). ~ 3, C/il , , bo, ro, e, 6. For each word eij in the phrase Ei, it generates a tablet rij = {Tijl,Tij2,'''TijC/ij} by generating each of the words in rij in turn with the probability P(rijk I r~.li,rJ~ -1 - , rio-l, l%, bo,qr~,e,E) forthek-th word in the tablet.</Paragraph> <Paragraph position="7"> 7. For each element risk in the tablet vii, the permutation 7rij k determines its position in the target sentence according to the distribution P(rrij k I 7rk_ 1 &quot;- . ijl , 7r~l 1, 7r;-1, TO/, (~/, b(~, r~, e, E). We made the following independence assumptions: null 1. The number of target sentence phrases depends only on the number of phrases in the source sentence:</Paragraph> <Paragraph position="9"> This assumption states that P(ri I i, rio-X,E,e) depends on i and ri. It also 1 depends on r~- with the factor YI0<j<i(1(f(ri, rj)) to ensure that each EnglisI~ phrase is aligned with at most one German phrase.</Paragraph> <Paragraph position="10"> 3. The beginning position of a target phrase depends on its distance from the beginning position of its preceding phrase, as well as .</Paragraph> <Paragraph position="11"> .</Paragraph> <Paragraph position="12"> the length of the source phrase aligned with the preceding phrase:</Paragraph> <Paragraph position="14"> The fertility and translation tablet of a source word depend on the word only:</Paragraph> <Paragraph position="16"> The leftmost position of the translations of a source word depends on its distance from the beginning of the target phrase aligned with the source phrase that contains that source word. It also depends on the identity of the phrase, and the position of the source word in the source phrase.</Paragraph> <Paragraph position="17"> j-1 i-i t E) = dl (Trijl -- bil El, j) For a target word rijk other than the left-most Tij 1 in the translation tablet of the source eij, its position depends on its distance from the position of another tablet word 7&quot;ij(k_l) closest to its left, the class of the target word Tijk, and the fertility of the source word eij.</Paragraph> <Paragraph position="18"> p( jkl l 1, i-1 i l - rCil ,Tr o ,rO,C/o,b~,r~,e,E) = d2(rcijk - lrij(k_l) I 6(rijk), C/ij) here G(g) is the equivalent class for g.</Paragraph> <Section position="1" start_page="1359" end_page="1359" type="sub_section"> <SectionTitle> 3.1 Parameter Estimation </SectionTitle> <Paragraph position="0"> EM algorithm was used to estimate the seven types of parameters: Pn, a, a, C/, r, dl and d2. We used a subset of probable alignments in the EM learning, since the total number of alignments is exponential to the target sentence length. The subset was the neighboring alignments (Brown et al., 1993) of the Viterbi alignments discovered by Model 1 and Model 2. We chose to include the Model 1 Viterbi alignment here because the Model 1 alignment is closer to the &quot;ideal&quot; when strong skewness exists in a sentence pair.</Paragraph> </Section> </Section> <Section position="6" start_page="1359" end_page="1360" type="metho"> <SectionTitle> 4 Finding the Structures </SectionTitle> <Paragraph position="0"> It is of little interest for the structure-based alignment model if we have to manually find the language structures and write a grammar for them, since the primary merit of statistical machine translation is to reduce human labor. In this section we introduce a grammar inference technique that finds the phrases used in the structure-based alignment model. It is based on the work in (Ries, BuC/, and Wang, 1995), where the following two operators are used:</Paragraph> <Paragraph position="2"> Clustering: Clustering words/phrases with similar meanings/grammatical functions into equivalent classes. The mutual information clustering algorithm(Brown et al., 1992) were used for this.</Paragraph> <Paragraph position="3"> Phrasing: The equivalent class sequence Cl, c2,...c k forms a phrase if P(cl, c2,'&quot; &quot;ck) log P(cI, c2,'&quot; &quot;ck) > 8, P(c,)P(c2)&quot; &quot;P(ck) where ~ is a threshold. By changing the threshold, we obtain a different number of phrases.</Paragraph> <Paragraph position="4"> The two operators are iteratively applied to the training corpus in alternative steps. This results in hierarchical phrases in the form of sequences of equivalent classes of words/phrases. Since the algorithm only uses a monolingual corpus, it often introduces some language-specific structures resulting from biased usages of a specific language. In machine translation we are more interested in cross-linguistic structures, similar to the case of using interlingua to represent cross-linguistic information in knowledge-based MT.</Paragraph> <Paragraph position="5"> To obtain structures that are common in both languages, a bilingual mutual information clustering algorithm (Wang, Lafferty, and Waibel, 1996) was used as the clustering operator. It takes constraints from parallel corpus. We also introduced an additional constraint in clustering, which requires that words in the same class must have at least one common potential partof-speech. null Bilingual constraints are also imposed on the phrasing operator. We used bilingual heuristics to filter out the sequences acquired by the phrasing operator that may not be common in multiple languages. The heuristics include:</Paragraph> <Paragraph position="7"> Average Translation Span: Given a phrase candidate, its average translation span is the distance between the leftmost and the rightmost target positions aligned with the words inside the candidate, averaged over all Model 1 Viterbi alignments of sample sentences. A candidate is filtered out if its average translation span is greater than the length of the candidate multiplied by a threshold. This criterion states that the words in the translation of a phrase have to be close enough to form a phrase in another language.</Paragraph> <Paragraph position="8"> Ambiguity Reduction: A word occurring in a phrase should be less ambiguous than in other random context. Therefore a phrase should reduce the ambiguity (uncertainty) of the words inside it. For each source language word class c, its translation entropy is defined as )-'\]~g t(g \[ c)log(g \[ c). The average per source class entropy reduction induced by the introduction of a phrase P is therefore</Paragraph> <Paragraph position="10"> A threshold was set up for minimum entropy reduction.</Paragraph> <Paragraph position="11"> By applying the clustering operator followed with the phrasing operator, we obtained shallow phrase structures partly shown in Figure 5. Given a set of phrases, we can deterministically parse a sentence into a sequence of phrases by replacing the leftmost unparsed substring with the longest matching phrase in the set.</Paragraph> </Section> class="xml-element"></Paper>