File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1006_metho.xml
Size: 9,352 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1006"> <Title>Improved Word Alignment Using a Symmetric Lexicon Model</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Statistical Word Alignment Models </SectionTitle> <Paragraph position="0"> In this section, we will give a short description of the commonly used statistical word alignment models. These alignment models stem from the source-channel approach to statistical machine translation (Brown et al., 1993). We are given a source language sentence fJ1 := f1:::fj:::fJ which has to be translated into a target language sentence eI1 := e1:::ei:::eI.</Paragraph> <Paragraph position="1"> Among all possible target language sentences, we will choose the sentence with the highest probability:</Paragraph> <Paragraph position="3"> This decomposition into two knowledge sources allows for an independent modeling of target language model Pr(eI1) and translation model Pr(fJ1 jeI1). Into the translation model, the word alignment A is introduced as a hid-</Paragraph> <Paragraph position="5"> Usually, we use restricted alignments in the sense that each source word is aligned to at most one target word, i.e. A = aJ1. A detailed description of the popular translation models IBM-1 to IBM-5 (Brown et al., 1993), aswellastheHidden-Markovalignmentmodel (HMM) (Vogel et al., 1996) can be found in (Och and Ney, 2003). All these models include parameters p(fje) for the single-word based lexicon. They differ in the alignment model.</Paragraph> <Paragraph position="6"> A Viterbi alignment ^A of a specific model is an alignment for which the following equation We measure the quality of an alignment model usingthequalityoftheViterbialignmentcompared to a manually produced reference alignment. null In Section 3, we will apply the lexicon symmetrization methods to the models described previously. Therefore, we will now sketch the standard training procedure for the lexicon model. The EM algorithm is used to train the free lexicon parameters p(fje).</Paragraph> <Paragraph position="7"> In the E-step, the lexical counts for each sentence pair (fJ1 ;eI1) are calculated and then summed over all sentence pairs in the training corpus:</Paragraph> <Paragraph position="9"/> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Symmetrized Lexicon Model </SectionTitle> <Paragraph position="0"> During the standard training procedure, the lexicon parameters p(fje) and p(ejf) were estimated independent of each other in strictly separate trainings. In this section, we present two symmetrization methods for the lexicon model. As a starting point, we use the joint lexicon probability p(f;e) and determine the conditional probabilities for the source-to-target direction p(fje) and the target-to-source direction p(ejf) as the corresponding marginal distribution:</Paragraph> <Paragraph position="2"> The nonsymmetric auxiliary Q-functions for reestimating the lexicon probabilities during the EM algorithm can be represented as follows. Here, NST(f;e) and NTS(f;e) denote the lexicon counts for the source-to-target (ST) direction and the target-to-source (TS) direction, respectively.</Paragraph> <Paragraph position="4"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Linear Interpolation </SectionTitle> <Paragraph position="0"> To estimate the joint probability using the EM algorithm, we define the auxiliary Q-function as a linear interpolation of the Q-functions for the source-to-target and the target-to-source direction:</Paragraph> <Paragraph position="2"> The unigram counts N(e) and N(f) are determined, for each of the two translation directions, by taking a sum of N(f;e) over f and over e, respectively. We define the combined lexicon count Nfi(f;e):</Paragraph> <Paragraph position="4"> Now, we derive the symmetrized Q-function over p(f;e) for a certain word pair (f;e).</Paragraph> <Paragraph position="5"> Then, we set this derivative to zero to determine the reestimation formula for p(f;e) and obtain the following equation:</Paragraph> <Paragraph position="7"> We do not know a closed form solution for this equation. As an approximation, we use the following term:</Paragraph> <Paragraph position="9"> This estimate is an exact solution, if the uni-gram counts for f and e are independent of the translation direction, i.e. NST(f) = NTS(f) and NST(e) = NTS(e). We make this approximation and thus we interpolate the lexicon counts linear after each iteration of the EM algorithm. Then, we normalize these counts (according to Equations 1 and 2) to determine the lexicon probabilities for each of the two translation directions.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Loglinear Interpolation </SectionTitle> <Paragraph position="0"> We will show in Section 5 that the linear interpolation results in significant improvements over the nonsymmetric system. Motivated by these experiments, we investigated also the loglinear interpolation of the lexicon counts of the two translation directions. The combined lexicon count Nfi(f;e) is now defined as:</Paragraph> <Paragraph position="2"> The normalization is done in the same way as for the linear interpolation. The linear interpolationresemblesmoreaunionofthetwolex- null ica whereas the loglinear interpolation is more similar to an intersection of both lexica. Thus for the linear interpolation, a word pair (f;e) obtains a large combined count, if the count in at least one direction is large. For the loglinear interpolation, the combined count is large only if both lexicon counts are large.</Paragraph> <Paragraph position="3"> In the experiments, we will use the interpolation weight fi = 0:5 for both the linear and the loglinear interpolation, i.e. both translation directions are weighted equally.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Evidence Trimming </SectionTitle> <Paragraph position="0"> Initially, the lexicon contains all word pairs that cooccur in the bilingual training corpus.</Paragraph> <Paragraph position="1"> The majority of these word pairs are not translations of each other. Therefore, we would like to remove those lexicon entries. Evidence trimming is one way to do this. The evidence of a word pair (f;e) is the estimated count N(f;e). Now, we discard a word pair if its evidence is below a certain threshold ?.1 In the case of the symmetric lexicon, we can further refine this method. For estimating the lexicon in the source-to-target direction ^p(fje), the idea is to keep all entries from this direction and to boost the entries that have a high evidence in the target-to-source direction NTS(f;e). We obtain the following formula:</Paragraph> <Paragraph position="3"> The count -NST(f;e) is now used to estimate the source-to-target lexicon ^p(fje). With this method, we do not keep entries in the source-to-target lexicon ^p(fje) if their evidence is low, even if their evidence in the target-to-source 1Actually, there is always implicit evidence trimming caused by the limited machine precision.</Paragraph> <Paragraph position="4"> direction NTS(f;e) is high. For the target-to-source direction, we apply this method in a similar way.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Lexicon Smoothing </SectionTitle> <Paragraph position="0"> The lexicon model described so far is based on full-formwords. Forhighlyinflectedlanguages such as German this might cause problems, because many full-form words occur only a few times in the training corpus. Compared to English, the token/type ratio for German is usually much lower (e.g. Verbmobil: English 99:4, German 56:3). The information that multiple full-form words share the same base form is not used in the lexicon model. To take this information into account, we smooth the lexicon model with a backing-off lexicon that is based on word base forms. The smoothing method we apply is absolute discounting with interpolation: null</Paragraph> <Paragraph position="2"> This method is well known from language modeling (Ney et al., 1997). Here, -e denotes the generalization, i.e. the base form, of the word e. The nonnegative value d is the discounting parameter, fi(e) is a normalization constant and fl(f;-e) is the normalized backing-off distribution.</Paragraph> <Paragraph position="3"> The formula for fi(e) is:</Paragraph> <Paragraph position="5"> This formula is a generalization of the one typically used in publications on language modeling. This generalization is necessary, because the lexicon counts may be fractional whereas in language modeling typically integer counts are used. Additionally, we want to allow for discounting values d greater than one. The backing-off distribution fl(f;-e) is estimated using relative frequencies:</Paragraph> <Paragraph position="7"> Here, N(f;-e) denotes the count of the event thatthesourcelanguage wordf andthe target language base form -e occur together. These counts are computed by summing the lexicon counts N(f;e) over all full-form words e which share the same base form -e.</Paragraph> </Section> class="xml-element"></Paper>