File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1065_metho.xml
Size: 27,205 bytes
Last Modified: 2025-10-06 14:10:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1065"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Improved Discriminative Bilingual Word Alignment</Title> <Section position="5" start_page="513" end_page="514" type="metho"> <SectionTitle> 3 Stage 1 Model </SectionTitle> <Paragraph position="0"> In our previous stage 1 model, we used five features. The most informative feature was the sum of bilingual word-association scores for all linked word pairs, computed as a log likelihood ratio. We used two features to measure the degree of non-monotonicity of alignments, based on traversing the alignment in the order of the source sentence tokens, and noting the instances where the corresponding target sentence tokens were not in left-to-right order. One feature counted the number of times there was a backwards jump in the order of the target sentence tokens, and the other summed the magnitudes of these jumps. In order to model the trade-off between one-to-one and many-to-one alignments, we included a feature that counted the number of alignment links such that one of the linked words participated in another link. Our fifth feature was the count of the number of words in the sentence pair left unaligned.</Paragraph> <Paragraph position="1"> In addition to these five features, we employed two hard constraints. One constraint was that the only alignment patterns allowed were 1-1, 1-2, 13, 2-1, and 3-1. Thus, many-to-many link patterns were disallowed, and a single word could be linked to at most three other words. The second constraint was that a possible link was considered only if it involved the strongest degree of association within the sentence pair for at least one of the words to be linked. If both words had stronger associations with other words in the sentence pair, then the link was disallowed.</Paragraph> <Paragraph position="2"> Our new stage 1 model includes all the features we used previously, plus the constraint on alignment patterns. The constraint involving strongest association is not used. In addition, our new stage 1 model employs the following features: association score rank features We define the rank of an association with respect to a word in a sentence pair to be the number of association types (word-type to word-type) for that word that have higher association scores, such that words of both types occur in the sentence pair. The contraint on strength of association we previously used can be stated as a requirement that no link be considered unless the corresponding association is of rank 0 for at least one of the words. We replace this hard constraint with two features based on association rank. One feature totals the sum of the association ranks with respect to both words involved in each link. The second feature sums the minimum of association ranks with respect to both words involved in each link. For alignments that obey the previous hard constraint, the value of this second feature would always be 0.</Paragraph> <Paragraph position="3"> jump distance difference feature In our original models, the only features relating to word order were those measuring nonmonotonicity. The likelihoods of various forward jump distances were not modeled. If alignments are dense enough, measuring nonmonotonicity gets at this indirectly; if every word is aligned, it is impossible to have large forward jumps without correspondingly large backwards jumps, because something has to link to the words that are jumped over. If word alignments are sparse, however, due to free translation, it is possible to have alignments with very different forward jumps, but the same backwards jumps. To differentiate such alignments, we introduce a feature that sums the differences between the distance between consecutive aligned source words and the distance between the closest target words they are aligned to.</Paragraph> <Paragraph position="4"> many-to-one jump distance features It seems intuitive that the likelihood of a large forward jump on either the source or target side of an alignment is much less if the jump is between words that are both linked to the same word of the other language. This motivates the distinction between the d1 and d>1 parameters in IBM Models 4 and 5.</Paragraph> <Paragraph position="5"> We model this by including two features. One feature sums, for each word w, the number of words not linked to w that fall between the first and last words linked to w. The other features counts only such words that are linked to some word other than w. The intuition here is that it is not so bad to have a function word not linked to anything, between two words linked to the same word.</Paragraph> <Paragraph position="6"> exact match feature We have a feature that sums the number of words linked to identical words. This is motivated by the fact that proper names or specialized terms are often the same in both languages, and we want to take advantage of this to link such words even when they are too rare to have a high association score.</Paragraph> <Paragraph position="7"> lexical features Taskar et al. (2005) gain considerable benefit by including features counting the links between particular high frequency words.</Paragraph> <Paragraph position="8"> They use 25 such features, covering all pairs of the five most frequent non-punctuation words in each language. We adopt this type of feature but do so more agressively. We include features for all bilingual word pairs that have at least two co-occurrences in the labeled training data. In addition, we include features counting the number of unlinked occurrences of each word having at least two occurrences in the labeled training data.</Paragraph> <Paragraph position="9"> In training our new stage 1 model, we were concerned that using so many lexical features might result in overfitting to the training data. To try to prevent this, we train the stage 1 model by first optimizing the weights for all other features, then optimizing the weights for the lexical features, with the other weights held fixed to their optimium values without lexical features.</Paragraph> </Section> <Section position="6" start_page="514" end_page="515" type="metho"> <SectionTitle> 4 Stage 2 Model </SectionTitle> <Paragraph position="0"> In our original stage 2 model, we replaced the loglikelihood-based word association statistic with the logarithm of the estimated conditional probability of a cluster of words being linked by the stage 1 model, given that they co-occur in a pair of aligned sentences, computed over the full (500,000 sentence pairs) training data. We estimated these probabilities using a discounted maximum likelihood estimate, in which a small fixed amount was subtracted from each link count:</Paragraph> <Paragraph position="2"> LPd(w1,...,wk) represents the estimated conditional link probability for the cluster of words w1,...,wk; links1(w1,...,wk) is the number of times they are linked by the stage 1 model, d is the discount; and cooc(w1,...,wk) is the number of times they co-occur. We found that d = 0.4 seemed to minimize training set AER.</Paragraph> <Paragraph position="3"> An important difference between our stage 1 and stage 2 models is that the stage 1 model considers each word-to-word link separately, but allows multiple links per word, as long as they lead to an alignment consisting only of one-to-one and one-to-many links (in either direction). The stage 2 model, however, uses conditional probabilities for both one-to-one and one-to-many clusters, but requires all clusters to be disjoint. Our original stage 2 model incorporated the same addtional features as our original stage 1 model, except that the feature that counts the number of links involved in non-one-to-one link clusters was omitted.</Paragraph> <Paragraph position="4"> Our new stage 2 model differs in a number of ways from the original version. First we replace the estimated conditional probability of a cluster of words being linked with the estimated conditional odds of a cluster of words being linked:</Paragraph> <Paragraph position="6"> LO(w1,...,wk) represents the estimated conditional link odds for the cluster of words w1,...,wk. Note that we use &quot;add-one&quot; smoothing in place of a discount.</Paragraph> <Paragraph position="7"> Additional features in our new stage 2 model include the unaligned word feature used previously, plus the following features: symmetrized nonmonotonicity feature We symmetrize the previous nonmonontonicity feature that sums the magnitude of backwards jumps, by averaging the sum of of backwards jumps in the target sentence order relative to the source sentence order, with the sum of the backwards jumps in the source sentence order relative to the target sentence order. We omit the feature that counts the number of backwards jumps.</Paragraph> <Paragraph position="8"> multi-link feature This feature counts the number of link clusters that are not one-to-one. This enables us to model whether the link scores for these clusters are more or less reliable than the link scores for one-to-one clusters.</Paragraph> <Paragraph position="9"> empirically parameterized jump distance feature We take advantage of the stage 1 alignment to incorporate a feature measuring the jump distances between alignment links that are more sophisticated than simply measuring the difference in source and target distances, as in our stage 1 model. We measure the (signed) source and target distances between all pairs of links in the stage 1 alignment of the full training data. From this, we estimate the odds of each possible target distance given the corresponding source distance:</Paragraph> <Paragraph position="11"> We similarly estimate the odds of each possible source distance given the corresponding target distance. The feature values consist of the sum of the scaled log odds of the jumps between consecutive links in a hypothesized alignment, computed in both source sentence and target sentence order. This feature is applied only when both the source and target jump distances are non-zero, so that it applies only to jumps between clusters, not to jumps on the &quot;many&quot; side of a many-to-one cluster. We found it necessary to linearly scale these feature values in order to get good results (in terms of training set AER) when using perceptron training.1 We found empirically that we could get good results in terms of training set AER by dividing each log odds estimate by the largest absolute value of any such estimate computed.</Paragraph> </Section> <Section position="7" start_page="515" end_page="516" type="metho"> <SectionTitle> 5 Perceptron Training </SectionTitle> <Paragraph position="0"> We optimize feature weights using a modification of averaged perceptron learning as described by Collins (2002). Given an initial set of feature weight values, the algorithm iterates through the 1Note that this is purely for effective training, since after training, one could adjust the feature weights according to the scale factor, and use the original feature values.</Paragraph> <Paragraph position="1"> labeled training data multiple times, comparing, for each sentence pair, the best alignment ahyp according to the current model with the reference alignment aref. At each sentence pair, the weight for each feature is is incremented by a multiple of the difference between the value of the feature for the best alignment according to the model and the value of the feature for the reference alignment: li-li + e(fi(aref,e,f)[?]fi(ahyp,e,f)) The updated feature weights are used to compute ahyp for the next sentence pair. The multiplier e is called the learning rate. In the averaged perceptron, the feature weights for the final model are the average of the weight values over all the data rather than simply the values after the final sentence pair of the final iteration.</Paragraph> <Paragraph position="2"> Differences between our approach and Collins's include averaging feature weights over each pass through the data, rather than over all passes; randomizing the order of the data for each learning pass; and performing an evaluation pass after each learning pass, with feature weights fixed to their average values for the preceding learning pass, during which training set AER is measured.</Paragraph> <Paragraph position="3"> This procedure is iterated until a local minimum on training set AER is found.</Paragraph> <Paragraph position="4"> We initialize the weight of the anticipated mostinformative feature (word-association scores in stage 1; conditional link probabilities or odds in stage 2) to 1.0, with other feature weights intialized to 0. The weight for the most informative feature is not updated. Allowing all weights to vary allows many equivalent sets of weights that differ only by a constant scale factor. Fixing one weight eliminates a spurious apparent degree of freedom.</Paragraph> <Paragraph position="5"> Previously, we set the learning rate e differently in training his stage 1 and stage 2 models. For the stage 2 model, we used a single learning rate of 0.01. For the stage 1 model, we used a sequence of learning rates: 1000, 100, 10, and 1.0. At each transition between learning rates, we re-initialized the feature weights to the optimum values found with the previous learning rate.</Paragraph> <Paragraph position="6"> In our current work, we make a number of modifications to this procedure. We reset the feature weights to the best averaged values we have yet seen at the begining of each learning pass through the data. Anecdotally, this seems to result in faster convergence to a local AER minimum. We also use multiple learning rates for both the stage 1 and stage 2 models, setting the learning rates automatically. The initial learning rate is the maximum absolute value (for one word pair/cluster) of the word association, link probability, or link odds feature, divided by the number of labeled training sentence pairs. Since many of the feature values are simple counts, this allows a minimal difference of 1 in the feature value, if repeated in every training example, to permit a count feature to have as large a weighted value as the most informative feature, after a single pass through the data.</Paragraph> <Paragraph position="7"> After the learning search terminates for a given learning rate, we reduce the learning rate by a factor of 10, and iterate until we judge that we are at a local minimum for this learning rate. We continue with progressively smaller learning rates until an entire pass through the data produces feature weights that differ so little from their values at the beginning of the pass that the training set AER does not change.</Paragraph> <Paragraph position="8"> Two final modifications are inspired by the realization that the results of perceptron training are very sensitive to the order in which the data is presented. Since we randomize the order of the data on every pass, if we make a pass through the training data, and the training set AER increases, it may be that we simply encountered an unfortunate ordering of the data. Therefore, when training set AER increases, we retry two additional times with the same initial weights, but different random orderings of the data, before giving up and trying a smaller learning rate. Finally, we repeat the entire training process multiple times, and average the feature weights resulting from each of these runs.</Paragraph> <Paragraph position="9"> We currently use 10 runs of each model. This final averaging is inspired by the idea of &quot;Bayes-point machines&quot; (Herbrich and Graepel, 2001).</Paragraph> </Section> <Section position="8" start_page="516" end_page="516" type="metho"> <SectionTitle> 6 SVM Training </SectionTitle> <Paragraph position="0"> After extensive experiments with perceptron training, we wanted to see if we could improve the results obtained with our best stage 2 model by using a more sophisticated training method. Perceptron training has been shown to obtain good results for some problems, but occasionally very poor results are reported, notably by Taskar et al. (2005) for the word-alignment problem. We adopted the support vector machine (SVM) method for structured output spaces of Tsochantaridis et al. (2005), using Joachims' SV Mstruct package.</Paragraph> <Paragraph position="1"> Like standard SVM learning, this method tries to find the hyperplane that separates the training examples with the largest margin. Despite a very large number of possible output labels (e.g., all possible alignments of a given pair of sentences), the optimal hyperplane can be efficiently approximated given the desired error rate, using a cutting plane algorithm. In each iteration of the algorithm, it adds the &quot;best&quot; incorrect predictions given the current model as constraints, and optimizes the weight vector subject only to them.</Paragraph> <Paragraph position="2"> The main advantage of this algorithm is that it does not pose special restrictions on the output structure, as long as &quot;decoding&quot; can be done efficiently. This is crucial to us because several features we found very effective in this task are difficult to incorporate into structured learning methods that require decomposable features. This method also allows a variety of loss functions, but we use only simple 0-1 loss, which in our case means whether or not the alignment of a sentence pair is completely correct, since this worked as well as anything else we tried.</Paragraph> <Paragraph position="3"> Our SVM method has a number of free parameters, which we tried tuning in two different ways. One way is minimizing training set AER, which is how we chose the stopping points in perceptron training. The other is five-fold cross validation.</Paragraph> <Paragraph position="4"> In this method, we train five times on 80% of the training data and test on the other 20%, with five disjoint subsets used for testing. The parameter values yielding the best averaged AER on the five test subsets of the training set are used to train the final model on the entire training set.</Paragraph> </Section> <Section position="9" start_page="516" end_page="517" type="metho"> <SectionTitle> 7 Evaluation </SectionTitle> <Paragraph position="0"> We used the same training and test data as in our previous work, a subset of the Canadian Hansards bilingual corpus supplied for the bilingual word alignment workshop held at HLT-NAACL 2003 (Mihalcea and Pedersen, 2003). This subset comprised 500,000 English-French sentences pairs, including 224 manually word-aligned sentence pairs for labeled training data, and 223 labeled sentences pairs as test data. Automatic sentence alignment of the training data was provided by Ulrich Germann, and the hand alignments of the labeled data were created by Franz Och and Hermann Ney (Och and Ney, 2003).</Paragraph> <Paragraph position="1"> For baselines, Table 1 shows the test set results we previously reported, along with results for IBM Model 4, trained with Och's Giza++ software package, using the default configuration file (Och and Ney, 2003).2 &quot;Prev LLR&quot; is our earlier stage 1 model, and CLP1 and CLP2 are two versions of our earlier stage 2 model. For CLP1, conditional link probabilities were estimated from the alignments produced by our &quot;Prev LLR&quot; model, and for CLP2, they were obtained from a yet earlier, heuristic alignment model. Results for IBM Model 4 are reported for models trained in both directions, English-to-French and French-to-English, and for the union, intersection, and what Och and Ney (2003) call the &quot;refined&quot; combination of the those two alignments.</Paragraph> <Paragraph position="2"> Results for our new stage 1 model are presented in Table 2. The first line is for the model described in Section 3, optimizing non-lexical features before lexical features. The second line gives results for optimizing all features simultaneously. The next line omits lexical features entirely. The last line is for our original stage 1 model, but trained using our improved perceptron training method.</Paragraph> <Paragraph position="3"> As we can see, our best stage 1 model reduces the error rate of previous stage 1 model by almost half. Comparing the first two lines shows that two-phase training of non-lexical and lexical features produces a 0.7% reduction in test set error. Although the purpose of the two-phase training was to mitigate overfitting to the training data, we also found training set AER was reduced (7.3% vs.</Paragraph> <Paragraph position="4"> 8.8%). Taken all together, the results show a 7.9% total reduction in error rate: 4.0% from new non-lexical features, 3.3% from lexical features with two-phase training, and 0.6% from other improvements in perceptron training.</Paragraph> <Paragraph position="5"> Table 3 presents results for perceptron training of our new stage 2 model. The first line is for the model as described in Section 4. Since the use of log odds is somewhat unusual, in the second line we show results for a similiar model, but using log probabilities instead of log odds for both the link model and the jump model. This result is 0.2% worse than the log-odds-based model, but the difference is small enough to warrant testing its significance. Comparing the errors on each test sentence pair with a 2-tailed paired t test, the results were suggestive, but not significant (p = 0.28) The third line of Table 3 shows results for our earlier CLP1 model with probabilities estimated from our new stage 1 model alignments (&quot;new A&quot;), using our recent modifications to perceptron training (&quot;new T&quot;). These results are significantly worse than either of the two preceding models (p < 0.0008). The fourth line is for the same model and stage 1 alignments, but with our earlier perceptron training method. While the results are 0.3% worse than with our new training method, the difference is not significant (p = 0.62).</Paragraph> <Paragraph position="6"> Table 4 shows the results of SVM training of the model that was best under perceptron training, tuning free parameters either by minimizing error on the entire training set or by 5-fold cross validation on the training set. The cross-validation method produced slightly lower test-set AER, but both results rounded to 4.7%. While these results are somewhat better than with perceptron training, the differences are not significant (p[?]0.47).</Paragraph> </Section> <Section position="10" start_page="517" end_page="518" type="metho"> <SectionTitle> 8 Comparisons to Other Work </SectionTitle> <Paragraph position="0"> At the time we carried out the experiments described above, our sub-5% AER results were the best we were aware of for word alignment of Canadian Hansards bilingual data, although direct comparisons are problematic due to differences in total training data, labeled training data, and test data. The best previously reported result was by Och and Ney (2003), who obtained 5.2% AER for a combination including all the IBM models except Model 2, plus the HMM model and their Model 6, together with a bilingual dictionary, for the refined alignment combination, trained on three times as much data as we used.</Paragraph> <Paragraph position="1"> Cherry and Lin's (2003) method obtained an AER of 5.7% as reported by Mihalcea and Pedersen (2003), the previous lowest reported error rate for a method that makes no use of the IBM models. Cherry and Lin's method is similar to ours in using explicit estimates of the probability of a link given the co-occurence of the linked words; but it is generative rather than discriminative, it requires a parser for the English side of the corpus, and it does not model many-to-one links. Taskar et al. (2005) reported 5.4% AER for a discriminative model that includes predictions from the intersection of IBM Model 4 alignments as a feature. Their best result without using information from the IBM models was 10.7% AER.</Paragraph> <Paragraph position="2"> After completing the experiments described in Section 7, we became aware further developments in the line of research reported by Taskar et al.</Paragraph> <Paragraph position="3"> (Lacoste-Julien et al., 2006). By modifying their previous approach to allow many-to-one alignments and first-order interactions between alignments, Lacoste-Julien et al. have improved their best AER without using information from the more complex IBM models to 6.2%. Their best result, however, is obtained from a model that includes both a feature recording intersected IBM Model 4 predictions, plus a feature whose values are the alignment probabilities obtained from a pair of HMM alignment models trained in both directions in such a way that they agree on the alignment probabilities (Liang et al., 2006). With this model, they obtained a much lower 3.8% AER.</Paragraph> <Paragraph position="4"> Lacoste-Julien very graciously provided both the IBM Model 4 predictions and the probabilities estimated by the bidirectional HMM models that they had used to compute these additional feature values. We then added features based on this information to see how much we could improve our best model. We also eliminated one other difference between our results and those of Lacoste-Julien et al., by training on all 1.1 million English-French sentence pairs from the 2003 word alignment workshop, rather than the 500,000 sentence pairs we had been using.</Paragraph> <Paragraph position="5"> Since all our other feature values derived from probabilities are expressed as log odds, we also converted the HMM probabilities estimated by Liang et al. to log odds. To make this well defined in all cases, we thresholded high probabilities (including 1.0) at 0.999999, and low probabilities (including 0.0) at 0.1 (which we found produced lower training set error than using a very small non-zero probability, although we have not searched systematically for the optimal value).</Paragraph> <Paragraph position="6"> In our latest experiments, we first established that simply increasing the unlabled training data to 1.1 million sentence pairs made very little difference, reducing the test-set AER of our stage 2 model under perceptron training only from 4.9% to 4.8%. Combining our stage 2 model features with the HMM log odds feature using SVM training with 5-fold cross validation yielded a substantial reduction in test-set AER to 3.9% (96.9% precision, 95.1% recall). We found it somewhat difficult to improve these results further by including IBM Model 4 intersection feature. We finally obtained our best results, however, for both training-set and test-set AER, by holding the stage 2 model feature weights at the values obtained by SVM training with the HMM log odds feature, and optimizing the HMM log odds feature weight and IBM Model 4 intersection feature weight with perceptron training.3 This produced a test-set AER of 3.7% (96.9% precision, 95.5% recall).</Paragraph> </Section> class="xml-element"></Paper>