File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1013_metho.xml
Size: 23,073 bytes
Last Modified: 2025-10-06 14:10:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1013"> <Title>A Maximum Entropy Approach to Combining Word Alignments</Title> <Section position="3" start_page="0" end_page="96" type="metho"> <SectionTitle> 2 Maximum Entropy (ME) Models </SectionTitle> <Paragraph position="0"> In a statistical classification problem, the goal is to estimate the probability of a class y in a given context x, i.e., p(y|x). In an ideal scenario, if the training data contain evidence for all pairs of (y, x), it is trivial to compute the probability distribution p. Unfortunately, due to training-data sparsity, p is generally modeled using only the available evidence.</Paragraph> <Paragraph position="1"> Given a collection of facts, ME chooses a model consistent with all the facts, but otherwise as uniform as possible (Berger et al., 1996). Formally, the evidence is represented as feature functions, i.e., binary valued functions that map a class y and a context x to either 0 or 1, i.e., h</Paragraph> <Paragraph position="3"> where Y is the set of all classes and X is the set of all facts. The biggest advantage of maximum entropy models is that they are able to focus on the selection of feature functions rather than on how such functions are used. Any context can be used to define feature functions without concern for the independence of the feature functions from each other or the relevance of the feature functions to the final decision (Ratnaparkhi, 1998).</Paragraph> <Paragraph position="4"> Each feature function h m is associated with a model parameter l m . Given a set of M feature functions h</Paragraph> <Paragraph position="6"> , the probability of class y given a context x is equal to:</Paragraph> <Paragraph position="8"> is a normalization constant. The contribution of each feature function to the final decision, i.e., l m , can be automatically computed using Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff, 1972). The final classification for a given instance is the class y that maximizes p(y|x).</Paragraph> </Section> <Section position="4" start_page="96" end_page="98" type="metho"> <SectionTitle> 3 Alignment Combination: ACME </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> be two sentences in two different languages. An alignment link (i, j) corresponds to a translational equivalence between words e</Paragraph> <Paragraph position="4"> be an alignment between sentences e and f, where each</Paragraph> <Paragraph position="6"> } be a set of alignments between e and f. We refer to the true alignment as T, where each a [?] T is of the form (i, j). The goal of ACME is to combine the information in A such that the combined alignment A C is closer to T. A straightforward solution is to take the intersection or union of the individual alignments. In this paper, an additional model is learned to combine outputs of A</Paragraph> <Paragraph position="8"> word alignments between a given English sentence and a foreign-language (FL) sentence. Then a Feature Extractor takes the output of these alignment systems and the parallel corpus (which might be enriched with linguistic features) and extracts a set of feature functions based on linguistic properties of the words and the input alignments. Each feature function h m is associated with a model parameter</Paragraph> <Paragraph position="10"> . Next, an Alignment Combiner decides whether to include or exclude an alignment link based on the extracted feature functions and the model parameters associated with them.</Paragraph> <Paragraph position="11"> For each possible alignment link a set of features is extracted from the input alignments and linguistic properties of words. The features that are used for representing an alignment link (i, j) are as follows: 1. Part-of-speech tags (posE, posF, prevposE, prevposF, nextpostE, nextposF): POS tags for the previous, current, and the next English and FL words.</Paragraph> <Paragraph position="12"> 2. Outputs of input aligners (out): Whether (i, j) exists in a given input alignment A k .</Paragraph> <Paragraph position="13"> 3. Neighbors (neigh): A neighborhood of an alignment link (i, j)--denoted by N(i, j)-consists of 8 possible alignment links in a 3x3 window with (i, j) in the center of the window. Each element of N(i, j) is called a neighboring link of (i, j). Neighbor features include: (1) Whether a particular neighbor of (i, j) exists in a given input alignment A k ; and (2) Total number of neighbors of (i, j) in a given input alignment A k .</Paragraph> <Paragraph position="14"> 4. Fertilities (fertE, fertF): The number of words that e i (or f j ) is aligned to in a given input alignment A k .</Paragraph> <Paragraph position="15"> 5. Monotonicity (mon): The absolute difference between i and j.</Paragraph> <Paragraph position="16"> Our combination approach employs feature functions derived from a subset of the features above. Assuming Y = {yes, no} represents the set of classes, where each class denotes the existence or absence of a link in the combined alignment, and X is the set of features above, we generate various feature functions h(y, x), where y [?] Y and x are instantiations of one or more features in X. Table 1 lists the feature sets with an example feature func-</Paragraph> <Paragraph position="18"> For example, the feature function in the fifth row has a value of 1 if there are 2 neighboring links to (i, j) that exist in the input alignment A</Paragraph> <Paragraph position="20"> In combining evidence from different alignments, it is assumed that, when an alignment link is left out by all aligners, that particular link should not be included in the final output. Since the majority of all possible word pairs are unaligned in real data, the inclusion of all possible word pairs in the training data leads to skewed results, where the learning algorithm is biased toward labeling the links as invalid. To offset this problem, our training data includes only alignment links that appear in at least one input alignment.</Paragraph> <Paragraph position="21"> Once the feature functions are extracted, we learn the model parameters using the YASMET ME package (Och, 2002), which is an efficient implementation of the GIS algorithm.</Paragraph> <Paragraph position="22"> configurations. We indicate the two directions using the notation Aligner(en - fl) and Aligner(fl en), where en is English, fl is either Chinese (ch), Arabic (ar), or Romanian (ro).</Paragraph> <Paragraph position="23"> To train both systems, additional data was used for the three language pairs: 107K English-Chinese POS tags were generated using the MXPOST tagger (Ratnaparkhi, 1998). POS tagger for English was trained on Sections 0-18 of the Penn Treebank Wall Street Journal corpus. On the FL side, we used POS tagger for only Chinese and it was trained on Sections 16-299 of Chinese Treebank.</Paragraph> <Paragraph position="24"> For comparison purposes, three additional heuristically-induced alignments are generated for each system: (1) Intersection of both directions (Aligner(int)); (2) Union of both directions (Aligner(union)); and (3) The previously best-known heuristic combination approach called growdiag-final (Koehn et al., 2003) (Aligner(gdf)). In our evaluation, we take A to be the set of alignment links for a set of sentences, S to be the set of sure alignment links, and P be the set of probable alignment links (in the gold standard). Precision (Pr), recall (Rc) and alignment error rate (AER) are defined as follows: Note that both GIZA++ and SAHMM are unsupervised learning systems. Sentence-aligned parallel texts are the only required input.</Paragraph> <Paragraph position="25"> Note that AER= 1 - F-score when there is no distinction between probable and sure alignment links.</Paragraph> <Paragraph position="26"> Our gold standard for each language pair is a manually aligned corpus. English-Chinese annotations distinguish between sure and probable alignment links (i.e., S [?] P), but there is no such distinction for the other two language pairs (i.e., P = S). Because of the availability of limited manually annotated data, evaluations are performed using 5fold cross validation. Once the alignments are generated for each fold (using one as the test set and the other 4 folds as training set), the results are concatenated to compute precision, recall and error rate on the entire set of sentence pairs for each data set.</Paragraph> </Section> <Section position="5" start_page="98" end_page="98" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> This section presents several experiments and results comparing AER of ACME to those of standard alignment approaches on English-Chinese data. We also present experiments on additional languages, analyses based on precision and recall, an upper-bound oracle analysis, and MT evaluations.</Paragraph> <Section position="1" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 5.1 English-Chinese Experiments </SectionTitle> <Paragraph position="0"> The experiments below test the effects of input alignments, feature set, data partitioning, number of inputs, and size of training data on the performance of ACME.</Paragraph> <Paragraph position="1"> 2 Input alignments: Table 3 shows the AER for GIZA++ and SAHMM (in each direction), three heuristic-based combinations and ACME using 2 uni-directional alignments as input and all features described in Section 3.</Paragraph> <Paragraph position="2"> (We use 'ACME[2]' in this section to refer to ACME applied to two input alignments and ACME[4] in later sections to refer to ACME applied to four input alignments.) Using 2 GIZA++ uni-directional alignments as input, ACME yields a 22.0% AER--a relative error reduction of 25.9% over GIZA++(gdf). Similarly, using 2 SAHMM uni-directional alignments as input, ACME produces a 20.6% AER--a relative error reduction of 28.0% and 25.4% over SAHMM(gdf) and SAHMM(int), respectively.</Paragraph> <Paragraph position="3"> Because the NIST MTEval data include sentences that may be related (according to the document in which they appear), the training and test material could potentially be related; however, given the types of features used in our experiments, we do not believe this biases our results.</Paragraph> <Paragraph position="4"> For ease of readability, in the rest of this paper, we will report precision, recall, and AER in percentages. Feature Set: To examine the effects of each feature on the performance of ACME, we compute the AER under a variety of conditions, removing each feature one at a time. ACME is evaluated using</Paragraph> </Section> </Section> <Section position="6" start_page="98" end_page="101" type="metho"> <SectionTitle> 2 uni-directional GIZA++ alignments as input on </SectionTitle> <Paragraph position="0"> English-Chinese data. Using all features, the AER is 22.0%. Our experiments show that there is no significant increase in AER for the removal of features corresponding to monotonicity (22.1%), neighbors (22.8%), POS on English side (22.9%), POS on foreign-language side (22.9%). On the other hand, deleting POS tags on both sides yields an AER of 25.2% and deleting the fertility features increases the AER to 25.9%. This indicates that both POS tags (or fertilities) contribute heavily toward the decision as to whether a particular alignment should be included/excluded.</Paragraph> <Paragraph position="1"> Partitioning Data: Previous work showed that partitioning the data into disjoint subsets and learning a different model for each partition improves the performance of the alignment systems (Ayan et al., 2005). To test whether this same principle applies to alignment combination with maximum entropy modeling, the training data was partitioned using POS tags for English and the FL, and different weights were learned for each partition.</Paragraph> <Paragraph position="2"> in AER: 19.8% (GIZA++) and 18.0% (SAHMM).</Paragraph> <Paragraph position="3"> Interestingly, using foreign-language (FL) tags on their own or together with English POS tags does not provide any improvement. Overall when ACME[2] is applied to partitioned data (using posE for partitioning) a relative error reduction of 33-37% over GIZA++(gdf) and SAHMM(gdf) is achieved.</Paragraph> <Paragraph position="4"> Number of Input Alignments: Table 5 presents the English-Chinese AER for ACME[1] (using either GIZA++ or SAHMM in only one direction), ACME[2] (using either GIZA++ or SAHMM in two directions) and ACME[4] (using GIZA++ and SAHMM, each in two directions).</Paragraph> <Paragraph position="5"> Regardless of the number of inputs, partitioning the data (using English POS tags) yields lower AER than no partitioning. Using one GIZA++ alignment as input, ACME[1] with partitioning improves the AER to 26.9% and 25.5% for each direction, respectively. Similarly, using one SAHMM alignment as input, ACME[1] with partitioning reduces the AER to 22.9% and 24.7%. ACME[2] with partitioning reduces the AER to 19.8% and 18.0% for GIZA++ and SAHMM, respectively. Finally, using all four input alignments, ACME[4] with partitioning yields a 15.6% AER--a relative error reduction of 21.2% and 13.3% over each ACME[2] case.</Paragraph> <Section position="1" start_page="99" end_page="99" type="sub_section"> <SectionTitle> Size of Training Data to Obtain Input Align- </SectionTitle> <Paragraph position="0"> ments: In general, statistical alignment systems improve as the size of the training data increases.</Paragraph> <Paragraph position="1"> We present the AER for GIZA++ and ACME[2] using GIZA++ alignments as input, where GIZA++ is trained on different sizes of data. We started with 20K sentence pairs of FBIS data and increased it to 20K sentence pairs, ACME[2] achieves an AER of 23.7% in contrast to 34.3% AER for GIZA++(gdf).</Paragraph> <Paragraph position="2"> With 241K sentence pairs, ACME[2] yields 18.3% AER in contrast to 27.7% AER for GIZA++(gdf).</Paragraph> <Paragraph position="3"> We should emphasize that ACME[2] on only 20K sentence pairs yields a lower AER than those of all GIZA++ alignments obtained on 241K sentence pairs. Overall ACME[2] achieves a relative error reduction of 31-38% over the input alignments, and a relative error reduction of 31-34% over GIZA++(gdf) for different sizes of training data.</Paragraph> </Section> <Section position="2" start_page="99" end_page="100" type="sub_section"> <SectionTitle> 5.2 Expanding to Additional Languages </SectionTitle> <Paragraph position="0"> We also investigated the applicability of ACME to additional language pairs. Table 6 presents the AER for GIZA++ and SAHMM (in each direction), three combination heuristics (gdf, int and union), and ACME[2] and ACME[4] on English-Arabic and English-Romanian data. We should emphasize that no POS tagger on the FL side was used for these experiments.</Paragraph> <Paragraph position="1"> On English-Arabic data, ACME[2] (with POS partitioning and including all features) yields 21.4% (20.7%) AER--a relative error reduction of 24.6% (13.0%) over the best combination heuristic with GIZA++ (SAHMM) alignments. ACME[4] reduces the AER to 18.1%--a relative error reduction of 36.3% and 23.9% over GIZA++(int) and SAHMM(int), respectively.</Paragraph> <Paragraph position="2"> On English-Romanian data, ACME[2] (with POS partitioning and including all features) yields 24.7% (26.2%) AER--a relative error reduction of 14.3% (10.6%) over the best combination heuristic with GIZA++ (SAHMM) alignments. ACME[4] re- null duces the AER to 22.3%--a relative error reduction of 22.6% and 23.9% over GIZA++(int) and SAHMM(int), respectively.</Paragraph> </Section> <Section position="3" start_page="100" end_page="100" type="sub_section"> <SectionTitle> 5.3 Precision, Recall and Upper-Bound Analysis </SectionTitle> <Paragraph position="0"> We now turn to a precision vs. recall analysis of different alignments to elucidate the nature of the differences between two alignments.</Paragraph> <Paragraph position="1"> Figure 2 presents precision and recall values for three combined alignments using GIZA++ (int, union, gdf) as well as results for ACME[2] and ACME[4] on three different language pairs. For all three pairs, the ranking of the combined alignments is the same with respect to precision and recall. GIZA++(int) yields the highest precision (nearly 95%) but the lowest recall (53-57%). Both union and gdf methods achieve low precision (5668%) but high recall (75-83%), and gdf is better than union. By contrast, ACME[2] yields significantly higher precision (nearly 87%) but lower recall (67-75%) with respect to union and gdf. ACME[4] has higher precision and recall than ACME[2]--an absolute increase of 2-3% and 4%, respectively.</Paragraph> <Paragraph position="2"> Next we compute an oracle upper-bound in AER where mismatched input alignments are assumed to be resolved perfectly within the alignment combination framework (i.e., an oracle chooses the correct output in cases where the input aligners make different choices).</Paragraph> <Paragraph position="3"> Table 7 presents the upper bounds using a generic alignment combiner (denoted Oracle) with 2 and 4 input alignments on three language pairs, assuming a perfect resolution of mismatched input alignments. For English-Chinese, the upper bound is 9.4% (us null If the input aligners agree on a particular link, that decision is taken as the final output in computing the upper bound.</Paragraph> </Section> <Section position="4" start_page="100" end_page="100" type="sub_section"> <SectionTitle> ment Combination </SectionTitle> <Paragraph position="0"> ing Oracle[2]) and 4.7% (using Oracle[4]). The English-Arabic data exhibits a slightly higher upper bound of 5.5% for Oracle[4]. The upper bounds for AER on English-Romanian data are even higher (up to 17.7%), which indicates that the input alignments are significantly worse than others. This may be one of the main contributing factors to the lower improvement of ACME on English-Romanian in comparison to the other two language pairs.</Paragraph> </Section> <Section position="5" start_page="100" end_page="101" type="sub_section"> <SectionTitle> 5.4 MT Evaluation </SectionTitle> <Paragraph position="0"> To determine the contribution of improved alignment in an external application, we examined the improvement in an off-the-shelf phrase-based MT system Pharaoh (Koehn, 2004) on both Chinese and Arabic data. In these experiments, all components of the MT system were kept the same except for the component that generates a phrase table from a given alignment.</Paragraph> <Paragraph position="1"> The input alignments were generated using GIZA++ and SAHMM on 107K (44K) sentence pairs for Chinese (Arabic). ACME (with English POS partitioning) combines alignments using model parameters learned from the corresponding manually aligned data. MT output is evaluated using the standard MT evaluation metric BLEU (Papineni et al., 2002).</Paragraph> <Paragraph position="2"> Table 8 presents the BLEU scores on We used the NIST script (version 11a) with its default set- null MTEval'03 data for 5 different Pharaoh runs, one for each alignment. The parameters of the MT system were optimized on MTEval'02 data using minimum error rate training (Och, 2003).</Paragraph> <Paragraph position="3"> For the language model, the SRI Language Modeling Toolkit was used to train a trigram model with modified Kneser-Ney smoothing on 155M words of English newswire text, mostly from the Xinhua portion of the Gigaword corpus. During decoding, the number of English phrases per FL phrase was limited to 100 and the distortion of phrases was limited by 4. Based on the observations in (Koehn et al., 2003), we also limited the phrase length to 3 for computational reasons.</Paragraph> <Paragraph position="4"> Alignments using BLEU (in percentages) For both languages, ACME[2] and ACME[4] outperform the other three alignment combination techniques. ACME[4], for instance, yields the BLEU scores of 25.59% for Chinese and 45.54% for Arabic--an absolute 1.6-1.7% BLEU point increase over the best of the other three alignment combinations. The differences between the BLEU scores for ACME and the other three BLEU scores are statistically significant, using a significance test with bootstrap resampling (Zhang et al., 2004).</Paragraph> </Section> </Section> <Section position="7" start_page="101" end_page="102" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> ME models have been previously applied to several NLP problems, including word alignments. For intings: case-insensitive matching of n-grams up to n = 4, and the shortest reference sentence for the brevity penalty.</Paragraph> <Paragraph position="1"> stance, the IBM models (Brown et al., 1993) can be improved by adding more context dependencies into the translation model using a ME framework rather than using only p(f</Paragraph> <Paragraph position="3"> In a later study, Och and Ney (2003) present a log-linear combination of the HMM and IBM Model 4 that produces better alignments than either of those.</Paragraph> <Paragraph position="4"> The major advantage of these two methods is that they do not require manually annotated data.</Paragraph> <Paragraph position="5"> The alignment process can be modeled as a product of a transition model and an observation model, where ME models the observations (Ittycheriah and Roukos, 2005). Significant improvements are reported using this approach but the need for large manually aligned data is a bottleneck. An alternative ME approach models alignment directly as a log-linear combination of feature functions (Liu et al., 2005). Moore (2005) and Taskar et al. (2005) represent alignments with several feature functions that are then combined in a weighted sum to model word alignments. Once a confidence score is assigned to all links, a non-trivial search is invoked to find the best alignment using the scores associated with the links. The major difference between these approaches and that of ACME is that we use the ME model to predict the correct class for each alignment link independently using outputs of existing alignment systems, instead of generating them from scratch at the level of the whole sentence, thus eliminating the need for an exhaustive search over all possible alignments, i.e., previous approaches work globally while ACME is a localized model. A discussion of these two contrasting approaches can be found in (Tillmann and Zhang, 2005).</Paragraph> <Paragraph position="6"> A recent attempt to combine outputs of different alignments views the combination problem as a classifier ensemble in the neural network framework (Ayan et al., 2005). However, this method is subject to the unpredictability of random network initialization, whereas ACME is guaranteed to find the model that maximizes the likelihood of training data.</Paragraph> </Section> class="xml-element"></Paper>