File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1057_metho.xml
Size: 17,343 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1057"> <Title>Log-linear Models for Word Alignment</Title> <Section position="5" start_page="459" end_page="459" type="metho"> <SectionTitle> 2 Log-linear Models </SectionTitle> <Paragraph position="0"> Formally, we use following definition for alignment.</Paragraph> <Paragraph position="1"> Given a source ('English') sentence e = eI1 = e1, . . . , ei, . . . , eI and a target language ('French') sentence f = fJ1 = f1, . . . , fj, . . . , fJ. We define a link l = (i,j) to exist if ei and fj are translation (or part of a translation) of one another. We define the null link l = (i,0) to exist if ei does not correspond to a translation for any French word in f. The null link l = (0,j) is defined similarly. An alignment a is defined as a subset of the Cartesian product of the word positions: a [?]{(i,j) : i = 0,...,I;j = 0,...,J} (1) We define the alignment problem as finding the alignment a that maximizes Pr(a|e, f) given e and f.</Paragraph> <Paragraph position="2"> We directly model the probability Pr(a |e, f).</Paragraph> <Paragraph position="3"> An especially well-founded framework is maximum entropy (Berger et al., 1996). In this framework, we have a set of M feature functions hm(a,e,f), m = 1, . . . , M. For each feature function, there exists a model parameter lm, m = 1, . . . , M. The direct alignment probability is given by:</Paragraph> <Paragraph position="5"> This approach has been suggested by (Papineni et al., 1997) for a natural language understanding task and successfully applied to statistical machine translation by (Och and Ney, 2002).</Paragraph> <Paragraph position="6"> We obtain the following decision rule:</Paragraph> <Paragraph position="8"> Typically, the source language sentence e and the target sentence f are the fundamental knowledge sources for the task of finding word alignments. Linguistic data, which can be used to identify associations between lexical items are often ignored by traditional word alignment approaches. Linguistic tools such as part-of-speech taggers, parsers, named-entity recognizers have become more and more robust and available for many languages by now. It is important to make use of linguistic information to improve alignment strategies. Treated as feature functions, syntactic dependencies can be easily incorporated into log-linear models.</Paragraph> <Paragraph position="9"> In order to incorporate a new dependency which contains extra information other than the bilingual sentence pair, we modify Eq.2 by adding a new vari-</Paragraph> <Paragraph position="11"> Note that our log-linear models are different from Model 6 proposed by Och and Ney (2003), which defines the alignment problem as finding the alignment a that maximizes Pr(f, a|e) given e.</Paragraph> </Section> <Section position="6" start_page="459" end_page="461" type="metho"> <SectionTitle> 3 Feature Functions </SectionTitle> <Paragraph position="0"> In this paper, we use IBM translation Model 3 as the base feature of our log-linear models. In addition, we also make use of syntactic information such as part-of-speech tags and bilingual dictionaries.</Paragraph> <Section position="1" start_page="460" end_page="460" type="sub_section"> <SectionTitle> 3.1 IBM Translation Models </SectionTitle> <Paragraph position="0"> Brown et al. (1993) proposed a series of statistical models of the translation process. IBM translation models try to model the translation probability Pr(fJ1 |eI1), which describes the relationship between a source language sentence eI1 and a target language sentence fJ1 . In statistical alignment models Pr(fJ1 ,aJ1|eI1), a 'hidden' alignment a = aJ1 is introduced, which describes a mapping from a target position j to a source position i = aj. The relationship between the translation model and the alignment model is given by:</Paragraph> <Paragraph position="2"> Although IBM models are considered more coherent than heuristic models, they have two drawbacks. First, IBM models are restricted in a way such that each target word fj is assigned to exactly one source word eaj. A more general way is to model alignment as an arbitrary relation between source and target language positions. Second, IBM models are typically language-independent and may fail to tackle problems occurred due to specific languages. null In this paper, we use Model 3 as our base feature function, which is given by 1:</Paragraph> <Paragraph position="4"> We distinguish between two translation directions to use Model 3 as feature functions: treating English as source language and French as target language or vice versa.</Paragraph> </Section> <Section position="2" start_page="460" end_page="460" type="sub_section"> <SectionTitle> 3.2 POS Tags Transition Model </SectionTitle> <Paragraph position="0"> The first linguistic information we adopt other than the source language sentence e and the target language sentence f is part-of-speech tags. The use of POS information for improving statistical alignment quality of the HMM-based model is described 1If there is a target word which is assigned to more than one source words, h(a,e,f) = 0.</Paragraph> <Paragraph position="1"> in (Toutanova et al., 2002). They introduce additional lexicon probability for POS tags in both languages. null In IBM models as well as HMM models, when one needs the model to take new information into account, one must create an extended model which can base its parameters on the previous model. In log-linear models, however, new information can be easily incorporated.</Paragraph> <Paragraph position="2"> We use a POS Tags Transition Model as a feature function. This feature learns POS Tags transition probabilities from held-out data (via simple counting) and then applies the learned distributions to the ranking of various word alignments. We</Paragraph> <Paragraph position="4"> sequences of the sentence pair e and f. POS Tags Transition Model is formally described as:</Paragraph> <Paragraph position="6"> where a is an element of a, a(i) is the corresponding source position of a and a(j) is the target position.</Paragraph> <Paragraph position="7"> Hence, the feature function is:</Paragraph> <Paragraph position="9"> We still distinguish between two translation directions to use POS tags Transition Model as feature functions: treating English as source language and French as target language or vice versa.</Paragraph> </Section> <Section position="3" start_page="460" end_page="461" type="sub_section"> <SectionTitle> 3.3 Bilingual Dictionary </SectionTitle> <Paragraph position="0"> A conventional bilingual dictionary can be considered an additional knowledge source. We could use a feature that counts how many entries of a conventional lexicon co-occur in a given alignment between the source sentence and the target sentence. Therefore, the weight for the provided conventional dictionary can be learned. The intuition is that the conventional dictionary is expected to be more reliable than the automatically trained lexicon and therefore should get a larger weight.</Paragraph> <Paragraph position="1"> We define a bilingual dictionary as a set of entries: D = {(e,f,conf)}. e is a source language word, f is a target langauge word, and conf is a positive real-valued number (usually, conf = 1.0) assigned by lexicographers to evaluate the validity of the entry. Therefore, the feature function using a bilingual dictionary is:</Paragraph> <Paragraph position="3"/> </Section> </Section> <Section position="7" start_page="461" end_page="461" type="metho"> <SectionTitle> 4 Training </SectionTitle> <Paragraph position="0"> We use the GIS (Generalized Iterative Scaling) algorithm (Darroch and Ratcliff, 1972) to train the model parameters lM1 of the log-linear models according to Eq. 4. By applying suitable transformations, the GIS algorithm is able to handle any type of real-valued features. In practice, We use YASMET 2 written by Franz J. Och for performing training.</Paragraph> <Paragraph position="1"> The renormalization needed in Eq. 4 requires a sum over a large number of possible alignments. If e has length l and f has length m, there are possible 2lm alignments between e and f (Brown et al., 1993). It is unrealistic to enumerate all possible alignments when lm is very large. Hence, we approximate this sum by sampling the space of all possible alignments by a large set of highly probable alignments. The set of considered alignments are also called n-best list of alignments.</Paragraph> <Paragraph position="2"> We train model parameters on a development corpus, which consists of hundreds of manually-aligned bilingual sentence pairs. Using an n-best approximation may result in the problem that the parameters trained with the GIS algorithm yield worse alignments even on the development corpus. This can happen because with the modified model scaling factors the n-best list can change significantly and can include alignments that have not been taken into account in training. To avoid this problem, we iteratively combine n-best lists to train model parameters until the resulting n-best list does not change, as suggested by Och (2002). However, as this training procedure is based on maximum likelihood criterion, there is only a loose relation to the final align- null having a series of model parameters when the iteration ends, we select the model parameters that yield best alignments on the development corpus.</Paragraph> <Paragraph position="3"> After the bilingual sentences in the development corpus are tokenized (or segmented) and POS tagged, they can be used to train POS tags transition probabilities by counting relative frequencies:</Paragraph> <Paragraph position="5"> Here, NA(fT,eT) is the frequency that the POS tag fT is aligned to POS tag eT and N(eT) is the frequency of eT in the development corpus.</Paragraph> </Section> <Section position="8" start_page="461" end_page="462" type="metho"> <SectionTitle> 5 Search </SectionTitle> <Paragraph position="0"> We use a greedy search algorithm to search the alignment with highest probability in the space of all possible alignments. A state in this space is a partial alignment. A transition is defined as the addition of a single link to the current state. Our start state is the empty alignment, where all words in e and f are assigned to null. A terminal state is a state in which no more links can be added to increase the probability of the current alignment. Our task is to find the terminal state with the highest probability.</Paragraph> <Paragraph position="1"> We can compute gain, which is a heuristic function, instead of probability for efficiency. A gain is defined as follows:</Paragraph> <Paragraph position="3"> where l = (i,j) is a link added to a.</Paragraph> <Paragraph position="4"> The greedy search algorithm for general log-linear models is formally described as follows: Input: e,f,eT, fT, and D Output: a 1. Start with a = ph.</Paragraph> <Paragraph position="5"> 2. Do for each l = (i,j) and l /[?] a: Compute gain(a,l) 3. Terminate if [?]l,gain(a,l) [?] 1. 4. Add the link ^l with the maximal gain(a,l) to a.</Paragraph> <Paragraph position="6"> 5. Goto 2.</Paragraph> <Paragraph position="7"> The above search algorithm, however, is not efficient for our log-linear models. It is time-consuming for each feature to figure out a probability when adding a new link, especially when the sentences are very long. For our models, gain(a,l) can be obtained in a more efficient way 3:</Paragraph> <Paragraph position="9"> Note that we restrict that h(a,e,f) [?] 0 for all feature functions.</Paragraph> <Paragraph position="10"> The original terminational condition for greedy search algorithm is:</Paragraph> <Paragraph position="12"> Note that we restrict h(a,e,f) [?] 0 for all feature functions. Gain threshold t is a real-valued number, which can be optimized on the development corpus.</Paragraph> <Paragraph position="13"> Therefore, we have a new search algorithm: Input: e,f,eT, fT, D and t Output: a 1. Start with a = ph.</Paragraph> <Paragraph position="14"> 2. Do for each l = (i,j) and l /[?] a: Compute gain(a,l) 3We still call the new heuristic function gain to reduce notational overhead, although the gain in Eq. 13 is not equivalent to the one in Eq. 12.</Paragraph> <Paragraph position="15"> 3. Terminate if [?]l,gain(a,l) [?] t.</Paragraph> <Paragraph position="16"> 4. Add the link ^l with the maximal gain(a,l) to a.</Paragraph> <Paragraph position="17"> 5. Goto 2.</Paragraph> <Paragraph position="18"> The gain threshold t depends on the added link l. We remove this dependency for simplicity when using it in search algorithm by treating it as a fixed real-valued number.</Paragraph> </Section> <Section position="9" start_page="462" end_page="464" type="metho"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> We present in this section results of experiments on a parallel corpus of Chinese-English texts. Statistics for the corpus are shown in Table 1. We use a training corpus, which is used to train IBM translation models, a bilingual dictionary, a development corpus, and a test corpus.</Paragraph> <Paragraph position="1"> gual dictionary (Dict), development corpus (Dev), and test corpus (Test).</Paragraph> <Paragraph position="2"> The Chinese sentences in both the development and test corpus are segmented and POS tagged by ICTCLAS (Zhang et al., 2003). The English sentences are tokenized by a simple tokenizer of ours and POS tagged by a rule-based tagger written by Eric Brill (Brill, 1995). We manually aligned 935 sentences, in which we selected 500 sentences as test corpus. The remaining 435 sentences are used as development corpus to train POS tags transition probabilities and to optimize the model parameters and gain threshold.</Paragraph> <Paragraph position="3"> Provided with human-annotated word-level alignment, we use precision, recall and AER (Och and Ney, 2003) for scoring the viterbi alignments of each model against gold-standard annotated alignments:</Paragraph> <Paragraph position="5"> where A is the set of word pairs aligned by word alignment systems, S is the set marked in the gold standard as &quot;sure&quot; and P is the set marked as &quot;possible&quot; (including the &quot;sure&quot; pairs). In our Chinese-English corpus, only one type of alignment was marked, meaning that S = P.</Paragraph> <Paragraph position="6"> In the following, we present the results of log-linear models for word alignment. We used GIZA++ package (Och and Ney, 2003) to train IBM translation models. The training scheme is 15H535, which means that Model 1 are trained for five iterations, HMM model for five iterations and finally Model 3 for five iterations. Except for changing the iterations for each model, we use default configuration of GIZA++. After that, we used three types of methods for performing a symmetrization of IBM models: intersection, union, and refined methods (Och and Ney , 2003).</Paragraph> <Paragraph position="7"> The base feature of our log-linear models, IBM Model 3, takes the parameters generated by GIZA++ as parameters for itself. In other words, our log-linear models share GIZA++ with the same parameters apart from POS transition probability table and bilingual dictionary.</Paragraph> <Paragraph position="8"> Table 2 compares the results of our log-linear models with IBM Model 3. From row 3 to row 7 are results obtained by IBM Model 3. From row 8 to row 12 are results obtained by log-linear models.</Paragraph> <Paragraph position="9"> As shown in Table 2, our log-linear models achieve better results than IBM Model 3 in all training corpus sizes. Considering Model 3 E - C of GIZA++ and ours alone, greedy search algorithm described in Section 5 yields surprisingly better alignments than hillclimbing algorithm in GIZA++.</Paragraph> <Paragraph position="10"> Table 3 compares the results of log-linear models with IBM Model 5. The training scheme is 15H5354555. Our log-linear models still make use of the parameters generated by GIZA++.</Paragraph> <Paragraph position="11"> Comparing Table 3 with Table 2, we notice that our log-linear models yield slightly better alignments by employing parameters generated by the training scheme 15H5354555 rather than 15H535, which can be attributed to improvement of parameters after further Model 4 and Model 5 training. For log-linear models, POS information and an additional dictionary are used, which is not the case for GIZA++/IBM models. However, treated as a method for performing symmetrization, log-linear combination alone yields better results than intersection, union, and refined methods.</Paragraph> <Paragraph position="12"> Figure 1 shows how gain threshold has an effect on precision, recall and AER with fixed model scaling factors.</Paragraph> <Paragraph position="13"> Figure 2 shows the effect of number of features gain thresholds with the same model scaling factors. and size of training corpus on search efficiency for log-linear models.</Paragraph> <Paragraph position="14"> Table 4 shows the resulting normalized model scaling factors. We see that adding new features also has an effect on the other model scaling factors.</Paragraph> </Section> class="xml-element"></Paper>