File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-2004_metho.xml
Size: 38,233 bytes
Last Modified: 2025-10-06 14:07:13
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-2004"> <Title>Models of Translational Equivalence among Words</Title> <Section position="5" start_page="224" end_page="224" type="metho"> <SectionTitle> 3. The One-to-One Assumption </SectionTitle> <Paragraph position="0"> The most general word-to-word translation model trans(.1, ~), where ,i and C/C/ range over sequences in PS1 and PS2, has an infinite number of parameters. This model can be constrained in various ways to make it more practical. The models presented in this article are based on the one-to-one assumption: Each word is translated to at most one other word. In these models, .1 and C/C/ may consist of at most one word each.</Paragraph> <Paragraph position="1"> As before, one of the two sequences (but not both) may be empty. I shall describe empty sequences as consisting of a special NULL word, so that each word sequence will contain exactly one word and can be treated as a scalar. Henceforth, I shall write u and v instead of 11 and ~C/. Under the one-to-one assumption, a pair of bags containing m</Paragraph> </Section> <Section position="6" start_page="224" end_page="225" type="metho"> <SectionTitle> 6 The number of permutations is smaller when either bag contains two or more identical elements, but </SectionTitle> <Paragraph position="0"> this detail will not affect the estimation algorithms presented here.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 26, Number 2 and n nonempty words can be generated by a process where the bag size I is anywhere between max(m, n) and m + n.</Paragraph> <Paragraph position="2"> The one-to-one assumption is not as restrictive as it may appear: The explanatory power of a model based on this assumption may be raised to an arbitrary level by extending Western notions of what words are to include words that contain spaces (e.g., in English) or several characters (e.g., in Chinese). For example, I have shown elsewhere how to estimate word-to-word translation models where a word can be a noncompositional compound consisting of several space-delimited tokens (Melamed, to appear). For the purposes of this article, however, words are the tokens generated by my tokenizers and stemmers for the languages in question. Therefore, the models in this article are only a first approximation to the vast complexities of translational equivalence between natural languages. They are intended mainly as stepping stones towards better models.</Paragraph> </Section> <Section position="7" start_page="225" end_page="229" type="metho"> <SectionTitle> 4. Previous Work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="225" end_page="226" type="sub_section"> <SectionTitle> 4.1 Models of Co-occurrence </SectionTitle> <Paragraph position="0"> Most methods for estimating translation models from bitexts start with the following intuition: Words that are translations of each other are more likely to appear in corresponding bitext regions than other pairs of words. Following this intuition, most authors begin by counting the number of times that word types in one half of the bitext co-occur with word types in the other half. Different co-occurrence counting methods stem from different models of co-occurrence.</Paragraph> <Paragraph position="1"> A model of co-occurrence is a Boolean predicate, which indicates whether a given pair of word tokens co-occur in corresponding regions of the bitext space. Different models of co-occurrence are possible, depending on the kind of bitext map that is available, the language-specific information that is available, and the assumptions made about the nature of translational equivalence. All the translation models reviewed and introduced in this article can be based on any of the co-occurrence models described by Melamed (1998a). For expository purposes, however, I shall assume a boundary-based model of co-occurrence throughout this article. A boundary-based model of co-occurrence assumes that both halves of the bitext have been segmented into s segments, so that segment Ui in one half of the bitext and segment Vi in the other half are mutual translations, 1 < i < s.</Paragraph> <Paragraph position="2"> Under the boundary-based model of co-occurrence, there are several ways to compute co-occurrence counts cooc(u, v) between word types u and v. In the models of Brown, Della Pietra, Della Pietra, and Mercer (1993), reviewed in Section 4.3,</Paragraph> <Paragraph position="4"> where ei and j5 are the unigram frequencies of u and v, respectively, in each aligned text segment i. For most translation models, this method produces suboptimal results, however, when ei(u) > 1 and )~(v) > 1. I argue elsewhere (Melamed 1998a) that</Paragraph> <Paragraph position="6"> is preferable, and this is the method used for the models introduced in Section 5.</Paragraph> <Paragraph position="8"> hoche la tete nods and hoche often co-occur, as do nods and head. The direct association between nods and hoche, and the direct association between nods and head give rise to an indirect association between hoche and head.</Paragraph> </Section> <Section position="2" start_page="226" end_page="227" type="sub_section"> <SectionTitle> 4.2 Nonprobabilistic Translation Lexicons </SectionTitle> <Paragraph position="0"> Many researchers have proposed greedy algorithms for estimating nonprobabilistic word-to-word translation models, also known as translation lexicons (e.g., Catizone, Russell, and Warwick 1989; Gale and Church 1991; Fung 1995; Kumano and Hirakawa 1994; Melamed 1995; Wu and Xia 1994). Most of these algorithms can be summarized as follows: 1. Choose a similarity function S between word types in PS1 and word types in PS2.</Paragraph> <Paragraph position="1"> 2. Compute association scores S(u,v) for a set of word type pairs (U, V) C (PS1 X PS2) that occur in training data.</Paragraph> <Paragraph position="2"> 3. Sort the word pairs in descending order of their association scores. 4. Discard all word pairs for which S(u, v) is less than a chosen threshold. The remaining word pairs become the entries in the translation lexicon. The various proposals differ mainly in their choice of similarity function. Almost all the similarity functions in the literature are based on a model of co-occurrence with some linguistically motivated filtering (see Fung \[1995\] for a notable exception). Given a reasonable similarity function, the greedy algorithm works remarkably well, considering how simple it is. However, the association scores in Step 2 are typically computed independently of each other. The problem with this independence assumption is illustrated in Figure 1. The two word sequences represent corresponding regions of an English/French bitext. If nods and hoche co-occur much more often than expected by chance, then any reasonable similarity metric will deem them likely to be mutual translations. Nods and hoche are indeed mutual translations, so their tendency to co-occur is called a direct association. Now, suppose that nods and head often co-occur in English. Then hoche and head will also co-occur more often than expected by chance. The dashed arrow between hoche and head in Figure i represents an indirect association, since the association between hoche and head arises only by virtue of the association between each of them and nods. Models of translational equivalence that are ignorant of indirect associations have &quot;a tendency.., to be confused by collocates&quot; (Dagan, Church, and Gale 1993,5).</Paragraph> <Paragraph position="3"> Paradoxically, the irregularities (noise) in text and in translation mitigate the problem. If noise in the data reduces the strength of a direct association, then the same noise will reduce the strengths of any indirect associations that are based on this direct Computational Linguistics Volume 26, Number 2 Table 1 Variables used to describe translation models.</Paragraph> <Paragraph position="4"> (U, V) = the two halves of the bitext (U, V) = a pair of aligned text segments in (/d, V) e(u) = the unigram frequency of u in U f(v) = the unigram frequency of v in V cooc(u, v) = the number of times that u and v co-occur trans(vlu ) = the probability that a token of u will be translated as a token of v association. On the other hand, noise can reduce the strength of an indirect association without affecting any direct associations. Therefore, direct associations are usually stronger than indirect associations. If all the entries in a translation lexicon are sorted by their association scores, the direct associations will be very dense near the top of the list, and sparser towards the bottom.</Paragraph> <Paragraph position="5"> Gale and Church (1991) have shown that entries at the very top of the list can be over 98% correct. Their algorithm gleaned lexicon entries for about 61% of the word tokens in a sample of 800 English sentences. To obtain 98% precision, their algorithm selected only entries for which it had high confidence that the association score was high. These would be the word pairs that co-occur most frequently. A random sample of 800 sentences from the same corpus showed that 61% of the word tokens, where the tokens are of the most frequent types, represent 4.5% of all the word types.</Paragraph> <Paragraph position="6"> A similar strategy was employed by Wu and Xia (1994) and by Fung (1995).</Paragraph> <Paragraph position="7"> Fung skimmed off the top 23.8% of the noun-noun entries in her lexicon to achieve a precision of 71.6%. Wu and Xia have reported automatic acquisition of 6,517 lexicon entries from a 3.3-million-word corpus, with a precision of 86%. The first 3.3 million word tokens in an English corpus from a similar genre contained 33,490 different word types, suggesting a recall of roughly 19%. Note, however, that Wu and Xia chose to weight their precision estimates by the probabilities attached to each entry: For example, if the translation set for English word detect has the two correct Chinese candidates with 0.533 probability and with 0.277 probability, and the incorrect translation with 0.190 probability, then we count this as 0.810 correct translations and 0.190 incorrect translations. (Wu and Xia 1994, 211) This is a reasonable evaluation method, but it is not comparable to methods that simply count each lexicon entry as either right or wrong (e.g., Daille, Gaussier, and Lang6 1994; Melamed 1996b). A weighted precision estimate pays more attention to entries that are more frequent and hence easier to estimate. Therefore, weighted precision estimates are generally higher than unweighted ones.</Paragraph> </Section> <Section position="3" start_page="227" end_page="229" type="sub_section"> <SectionTitle> 4.3 Reestimated Sequence-to-Sequence Translation Models </SectionTitle> <Paragraph position="0"> Most probabilistic translation model reestimation algorithms published to date are variations on the theme proposed by Brown et al. (1993b). These models involve conditional probabilities, but they can be compared to symmetric models if the latter are normalized by the appropriate marginal distribution. I shall review these models using the notation in Table 1.</Paragraph> <Paragraph position="1"> ploy the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to estimate the parameters of their Model 1. On iteration i, the EM algorithm reestimates the model parameters transi(v\]u) based on their estimates from iteration i- 1. In Model 1, the relationship between the new parameter estimates and the old ones is</Paragraph> <Paragraph position="3"> where z is a normalizing factor. 7 It is instructive to consider the form of Equation 14 when all the translation probabilities trans(v\[u) for a particular u are initialized to the same constant p, as Brown et al. (1993b, 273) actually do:</Paragraph> <Paragraph position="5"> The initial translation probability transl(v\]u) is set proportional to the co-occurrence count of u and v and inversely proportional to the length of each segment U in which u occurs. The intuition behind the numerator is central to most bitext-based translation models: The more often two words co-occur, the more likely they are to be mutual translations. The intuition behind the denominator is that the co-occurrence count of u and v should be discounted to the degree that v also co-occurs with other words in the same segment pair.</Paragraph> <Paragraph position="6"> Now consider how Equation 16 would behave if all the text segments on each side were of the same length, s so that each token of v co-occurs with exactly c words (where c is constant):</Paragraph> <Paragraph position="8"> The normalizing coefficient z is constant over all words. The only difference between Equations 16 and 18 is that the former discounts co-occurrences proportionally to the segment lengths. When information about segment lengths is not available, the only information available to initialize Model 1 is the co-occurrence counts. This property makes Model 1 an appropriate baseline for comparison to more sophisticated models that use other information sources, both in the work of Brown and his colleagues and in the work described here.</Paragraph> <Paragraph position="9"> 7 This expression is obtained by substituting Brown, Della Pietra, Della Pietra, and Mercer's (1993) Equation 17 into their Equation 14.</Paragraph> <Paragraph position="10"> 8 Or, equivalently, if the notion of segments were dispensed with altogether, as under the distance-based model of co-occurrence (Melarned 1998a).</Paragraph> <Paragraph position="11"> the true bitext map correlate with the positions of their translations. The correlation is stronger for language pairs with more similar word order. Brown et al. (1988) introduced the idea that this correlation can be encoded in translation model parameters. Dagan, Church, and Gale (1993) expanded on this idea by replacing Brown et al.'s (1988) word alignment parameters, which were based on absolute word positions in aligned segments, with a much smaller set of relative offset parameters. The much smaller number of parameters allowed Dagan, Church, and Gale's model to be effectively trained on much smaller bitexts. Vogel, Ney, and Tillmann (1996) have shown how some additional assumptions can turn this model into a hidden Markov model, enabling even more efficient parameter estimation.</Paragraph> <Paragraph position="12"> It cannot be overemphasized that the word order correlation bias is just knowledge about the problem domain, which can be used to guide the search for the optimum model parameters. Translational equivalence can be empirically modeled for any pair of languages, but some models and model biases work better for some language pairs than for others. The word order correlation bias is most useful when it has high predictive power, i.e., when the distribution of alignments or offsets has low entropy. The entropy of this distribution is indeed relatively low for the language pair that both Brown and his colleagues and Dagan, Church, and Gale were working with--French and English have very similar word order. A word order correlation bias, as well as the phrase structure biases in Brown et al.'s (1993b) Models 4 and 5, would be less beneficial with noisier training bitexts or for language pairs with less similar word order. Nevertheless, one should use all available information sources, if one wants to build the best possible translation model. Section 5.3 suggests a way to add the word order correlation bias to the models presented in this article.</Paragraph> </Section> <Section position="4" start_page="229" end_page="229" type="sub_section"> <SectionTitle> 4.4 Reestimated Bag-to-Bag Translation Models </SectionTitle> <Paragraph position="0"> At about the same time that I developed the models in this article, Hiemstra (1996) independently developed his own bag-to-bag model of translational equivalence. His model is also based on a one-to-one assumption, but it differs from my models in that it allows empty words in only one of the two bags, the one representing the shorter sentence. Thus, Hiemstra's model is similar to the first model in Section 5, but it has a little less explanatory power. Hiemstra's approach also differs from mine in his use of the Iterative Proportional Fitting Procedure (IPFP) (Deming and Stephan 1940) for parameter estimation.</Paragraph> <Paragraph position="1"> The IPFP is quite sensitive to initial conditions, so Hiemstra investigated a number of initialization options. Choosing the most advantageous, Hiemstra has published parts of the translational distributions of certain words, induced using both his method and Brown et al.'s (1993b) Model 1 from the same training bitext. Subjective comparison of these examples suggests that Hiemstra's method is more accurate. Hiemstra (1998) has also evaluated the recall and precision of his method and of Model 1 on a small hand-constructed set of link tokens in a particular bitext. Model 1 fared worse, on average.</Paragraph> </Section> </Section> <Section position="8" start_page="229" end_page="237" type="metho"> <SectionTitle> 5. Parameter Estimation </SectionTitle> <Paragraph position="0"> This section describes my methods for estimating the parameters of a symmetric word-to-word translation model from a bitext. For most applications, we are interested in estimating the probability trans(u,v) of jointly generating the pair of words (u,v).</Paragraph> <Paragraph position="1"> Unfortunately, these parameters cannot be directly inferred from a training bitext, because we don't know which words in one half of the bitext were generated together Melamed Models of Translational Equivalence with which words in the other half. The observable features of the bitext are only the co-occurrence counts cooc(u, v) (see Section 4.1).</Paragraph> <Paragraph position="2"> Methods for estimating translation parameters from co-occurrence counts typically involve link counts links(u, v), which represent hypotheses about the number of times that u and v were generated together, for each u and v in the bitext. A link token is an ordered pair of word tokens, one from each half of the bitext. A link type is an ordered pair of word types. The link counts links(u, v) range over link types. We can always estimate trans(u, v) by normalizing link counts so that Y\]~u,v trans(u, v) = 1: trans(u, v) = links(u, v) Y~-u,,v, links(u', v') (19) For estimation purposes, it is convenient to also employ a separate set of non-probabilistic parameters score(u, v), which represent the chances that u and v can ever be mutual translations, i.e., that there exists some context where tokens u and v are generated from the same concept. The relationship between score(u, v) and trans(u, v) can be more or less direct, depending on the model and its estimation method. Each of the models presented below uses a different score formulation.</Paragraph> <Paragraph position="3"> All my methods for estimating the translation parameters trans(u,v) share the following general outline:</Paragraph> <Paragraph position="5"> Initialize the score parameters to a first approximation, based only on the co-occurrence counts.</Paragraph> <Paragraph position="6"> Approximate the expected link counts links(u, v), as a function of the score parameters and the co-occurrence counts.</Paragraph> <Paragraph position="7"> Estimate trans(u, v), by normalizing the link counts as in Equation 19. If less than .0001 of the trans(u, v) distribution changed from the previous iteration, then stop.</Paragraph> <Paragraph position="8"> Reestimate the parameters score(u, v), as a function of the link counts and the co-occurrence counts.</Paragraph> <Paragraph position="9"> Repeat from Step 2.</Paragraph> <Paragraph position="10"> Under certain conditions, a parameter estimation process of this sort is an instance of the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). As explained below, meeting these conditions is computationally too expensive for my models. 9 Therefore, I employ some approximations, which lack the EM algorithm's convergence guarantee.</Paragraph> <Paragraph position="11"> The maximum likelihood approach to estimating the unknown parameters is to find the set of parameters ~) that maximize the probability of the training bitext (U, V).</Paragraph> <Paragraph position="13"> The probability of the bitext is a sum over the distribution ~4 of possible assignments:</Paragraph> <Paragraph position="15"> Computational Linguistics Volume 26, Number 2 The munber of possible assignments grows exponentially with the size of aligned text segments in the bitext. Due to the parameter interdependencies introduced by the one-to-one assumption, we are unlikely to find a method for decomposing the assignments into parameters that can be estimated independently of each other as in Brown et al. \[1993b, Equation 26\]). Barring such a decomposition method, the MLE approach is infeasible. This is why we must make do with approximations to the EM algorithm.</Paragraph> <Paragraph position="16"> In this situation, Brown et al. (1993b, 293) recommend &quot;evaluating the expectations using only a single, probable alignment.&quot; The single most probable assignment Ama~ is the maximum a posteriori (MAP) assignment:</Paragraph> <Paragraph position="18"> If we represent the bitext as a bipartite graph and weight the edges by log trans(u, v), then the right-hand side of Equation 26 is an instance of the weighted maximum matching problem and Ama~ is its solution. For a bipartite graph G = (V1 U V2, E), with v = IV1 U V21 and e = IEI, the lowest currently known upper bound on the computational complexity of this problem is O(ve + v 2 log v) (Ahuja, Magnati, and Orlin 1993, 500). Although this upper bound is polynomial, it is still too expensive for typical bitexts. 1deg Subsection 5.1.2 describes a greedy approximation to the MAP approximation.</Paragraph> <Section position="1" start_page="231" end_page="233" type="sub_section"> <SectionTitle> 5.1 Method A: The Competitive Linking Algorithm </SectionTitle> <Paragraph position="0"> 5.1.1 Step 1: Initialization. Almost every translation model estimation algorithm exploits the well-known correlation between translation probabilities and co-occurrence counts. Many algorithms also normalize the co-occurrence counts cooc(u,v) by the marginal frequencies of u and v. However, these quantities account for only the three shaded cells in Table 2. The statistical interdependence between two word types can be estimated more robustly by considering the whole table. For example, Gale and Church (1991, 154) suggest that &quot;~b 2, a X2-1ike statistic, seems to be a particularly good choice because it makes good use of the off-diagonal cells&quot; in the contingency table.</Paragraph> <Paragraph position="1"> where B(kln, p) = (nk)pk(1--p)n--k are binomial probabilities. The statistic uses maximum likelihood estimates for the probability parameters: Pl = ~'a p2 = 74-d'c P- a+b+c+a'a+c G 2 is easy to compute because the binomial coefficients in the numerator and in the denominator cancel each other out. All my methods initialize the parameters score(u, v) to G2(u,v), except that any pairing with NULL is initialized to an infinitesimal value. I have also found it useful to smooth the co-occurrence counts, e.g., using the Simple Good-Turing smoothing method (Gale and Sampson 1995), before computing G 2.</Paragraph> <Paragraph position="2"> mating link counts, I employ the competitive linking algorithm, which is a greedy approximation to the MAP approximation: 1. Sort all the score(u, v) from highest to lowest.</Paragraph> <Paragraph position="3"> 2. For each score(u, v), in order: (a) (b) If u (resp., v) is NULL, consider all tokens of v (resp., u) in the bitext linked to NULL. Otherwise, link all co-occurring token pairs (u, v) in the bitext.</Paragraph> <Paragraph position="4"> The one-to-one assumption implies that linked words cannot be linked again. Therefore, remove all linked word tokens from their respective halves of the bitext.</Paragraph> <Paragraph position="5"> The competitive linking algorithm can be viewed as a heuristic search for the most likely assignment in the space of all possible assignments. The heuristic is that the most likely assignments contain links that are individually the most likely. The search proceeds by a process of elimination. In the first search iteration, all the assignments that do not contain the most likely link are discarded. In the second iteration, all the assignments that do not contain the second most likely link are discarded, and Computational Linguistics Volume 26, Number 2 so on until only one assignment remains, u The algorithm greedily selects the most likely links first, and then selects less likely links only if they don't conflict with previous selections. The probability of a link being rejected increases with the number of links that are selected before it, and thus decreases with the link's score. In this problem domain, the competitive linking algorithm usually finds one of the most likely assignments, as I will show in Section 6. Under an appropriate hashing scheme, the expected running time of the competitive linking algorithm is linear in the size of the input bitext.</Paragraph> <Paragraph position="6"> The competitive linking algorithm and its one-to-one assumption are potent weapons against the ever-present sparse data problem. They enable accurate estimation of translational distributions even for words that occur only once, as long as the surrounding words are more frequent. In most translation models, link scores are correlated with co-occurrence frequency. So, links between tokens u and v for which score(u, v) is highest are the ones for which there is the most evidence, and thus also the ones that are easiest to predict correctly. Winner-take-all link assignment methods, such as the competitive linking algorithm, can prevent links based on indirect associations (see Section 4.2), thereby leveraging their accuracy on the more confident links to raise the accuracy of the less confident links. For example, suppose that ul and u2 co-occur with vl and v2 in the training data, and the model estimates score(u1, vl) -.05, score (ul, v2) = .02, and score(u2, v2) = .01. According to the one-to-one assumption, (Ul, v2) is an indirect association and the correct translation of v2 is u2. To the extent that the one-to-one assumption is valid, it reduces the probability of spurious links for the rarer words. The more incorrect candidate translations can be eliminated for a given rare word, the more likely the correct translation is to be found. So, the probability of a correct match for a rare word is proportional to the fraction of words around it that can be linked with higher confidence. This fraction is largely determined by two bitext properties: the distribution of word frequencies, and the distribution of co-occurrence counts. Melamed (to appear) explores these properties in greater depth.</Paragraph> <Paragraph position="7"> parameters as the logarithm of the trans parameters. The competitive linking algorithm only cares about the relative magnitudes of the various score(u, v). However, Equation 26 is a sum rather than a product, so I scale the trans parameters logarithmically, to be consistent with its probabilistic interpretation: scoreA(u, v) = log trans(u, v) (28)</Paragraph> </Section> <Section position="2" start_page="233" end_page="237" type="sub_section"> <SectionTitle> 5.2 Method B: Improved Estimation Using an Explicit Noise Model </SectionTitle> <Paragraph position="0"> Yarowsky (1993, 271) has shown that &quot;for several definitions of sense and collocation, an ambiguous word has only one sense in a given collocation with a probability of 90-99%.&quot; In other words, a single contextual clue can be a highly reliable indicator of a word's sense. One of the definitions of &quot;sense&quot; studied by Yarowsky was a word token's translation in the other half of a bitext. For example, the English word sentence may be considered to have two senses, corresponding to its French translations peine (judicial sentence) and phrase (grammatical sentence). If a token of sentence occurs in the vicinity of a word like jury or prison, then it is far more likely to be translated as peine than as phrase. &quot;In the vicinity of&quot; is one kind of collocation. Co-occurrence 11 The competitive linking algorithm can be generalized to stop searching before the number of possible assignments is reduced to one, at which point the link counts can be computed as probabilistically weighted averages over the remaining assignments. I use this method to resolve ties.</Paragraph> <Paragraph position="2"> The ratio links(u, v)/cooc(u, v), for several values of cooc(u, v).</Paragraph> <Paragraph position="3"> in bitext space is another kind of collocation. If each word's translation is treated as a sense tag (Resnik and Yarowsky 1997), then &quot;translational&quot; collocations have the unique property that the collocate and the word sense are one and the same! Method B exploits this property under the hypothesis that &quot;one sense per collocation&quot; holds for translational collocations. This hypothesis implies that if u and v are possible mutual translations, and a token u co-occurs with a token v in the bitext, then with very high probability the pair (u, v) was generated from the same concept and should be linked. To test this hypothesis, I ran one iteration of Method A on 300,000 aligned sentence pairs from the Canadian Hansards bitext. I then plotted the links(u,v) ratio ~ for several values of cooc(u, v) in Figure 2. The curves show that the ratio links(u,v) cooc(u,v) tends to be either very high or very low. This bimodality is not an artifact of the competitive linking process, because in the first iteration, linking decisions are based only on the initial similarity metric.</Paragraph> <Paragraph position="4"> Information about how often words co-occur without being linked can be used to links(u,v) bias the estimation of translation model parameters. The smaller the ratio cooc(u,v), the more likely it is that u and v are not mutual translations, and that links posited between tokens of u and v are noise. The bias can be implemented via auxiliary parameters that model the curve illustrated in Figure 2. The competitive linking algorithm creates all the links of a given type independently of each other. 12 So, the distribution of the number links(u, v) of links connecting word types u and v can be modeled by a binomial distribution with parameters cooc(u, v) and p(u, v). p(u, v) is the probability 12 Except for the case when multiple tokens of the same word type occur near each other, which I hereby sweep under the carpet.</Paragraph> <Paragraph position="6"> = the number of times that u and v are hypothesized to co-occur as mutual translations = probability of k being generated from a binomial distribution with parameters n and p = probability of a link given mutual translations = probability of a link given not mutual translations = probability of a link = probability of mutual translations = total number of links in the bitext = total number of co-occurrences in the bitext that u and v will be linked when they co-occur. There is never enough data to robustly estimate each p parameter separately. Instead, I shall model all the p's with just two parameters. For u and v that are mutual translations, p(u, v) will average to a relatively high probability, which I will call ~+. for u and v that are not mutual translations, p(u, v) will average to a relatively low probability, which I will call ),-. ~+ and ,klinks(u,v) correspond to the two peaks of the distribution cooc(u,v), which is illustrated in Figure 2. The two parameters can also be interpreted as the rates of true and false positives. If the translation in the bitext is consistent and the translation model is accurate, then ~+ will be close to one and ,~- will be close to zero.</Paragraph> <Paragraph position="7"> To find the most likely values of the auxiliary parameters ,k + and )~-, I adopt the standard method of maximum likelihood estimation, and find the values that maximize the probability of the link frequency distributions, under the usual independence assumptions:</Paragraph> <Paragraph position="9"> Table 3 summarizes the variables involved in this auxiliary estimation process.</Paragraph> <Paragraph position="10"> The factors on the right-hand side of Equation 29 can be written explicitly with the help of a mixture coefficient. Let ~- be the probability that an arbitrary co-occurring pair of word types are mutual translations. Let B(kln, p) denote the probability that k links are observed out of n co-occurrences, where k has a binomial distribution with parameters n and p. Then the probability that word types u and v will be linked links(u, v) times out of cooc(u, v) co-occurrences is a mixture of two binomials:</Paragraph> <Paragraph position="12"> One more variable allows us to express -r in terms of A + and ~-: Let )~ be the probability that an arbitrary co-occuring pair of word tokens will be linked, regardless of whether they are mutual translations. Since ~- is constant over all word types, it also represents the probability that an arbitrary co-occurring pair of word tokens are mutual translations. Therefore,</Paragraph> <Paragraph position="14"> ), can also be estimated empirically. Let K be the total number of links in the bitext as given in Equation 29, has only one global maximum in the region of interest, where 1 > ),+ > ,~ > ,~- > 0.</Paragraph> <Paragraph position="15"> and let N be the total number of word token pair co-occurrences:</Paragraph> <Paragraph position="17"> Equating the right-hand sides of Equations 31 and 34 and rearranging the terms, we get: K/N- A- (35) Since r is now a function of )~+ and )~-, only the latter two variables represent degrees of freedom in the model.</Paragraph> <Paragraph position="18"> In the preceding equations, either u or v can be NULL. However, the number of times that a word co-occurs with NULL is not an observable feature of bitexts. To make sense of co-occurrences with NULL, we can view co-occurrences as potential links and cooc(u, v) as the maximum number of times that tokens of u and v might be linked. From this point of view, cooc(u, NULL) should be set to the unigram frequency of u, since each token of u represents one potential link to NULL. Similarly for cooc( NULL, V). These co-occurrence counts should be summed together with all the others in Equation 33.</Paragraph> <Paragraph position="19"> The probability function expressed by Equations 29 and 30 may have many local maxima. In practice, these local maxima are like pebbles on a mountain, invisible at low resolution. I computed Equation 29 over various combinations of A + and A- after one iteration of Method A over 300,000 aligned sentence pairs from the Canadian Hansard bitext. Figure 3 illustrates that the region of interest in the parameter space, where 1 > A + > )~ > )~- > 0, has only one dominant global maximum. This global maximum can be found by standard hill-climbing methods, as long as the step size is large enough to avoid getting stuck on the pebbles.</Paragraph> <Paragraph position="20"> Computational Linguistics Volume 26, Number 2 Given estimates for A + and A-, we can compute B(links(u,v)\[cooc(u,v), A +) and B(links(u, v)\[cooc(u, v), A-) for each occurring combination of links and cooc values. These are the probabilities that links (u, v) links were generated out of cooc(u, v) possible links by a process that generates correct links and by a process that generates incorrect links, respectively. The ratio of these probabilities is the likelihood ratio in favor of the types u and v being possible mutual translations, for all u and v: B(links(u, v)\[cooc(u, v), A +) scoreB(u, v) = log B(links(u, v)Icooc(u, v), A-)&quot; (36) Method B differs from Method A only in its redefinition of the score function in Equation 36. The auxiliary parameters A + and A- and the noise model that they represent can be employed the same way in translation models that are not based on the one-to-one assumption.</Paragraph> </Section> <Section position="3" start_page="237" end_page="237" type="sub_section"> <SectionTitle> 5.3 Method C: Improved Estimation Using Preexisting Word Classes </SectionTitle> <Paragraph position="0"> In Method B, the estimation of the auxiliary parameters A + and A- depends only on the overall distribution of co-occurrence counts and link frequencies. All word pairs that co-occur the same number of times and are linked the same number of times are assigned the same score. More accurate models can be induced by taking into account various features of the linked tokens. For example, frequent words are translated less consistently than rare words (Catizone, Russell, and Warwick 1989). To account for these differences, we can estimate separate values of A + and A- for different ranges of cooc(u, v). Similarly, the auxiliary parameters can be conditioned on the linked parts of speech. A kind of word order correlation bias can be effected by conditioning the auxiliary parameters on the relative positions of linked word tokens in their respective texts. Just as easily, we can model link types that coincide with entries in an on-line bilingual dictionary separately from those that do not (cf. Brown et al. 1993). When the auxiliary parameters are conditioned on different link classes, their optimization is carried out separately for each class: B (links (u, v)\[cooc(u, v), A +) scorec(u, vlZ = class(u, v)) = log B(links(u, v)\[cooc(u, v), A z)&quot; (37) Section 6.1.1 describes the link classes used in the experiments below.</Paragraph> </Section> </Section> class="xml-element"></Paper>