File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1026_metho.xml
Size: 18,848 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1026"> <Title>Identifying Word Correspondences in Parallel Texts</Title> <Section position="2" start_page="152" end_page="152" type="metho"> <SectionTitle> 2. Applications Beyond MT </SectionTitle> <Paragraph position="0"> As mentioned above, MT is not the only motivation for sentence alignment and word correspondence.</Paragraph> <Paragraph position="1"> Computational linguists (e.g.. Klavans and Tzoukermann, 1990) have recently become interested in bilingual concordances. Table 1, for example, shows a bilingual concordance contrasting the uses of bank that are translated as banque with those that are wanslated as banc. Of course it is well-know that sense disambiguation is important for many natural language applications (including MT as well as many others). In the past, this fact has been seen as a serious obstacle blocking progress in natural language research since sense disambiguation is a very tricky unsolved problem, and it is unlikely that it will be solved in the near future. However, we prefer to view these same facts in a more optimistic light. In many cases, the French text can be used to disambiguate the English text, so that the French can be used to generate a corpus of (partially) sense-disambiguated English text. Such a sense-disambiguated corpus would be a valuable resource for all kinds of natural language applications. In particular, the corpus could be used to develop and test sense-disambiguation algorithms. For example, if you have an algorithm that is intended to distinguish the money&quot; sense of bank from the &quot;place&quot; sense of bank, then you might apply your algorithm to all of the uses of bank in the English portion of the parallel corpus and use the French text to grade the results.</Paragraph> <Paragraph position="2"> That is, you would say that your program was correct if it identified a use of bank as a &quot;money&quot; sense and it was translated as banque, and you would say that the program was incorrect if the program identified the use as a &quot;money&quot; sense and it was translated as banc. Thus, the availability of the French text provides a valuable research opportunity, for both monolingual and bilingual applications. The French text can be used to help clanfy distinctions in the English text that may not be obvious to a dumb computer.</Paragraph> <Paragraph position="3"> bank/ banque (&quot;money&quot; sense) f finance (mr . wilson ) and the governor of the bank of canada have frequently on es finances (m . wilson ) et le gouvemeur de la banque du canada ont frt?quemmct reduced by over 800 per cent in one week through bank action. SENT there was a he us de 800 p . 100 en une semaine i! cause d' une banque . SENT voili un chemisic~ bank/ banc (&quot;ulace&quot; sense) . .</Paragraph> <Paragraph position="4"> ha forum. SENT such was the case in the gwrges bank issue which was settlcd be~u entre les dtats-unis et le canada B prop du banc de george . SENT > c' est da han i did. SENT he said the nose and tail of the bank were surrendered by this go\ gouvemement avait ctddles mtdmitds du banc . SENT en fait , lors des nCgc 3. Using Word Correspondances Rather than Sentence Alignments Most bilingual concordance programs such as ISSCO's BCP program mentioned in footnote 1 of (Warwick and Russel, 1990) and a similar program mentioned on page 20 of (Klavans and Tzoukermann, 1990) are based on aligned sentences rather than word correspondences.</Paragraph> <Paragraph position="5"> Table 1 shows an example of such a sentence-based concordance program. These sentence-based programs require the user to supply the program with both an English and a French word (e.g., bank and banque). In contrast, a word-based concordance program is given just bank and finds the French translations by making use of the word correspondences.</Paragraph> <Paragraph position="6"> The advantage of the word-based approach becomes important for complicated words like take, where it is difficult for users to generate many of the possible translations. take is often used in complex idiomatic expressions, and consequently, there are many uses of take that should not be translated with prendre. In fact, most uses of take are not translated with prendre (or any of its morphological variants). The word-based bilingual concordances show this fairly clearly. We find that only 23% of the uses of take are translated with a form of prendre, a figure is fairly consistent with IBM's estimate of 28% (Brown, personal communication). The striking absence of prendre is consistent with the observation in the Cobuild dictionary (Sinclair et al., 1987, p. 1488) that &quot;[tlhe most frequent use of take is in expressions where it does not have a very distinct meaning of its own, but where most of the meaning is in ... the direct object.&quot; 4. Two Possible Problems with the EM Algorithm This paper is primarily concerned with the task of identifying word correspondences. There is relatively little discussion of this topic in Brown et al. (1990).</Paragraph> <Paragraph position="7"> although a brief mention of the EM algorithm is made.</Paragraph> <Paragraph position="8"> We decided to look for an alternative estimation algorithm for two reasons.</Paragraph> <Paragraph position="9"> First, their procedure appears to require a prohibitive amount of memory. We observed that they limited the sizes of the English and French vocabularies, V E and V e, respectively, to just 9000 words each. Having constrained the vocabularies in this way, there were a mere 81 million parameters to estimate, all of which could be squeezed into memory at the same time.</Paragraph> <Paragraph position="10"> However, if the two vocabularies are increased to a more realistic size of 106 words, then there are 10 TM parameters to estimate, and it is no longer practical to store all of them in memory. (Apparently, in some more recent unpublished work (Brown, personal communication), they have also found a way to scale up the size of the vocabulary).</Paragraph> <Paragraph position="11"> Secondly, we were concerned that their estimates might lack robustness (at least in some cases): &quot;This algorithm leads to a local maximum of the probability of the observed pairs as a function of the parameters of the model. There may be many such local maxima. The particular one at which we arrive will, in general, depend on the initial choice of parameters.&quot; (Brown et al., p. 82) In particular, we looked at their estimates for the word hear, which is surprisingly often translated as bravo (espeeiaUy, Hear, hear? --~ Bravo?), though it is not clear just how common this is. Brown et al. reported that more than 99% of the uses of hear were translated with bravo, whereas we estimate the fraction to be much closer to 60% (which is fairly consistent with their more recent estimates (Brown, personal communication)). The fact that estimates can vary so widely from 99% to 60% indicates that there might be a serious problem with robustness. It became clear after more private discussions that our methods were coming up with substantially different probability estimates for quite a number of words. It is not clear that the maximum likelihood methods are robust enough to produce estimates that can be reliably replicated in other laboratories.</Paragraph> </Section> <Section position="3" start_page="152" end_page="152" type="metho"> <SectionTitle> 5. Contingency Tables </SectionTitle> <Paragraph position="0"> Because of the memory and robustness questions, we decided to explore an alternative to the EM algorithm.</Paragraph> <Paragraph position="1"> Table 2 illustrates a two-by-two contingency table for the English word house and the French word chambre.</Paragraph> <Paragraph position="2"> Cell a (upper-left) counts the number of sentences (aligned regions) that contain both house and chambre.</Paragraph> <Paragraph position="3"> Cell b (upper-right) counts the number of regions that contain house but not chambre. Cells c and d fill out the pattern in the obvious way.</Paragraph> <Paragraph position="4"> The table can be computed from freq(house, chambre), freq(house) and freq(chambre), the number of aligned regions that contain one or both these words, and from</Paragraph> <Paragraph position="6"> We can now measure the association between house and chambre by making use of any one of a number of association measures such as mutual information. ~b 2, a g2-like statistic, seems to be a particularly good choice because it makes good use of the off-diagonal cells b andc.</Paragraph> <Paragraph position="8"> 02 is bounded between 0 and 1. In this case, 02 is 0.62, a relatively high value, indicating the two words are strongly associated, and that they may be translations of one another. One can make this argument more rigorous by measuring the confidence that ~2 is different from chance (zero). In this case, the variance of ~b z is estimated to be 2.5x10 -5 (see the section &quot;Calculation of Variances&quot;), and hence</Paragraph> <Paragraph position="10"> With such a large t, we can very confidently reject the null hypothesis and assume that there is very likely to be an association between house and chambre.</Paragraph> <Paragraph position="11"> i.e. ao</Paragraph> </Section> <Section position="4" start_page="152" end_page="156" type="metho"> <SectionTitle> 6. A Near Miss/ I'~-,~,L.t~.,~ .s ~ ..S </SectionTitle> <Paragraph position="0"> Unfortunately, this pair is also significantly different from zero (t = 31) because there are many references in the Canadian Hansard to the English phrase House of Commons and its French equivalent Chambre des Communes. How do we know that house is more associated with chambre than with communes? Note that mutual information does not distinguish these two pairs. Recall the mutual information I(x;y) is</Paragraph> <Paragraph position="2"> the formulas, we find that house and chambre actually have a lower mutual information value than house and communes: l(house;chambre) = 4.1 while l(house;communes) = 4.2.</Paragraph> <Paragraph position="3"> Mutual information picks up the fact that there are strong associations in both eases. Unfortunately, it is not very good at deciding which association is stronger. Crucially, it does not make very good use of the off-diagonal cells b and c, which are often better estimated than cell a since the counts in b and c are often larger than those in a.</Paragraph> <Paragraph position="4"> In this case, the crucial difference is that cell b is much smaller in Table 2 than in Table 3. ~2 picks up this difference; Table 3 has a ~2 of 0.098, signitieantly less than Table 2% ~2 of 0.62:</Paragraph> <Paragraph position="6"> Thus, we can very confidently say that house (h)is more associated with chambre (ch) than with communes (co).</Paragraph> <Paragraph position="7"> 7. Calculation of Variances The estimate of var(~ 2) is very important to this argument. We use the following reasoning:</Paragraph> <Paragraph position="9"> As ~2 approaches 1, var(~ 2) decreases to 0, which makes the equation for var~,,,at unsuitable as an estimate of the variance. We calculate a variance for this case by assuming that bc << ad, which implies</Paragraph> <Paragraph position="11"> We do not have an exact relation to specify when ~2 is large and when it is small. Rather, we observe that each estimate produces a value that is small in its domain, so we estimate the variance of ~2 by the minimum of the two cases: var(~ 2) = min(var~,~,,vart~,8 ,).</Paragraph> <Paragraph position="12"> 8. Selecting Pairs We have now seen how we could decide that house and chambre are more associated than house and communes. But why did we decide to look at these pairs of words and not some others? As we mentioned before, we probably can't afford to look at all VzVp pairs unless we limit the vocabulary sizes down to something like the 9000 word limit in Brown et al. And even then, there would be 81 million pairs to consider. If the training corpus is not too large (e.g., 50,000 regions), then it is possible to consider all pairs of words that actually co-occur in at least one region (i.e., a ~ 0). Unfortunately, with a training corpus of N = 890,000 regions, we have found that there are too many such pairs and it becomes necessary to be more sdective (heuristic).</Paragraph> <Paragraph position="13"> We have had fakly good success with a progressive deepening strategy. That is, select a small set of regions (e.g., 10,000) and use all of the training material to compute #2 for all pairs of words that appear in any of these 10,000 regions. Select the best pairs. That is, take a pair (x, y) if it has a ~2 significantly better than any other pair of the form (x, z) or (w, y). This procedure would take house/chambre but not house/communes. Repeat this operation, using larger and larger samples of the training corpus to suggest possibly interesting pairs. On each iteration, remove pairs of words from the training corpus that have already been selected so that other alternatives can be identified. We have completed four passes of this algorithm, and selected more than a thousand pairs on each iteration.</Paragraph> <Paragraph position="14"> A few of the selected pairs are shown below. The first column indicates the iteration that the pair was selected on. The second column indicates the number of sentences (aligned regions) that the pair appears in. Note that the most frequent pairs are usually selected first, leaving less important pairs to be picked up on subsequent iterations. Thus, for example, accept/ accepter is selected before accept/accepte. Based on a sample of 1000 pairs, about 98% of the selected pairs of words are translations. Here, as elsewhere, we act to keep our errors of commission low.</Paragraph> <Section position="1" start_page="154" end_page="156" type="sub_section"> <SectionTitle> Iteration Freq English French </SectionTitle> <Paragraph position="0"> After a few iterations, it became clear that many of the pairs that were being selected were morphologically related to pairs that had already been selected on a previous iteration. A remarkably simple heuristic seemed to work fairly well to incorporate this observation. That is, assume that two pairs are morphologically related if both words start with the same first 5 characters. Then, select a pair if it is morphologically related to a pair that is already selected and it appears &quot;significantly often&quot; (in many more sentences than you would expect by chance) on any iteration. This very simple heuristic more than doubled the number of pairs that had been selected on the first four iterations, from 6419 to 13,466. As we will see in the next section, these 13 thousand pairs cover more than half of the words in the text. Again, the error rate for pairs selected by this procedure was low, less than two percent.</Paragraph> <Paragraph position="1"> 9. Returing to the Sentence Context It is now time to try to put these pairs back into their sentence context. Consider the pair of sentences mentioned previously.</Paragraph> <Paragraph position="2"> English: we took the initiative in assessing and amending current legislation and policies to ensure that they reflect a broad interpretation of the charter.</Paragraph> <Paragraph position="3"> French: nous avons In'is 1' initiative d' ffvaluer et de modifier des lois et des politiques en vigueur afin qu' elles correspondent ~t une interpr&ation gdn&euse de la charte.</Paragraph> <Paragraph position="4"> The matching procedure attempts to match English and French words using the selected pairs. When there are several possibilities, the procedure uses a slope condition to select the best pair. Thus, for example, there are two instances of the word and in the English sentence and two instances of the word et in the French sentence. We prefer to match the first and to the first et and the second and to the second et, as illustrated below. (The i and j columns give the positions into the English and French sentences, respectively. The column labeled slope indicates the difference between the j values for the current French word and the last previous non-NULL French word.</Paragraph> <Paragraph position="5"> The matching procedure uses a dynamic programming optimization to find the sequence of j values with the best score. A sequence ofj values is scored with X. logprob (match I slope j) J Using Bayes rule, the prob(matchlslopej) is rewritten as prob( slope ~ I match) prob ( match). Both terms were estimated empirically.</Paragraph> <Paragraph position="6"> The second term is determined by the fan-in, the number of possible matches that a particular j value might play a role in. In this example, most of the j values had a fan-in of 1. However, the two instances of et had a fan-in of 2 because they could match either of the two instances of and. The score is smaller for both of these uses of et because there is more uncertainty. We considered three cases: the fan-in is 1, 2 or many. The log prob(match) in each of these three cases is -0.05, --0.34 and ---0.43, respectively.</Paragraph> <Paragraph position="7"> The first term is also determined empirically. The score is maximized for a slope of 1, In this case, log prob(slopelmatch) is --0.46. The score falls off rapidly with larger or smaller slopes.</Paragraph> <Paragraph position="8"> The dynamic programming optimization is also given the choice to match an English word to NULL. If the procedure elects this option, then a constant, log prob(NULL), is added to the score. This value is set so that the matching procedure will avoid making hard decisions when it isn't sure. For example, the 5 ~h English word (in) could have been matched with 16 ~h French word (en), but it didn't do so because log prob(NULL) was more than the score of such a radical reordering. We have found that -5 is a good setting for log prob(match). If we set the value much higher, then the matching procedure attempts to reorder the text too much. If we set the value much lower, then the matching procedure does not attempt to reorder the text enough.</Paragraph> <Paragraph position="9"> This matching procedure works remarkably well. As mentioned above, based on a sample of 800 sentences, we estimate that the procedure matches 61% of the English words with some French word, and about 95% of these pairs match the English word with the appropriate French word. All but one of these errors of commission involved a function word, usually one surrounded on both sides by words that could not be matched.</Paragraph> </Section> </Section> class="xml-element"></Paper>