XML Viewer - j96-1001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/j96-1001_metho.xml
Size: 67,573 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-1001">
  <Title>NetPatrol Consulting</Title>
  <Section position="5" start_page="5" end_page="14" type="metho">
    <SectionTitle>
4. The Similarity Measure
</SectionTitle>
    <Paragraph position="0"> To rank the proposed translations so that the best one is selected, Champollion uses a quantitative measure of correlation between the source collocation and its complete or partial translations. This measure is also used to reduce the search space to a manageable size, by filtering out partial translations that are not highly correlated with the source collocation. In this section, we discuss the properties of similarity measures that are appropriate for our application. We explain why the Dice coefficient meets these criteria and why this measure is more appropriate than another frequently used measure--mutual information.</Paragraph>
    <Paragraph position="1"> Our approach is based on the assumption that each collocation is unambiguous in the source language and has a unique translation in the target language (at least in a clear majority of the cases). In this way, we can ignore the context of the collocations and their translations, and base our decisions only on the patterns of co-occurrence of each collocation and its candidate translations across the entire corpus. This approach is quite different from those adopted for the translation of single words (Klavans and Tzoukermann 1990; Dorr 1992; Klavans and Tzoukermann 1996), since for single words polysemy cannot be ignored; indeed, the problem of sense disambiguation has been linked to the problem of translating ambiguous words (Brown et al. 1991; Dagan, Itai, and Schwall 1991; Dagan and Itai 1994). The assumption of a single meaning per collocation was based on our previous experience with English collocations (Smadja 1993), is supported for less opaque collocations by the fact that their constituent words tend to have a single sense when they appear in the collocation (Yarowsky 1993), and was verified during our evaluation of Champollion (Section 7).</Paragraph>
    <Paragraph position="2"> We construct a mathematical model of the events we want to correlate, namely, the appearance of any word or group of words in the sentences of our corpus, as follows: To each group of words G, in either the source or the target language, we map a binary random variable Xc that takes the value &amp;quot;1&amp;quot; if G appears in a particular sentence and &amp;quot;0&amp;quot; if not. Then, the corpus of paired sentences comprising our database represents a collection of samples for the various random variables X for the various groups of words. Each new sentence in the corpus provides a new independent sample for every variable XG. For example, if G is unemployment rate and the words unemployment rate appear only in the fifth and fifty-fifth sentences of our corpus (not necessarily in that order and perhaps with other words intervening), then in our sample collection, Xc Smadja, McKeown', and Hatzivassiloglou Translating Collocations for Bilingual Lexicons mutual information represents the log-likelihood ratio of the joint probability of seeing a &amp;quot;1&amp;quot; in both variables over the probability that such an event would have if the two variables were independent, and thus provides a measure of the departure from independence.</Paragraph>
    <Paragraph position="3"> The Dice coefficient, on the other hand, combines the conditional probabilities p(X= 1 I Y= 1) and p(Y= 1 I X= 1) with equal weights in a single number. This can be shown by replacing p(X= 1, Y= 1) on the right side of equation (1): 3</Paragraph>
    <Paragraph position="5"> As is evident from the above equation, the Dice coefficient depends only on the conditional probabilities of seeing a &amp;quot;1&amp;quot; for one of the variables after seeing a &amp;quot;1&amp;quot; for the other variable, and not on the marginal probabilities of &amp;quot;l's for the two variables.</Paragraph>
    <Paragraph position="6"> In contrast, both the average and the specific mutual information depend on both the conditional and the marginal probabilities. For SI(X, Y) in particular, we have</Paragraph>
    <Paragraph position="8"> To select among the three measures, we first observe that for our application, 1-1 matches (paired samples where both X and Y are 1) are significant while 0-0 matches (samples where both X and Y are 0) are not. These two types of matches correspond to the cases where either both word groups of interest appear in a pair of aligned sentences or neither word group does. Seeing the two word groups in aligned 3 In the remainder of this discussion, we assume that p(X= 1, Y= 1) is not zero. This is a justified assumption for our model, since we cannot say that two words or word groups will not occur in the same sentence or in a sentence and its translation; such an event may well happen by chance, or because the words or word groups are parts of different syntactic constituents, even for unrelated words and word groups. The above assumption guarantees that all three measures are always well-defined; in particular, it guarantees that the marginal probabilities p(X= 1) and p(Y= 1) and the conditional probabilities p(X = 1 I Y = 1) and p(Y = 1 I X = 1 ) are all nonzero.</Paragraph>
    <Paragraph position="9"> Computational Linguistics Volume 22, Number 1 sentences (a 1-1 match) certainly contributes to their association and increases our belief that one is the translation of the other. Similarly, seeing only one of them (a 1-0 or 0-1 mismatch) decreases our belief in their association. But, given the many possible groups of words that can appear in each sentence, the fact that neither of two groups of words appears in a pair of aligned sentences does not offer any information about their similarity. Even when the word groups have been observed relatively few times (together or separately), seeing additional sentences containing none of the groups of words we are interested in should not affect our estimate of their similarity.</Paragraph>
    <Paragraph position="10"> In other words, in our case, X and Y are highly asymmetric; a &amp;quot;1&amp;quot; value (and a 1-1 match) is much more informative than a &amp;quot;0&amp;quot; value (or 0-0 match). Therefore, we should select a similarity measure that is based only on 1-1 matches and mismatches. 0-0 matches should be completely ignored; otherwise, they would dominate the similarity measure, given the overall relatively low frequency of any particular word or word group in our corpus.</Paragraph>
    <Paragraph position="11"> The Dice coefficient satisfies the above requirement of asymmetry: adding 0-0 matches does not change any of the absolute frequencies fxY, fx, and fy, and so does not affect Dice(X, Y). On the other hand, average mutual information depends only on the distribution of X and Y and not on the actual values of the random variables.</Paragraph>
    <Paragraph position="12"> In fact, I(X, Y) is a completely symmetric measure. If the variables X and Y are transformed so that every &amp;quot;1&amp;quot; is replaced with a &amp;quot;0&amp;quot; and vice versa, the average mutual information between X and Y remains the same. This is appropriate in the context of communications for which mutual information was originally developed (Shannon 1948), where the ones and zeros encode two different states with no special preference for either of them. But in the context of translation, exchanging the &amp;quot;l&amp;quot;s and &amp;quot;0&amp;quot;s is equivalent to considering a word or word group to be present when it was absent and vice versa, thus converting all 1-1 matches to 0-0 matches and all 0-0 matches to 1-1 matches. As explained above, such a change should not be considered similarity preserving, since 1-1 matches are much more significant than 0-0 ones.</Paragraph>
    <Paragraph position="13"> As a concrete example, consider a corpus of 100 matched sentences, where each of the word groups associated with X and Y appears five times. Furthermore, suppose that the two groups appear twice in a pair of aligned sentences and each word group also appears three times by itself. This situation is depicted in the column labeled &amp;quot;Original Variables&amp;quot; in Table 1. Since each word group appears two times with the other group and three times by itself, we would normally consider the source and target groups somewhat similar but not strongly related. And indeed, the value of the {2x2 ~--_ 0.4) intuitively corresponds to that assessment of similarity. 4 Dice coefficient ,Y4-5 Now, suppose that the &amp;quot;0&amp;quot;s and &amp;quot;l&amp;quot;s in X and Y are exchanged, so that the situation is now described by the last column of Table 1. The transformed variables now indicate that out of 100 sentences, the two word groups appear together 92 times, while each appears by itself three times and there are two sentences that contain none of the groups. We would consider such evidence to strongly indicate very high similarity between the two groups, and indeed the Dice coefficient of the transformed variables 2x92 0.9684. However, the average mutual information of the variables is now 95+95 would remain the same.</Paragraph>
    <Paragraph position="14"> Specific mutual information falls somewhere in between the Dice coefficient and average mutual information: it is not completely symmetric but neither does it ignore 0-0 matches. This measure is very sensitive to the marginal probabilities (relative frequencies) of the &amp;quot;l&amp;quot;s in the two variables, tending to give higher values as these  probabilities decrease. Adding 0-0 matches lowers the relative frequencies of &amp;quot;l&amp;quot;s, and therefore always increases the estimate of SI(X, Y). Furthermore, as the marginal probabilities of the two word groups become very small, SI(X, Y) tends to infinity, independently of the distribution of matches (including 1-1 and 0-0 ones) and mismatches, as long as the joint probability of 1-1 matches is not zero. By taking the limit of SI(X,Y) for p(X=l) --* 0 or p(Y=l) ~ 0 in equation (2) we can easily verify that this happens even if the conditional probabilities p(X= 1 I Y= 1) and p(Y= 1 I X= 1) remain constant, a fact that should indicate a constant degree of relatedness between the two variables. Neither of these problems occurs with the Dice coefficient, exactly because that measure combines the conditional probabilities of &amp;quot;l&amp;quot;s in both directions without looking at the marginal distributions of the two variables. In fact, in cases such as the examples of Table 1, where p(X = 1 I Y = 1) = p(Y = 1 t X = 1), the Dice coefficient becomes equal to these conditional probabilities.</Paragraph>
    <Paragraph position="15"> The dependence of SI(X, Y) on the marginal probabilities of &amp;quot;l&amp;quot;s shows that using it would make rare word groups look more similar than they really are. For our example in Table 1, the specific mutual information is SI(X, Y) = log 0.02 log 8 = 0.05 x0.05 -3 bits for the original variables, but SI(X', Y') = log 0.92 log 1.019391 = 0.027707 0.95 x0.95 -bits for the transformed variables. Note, however, that the change is in the opposite direction from the appropriate one; that is, the new variables are deemed far less similar than the old ones. This can be attributed to the fact that the number of &amp;quot;l&amp;quot;s in the original variables is far smaller.</Paragraph>
    <Paragraph position="16"> SI(X,Y) also suffers disproportionately from estimation errors when the observed counts of &amp;quot;l&amp;quot;s are very small. While all similarity measures will be inaccurate when the data is sparse, the results produced by specific mutual information can be more misleading than the results of other measures, because S! is not bounded. This is not a problem for our application, as Champollion applies absolute frequency thresholds to avoid considering very rare words and word groups; but it indicates another potential problem with the use of SI to measure similarity.</Paragraph>
    <Paragraph position="17"> Finally, another criterion for selecting a similarity measure is its suitability for testing for a particular outcome, where outcome is determined by the application. In our case, we need a clear-cut test to decide when two events are correlated. Both for mutual information and the Dice coefficient, this involves comparison with an experimentally determined threshold. Although the two measures are similar in that they compare the joint probability p(X= 1, Y = 1) with the marginal probabilities, they have different asymptotic behaviors. This was demonstrated in the previous paragraphs for the cases of small and decreasing relative frequencies. Here we examine two more  Computational Linguistics Volume 22, Number 1 cases associated with specific tests. We consider the two extreme cases, where The two events are perfectly independent. In this case,</Paragraph>
    <Paragraph position="19"> The two events are perfectly correlated in the positive direction: each word group appears every time (and only when) the other appears in the corresponding sentence. Then</Paragraph>
    <Paragraph position="21"> In the first case, both average and specific mutual information are equal to 0 since log p(X=x,Y-y) = log I = 0 for all x and y, and are thus easily testable, whereas the p(X--x)p(Y--y) Dice coefficient is equal to 2x (p(X=t)xp(Y=l)) and is thus a function of the individual fre- p(X=I)+p(Y=I) quencies of the two word groups. In this case, the test is easier to decide using mutual information. In the second case, the results are reversed; specific mutual information is equal to log p(X=l) = -log(p(X=l)), and it can be shown that the average mutual information becomes equal to the entropy H(X) of X (or Y). Both of these measures depend on the individual probabilities (or relative frequencies) of the word groups, 2xp(X-1) 1. In this case, the test is easier whereas the Dice coefficient is equal to p(X-1)+p(x-1) to decide using the Dice coefficient. Since we are looking for a way to identify positively correlated events we must be able to easily test the second case, while testing the first case is not relevant. Specific mutual information is a good measure of independence (which it was designed to measure), but good measures of independence are not necessarily good measures of similarity.</Paragraph>
    <Paragraph position="22"> The above arguments all support the use of the Dice coefficient over either average or specific mutual information. We have confirmed the theoretically expected behavior of the similarity measures through testing. In our early work on Champollion (Smadja 1992), we used specific mutual information (S/) as a correlation metric. After carefully studying the errors produced, we suspected that the Dice measure would produce better results for our task, according to the arguments given above.</Paragraph>
    <Paragraph position="23"> Consider the example given in Table 2. In the table, the second column represents candidate French word pairs for translating the single word today. The third column gives the frequency of the word today in a subset of the Hansards containing 182,584 sentences. The fourth column gives the frequency of each French word pair in the French counterpart of the same corpus, and the fifth column gives the frequency of appearance of today and each French word pair in matched sentences. Finally, the sixth and seventh columns give the similarity scores for today and each French word pair computed according to the Dice measure or specific mutual information (in bits) respectively. Of the four candidates, aujourd hui (shown in bold) is the only correct translation. 5 We see from the table that the specific mutual information scores fail to identify aujourd hui as the best candidate--it is only ranked fourth. Furthermore, the four SI scores are very similar, thus not clearly differentiating the results. In contrast, 5 Note that the correct translation is really a single word in contemporary French. Aujourd'hui has evolved from a collocation (au jour d'hui) which has become so rigid that it is now considered a single word. Hui can still appear on its own, but aujourd is not a French word, so Champollion's French tokenizer erroneously considered the apostrophe character as a word separator in this case. Champollion will correct this error by putting aujourd and hui back together and identifying them as a rigid collocation.</Paragraph>
    <Paragraph position="24">  Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons Table 2 Dice versus specific mutual information scores for the English word today. The correct translation is shown in bold.</Paragraph>
    <Paragraph position="25">  the Dice coefficient clearly identifies aujourd hui as the group of words most similar to today, which is what we want.</Paragraph>
    <Paragraph position="26"> After implementing Champollion, we attempted to generalize these results and confirm our theoretical argumentation by performing an experiment to compare SI and the Dice coefficient in the context of Champollion. We selected a set of 45 collocations with mid-range frequency identified by XTRACT and we ran Champollion on them using sample training corpora (databases). For each run of Champollion, and for each input collocation, we took the final set of candidate translations of different lengths produced by Champollion (with the intermediate stages driven by the Dice coefficient) and compared the results obtained using both the Dice coefficient and SI at the last stage for selecting the proposed translation. The 45 collocations were randomly selected from a larger set of 300 collocations so that the Dice coefficient's performance on them is representative (i.e., approximately 70% of them are translated correctly by Champollion when the Dice measure is used), and the correct translation is always included in the final set of candidate translations. In this way, the number of erroneous decisions made when SI is used at the final pass is a lower bound on the number of errors that would have been made if SI had also been used in the intermediate stages.</Paragraph>
    <Paragraph position="27"> We compared the results and found that out of the 45 source collocations, * 2 were not frequent enough in the database to produce any candidate translations.</Paragraph>
    <Paragraph position="28"> * Using the Dice coefficient, 36 were correctly translated and 7 were incorrectly translated.</Paragraph>
    <Paragraph position="29"> * Using SI, 26 were correctly translated and 17 incorrectly. 6 Table 3 summarizes these results and shows the breakdown across categories. In the table, the numbers of collocations correctly and incorrectly translated when the Dice coefficient is used are shown in the second and third rows respectively. For both cases, the second column indicates the number of collocations that were correctly translated with SI and the third column indicates the number of these collocations that were incorrectly translated with SI. The last column and the last row show the total number of collocations correctly and incorrectly translated when the Dice coefficient  represents candidate translations in French (for the credit cards example: cartes, cartes credit, cartes credit taux, and cartes crddit taux paient). The correct translations are again shown in bold. The third and fourth columns give the independent frequencies of each word group, while the fifth column gives the number of times that both groups appear in matched sentences. The two subsequent columns give the similarity values computed according to the Dice coefficient and specific mutual information (in bits). The corpus used for these examples contained 54,944 sentences in each language. We see from Table 4 that, as for the today example in Table 2, the SI scores are very close to each other and fail to select the correct candidate whereas the Dice scores cover a wider range and clearly peak for the correct translation.</Paragraph>
    <Paragraph position="30"> In conclusion, both theoretical arguments and experimental results support the choice of the Dice coefficient over average or specific mutual information for our  in Champollion.</Paragraph>
    <Paragraph position="31"> 5. Champollion: The Algorithm and the Implementation Champollion translates single words or collocations in one language into collocations  (including single word translations) in a second language using the aligned corpus as a reference database. Before running Champollion there are two steps that must be carried out: source and target language sentences of the database corpus must be aligned and a list of collocations to be translated must be provided in the source language. For our experiments, we used corpora that had been aligned by Gale and Church's sentence alignment program (Gale and Church 1991b) as our input data. 8 Since our intent in this paper is to evaluate Champollion, we tried not to introduce errors into the training data; for this purpose, we kept only the 1-1 alignments. Indeed, more complex sentence alignments tend to have a much higher alignment error rate (Gale and Church 1991b).</Paragraph>
    <Paragraph position="32"> By doing so, we lost an estimated 10% of the text (Brown, Lai, and Mercer 1991), which was not problematic since we had enough data. In the future, we plan to design more flexible techniques that would work from a loosely aligned corpus (see Section 9).</Paragraph>
    <Paragraph position="33"> To compile collocations, we used XTRACT on the English version of the Hansards.</Paragraph>
    <Paragraph position="34"> Some of the collocations retrieved are shown in Table 5. Collocations labeled &amp;quot;fixed,&amp;quot; such as International Human Rights Covenants, are rigid compounds. Collocations labeled &amp;quot;flexible&amp;quot; are pairs of words that can be separated by intervening words or occur in reverse order, possibly with different inflected forms.</Paragraph>
    <Paragraph position="35"> Given a source English collocation, Champollion first identifies in the database corpus all the sentences containing the source collocation. It then attempts to find all words that can be part of the translation of the collocation, producing all words that are highly correlated with the source collocation as a whole. Once this set of words is identified, Champollion iteratively combines these words in groups, so that each group is in turn highly correlated with the source collocation. Finally, Champollion produces as the translation the largest group of words having a high correlation with the source collocation.</Paragraph>
    <Paragraph position="36"> More precisely, for a given source collocation, Champollion initially identifies a set S of k words that are highly correlated with the source collocation. This operation is described in detail in Section 5.1 below. Champollion assumes that the target collocation is a combination of some subset of these words. Its search space at this point thus consists of the powerset ~(S) of S containing 2 k elements. Instead of computing a correlation factor for each of the 2 k elements with the source collocation, Champollion searches a part of this space in an iterative manner. Champollion first forms all pairs of words in S, evaluates the correlation between each pair and the source collocation using the Dice coefficient, and keeps only those pairs that score above some threshold. Subsequently, it constructs the three-word elements of ~P(S) containing one of</Paragraph>
  </Section>
  <Section position="6" start_page="14" end_page="14" type="metho">
    <SectionTitle>
7 The choice of the Dice coefficient is not crucial; for example, using the Jaccard coefficient or any other
</SectionTitle>
    <Paragraph position="0"> similarity measure that is monotonically related to the Dice coefficient would be equivalent. What is important is that the selected measure satisfy the conditions of asymmetry, insensitivity to marginal word probabilities, and convenience in testing for correlation. There are many other possible measures of association, and the general points made in this section may apply to them insofar as they also exhibit the properties we discussed. For example, the normalized chi-square measure (C/2) used in Gale and Church (1991a) shares some of the important properties of average mutual information (for example, it is completely symmetric with respect to 1-1 and 0-0 matches). 8 We are thankful to Ken Church and the AT&amp;T Bell Laboratories for providing us with a prealigned Hansards corpus.</Paragraph>
    <Paragraph position="1"> Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons</Paragraph>
  </Section>
  <Section position="7" start_page="14" end_page="14" type="metho">
    <SectionTitle>
SOURCE COLLOCATION:
</SectionTitle>
    <Paragraph position="0"> official, 492 languages, 266 The numbers indicate the frequencies of the input words in the English corpus.</Paragraph>
  </Section>
  <Section position="8" start_page="14" end_page="14" type="metho">
    <SectionTitle>
NUMBER OF SENTENCES IN COMMON: 167
</SectionTitle>
    <Paragraph position="0"> The words appear together in 167 English sentences.</Paragraph>
    <Paragraph position="1"> Champollion now gives all the candidate final translations; that is, the best translations at each stage of the iteration process. The best single word translation is thus (officielles), the best pair (officielles, langues), the best translation with 8 words (suivantes, doug, ddposer, lewis, pdtitions, honneur, officielles, langues ). The word groups are treated as sets, with no ordering. The numbers are the associated similarity score (using the Dice coefficient)for the best translation at each iteration and the number of candidate translations that passed the threshold among the word groups considered at that iteration. There are thus 11 single words that pass the thresholds at the first iteration, 35 pairs of words, and so on.</Paragraph>
  </Section>
  <Section position="9" start_page="14" end_page="14" type="metho">
    <SectionTitle>
CANDIDATE TRANSLATIONS:
</SectionTitle>
    <Paragraph position="0"> officielles, 0.94 out of 11 officielles langues, 0.95 out of 35 honneur officielles langues, 0.45 out of 61 d6poser honneur officielles langues, 0.36 out of 71 d6poser p6titions honneur officielles langues, 0.34 out of 56 d6poser lewis p6titions honneur officielles langues, 0.32 out of 28 doug d6poser lewis p6titions honneur officielles langues, 0.32 out of 8 suivantes doug d4poser lewis p6titions honneur officielles langues, 0.20 out of 1 Champollion then selects the optimal translation, which is the translation with the highest similarity score. In this case the result is correct.</Paragraph>
  </Section>
  <Section position="10" start_page="14" end_page="14" type="metho">
    <SectionTitle>
SELECTED TRANSLATION:
</SectionTitle>
    <Paragraph position="0"> officielles langues 0.951070 An example sentence in French where the selected translation is used is also shown.</Paragraph>
  </Section>
  <Section position="11" start_page="14" end_page="34" type="metho">
    <SectionTitle>
EXAMPLE SENTENCE:
</SectionTitle>
    <Paragraph position="0"> Le d6put6 n' ignore pas que le gouvernement compte pr6senter, avant la fin de 1' ann6e, un projet de r6vision de la Loi sur les langues officielles.</Paragraph>
    <Paragraph position="1"> Finally, additional information concerning word order is computed and presented. For a rigid collocation such as this one, Champollion will print for all words in the selected translation except the first one their distance from the first word. In our example, the second word (langues) appears in most cases one word before officielles, to form the compound langues officielles. Note that this information is added during postprocessing after the translation has been selected, and takes very little time to compute because of the indexing. In this case, it took a few seconds to compute this information.</Paragraph>
    <Paragraph position="2">  guage that satisfy the following two conditions: 1. The value of the Dice coefficient between the word and the source collocation W is at least Ta, where T~ is an empirically chosen threshold, and 2. The word appears in the target language opposite the source collocation at least Tf times, where Tf is another empirically chosen threshold.</Paragraph>
    <Paragraph position="3">  Computational Linguistics Volume 22, Number 1 Words that pass these tests are collected in a set S, from which the final translation will eventually be produced. When given official languages as input (see Figure 2), this step produces a set S with the following eleven words: suivantes, doug, d~poser, supr~matie, lewis, p~titions, honneur, programme, mixte, officielles, and langues. The Dice threshold Ta (currently set at 0.10) is the major criterion that Champollion uses to decide which words or partial collocations should be kept as candidates for the final translation of the source collocation. In Section 6 we explain why this incremental filtering process is necessary and we show that it does not significantly degrade the quality of Champollion's output. To our surprise, we found that the filtering process may even increase the quality of the proposed translation.</Paragraph>
    <Paragraph position="4"> The absolute frequency threshold Tf (currently set at 5) also helps limit the size of S, by rejecting words that appear too few times opposite the source collocation. Its most important function, however, is to remove from consideration words that appear too few times for our statistical methods to be meaningful. Applying the Dice measure (or any other statistical similarity measure) to very sparse data can produce misleading results, so we use Tf as a guide for the applicability of our method to low frequency words.</Paragraph>
    <Paragraph position="5"> It is possible to modify the thresholds Td and Tf according to properties of the database corpus and the collocations that are translated. Such an approach would use lower values of the thresholds, especially of Tf, for smaller corpora or less frequent collocations. In that case, a separate estimation phase is needed to automatically determine the values of the thresholds. The alternative we currently support is to allow the user to replace the default thresholds during the execution of Champollion with values that are more appropriate for the corpus at hand.</Paragraph>
    <Paragraph position="6"> After all words have been collected in S, the initial set of possible translations P is set equal to S, and Champollion proceeds with the next stage.</Paragraph>
    <Paragraph position="7"> Stage 2--Step 2: Scoring of possible translations. In this step, Champollion examines all members of the set P of possible translations. For each member x of P, Champollion computes the Dice coefficient between the source language collocation W and x. If the Dice coefficient is below the threshold Td, x is discarded from further consideration; otherwise, x is saved in a set P'.</Paragraph>
    <Paragraph position="8"> When given official languages as input, the first iteration of Step 2 simply sets P~ to P, the second iteration selects 35 word pairs out of the possible 110 candidates, the third iteration selects 61 word triplets, and so on until the final (ninth) iteration when none of the three elements of P passes the threshold Ta and thus P~ has no elements.</Paragraph>
    <Paragraph position="9"> Stage 2--Step 3: Identifying the locally best translation. Once the set of surviving translations P~ has been computed, Champollion checks if it is empty. If it is, there cannot be any more translations to be considered, so Champollion proceeds to Step 5. If P' is not empty, Champollion locates the translation that looks locally the best; that is, among all members of P~ analyzed at this iteration, the translation that has the highest Dice coefficient value with the source collocation. This translation is saved in a table C of candidate final translations, along with its length in words and its similarity score. Champollion then continues with the next step.</Paragraph>
    <Paragraph position="10"> The first iteration of Step 3 on our example collocation would select the word officielles (among the 11 words in S) as the first candidate translation, with a score of 0.94. On the second iteration, the word pair (officielles, langues) is selected (out of 35 pairs that pass the threshold) with a score of 0.95. On the third run, the word triplet (honneur, officieUes, langues), is selected (out of 61 triplets) with a score of 0.45. On the</Paragraph>
    <Section position="1" start_page="18" end_page="21" type="sub_section">
      <SectionTitle>
Computational Linguistics Volume 22, Number 1
5.1 Computational and Implementation Features
</SectionTitle>
      <Paragraph position="0"> Considering the size of the corpora that must be handled by Champollion, special care has been taken to minimize the number of disk accesses made during processing. We have experimented on up to two full years of the Hansards corpus, amounting to some 640,000 sentences in each language or about 220 megabytes of uncompressed text. With corpora of this magnitude, Champollion takes between one and two minutes to translate a collocation, thus enabling its practical use as a bilingual lexicography tool.</Paragraph>
      <Paragraph position="1"> To achieve efficient processing of the corpus database, Champollion is implemented in two phases: the preparation phase and the actual translation phase. The preparation phase reads in the database corpus and indexes it for fast future access using a commercial B-tree package (Informix 1990). Each word in the original corpus is associated with a set of pointers to all the sentences containing it and to the positions of the word in each of these sentences. The frequency of each word (in sentences) is also computed at this stage. Thus, all the necessary information is collected from the corpus database at this preprocessing phase with only one pass over the corpus file. At the translation phase, only the indices are accessed.</Paragraph>
      <Paragraph position="2"> For the translation phase, we developed an algorithm that avoids computing the Dice coefficient for French words when the result must necessarily fall below the threshold. Using the index file on the English part of the corpus, we collect all French sentences that match the source collocation, and produce a list of all words that appear in these sentences, together with their frequency (in sentences) in this subset of the French corpus. This operation takes only a few seconds to perform, and yields a list of a few thousand French words. The list also contains the local frequency of these words (i.e., frequency within this subset of the French corpus), and is sorted by this frequency in decreasing order. We start from the top of this list and work our way downwards until we find a word that fails either of the following tests:</Paragraph>
      <Paragraph position="4"> The word's local frequency is lower than the threshold Tf.</Paragraph>
      <Paragraph position="5"> The word's local frequency is so low that we know it would be impossible for the Dice coefficient between it and the source collocation to be higher than the threshold Td.</Paragraph>
      <Paragraph position="6"> Once a word fails one of the above tests, we are guaranteed that all subsequent words in the list (with lower local frequencies) will also fail the same test. By applying these two tests and removing all closed-class words from the list, we greatly reduce the number of words that must be considered. In practice, about 90-98% of the words in the list fail to meet the two tests above, so we dramatically reduce our search space without having to perform any relatively expensive operations. For the remaining words in the list, we need to compute their Dice coefficient value so as to select the best-ranking one-word translation of the source collocation.</Paragraph>
      <Paragraph position="7"> The first of the above tests is rather obviously valid and easy to apply. For the second test, we compute an upper bound for the Dice coefficient between the word under consideration and the source collocation. Let X and Y stand for the source collocation and the French word under consideration, respectively, at some step of the loop through the word list. At this point, we know the global frequency of the source collocation (fx) and the local frequency of the candidate translation word (fxY), but not the global frequency of the candidate word (fy). We need all these three quantities to compute the Dice coefficient, but while fx is computed once for all Y, and it is very efficient to compute fxY at the same time as the set of sentences matching X is  Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons identified, it is more costly to find fy even if a special access structure is maintained. So, we first check whether there is any possibility that this word correlates with the source collocation highly enough to pass the Dice threshold by assuming temporarily that the word does not appear at all outside the sentences matching the source collocation. By setting fY=fxY, we can efficiently compute the Dice coefficient between X and Y under this assumption:</Paragraph>
      <Paragraph position="9"> Of course, this assumption most likely won't be true. But since we know that fxY &lt; fY, it follows that Dicea(X,Y) is never less than the true value of the Dice co-efficient between X and y10 Comparing Dicea(X,Y) with the Dice threshold Ta will only filter out words that are guaranteed not to have a high enough Dice coefficient value independently of their overall frequency fy; thus, this is the most efficient process for this task that also guarantees correctness, n Another possible implementation involves representing the words as integers using hashing. Then it would be possible to compute fr and the Dice coefficient in linear time. Our method, in comparison, takes O(n log n) time to sort n candidates by their local frequency fxY, but it retrieves the frequency fy and computes the Dice coefficient for a much smaller percentage of them.</Paragraph>
      <Paragraph position="10"> 6. Analysis of Champollion's Heuristic Filtering Stage In this section, we analyze the generative capacity of our algorithm. In particular, we compare it to the obvious method of exhaustively generating and testing all possible groups of k words, with k varying from 1 to some maximum length of the translation m.</Paragraph>
      <Paragraph position="11"> Our concern is whether our algorithm will actually generate all valid translations-those with final Dice coefficient above the threshold--while it is clear that the exhaustive algorithm would. 12 Does the filtering process we use sometimes cause our algorithm to omit a valid translation? In other words, is there a possibility that a group of words has high similarity with the source collocation (above the threshold) and at the same time one or more of its subgroups have similarity below the threshold? In the worst case, as we show below, the answer to this question is affirmative. However, if only very few translations are missed in practice, the algorithm is indeed a good choice. In this section, we first show why the filtering we use is necessary and how it can miss valid translations, and then present the results of Monte Carlo simulation experiments (Rubenstein 1981) showing that with appropriate selection of the threshold, the algorithm misses very few translations, that this rate of failure can be reduced even more by using different thresholds at each level, and that the missed translations are in general the less interesting ones, so that the rejection of some of the valid (according to the Dice coefficient) translations most likely leads to an increase of Champollion&amp;quot; s performance.</Paragraph>
      <Paragraph position="12">  10 And actually is a fight upper bound, realized when fx=o,y=l = O.</Paragraph>
      <Paragraph position="13"> 11 Heuristic filtering of words with low local frequency may be more or less efficient, depending on the word, but a higher percentage of discarded words will come at the cost of inadvertently throwing out some valid words.</Paragraph>
      <Paragraph position="14"> 12 In this section we refer to missed valid translations or failures, using these terms to describe  candidate translations that are above the Dice threshold but are nevertheless rejected due to the non-exhaustive algorithm we use. These candidate translations are not necessarily correct translations from a performance perspective.</Paragraph>
      <Paragraph position="15">  Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons and with a similar derivation, for the upper bound (i ~ 3),</Paragraph>
      <Paragraph position="17"> The sums of the bounds on the values Pi for i = 3 to m, plus the value P1 + P2 = Q + (Q), give upper and lower bounds on the total number of candidate translations generated and examined by Champollion. When the ri's are high, the actual number of candidate translations will be close to the lower bound. On the other hand, low values for the ri's (i.e., a low threshold Td) will result in the actual number of candidate translations being close to the upper bound. To estimate the average number of candidate translations examined, we make the simplifying assumption that the decisions to reject each candidate translation with i words are made independently with constant probability ri. Under these assumptions, the probability 7i of generating a particular candidate translation with i words is the same for all translations with length i; the same applies to the probability ;~i that a translation with i words is included in the set of translations of length i that will generate the candidate translations of length i + 1. Clearly, 1 = 71 = 72 = 1 and ,~i = ri'Yi for i &gt; 2. For a particular translation with i _&gt; 3 words to be generated, at least one of its i subsets with i - 1 words must have survived the threshold. With our assumptions, we have</Paragraph>
      <Paragraph position="19"> From this recurrence equation and the boundary conditions given above we can compute the values of 7/ and /~i for all i. Then the expected (average) number of candidate translations with i ___ 3 words examined by Champollion will be and the sum of these terms for i = 3 to m, plus the terms Q and (2Q), gives the total complexity of our algorithm. In Table 6 we show the number of candidate translations examined by the exhaustive algorithm and the corresponding best-, worst-, and average-case behavior of Champollion for several values of Q and m, using empirical estimates of the ri's.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="25" type="sub_section">
      <SectionTitle>
6.2 Effects of the Filtering Process
</SectionTitle>
      <Paragraph position="0"> We showed above that filtering is necessary to bring the number of proposed translations down to manageable levels. For any corpus of reasonable size, we can find cases where a valid translation is missed because a part of it does not pass the threshold. Let N be the size of the corpus in terms of matched sentences. Separate the N sentences into eight categories, depending on whether each of the source collocation (X) and the partial translations (i.e., A and B) appear in it. Let the counts of these sentences be nABX, nABY:, nAgX, * *., n~2, where a bar indicates that the corresponding term is absent.</Paragraph>
      <Paragraph position="1"> We can then find values of the n...'s that cause the algorithm to miss a valid translation as long as the corpus contains a modest number of sentences. This happens when one or more of the parts of the final translation appear frequently in the corpus but not together with the other parts or the source collocation. This phenomenon occurs even if we are allowed to vary the Dice thresholds at each stage of the algorithm. With our current constant Dice threshold Td = 0.1, we may miss a valid translation as long as the corpus contains at least 20 sentences.</Paragraph>
      <Paragraph position="2">  While our algorithm will necessarily miss some valid translations, this is a worst case scenario. To study the average-case behavior of our algorithm, we simulated its performance with randomly selected points with integer non-negative coordinates (nABX, nAByC/, naf~x, nA~;~, n,~x, nABS, nA~x) from the hyperplane defined by the equation nABX + nAB R -b nAF~X + nA~ R q- nAB X q- nAuy C/ + nA~ X = No where No is the number of &amp;quot;interesting&amp;quot; sentences in the corpus for the translation under consideration, that is, the number of sentences that contain at least one of X, A, or B. 13 Sampling from this six-dimensional polytope in seven-dimensional space is not easy. We accomplish it by constructing a mapping from the uniform distribution to each allowed value for the n...'s, using combinatorial methods. For example, for No = 50, there are 3,478,761 different points with nABX = 0 but only one with nABX = 50. Using the above method, we sampled 20,000 points for each of several values for No (No = 50, 100, 500, and 1000). The results of the simulation were very similar for the different values of No, with no apparent pattern emerging as No increased. Therefore, in the following we give averages over the values of No tried.</Paragraph>
      <Paragraph position="3"> We first measured the percentage of missed valid translations when either A or B, or both, do not pass the threshold but AB should, for different values of the threshold parameter (solid line in Figure 3). We observed that for low values of the threshold, less than 1% of the valid translations are missed; for example, for the threshold value of 0.10 we currently use, the error rate is 0.74%. However, as the threshold increases, the rate of failure can become unacceptable.</Paragraph>
      <Paragraph position="4"> A higher value for the threshold has two advantages: First, it offers higher selectivity, allowing fewer false positives (proposed translations that are not considered 13 Note that the number of sentences that do not contain any of X, A, or B does not enter any of the Dice coefficients computed by Champollion and consequently does not affect the algorithm's decisions. As discussed in Section 4, this gives a definite advantage to the Dice method over other measures of similarity.  Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons  Failure rate of the translation algorithm with constant and increasing thresholds. The case c~ = 1 (solid line) represents the basic algorithm with no threshold changes.</Paragraph>
      <Paragraph position="5"> accurate by the human judges). Second, it speeds up the execution of the algorithm, as all fractions ri's decrease and the overall number of candidate translations is reduced. However, as Figure 3 shows, high values of the threshold parameter cause the algorithm to miss a significant percentage of valid translations. Intuitively, we expect this problem to be alleviated if a higher threshold value is used for the final admittance of a translation, but a lower threshold is used internally when the subparts of the translation are considered. Our second simulation experiment tested this expectation for various values of the final threshold using a lower initial threshold equal to a constant ~ &lt; 1 times the final threshold. The results are represented by the remaining curves of Figure 3. Surprisingly, we found that with moderate values of c~ (close to 1) this method gives a very low failure rate even for high final threshold values, and is preferable to using a constant but lower threshold just to reduce the failure rate.</Paragraph>
      <Paragraph position="6"> For example, running the algorithm at an initial threshold of 0.3 and a final threshold of 0.6 gives a failure rate of 0.45%, much less than the failure rate of 6.59% which corresponds to a constant threshold of 0.3 for both stages. TM The above analyses show that the algorithm fails quite rarely when the threshold is low, and its performance can be improved with a sequence of increasing thresholds.</Paragraph>
      <Paragraph position="7"> We also studied cases where the algorithm does fail. For this purpose, we stratified 14 The curves in Figure 3 become noticeably less smooth for values of the final threshold that are greater than 0.8. This happens for all settings of c~ in Figure 3. This apparently different behavior for high threshold values can be traced to sampling issues. Since few of the 20,000 points in each sample meet the criterion of having Dice(AB, X) greater or equal to the threshold for high final threshold values, the estimate of the percentage of failures is more susceptible to random variation in such cases. Furthermore, since the same sample (for a given No) is used for all values of c~, any such random variation due to small sample size will be replicated in all curves of Figure 3.</Paragraph>
      <Paragraph position="8">  prendre ... d6cision prendre.., mesures and year, taken from the aligned Hansards. Table 8 illustrates the range of translations which Champollion produces. Flexible collocations are shown with ellipsis points (...) indicating where additional, variable words could appear. These examples show cases where a two word collocation is translated as one word (e.g., health insurance), a two word collocation is translated as three words (e.g., employment equity), and how words can be inverted in the translation (e.g., additional costs). In this section, we discuss the design of the separate tests and our evaluation methodology, and present the results of our evaluation.</Paragraph>
    </Section>
    <Section position="3" start_page="25" end_page="34" type="sub_section">
      <SectionTitle>
7.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> We carried out three tests with Champollion using two database corpora and three sets of source collocations. The first database corpus (DB1) consists of 8 months of Hansards aligned data taken from 1986 (16 megabytes, 3.5 million words) and the second database corpus (DB2) consists of all of the 1986 and 1987 transcripts of the Canadian Parliament (a total of approximately 45 megabytes and 8.5 million words).</Paragraph>
      <Paragraph position="1"> For the first corpus (DB1), we ran XTRACT and obtained a set of approximately 3,000 collocations from which we randomly selected a subset of 300 for manual evaluation purposes. The 300 collocations were selected from among the collocations of mid-range frequency--collocations appearing more than 10 times in the corpus. We call this first set of source collocations C1. The second set (C2) is a set of 300 collocations similarly selected from the set of approximately 5,000 collocations identified by XTaACT on all data from 1987. The third set of collocations (C3) consists of 300 collocations selected  Computational Linguistics Volume 22, Number 1 8. Applications A bilingual lexicon of collocations has a variety of potential uses. The most obvious are machine translation and machine-assisted human translation, but other multilingual applications, including information retrieval, summarization, and computational lexicography, also require access to bilingual lexicons.</Paragraph>
      <Paragraph position="2"> While some researchers are attempting machine translation through purely statistical techniques, the more common approach is to use some hybrid of interlingual and transfer techniques. These symbolic machine translation systems must have access to a bilingual lexicon and the ability to construct one semi-automatically would ease the development of such systems. Champollion is particularly promising for this purpose for two reasons. First, it constructs translations for multiword collocations. Collocations are known to be opaque; that is, their meaning often derives from the combination of the words and not from the meaning of the individual words themselves. As a result, translation of collocations cannot be done on a word-by-word basis, and some representation of collocations in both languages is needed if the system is to translate fluently. Second, collocations are domain dependent. Particularly in technical domains, the collocations differ from those in general use. Accordingly, the ability to automatically discover collocations for a given domain by using a new corpus as input to Champollion would ease the work required to transfer an MT system to a new domain.</Paragraph>
      <Paragraph position="3"> Multilingual systems are now being developed in addition to pure machine translation systems. These systems also need access to bilingual phrases. We are currently developing a multilingual summarization system, in which we will use the results from Champollion. An early version of this system (McKeown and Radev 1995) produces short summaries of multiple news articles covering the same event using as input the templates produced by information extraction systems developed under the ARPA message understanding program. Since some information extraction systems, such as General Electric's NLToolset (Jacobs and Rau 1990), already produce similar representations for Japanese and English news articles, the addition of an English summary generator will automatically allow for English summarization of Japanese.</Paragraph>
      <Paragraph position="4"> In addition, we are planning to add a second language for the summaries. While the output is not a direct translation of input articles, collocations that appear frequently in the news articles will also appear in summaries. Thus, a list of bilingual collocations would be useful for the summarization process.</Paragraph>
      <Paragraph position="5"> Information retrieval is another prospective application. As shown in Maarek and Smadja (1989) and more recently in Broglio et al. (1995), the precision of information retrieval systems can be improved through the use of collocations in addition to the more traditional single word indexing units. A collocation gives the context in which a given word was used, whicl~ will help retrieve documents using the word with the same sense and thus improve precision. The well-known New Mexico example in information retrieval describes an oft-encountered problem when single word searches are employed: searching for new and Mexico independently will retrieve a multitude of documents that do not relate to New Mexico. Automatically identifying and explicitly using collocations such as New Mexico at search or indexing time can help solve this problem. We have licensed XTRACT to several sites that are using it to improve the accuracy of their retrieval or text categorization systems.</Paragraph>
      <Paragraph position="6"> A bilingual list of collocations could be used for the development of a multilingual information retrieval system. In cases where the database of texts includes documents written in multiple languages, the search query need only be expressed in one language. The bilingual collocations could be used to translate the query (particularly  protection de 1' environnement taxe de vente f~derale Tools for the target language. Tools in French, such as a morphological analyzer, a tagger, a list of acronyms, a robust parser, and various lists of tagged words, would be most helpful and would allow us to improve our results. For example, a tagger for French would allow us to run XTRACT on the French part of the corpus, and thus to translate from either French or English as input. In addition, running XTRACT on the French part of the corpus would allow for independent confirmation of the proposed translations, which should be French collocations. Similarly, a morphological analyzer would allow us to produce richer results, since several forms of the same word would be conflated, increasing both the expected and the actual frequencies of the co-occurrence events; this has been found empirically to have a positive effect in overall performance in other problems (Hatzivassiloglou in press). Note that ignoring inflectional distinctions can sometimes have a detrimental effect if only particular forms of a word participate in a given collocation. Consequently, it might be beneficial to take into account both the distribution of the base form and the differences between the distributions of the various inflected forms.</Paragraph>
      <Paragraph position="7"> In the current implementation of Champollion, we were restricted to using tools for only one of the two languages, since at the time of implementation tools for French were not readily available. However, from the above discussion it is clear that certain tools would improve the system's performance.</Paragraph>
      <Paragraph position="8"> Separating corpus-dependent translations from general ones. Champollion identifies translations for the source collocations using the aligned corpora database as its entire knowledge of the two languages. Consequently, sometimes the results are specific to the domain and seem peculiar when viewed in a more general context. For example, we have already mentioned that Mr. Speaker was translated as Monsieur le Prdsident, which is obviously only valid for this domain. Canadian family is another example; it is often translated as famille (the Canadian qualifier is dropped in the French version). This is an important feature of the system, since in this way the sublanguage of the domain is employed for the translation. However, many of the collocations that ChampoUion identifies are general, domain-independent ones. ChampoUion cannot make any distinction between domain-specific and general collocations. What is clearly needed is a way to determine the generality of each produced translation, as many translations found by ChampoUion are of general use and could be directly applied to other domains. This may be possible by intersecting the output of Champollion on corpora from many different domains.</Paragraph>
      <Paragraph position="9">  Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons Handling low frequency collocations. The statistics we used do not produce good results when the frequencies are low. This shows up clearly when our evaluation results on the first two experiments are compared. Running the collocation set C2 over the database DB1 produced our worst results, and this can be attributed to the low frequency in DB1 of many collocations in C2. Recall that C2 was extracted from a different (and larger) corpus from DB1. This problem is due not only to the frequencies of the source collocations or of the words involved but also to the frequencies of their &amp;quot;official&amp;quot; translations. Indeed, while most collocations exhibit unique senses in a given domain, sometimes a source collocation appearing multiple times in the corpus is not consistently translated into the same target collocation in the database. This sampling problem, which generally affects all statistical approaches, was not addressed in the paper. We reduced the effects of low frequencies by purposefully limiting ourselves to source collocations of frequencies higher than 10, containing individual words with frequencies higher than 15.</Paragraph>
      <Paragraph position="10"> Analysis of the effects of our thresholds. Various thresholds are used in Champollion's algorithm to reduce the search space. A threshold too low would significantly slow down the search as, according to Zipf's law (Zipf 1949), the number of terms occurring n times in a general English corpus is a decreasing function of n 2. Unfortunately, sometimes this filtering step causes Champollion to miss a valid translation. For example, one of the incorrect translations made by Champollion is that important factor was translated into facteur (factor) alone instead of the proper translation facteur important. The error is due to the fact that the French word important did not pass the first step of the algorithm as its Dice coefficient with important factor was too low. Important occurs a total of 858 times in the French part of the corpus and only 8 times in the right context, whereas a minimum of 10 appearances is required to pass this step.</Paragraph>
      <Paragraph position="11"> Although the theoretical analysis and simulation experiments of Section 6.2 show that such cases of missing the correct translation are rare, more work needs to be done in quantifying this phenomenon. In particular, experiments with actual corpus data should supplement the theoretical results (based on uniform distributions). Furthermore, more experimentation with the values of the thresholds needs to be done, to locate the optimum trade-off point between efficiency and accuracy. An additional direction for future experiments is to vary the thresholds (and especially the frequency threshold Tf) according to the size of the database corpus and the frequency of the collocation being translated.</Paragraph>
      <Paragraph position="12"> Incorporating the length of the translation into the score. Currently our scoring method only uses the lengths of candidate translations to break a tie in the similarity measure. It seems, however, that longer translations should get a &amp;quot;bonus.&amp;quot; For example, using our scoring technique the correlation of the collocation official languages with the French word officielles is equal to 0.94 and the correlation with the French collocation langues officielles is 0.95. Our scoring only uses the relative frequencies of the events without taking into account that some of these events are composed of multiple single events.</Paragraph>
      <Paragraph position="13"> We plan to refine our scoring method so that the length (number of words involved) of the events is taken into account.</Paragraph>
      <Paragraph position="14"> Using nonparallel corpora. Champollio n requires an aligned bilingual corpus as input.</Paragraph>
      <Paragraph position="15"> However, finding bilingual corpora can be problematic in some domains. Although organizations such as the United Nations, the European Community, and governments of countries with several official languages are big producers, such corpora are still difficult to obtain for research purposes. While aligned bilingual corpora will become  Computational Linguistics Volume 22, Number 1 more available in the future, it would be helpful if we could relax the constraint for aligned data. Bilingual corpora in the same domain, which are not necessarily translations of each other, are more easily available. For example, news agencies such as the Associated Press and Reuters publish in several languages. News stories often relate similar facts but they are not direct translations of one another. Even though the stories probably use equivalent terminology, totally different techniques would be necessary to be able to use such &amp;quot;nonalignable&amp;quot; corpora as databases. Ultimately, such techniques would be more useful than those currently used, because they would be able to extract knowledge from noisy data. While this is definitely a large research problem, our research team at Columbia University has begun work in this area (Fung and McKeown 1994) that shows promise for noisy parallel corpora (in which the target corpus may contain either additional or deleted paragraphs and where the languages themselves do not involve neat sentence-by-sentence translations). Bilingual word correspondences extracted from nonparallel corpora with techniques such as those proposed by Fung (1995a) also look promising.</Paragraph>
      <Paragraph position="16"> 10. Conclusion We have presented a method for translating collocations, implemented in Champollion.</Paragraph>
      <Paragraph position="17"> The ability to provide translations for collocations is important for three main reasons. First, because they are opaque constructions, they cannot be translated on a word-by-word basis. Instead, translations must be provided for the phrase as a whole. Second, collocations are domain dependent. Each domain includes a variety of phrases that have specific meanings and translations that apply only in the given domain. Finally, a quick look at a bilingual dictionary, even for two widely studied languages such as English and French, shows that correspondences between collocations in two languages are largely unexplored. Thus, the ability to compile a set of translations for a new domain automatically will ultimately increase the portability of machine translation systems. By applying Champollion to a corpus in a new domain, translations for the domain-specific collocations can be automatically compiled and inaccurate results filtered by a native speaker of the target language.</Paragraph>
      <Paragraph position="18"> The output of our system is a bilingual list of collocations that can be used in a variety of multilingual applications. It is directly applicable to machine translation systems that use a transfer approach, since such systems rely on correspondences between words and phrases of the source and target languages. For interlingua systems, identification of collocations and their translations provide a means of augmenting the interlingua. Since such phrases cannot be translated compositionally, they indicate where concepts representing such phrases must be added to the interlingua. Such bilingual phrases are also useful for other multilingual tasks, including information retrieval of multilingual documents given a phrase in one language, summarization in one language of texts in another, and multilingual generation.</Paragraph>
      <Paragraph position="19"> Finally, we have carried out three evaluations of the system on three separate years of the Hansards corpus. These evaluations indicate that Champollion has a high rate of accuracy: in the best case, 78% of the French translations of valid English collocations were judged to be good. This is a good score in comparison with evaluations carried out on full machine translation systems. We conjecture that by using statistical techniques to translate a particular type of construction, known to be easily observable in language, we can achieve better results than by applying the same technique to all constructions uniformly.</Paragraph>
      <Paragraph position="20"> Our work is part of a paradigm of research that focuses on the development of tools using statistical analysis of text corpora. This line of research aims at producing tools  Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons that satisfactorily handle relatively simple tasks. These tools can then be used by other systems to address more complex tasks. For example, previous work has addressed low-level tasks such as tagging a free-style corpus with part-of-speech information (Church 1988), aligning a bilingual corpus (Gale and Church 1991b; Brown, Lai, and Mercer 1991), and producing a list of collocations (Smadja 1993). While each of these tools is based on simple statistics and tackles elementary tasks, we have demonstrated with our work on Champollion that by combining them, one can reach new levels of complexity in the automatic treatment of natural languages.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML