XML Viewer - w97-0119

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0119_metho.xml
Size: 12,074 bytes
Last Modified: 2025-10-06 14:14:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0119">
  <Title>i i Finding Terminology Translations from Non-parallel Corpora</Title>
  <Section position="5" start_page="194" end_page="195" type="metho">
    <SectionTitle>
5 A word in relation to seed words
</SectionTitle>
    <Paragraph position="0"> Word correlations are important statistical information which has been successfully employed to find bilingual word pairs from parallel corpora. Word correlations W(w~, wt) are computed from general likelihood scores based on the co-occurrence of words in common segments. Segments are either sentences, paragraphs, or string groups delimited by anchor paints:</Paragraph>
    <Paragraph position="2"> All correlation measures use the above likelihood scores in different formulations. In our Word Relation Matrix (WORM) representation, we use the correlation measure W(ws, wt) between a seed word ws and an unknown Word tot. a, b, c and d are computed from the segments in the monolingual text of the non-parallel corpus.</Paragraph>
    <Paragraph position="3">  W(w=, w,) is the weighted mutual information in our algorithm since it is most suitable for lexicon compilation of mid-frequency technical words or terms:</Paragraph>
    <Paragraph position="5"> Given n seed words (1/1si, Ws2,... , t/),n), we thus obtain a Word Relation Matrix for w=: (W(w=, w,1), W(w , w,2), . . . , W(w=, wo,,) ) As an initial step, all Pr(w, = 1) are pre-computed for the seed words in both languages. We have experimented with various segment sizes, ranging from phrases delimited by all punctuations, a sentence, to an entire paragraph.</Paragraph>
    <Paragraph position="6"> From our experiment results, we conclude that the right segment size is a function of the frequency of the seed words: 1 segment size cc frequency(Ws) If the seed words are frequent, and if the segment size is as large as a paragraph size, then these frequent seed words could occur in every single segment. In this case, the chances for co-occurrence between such seed words and all new words are very high, close to one. With large segments, such seed words are too biasing and thus, smaller segment size must be used. Conversely, we need a larger segment size ff seed word frequency is low.</Paragraph>
    <Paragraph position="7"> Consequently, we use the paragraph as the segment size for our experiment on Wall Street Journal/Nikkei Corpus since all the seed words are mid-frequency content words. We computed all binary vectors of the 1,416 seed words Ws where the/-th dimension of the vector is 1 ff the seed word occurs in the/-th paragraph in the text, zero otherwise.</Paragraph>
    <Paragraph position="8"> We use a smaller segment size - between any two punctuations - for the segment size for the Wall Street Journal English/English corpus since many of the seed words are frequent. Next, Pr(w= = i) is computed for all unknown words z in both texts. The WoRM vectors are then sorted according to W(w=, w~i). The most correlated seed word w~/will have the top scoring As an example, using 307 seed word pairs in the WSJ/WSJ corpus, we obtain the following most correlated seed words with debentures in two different years of Wall Street Journal as shown in Figure 2. In both texts, the same set of words correlate with debenture closely.</Paragraph>
    <Paragraph position="9"> WoRM plots of debentures and administration are shown in Figures 3 and 4 respectively. The horizontal axis has 307 points representing the seed words, the vertical axis has the value of the correlation scores between these 307 seed words and our example words. These figures show that WoRMs of the same words are similar to each other, but WoRMs are different between different words.</Paragraph>
  </Section>
  <Section position="6" start_page="195" end_page="199" type="metho">
    <SectionTitle>
6 Matching Word Relation Matrices
</SectionTitle>
    <Paragraph position="0"> When all vnknown words are represented in WoRMs, a matching function is needed to find the best WoRM pairs as bilingual lexicon entries. There are many metrics we can use to measure the closeness of two WoRMs.</Paragraph>
    <Paragraph position="1"> When matching vectors are very similar such as those in the WSJ English/English corpus, a simple metric like the Euclidean Distance could be used to find those matching pairs:</Paragraph>
    <Paragraph position="3"/>
    <Paragraph position="5"/>
    <Paragraph position="7"> The Cosine Measure will give the highest value to vector pairs which share the most non-zero y values. Therefore, it favors word pairs which share the most number of closely related seed words.</Paragraph>
    <Paragraph position="8"> However, the Cosine Measure is also directly proportional to another parameter, namely the actual (ws. x wt.) values. Consequently, if ws has a high y value everywhere, then the Cosine Measure between any wt and this ws would be high. This violates our assumptions in that although w8 and wt might not correlate closely with the same set of seed words, the matching score would be nevertheless high. This is another supporting reason for choosing mid-frequency content words as seed words.</Paragraph>
    <Paragraph position="9">  7 Evaluation 1: Matching English words to English The evaluation on the WSJ/WSJ English/English corpus is intended as a pilot test on the discriminative power of the Word Relation Matrix. This non-parallel corpus has minimal content and style differences. Furthermore, using such an English/English test set, the output can be evaluated automatically--a translated pair is considered correct if they are identical English words. 307 seed words are chosen according to their occurrence frequency (400-3900) to minimize the number of function words. However, a frequency of 3900 in a corpus of 1.SM words is quite high. As a result, a segment delimited by two punctuations is used as the context window size. Furthermore, the frequent nature of the seed words led to our choice of the Euclidean Distance, instead of the Cosine Measure. The choices of segment size, seed words, and Euclidean Distance measure are all direct consequences of the atypical nature of the English/English pilot test set.</Paragraph>
    <Paragraph position="10"> We selected a test set of 582 (set A) by 687 (set B) single words with mid-range frequency from the WSJ texts. We computed the WoRM feature for each of these test words and computed the Euclidean Distance between every word in these sets. We then calculated the accuracy by counting the number of words whose top one candidate is identical to itself, obtaining a precision of 29%. By allowing N-top candidates, the accuracy improves as shown in the graphs for 582 words output in Figure 5 (i.e. a translation is correct if it appears among the first N candidates). If we find the correct translation among the top 100 candidates, we obtain a precision of around 58%.</Paragraph>
    <Paragraph position="11"> N-top candidates are useful as translator aids.</Paragraph>
    <Paragraph position="12"> Meanwhile, precisions for translating less polysemous content words are higher. If only the 445 content words (manually selected) are kept from the 582-word set, the precisions at different top N candidates for the 445-word set are higher as shown in Figure 5 by the dotted line. We believe the accuracy would be even higher if we only look at really unambiguous test words, such as an entire technical term. It is well known that polysemous words usually have only one sense when used as part of a collocation or technical term (Yarowsky 1993).</Paragraph>
    <Paragraph position="13"> 8 Evaluation 2: Matching Japanese terms to English m Evaluations are also carried out on the Wall Street Journal and Nikkei Financial News corpus, matching technical terms in Japanese to their counterpart in English. This evaluation is a difficult test case because (1) the two languages, English and Japanese, are across language groups; (2) the i  two texts, Wall Street Journal and Nikkei Financial News, do not focus on the same topics; and (3) the two texts are not written by the same authors.</Paragraph>
    <Paragraph position="14"> 1,416 entries from the Japanese/English online dictionary EDICT with occurrence frequencies between 100 and 1000 are chosen as seed words. Since these seed words have relatively low frequencies compared to the corpus size of around 7 million words for the WSJ text, we chose the segment size to be that of an entire paragraph. For the same reason, the Cosine Measure is chosen as a matching function.</Paragraph>
    <Paragraph position="15"> For evaluation, we need to select a test set of known technical term translations. We handtranslated a selected set of technical terms from the Nikkei Financial News corpus and looked them up in the Wall Street Journal text. Among these, 19 terms, shown in Figure 6, have their counterparts in the WSJ text.</Paragraph>
    <Paragraph position="16"> Three evaluations were carried out. In all cases, a translation is counted as correct if the top candidate is the rig~ht one. Test I tries to find the correct translation for each of the nineteen Japanese terms among the nineteen Engl/sh terms. To increase the candidate numbers, test II is carried out on 19 Japanese terms with their English counterparts plus 293 other English terms, giving a total of 312 possible English candidates. The third test set HI consists of the nineteen Japanese terms paired with their translations and 383 single English words in addition. The accuracies for the three test sets are shown in Figure 7; precision ranges from 21.1% to 52.6%.</Paragraph>
    <Paragraph position="17"> Figure 8 shows the ranking of the true translations among all the candidates for all 19 cases for the purpose of a translator-aid. Most of the correct translations can be found among the top 20 cand/dates.</Paragraph>
    <Section position="1" start_page="197" end_page="199" type="sub_section">
      <SectionTitle>
8.1 Translator-aid results
</SectionTitle>
      <Paragraph position="0"> The previous two evaluations show that the precision of best-candidate translation using our algorithm is around 30% on average. While it is far from ideal, this is the first result of terminology translation from non-parallel corpora. Meanwhile, we have found that the correct translation is often among the top 20 candidates. This leads us to conjecture that the output from this algorithm can be used as a translator-aid.</Paragraph>
      <Paragraph position="1">  To evaluate this, we again chose the nineteen English/Japanese terms from the WSJ/Nikkei non-parallel corpus as a test set. We chose three evaluators who are all native Chinese speakers with bilingual knowledge in English and Chinese. Chinese speakers are able to recognize most Japanese technical terms since they are very similar to Chinese. We asked them to translate these nineteen Japanese terms into English without using dictionaries or other reference material. The translators have some general knowledge of international news. However, none of them specializes in economics or finance, which is the domain of the WSJ/Nikkei corpus. Their output is in SET A. Our system then proposes two sets of outputs: (1) for each Japanese term, our system proposes the top-20 candidates from the set of 312 noun phrases. Using this candidate list, the translators again translate the nineteen terms. Their output based on this information is in SET.B; (2) for each Japanese term, our system proposes the top-20 candidates from the set containing 383 single words plus the nineteen terms. The result of human translation based on this candidate list is in SET C. Sets A, B and C are all compared to the original translation in the corpus. If the translation is the same as in the corpus, then it is judged as correct. The results are shown in Figure 9. Evaluators on average are able to translate 8 terms out of 19 by themselves, whereas they can translate 18 terms on average with the aid of our output. Translation precision increases on the average by 50.9%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML