File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1067_metho.xml

Size: 14,224 bytes

Last Modified: 2025-10-06 14:15:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1067">
  <Title>Automatic Identification of Word Translations from Unrelated English and German Corpora</Title>
  <Section position="4" start_page="519" end_page="520" type="metho">
    <SectionTitle>
2 Approach
</SectionTitle>
    <Paragraph position="0"> As mentioned above, it is assumed that across languages there is a correlation between the co-occurrences of words that are translations of each other. If - for example - in a text of one language two words A and B co-occur more often than expected by chance, then in a text of another language those words that are translations of A and B should also co-occur more frequently than expected. This is the only statistical clue used throughout this paper.</Paragraph>
    <Paragraph position="1"> It is further assumed that there is a small dictionary available at the beginning, and that our aim is to expand this base lexicon. Using a corpus of the target language, we first compute a co-occurrence matrix whose rows are all word types occurring in the corpus and whose colunms are all target words appearing in the base lexicon. We now select a word of the source language whose translation is to be determined.</Paragraph>
    <Paragraph position="2"> Using our source-language corpus, we compute  a co-occurrence vector for this word. We translate all known words in this vector to the target language. Since our base lexicon is small, only some of the translations are known. All unknown words are discarded from the vector and the vector positions are sorted in order to match the vectors of the target-language matrix. With the resulting vector, we now perform a similarity computation to all vectors in the co-occurrence matrix of the target language. The vector with the highest similarity is considered to be the translation of our source-language word.</Paragraph>
  </Section>
  <Section position="5" start_page="520" end_page="522" type="metho">
    <SectionTitle>
3 Simulation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="520" end_page="520" type="sub_section">
      <SectionTitle>
3.1 Language Resources
</SectionTitle>
      <Paragraph position="0"> To conduct the simulation, a number of resources were required. These are  1. a German corpus 2. an English corpus 3. a number of German test words with known English translations 4. a small base lexicon, German to English  As the German corpus, we used 135 million words of the newspaper Frankfurter Allgemeine Zeitung (1993 to 1996), and as the English corpus 163 million words of the Guardian (1990 to 1994). Since the orientation of the two newspapers is quite different, and since the time spans covered are only in part overlapping, the two corpora can be considered as more or less unrelated.</Paragraph>
      <Paragraph position="1"> For testing our results, we started with a list of 100 German test words as proposed by Russell (1970), which he used for an association experiment with German subjects. By looking up the translations for each of these 100 words, we obtained a test set for evaluation.</Paragraph>
      <Paragraph position="2"> Our German/English base lexicon is derived from the Collins Gem German Dictionary with about 22,300 entries. From this we eliminated all multi-word entries, so 16,380 entries remained. Because we had decided on our test word list beforehand, and since it would not make much sense to apply our method to words that are already in the base lexicon, we also removed all entries belonging to the 100 test words.</Paragraph>
    </Section>
    <Section position="2" start_page="520" end_page="521" type="sub_section">
      <SectionTitle>
3.2 Pre-processing
</SectionTitle>
      <Paragraph position="0"> Since our corpora are very large, to save disk space and processing time we decided to remove all function words from the texts. This was done on the basis of a list of approximately 600 German and another list of about 200 English function words. These lists were compiled by looking at the closed class words (mainly articles, pronouns, and particles) in an English and a German morphological lexicon (for details see Lezius, Rapp, &amp; Wettler, 1998) and at word frequency lists derived from our corpora. 1 By eliminating function words, we assumed we would lose little information: Function words are often highly ambiguous and their co-occurrences are mostly based on syntactic instead of semantic patterns. Since semantic patterns are more reliable than syntactic patterns across language families, we hoped that eliminating the function words would give our method more generality.</Paragraph>
      <Paragraph position="1"> We also decided to lemmatize our corpora.</Paragraph>
      <Paragraph position="2"> Since we were interested in the translations of base forms only, it was clear that lemmatization would be useful. It not only reduces the sparse-data problem but also takes into account that German is a highly inflectional language, whereas English is not. For both languages we conducted a partial lemmatization procedure that was based only on a morphological lexicon and did not take the context of a word form into account. This means that we could not lemmatize those ambiguous word forms that can be derived from more than one base form. However, this is a relatively rare case. (According to Lezius, Rapp, &amp; Wettler, 1998, 93% of the tokens of a German text had only one lemma.) Although we had a context-sensitive lemmatizer for German available (Lezius, Rapp, &amp; Wettler, 1998), this was not the case for English, so for reasons of symmetry we decided not to use the context feature.</Paragraph>
      <Paragraph position="3"> I In cases in which an ambiguous word can be both a content and a function word (e.g., can), preference was given to those interpretations that appeared to occur more frequently.</Paragraph>
    </Section>
    <Section position="3" start_page="521" end_page="521" type="sub_section">
      <SectionTitle>
3.3 Co-occurrence Counting
</SectionTitle>
      <Paragraph position="0"> For counting word co-occurrences, in most other studies a fixed window size is chosen and it is determined how often each pair of words occurs within a text window of this size. However, this approach does not take word order within a window into account. Since it has been empirically observed that word order of content words is often similar between languages (even between unrelated languages such as English and Chinese), and since this may be a useful statistical clue, we decided to modify the common approach in the way proposed by Rapp (1996, p.</Paragraph>
      <Paragraph position="1"> 162). Instead of computing a single co-occurrence vector for a word A, we compute several, one for each position within the window. For example, if we have chosen the window size 2, we would compute a first co-occurrence vector for the case that word A is two words ahead of another word B, a second vector for the case that word A is one word ahead of word B, a third vector for A directly following B, and a fourth vector for A following two words after B. If we added up these four vectors, the result would be the co-occurrence vector as obtained when not taking word order into account. However, this is not what we do. Instead, we combine the four vectors of length n into a single vector of length 4n.</Paragraph>
      <Paragraph position="2"> Since preliminary experiments showed that a window size of 3 with consideration of word order seemed to give somewhat better results than other window types, the results reported here are based on vectors of this kind. However, the computational methods described below are in the same way applicable to window sizes of any length with or without consideration of word order.</Paragraph>
    </Section>
    <Section position="4" start_page="521" end_page="522" type="sub_section">
      <SectionTitle>
3.4 Association Formula
</SectionTitle>
      <Paragraph position="0"> Our method is based on the assumption that there is a correlation between the patterns of word co-occurrences in texts of different languages. However, as Rapp (1995) proposed, this correlation may be strengthened by not using the co-occurrence counts directly, but association strengths between words instead. The idea is to eliminate word-frequency effects and to emphasize significant word pairs by comparing their observed co-occurrence counts with their expected co-occurrence counts. In the past, for this purpose a number of measures have been proposed. They were based on mutual information (Church &amp; Hanks, 1989), conditional probabilities (Rapp, 1996), or on some standard statistical tests, such as the chi-square test or the log-likelihood ratio (Dunning, 1993). For the purpose of this paper, we decided to use the log-likelihood ratio, which is theoretically well justified and more appropriate for sparse data than chi-square. In preliminary experiments it also led to slightly better results than the conditional probability measure. Results based on mutual information or co-occurrence counts were significantly worse. For efficient computation of the log-likelihood ratio we used the following formula: 2</Paragraph>
      <Paragraph position="2"> with parameters kij expressed in terms of corpus frequencies: kl~ = frequency of common occurrence of word A and word B</Paragraph>
      <Paragraph position="4"> All co-occurrence vectors were transformed using this formula. Thereafter, they were normalized in such a way that for each vector the sum of its entries adds up to one. In the rest of the paper, we refer to the transformed and normalized vectors as association vectors.</Paragraph>
    </Section>
    <Section position="5" start_page="522" end_page="522" type="sub_section">
      <SectionTitle>
3.5 Vector Similarity
</SectionTitle>
      <Paragraph position="0"> To determine the English translation of an unknown German word, the association vector of the German word is computed and compared to all association vectors in the English association matrix. For comparison, the correspondences between the vector positions and the columns of the matrix are determined by using the base lexicon. Thus, for each vector in the English matrix a similarity value is computed and the English words are ranked according to these values. It is expected that the correct translation is ranked first in the sorted list.</Paragraph>
      <Paragraph position="1"> For vector comparison, different similarity measures can be considered. Salton &amp; McGill (1983) proposed a number of measures, such as the Cosine coefficient, the Jaccard coefficient, and the Dice coefficient (see also Jones &amp; Furnas, 1987). For the computation of related terms and synonyms, Ruge (1995), Landauer and Dumais (1997), and Fung and McKeown (1997) used the cosine measure, whereas Grefenstette (1994, p. 48) used a weighted Jaccard measure.</Paragraph>
      <Paragraph position="2"> We propose here the city-block metric, which computes the similarity between two vectors X and Y as the sum of the absolute differences of corresponding vector positions: S:Z\[Xi -Yi\[ i=l In a number of experiments we compared it to other similarity measures, such as the cosine measure, the Jaccard measure (standard and binary), the Euclidean distance, and the scalar product, and found that the city-block metric yielded the best results. This may seem surprising, since the formula is very simple and the computational effort smaller than with the other measures. It must be noted, however, that the other authors applied their similarity measures directly to the (log of the) co-occurrence vectors, whereas we applied the measures to the association vectors based on the log-likelihood ratio. According to our observations, estimates based on the log-likelihood ratio are generally more reliable across different corpora and languages. null</Paragraph>
    </Section>
    <Section position="6" start_page="522" end_page="522" type="sub_section">
      <SectionTitle>
3.6 Simulation Procedure
</SectionTitle>
      <Paragraph position="0"> The results reported in the next section were obtained using the following procedure: 1. Based on the word co-occurrences in the German corpus, for each of the 100 German test words its association vector was computed. In these vectors, all entries belonging to words not found in the English part of the base lexicon were deleted.</Paragraph>
      <Paragraph position="1"> 2. Based on the word co-occurrences in the English corpus, an association matrix was computed whose rows were all word types of the corpus with a frequency of 100 or higher 3 and whose columns were all English words occurring as first translations of the German words in the base lexicon. 4 3. Using the similarity function, each of the German vectors was compared to all vectors of the English matrix. The mapping between vector positions was based on the first translations given in the base lexicon. For each of the German source words, the English vocabulary was ranked according to the resuiting similarity value.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="522" end_page="523" type="metho">
    <SectionTitle>
3 The limitation to words with frequencies above 99
</SectionTitle>
    <Paragraph position="0"> was introduced for computational reasons to reduce the number of vector comparisons and thus speed up the program. (The English corpus contains 657,787 word types after lemmatization, which leads to extremely large matrices.) The purpose of this limitation was not to limit the number of translation candidates considered. Experiments with lower thresholds showed that this choice has little effect on the results to our set of test words.</Paragraph>
    <Paragraph position="1"> 4 This means that alternative translations of a word were not considered. Another approach, as conducted by Fung &amp; Yee (1998), would be to consider all possible translations listed in the lexicon and to give them equal (or possibly descending) weight. Our decision was motivated by the observation that many words have a salient first translation and that this translation is listed first in the Collins Gem Dictionary German-English. We did not explore this issue further since in a small pocket dictionary only few ambiguities are listed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML