File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1067_intro.xml
Size: 4,806 bytes
Last Modified: 2025-10-06 14:06:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1067"> <Title>Automatic Identification of Word Translations from Unrelated English and German Corpora</Title> <Section position="3" start_page="0" end_page="519" type="intro"> <SectionTitle> languages </SectionTitle> <Paragraph position="0"> All these clues usually work well for parallel texts. However, despite serious efforts in the compilation of parallel corpora (Armstrong et al., 1998), the availability of a large-enough parallel corpus in a specific domain and for a given pair of languages is still an exception. Since the acquisition of monolingual corpora is much easier, it would be desirable to have a program that can determine the translations of words from comparable (same domain) or possibly unrelated monolingnal texts of two languages.</Paragraph> <Paragraph position="1"> This is what translators and interpreters usually do when preparing terminology in a specific field: They read texts corresponding to this field in both languages and draw their conclusions on word correspondences from the usage of the terms. Of course, the translators and interpreters can understand the texts, whereas our programs are only considering a few statistical clues.</Paragraph> <Paragraph position="2"> For non-parallel texts the first clue, which is usually by far the strongest of the three mentioned above, is not applicable at all. The second clue is generally less powerful than the first, since most words are ambiguous in natural languages, and many ambiguities are different across languages. Nevertheless, this clue is applicable in the case of comparable texts, although with a lower reliability than for parallel texts. However, in the case of unrelated texts, its usefulness may be near zero. The third clue is generally limited to the identification of word pairs with similar spelling. For all other pairs, it is usually used in combination with the first clue. Since the first clue does not work with non-parallel texts, the third clue is useless for the identification of the majority of pairs. For unrelated languages, it is not applicable anyway.</Paragraph> <Paragraph position="3"> In this situation, Rapp (1995) proposed using a clue different from the three mentioned above: His co-occurrence clue is based on the assumption that there is a correlation between co-occurrence patterns in different languages. For example, if the words teacher and school co-occur more often than expected by chance in a corpus of English, then the German translations of teacher and school, Lehrer and Schule, should also co-occur more often than expected in a corpus of German. In a feasibility study he showed that this assumption actually holds for the language pair English/German even in the case of unrelated texts. When comparing an English and a German co-occurrence matrix of corresponding words, he found a high correlation between the co-occurrence patterns of the two matrices when the rows and columns of both matrices were in corresponding word order, and a low correlation when the rows and columns were in random order.</Paragraph> <Paragraph position="4"> The validity of the co-occurrence clue is obvious for parallel corpora, but - as empirically shown by Rapp - it also holds for non-parallel corpora. It can be expected that this clue will work best with parallel corpora, second-best with comparable corpora, and somewhat worse with unrelated corpora. In all three cases, the problem of robustness - as observed when applying the word-order clue to parallel corpora- is not severe. Transpositions of text segments have virtually no negative effect, and omissions or insertions are not critical. However, the co-occurrence clue when applied to comparable corpora is much weaker than the word-order clue when applied to parallel corpora, so larger corpora and well-chosen statistical methods are required.</Paragraph> <Paragraph position="5"> After an attempt with a context heterogeneity measure (Fung, 1995) for identifying word translations, Fung based her later work also on the co-occurrence assumption (Fung & Yee, 1998; Fung & McKeown, 1997). By presupposing a lexicon of seed words, she avoids the prohibitively expensive computational effort encountered by Rapp (1995). The method described here - although developed independently of Fung's work- goes in the same direction.</Paragraph> <Paragraph position="6"> Conceptually, it is a trivial case of Rapp's matrix permutation method. By simply assuming an initial lexicon the large number of permutations to be considered is reduced to a much smaller number of vector comparisons. The main contribution of this paper is to describe a practical implementation based on the co-occurrence clue that yields good results.</Paragraph> </Section> class="xml-element"></Paper>