File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0119_intro.xml

Size: 7,454 bytes

Last Modified: 2025-10-06 14:06:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0119">
  <Title>i i Finding Terminology Translations from Non-parallel Corpora</Title>
  <Section position="4" start_page="0" end_page="194" type="intro">
    <SectionTitle>
2 Related work
</SectionTitle>
    <Paragraph position="0"> Few attempts have been made to explore non-parallel corpora of monolingual texts in the same domain. Early work uses a pair of non-parallel texts for the task of lexical disambiguation between several senses of a word (Dagan 1990).</Paragraph>
    <Paragraph position="1"> This basic idea extends to choosing a translation among multiple candidates (Dagan &amp; Itai 1994) given collocation information. A similar idea is later applied by (Rapp 1995) to show the plausibility of correlations between words in non-parallel text. He proposed a matrix permutation method matching co-occurrence patterns in two non-parallel texts, but noted that computational limitations hamper further extension of this method. Using the same idea, (Tanaka ~ Iwasaki 1996) demonstrated how to eliminate candidate words in a bilingual dictionary.</Paragraph>
    <Paragraph position="2"> All the above works point to a certain discriminatory feature in monolingual texts context and word relations. However, these works remain in the realm of solving ambiguities or choosing the best candidate among a small set of possibilities. It is argued in (Gale 8z Church 1994) that feature vectors of 100,000 dimensions are likely to be needed for high resolution discriminant analysis. It is so far questionable whether feature vectors of lower dimensions are discriminating enough for extracting bilingual lex/cal pairs from non-parallel corpora with a large number of candidates. Is it possible to achieve bilingual lex/con translation by looking at words in relation to other words? In this paper, we hope to shed some light on this question.</Paragraph>
    <Paragraph position="3"> I 3 Two pilot non-parallel corpora In our experiments, we use two sets of non-parallel corpora: (1) Wall Street Journal (WSJ) from 1993 and 1994, divided into two non-overlapping parts. Each resulting English corpus has 10.36M bytes of data. (2) Wall Street Journal in English and Nikkei Financial News in Japanese, from the same time period. The WSJ text contains 49M bytes of data, and the Nikkei 127M bytes. Since the Nikkei is encoded in two-byte Japanese character sets, the latter is equivalent to about 60M bytes of data in English.</Paragraph>
    <Paragraph position="4"> The English Wall Street Journal non-parallel corpus gives us an easier test set on which to start. The output of this corpus should consist of words matching to themselves as trans- null lations. It is useful as a baseline evaluation test set providing an estimate on performance. The WSJ/Nikkei corpus is the most non-parallel type of corpus. In addition to being written in languages across linguistic families by different journalists, WSJ/Nikkei also share only a limited amount of common topic. The Wall Street Journal tends to focus on U.S. domestic economic and political news, whereas the Nikkei Financial News focuses on economic and political events in Japan and in Asia. Due to the large difference in content, language, writing style, we consider this corpus more difficult than others. However, the result we obtain from this corpus gives us a lower-bound on the performance of our algorithm. | 4 An algorithm for finding terminology translations from non-parallel corpora Bilingual lexicon translation algorithms for parallel corpora in general make use of fixed correlations between a pair of bilingual terms, reflected in their frequent co-occurrences in translated texts, to find lexicon translations. We use correlations both between monolingual lexical units, and between bilingual or multilingual lexical units, to find a consistent pattern which is represented as statistical word features for translation.</Paragraph>
    <Paragraph position="5"> We illustrate the possible correlations using the word debentures in the two different parts of WSJ. Figure 1 shows segments from both texts containing the word deMntures.</Paragraph>
    <Paragraph position="6">  Universal said its 15 3/4% debentures due Dec sold $75 million of 6% debentures priced at par and due Sept sold $40 million of 6 1/4% convertible debentures priced at par and due March 15 GTE offered a $250 million issue of 8 1/2% debentures due in 30 years $250 million of notes due 1997 and $250 million of debentures due 2017 sold $300 million of 7 1/2% convertible debentures due 2012 at par said it agreed to issue $125 million Canadian in convertible debentures senior subordinated debentures was offered through Drexel said it completed the redemption of all $16 million of its 9% subordinated debentures due 2003 Moody's assigned a Ban-3 rating to a proposed $100 million convertible subordinated debenture issue and its 12 1/2% senior subordinated debentures at par $20 million of convertible debentures due June 1 issues of $110 million of senior notes due 1997 and $115 million of convertible debentures due 2012 said it reached an agreement with holders of $30 million of its convertible subordinated debentures downgraded the subordinated debentures of Bank of Montreal common shares and $35 million of convertible debentures due 2012 $35 million of convertible debentures due May 15 financed with $450 million of new Western Union senior secured debentures to be placed by Drexel Commission to issue as much as $125 million of 30-year debentures packaged with common stock to redeem its entire $55 million time amount of 8 3/4% convertible subordinated debentures due 2011 Figure I shows that:  1. debentures co-occurs most frequently with words such as million and due in both texts. 2. debentures is less related to engineering, which does not appear in any segments containing debentures.</Paragraph>
    <Paragraph position="7"> 3. Given all words in one text, debentures is closely correlated with a subset of all words in the  texts. In Figure 1, this subset consists of million, due, convertible, subordinated, etc. Following the above observations, we propose the following algorithm for finding word or term translation pairs from non-parallel corpora:  1. Given a bilingual list of known translation pairs (i.e. seed words) 2. For every unknown word or term e in language 1, find its eorrelationl with every word in the seed word list in language 1 ~ relation vector WORM1 3. Similarly for unknown words c in language 2, find its correlationl with every word in the seed word list in language 2 ~ relation vector Wo.RM2 4. Compete correlation2(WoRM1, WORM2); if it is high, e and c are considered as a translation  pair.</Paragraph>
    <Paragraph position="8"> We use online dictionaries to provide the it seed word lists. To avoid problems of polysemy and nonstandardization in dictionary entries, we choose a more reliable, less ambiguous subset of dictionary entries as the seed word list. This subset contains dictionary entries which occur at mid-range frequency in the corpus so that they are more likely to be content words. They must occur in both sides of the non-parallel corpora, and have fewer number of candidate translations. Such seed words serve as the textual anchor points in non-parallel corpora. For example, we obtained 1,416 entries from the Japanese/English online dictionary EDICT using these criteria.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML