File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1089_intro.xml

Size: 2,858 bytes

Last Modified: 2025-10-06 14:02:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1089">
  <Title>Mining New Word Translations from Comparable Corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> New words such as person names, organization names, technical terms, etc. appear frequently. In order for a machine translation system to translate these new words correctly, its bilingual lexicon needs to be constantly updated with new word translations.</Paragraph>
    <Paragraph position="1"> Much research has been done on using parallel corpora to learn bilingual lexicons (Melamed, 1997; Moore, 2003). But parallel corpora are scarce resources, especially for uncommon language pairs. Comparable corpora refer to texts that are not direct translation but are about the same topic. For example, various news agencies report major world events in different languages, and such news documents form a readily available source of comparable corpora. Being more readily available, comparable corpora are thus more suitable than parallel corpora for the task of acquiring new word translations, although relatively less research has been done in the past on comparable corpora. Previous research efforts on acquiring translations from comparable corpora include (Fung and Yee, 1998; Rapp, 1995; Rapp, 1999).</Paragraph>
    <Paragraph position="2"> When translating a word w, two sources of information can be used to determine its translation: the word w itself and the surrounding words in the neighborhood (i.e., the context) of w. Most previous research only considers one of the two sources of information, but not both. For example, the work of (Al-Onaizan and Knight, 2002a; Al-Onaizan and Knight, 2002b; Knight and Graehl, 1998) used the pronunciation of w in translation. On the other hand, the work of (Cao and Li, 2002; Fung and Yee, 1998; Koehn and Knight, 2002; Rapp, 1995; Rapp, 1999) used the context of w to locate its translation in a second language.</Paragraph>
    <Paragraph position="3"> In this paper, we propose a new approach for the task of mining new word translations from comparable corpora, by combining both context and transliteration information. Since both sources of information are complementary, the accuracy of our combined approach is better than the accuracy of using just context or transliteration information alone. We fully implemented our method and tested it on Chinese-English comparable corpora. We translated Chinese words into English. That is, Chinese is the source language and English is the target language. We achieved encouraging results.</Paragraph>
    <Paragraph position="4"> While we have only tested our method on Chinese-English comparable corpora, our method is general and applicable to other language pairs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML