File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/02/c02-1011_relat.xml

Size: 6,341 bytes

Last Modified: 2025-10-06 14:15:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1011">
  <Title>Base Noun Phrase Translation Using Web Data and the EM Algorithm</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2. Related Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Translation with Non-parallel
Corpora
</SectionTitle>
      <Paragraph position="0"> A straightforward approach to word or phrase translation is to perform the task by using parallel bilingual corpora (e.g., Brown et al, 1993).</Paragraph>
      <Paragraph position="1"> Parallel corpora are, however, difficult to obtain in practice.</Paragraph>
      <Paragraph position="2"> To deal with this difficulty, a number of methods have been proposed, which make use of relatively easily obtainable non-parallel corpora (e.g., Fung and Yee, 1998; Rapp, 1999; Diab and Finch, 2000). Within these methods, it is usually assumed that a number of translation candidates for a word or phrase are given (or can be easily collected) and the problem is focused on translation selection.</Paragraph>
      <Paragraph position="3"> All of the proposed methods manage to find out the translation(s) of a given word or phrase, on the basis of the linguistic phenomenon that the contexts of a translation tend to be similar to the contexts of the given word or phrase. Fung and Yee (1998), for example, proposed to represent the contexts of a word or phrase with a real-valued vector (e.g., a TF-IDF vector), in which one element corresponds to one word in the contexts. In translation selection, they select the translation candidates whose context vectors are the closest to that of the given word or phrase. Since the context vector of the word or phrase to be translated corresponds to words in the source language, while the context vector of a translation candidate corresponds to words in the target language, and further the words in the source language and those in the target language have a many-to-many relationship (i.e., translation ambiguities), it is necessary to accurately transform the context vector in the source language to a context vector in the target language before distance calculation.</Paragraph>
      <Paragraph position="4"> The vector-transformation problem was not, however, well-resolved previously. Fung and Yee assumed that in a specific domain there is only one-to-one mapping relationship between words in the two languages. The assumption is reasonable in a specific domain, but is too strict in the general domain, in which we presume to perform translation here. A straightforward extension of Fung and Yee's assumption to the general domain is to restrict the many-to-many relationship to that of many-to-one mapping (or one-to-one mapping). This approach, however, has a drawback of losing information in vector transformation, as will be described.</Paragraph>
      <Paragraph position="5"> For other methods using non-parallel corpora, see also (Tanaka and Iwasaki, 1996; Kikui, 1999, Koehn and Kevin 2000; Sumita 2000; Nakagawa 2001; Gao et al, 2001).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Translation Using Web Data
</SectionTitle>
      <Paragraph position="0"> Web is an extremely rich source of data for natural language processing, not only in terms of data size but also in terms of data type (e.g., multilingual data, link data). Recently, a new trend arises in natural language processing, which tries to bring some new breakthroughs to the field by effectively using web data (e.g., Brill et al, 2001).</Paragraph>
      <Paragraph position="1"> Nagata et al (2001), for example, proposed to collect partial parallel corpus data on the web to create a translation dictionary. They observed that there are many partial parallel corpora between English and Japanese on the web, and most typically English translations of Japanese terms (words or phrases) are parenthesized and inserted immediately after the Japanese terms in documents written in Japanese.</Paragraph>
      <Paragraph position="2"> 3. Base Noun Phrase Translation Our method for Base NP translation comprises of two steps: translation candidate collection and translation selection. In translation candidate collection, we look for translation candidates of a given Base NP. In translation selection, we find out possible translation(s) from the translation candidates.</Paragraph>
      <Paragraph position="3"> In this paper, we confine ourselves to translation of noun-noun pairs from English to Chinese; our method, however, can be extended to translations of other types of Base NPs between other language pairs.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Translation Candidate Collection
</SectionTitle>
      <Paragraph position="0"> We use heuristics for translation candidate collection. Figure 1 illustrates the process of collecting Chinese translation candidates for an English Base NP 'information age' with the heuristics.</Paragraph>
      <Paragraph position="1">  1. Input 'information age'; 2. Consult English-Chinese word translation dictionary: information -&gt;G5b5G1643 age -&gt;G1448G5558(how old somebody is)</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Translation Selection --
</SectionTitle>
      <Paragraph position="0"> EM-NBC-Ensemble We view the translation selection problem as that of classification and employ EM-NBC-Ensemble to perform the task. For the ease of explanation, we first describe the algorithm of using only EM-NBC and next extend it to that of using  Context Information As input data, we use 'contexts' in English which contain the phrase to be translated. We also use contexts in Chinese which contain the translation candidates.</Paragraph>
      <Paragraph position="1"> Here, a context containing a phrase is defined as the surrounding words within a window of a predetermined size, which window covers the phrase. We can easily obtain the data by searching for them on the web. Actually, the contexts containing the candidates are obtained at the same time when we conduct translation candidate collection (Step 4 in Figure 1).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML