File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1011_intro.xml

Size: 2,527 bytes

Last Modified: 2025-10-06 14:01:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1011">
  <Title>Base Noun Phrase Translation Using Web Data and the EM Algorithm</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> We address here the problem of Base NP translation, in which for a given Base Noun Phrase in a source language (e.g., 'information age' in English), we are to find out its possible translation(s) in a target language (e.g., 'G5b5G1643G1bca G4b7'inChinese).</Paragraph>
    <Paragraph position="1"> We define a Base NP as a simple and non-recursive noun phrase. In many cases, Base NPs represent holistic and non-divisible concepts, and thus accurate translation of them from one language to another is extremely important in applications like machine translation, cross language information retrieval, and foreign language writing assistance.</Paragraph>
    <Paragraph position="2"> In this paper, we propose a new method for Base NP translation, which contains two steps: (1) translation candidate collection, and (2) translation selection. In translation candidate collection, for a given Base NP in the source language, we look for its translation candidates in the target language. To do so, we use a word-to-word translation dictionary and corpus data in the target language on the web. In translation selection, we determine the possible translation(s) from among the candidates. We use non-parallel corpus data in the two languages on the web and employ one of the two methods which we have developed. In the first method, we view the problem as that of classification and employ an ensemble of Naive Bayesian Classifiers constructed with the EM Algorithm.</Paragraph>
    <Paragraph position="3"> We will use 'EM-NBC-Ensemble' to denote this method, hereafter. In the second method, we view the problem as that of calculating similarities between context vectors and use TF-IDF vectors also constructed with the EM Algorithm. We will use 'EM-TF-IDF' to denote this method.</Paragraph>
    <Paragraph position="4"> Experimental results indicate that our method is very effective, and the coverage and top 3 accuracy of translation at the final stage are 91.4% and 79.8%, respectively. The results are significantly better than those of the baseline methods relying on existing technologies. The higher performance of our method can be attributed to the enormity of the web data used and the employment of the EM Algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML