File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1041_intro.xml

Size: 3,155 bytes

Last Modified: 2025-10-06 14:06:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1041">
  <Title>Machine Translation vs. Dictionary Term Translation a Comparison for English-Japanese News Article Alignment</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In this paper we compare the effectiveness of full machine translation (MT) and simple dictionary term lookup (DTL) for the task of English-Japanese news article alignment using the vector space model from multi-lingual information retrieval. Matching texts depends essentially on lexical coincidence between the English text and the Japanese translation, and we see that the two methods show the trade-off between reduced transfer ambiguity in MT and increased synonymy in DTL.</Paragraph>
    <Paragraph position="1"> Corpus-based approaches to natural language processing are now well established for tasks such as vocabulary and phrase acquisition, word sense disambiguation and pattern learning. The continued practical application of corpus-based methods is critically dependent on the availability of corpus resources. null In machine translation we are concerned with the provision of bilingual knowledge and we have found that the types of language domains which users are interested in such as news, current affairs and technology, are poorly represented in today's publically available corpora. Our main area of interest is English-Japanese translation, but there are few clean parallel corpora available in large quantities.</Paragraph>
    <Paragraph position="2"> As a result we have looked at ways of automatically acquiring large amounts of parallel text for vocabulary acquisition.</Paragraph>
    <Paragraph position="3"> The World Wide Web and other Internet resources provide a potentially valuable source of parallel texts. Newswire companies for example publish news articles in various languages and various domains every day. We can expect a coincidence of content in these collections of text, but the degree of parallelism is likely to be less than is the case for texts such as the United Nations and parliamentary proceedings. Nevertheless, we can expect a coincidence of vocabulary, in the case of names of people and places, organisations and events. This time-sensitive bilingual vocabulary is valuable for machine translation and makes a significant difference to user satisfaction by improving the comprehensibility of the output.</Paragraph>
    <Paragraph position="4"> Our goal is to automatically produce a parallel corpus of aligned articles from collections of English and Japanese news texts for bilingual vocabulary acquisition. The first stage in this process is to align the news texts. Previously (Collier et al., 1998) adapted multi-lingual (also called &amp;quot;translingual&amp;quot; or &amp;quot;cross-language&amp;quot;) information retrieval (MLIR) for this purpose and showed the practicality of the method. In this paper we extend their investigation by comparing the performance of machine translation and conventional dictionary term translation for this task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML