File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1006_intro.xml

Size: 2,670 bytes

Last Modified: 2025-10-06 14:05:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1006">
  <Title>Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Bilingual dictionaries are essential componeuts for machine translation systems. One of the major problems with bilingual dictionaries is that they are expensive to build, since a huge number of terms are used in a variety of fields. Cnmputer support is thus needed to reduce the cost of dictionary building.</Paragraph>
    <Paragraph position="1"> With the growing wdume of text available in electronic lorm, a number of methods have been proposed tor extracting word correspondences from bilingual corpora automatically. These methods can be divided into those taking a statistical approach (Gale &amp; Church 1991a; Kupiec 1993; Dagan et al. 1993; Inoue &amp; Nogaito 1993; Fung 1995) and those taking a linguistic approach (Yamamoto &amp; Sakamoto 1993; Kum~mo &amp; Hirakawa 1994; Ishimoto &amp; Nagao 1994). The statistical approach utilizes the occurrence frequencies and locations of words in a parallel corpus to calculate the pairwise correlations between the words in the two languages. The linguistic approach primarily extracts correspondences between compound words by consulting a bilingual dictionary of simple words.</Paragraph>
    <Paragraph position="2"> These proposed methods for extracting word correspondences from bilingual corpora have the following drawbacks. First, most of theln assume that the input corpora m'e aligned sentence by sentence, which reduces their applicability remarkably. Although a number of automatic sentence alignment methods have been proposed (Brown et al. 1991 ; Gale &amp; Church 1991 b; Kay &amp; Roscheisen 1993; Chen 1993), they are not very reliable for real noisy bilingual texts. Second, the statistical methods usually require a very large corpus as their input. However, it is not easy to obtain a very large corpus. Third, tile linguistic methods are restricted to extracting correspondences between compound words.</Paragraph>
    <Paragraph position="3"> We have developed an extraction method that is free fi'om the above drawbacks. In Sec. 2 we describe the hasic idea of our methud and give an overview. In Sec. 3 we describe the technical details, and in Sec. 4 we describe an experiment using patent-specification texts.</Paragraph>
    <Paragraph position="4"> In Sec. 5 we make a remark on the effectiveness of the proposed method, and discuss directions for improvement.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML