File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/j96-1001_abstr.xml

Size: 6,127 bytes

Last Modified: 2025-10-06 13:48:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-1001">
  <Title>NetPatrol Consulting</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Hieroglyphics remained undeciphered for centuries until the discovery of the Rosetta stone in the beginning of the 19th century in Rosetta, Egypt. The Rosetta stone is a tablet of black basalt containing parallel inscriptions in three different scripts: Greek and two forms of ancient Egyptian writings (demotic and hieroglyphics). Jean-Francois Champollion, a linguist and Egyptologist, made the assumption that these inscriptions were parallel and managed after several years of research to decipher the hieroglyphic inscriptions. He used his work on the Rosetta stone as a basis from which to produce the first comprehensive hieroglyphics dictionary (Budge 1989).</Paragraph>
    <Paragraph position="1"> In this paper, we describe a modern version of a similar approach: given a large corpus in two languages, our system produces translations of common word pairs and phrases that can form the basis of a bilingual lexicon. Our focus is on the use of statistical methods for the translation of multiword expressions, such as collocations which are often idiomatic in nature. Published translations of such collocations are not readily available, even for languages such as French and English, despite the fact that collocations have been recognized as one of the main obstacles to second language acquisition (Leed and Nakhimovsky 1979).</Paragraph>
    <Paragraph position="2"> * The work reported in this paper was done while the author was at Columbia University. His current address is NetPatrol Consulting, Tel Maneh 6, Haifa 34363, Israel. E-maih smadj a(c)netvision, net. +-1. t Department of Computer Science, 450 Computer Science Building, Columbia University, New York, NY 10027, USA. E-mail: kathy@cs, columbia, edu, vh@cs, columbia, edu.</Paragraph>
    <Paragraph position="3"> (D 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 1 We have developed a program named Champollion', which, given a sentence-aligned parallel bilingual corpus, translates collocations (or individual words) in the source language into collocations (or individual words) in the target language. The aligned corpus is used as a reference, or database corpus, and represents Champollion's knowledge of both languages. Champollion uses statistical methods to incrementally construct the collocation translation, adding one word at a time. As a correlation measure, Champollion uses the Dice coefficient (Dice 1945; S6rensen 1948) commonly used in information retrieval (Salton and McGill 1983; Frakes and Baeza-Yates 1992).</Paragraph>
    <Paragraph position="4"> For a given source language collocation, Champollion identifies individual words in the target language that are highly correlated with the source collocation, thus producing a set of words in the target language. These words are then combined in a systematic, iterative manner to produce a translation of the source language collocation. Champollion considers all pairs of these words and identifies any that are highly correlated with the source collocation. Next, triplets are produced by adding a highly correlated word to a highly correlated pair, and the triplets that are highly correlated with the source language collocation are passed to the next stage. This process is repeated until no more highly correlated combinations of words can be found. Champollion selects the group of words with the highest cardinality and correlation factor as the target collocation. Finally, it produces the correct word ordering of the target collocation by examining samples in the corpus. If word order is variable in the target collocation, Champollion labels it flexible (for example, to take steps to can appear as took immediate steps to, steps were taken to, etc.); otherwise, the correct word order is reported and the collocation is labeled rigid.</Paragraph>
    <Paragraph position="5"> To evaluate Champollion, we used a collocation compiler, XTRACT (Smadja 1993), to automatically produce several lists of source (English) collocations. These source collocations contain both flexible word pairs, which can be separated by an arbitrary number of words, and fixed constituents~ such as compound noun phrases. Using XTRACT on three parts of the English data in the Hansards corpus, each representing one year's worth of data, we extracted three sets of collocations, each consisting of 300 randomly selected collocations occurring with medium frequency. We then ran Champollion on each of these sets, using three separate database corpora of varying size, also taken from the Hansards corpus. We asked several people fluent in both French and English to judge the results, and the accuracy of Champollion was found to range from 65% to 78%. In our discussion of results, we show how problems for the lower score can be alleviated by increasing the size of the database corpus.</Paragraph>
    <Paragraph position="6"> In the following sections, we first present a review of related work in statistical natural language processing dealing with bilingual data. Our algorithm depends on using a measure of correlation to find words that are highly correlated across languages. We describe the measure that we use and then provide a detailed description of the algorithm, following this with a theoretical analysis of the performance of our algorithm. Next, we turn to a description of the results and evaluation. Finally, we show how the results can be used for a variety of applications, closing with a discussion of the limitations of our approach and of future work.</Paragraph>
    <Paragraph position="7"> 1 None of the authors is affiliated with Boitet's research center on machine translation in Grenoble, France, which is also named &amp;quot;Champollion'.</Paragraph>
    <Paragraph position="8"> Smadja, McKeown, and Hatzivassiloglou Translating Collocations for Bilingual Lexicons</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML