File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2020_intro.xml

Size: 5,744 bytes

Last Modified: 2025-10-06 14:02:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2020">
  <Title>Effect of Domain-Specific Corpus in Compositional Translation Estimation for Technical Terms</Title>
  <Section position="2" start_page="0" end_page="114" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper studies issues on compiling a bilingual lexicon for technical terms. So far, several techniques of estimating bilingual term correspondences from a parallel/comparable corpus have been studied (Matsumoto and Utsuro, 2000). For example, in the case of estimation from comparable corpora, (Fung and Yee, 1998; Rapp, 1999) proposed standard techniques of estimating bilingual term correspondences from comparable corpora. In their techniques, contextual similarity between a source language term and its translation candidate is measured across the languages, and all the translation candidates are re-ranked according to the contextual similarities. However, collecting terms  there are limited number of parallel/comparable corpora that are available for the purpose of estimating bilingual term correspondences. Therefore, even if one wants to apply those existing techniques to the task of estimating bilingual term correspondences of technical terms, it is usually quite difficult to find an existing corpus for the domain of such technical terms.</Paragraph>
    <Paragraph position="1"> Considering such a situation, we take an approach of collecting a corpus for the domain of such technical terms from the Web. In this approach, in order to compile a bilingual lexicon for technical terms, the following two issues have to be addressed: collecting technical terms to be listed as the headwords of a bilingual lexicon, and estimating translation of those technical terms.</Paragraph>
    <Paragraph position="2"> Among those two issues, this paper focuses on the second issue of translation estimation of technical terms, and proposes a method for translation estimation for technical terms using a domain/topic specific corpus collected from the Web.</Paragraph>
    <Paragraph position="3"> More specifically, the overall framework of  compiling a bilingual lexicon from the Web can be illustrated as in Figure 1. Suppose that we have sample terms of a specific domain/topic, technical terms to be listed as the headwords of a bilingual lexicon are collected from the Web by the related term collection method of (Sato and Sasaki, 2003). Those collected technical terms can be divided into three subsets according to the number of translation candidates they have in an existing bilingual lexicon, i.e., the subset X</Paragraph>
    <Paragraph position="5"> of terms for which the number of translations in the existing bilingual lexicon is one, the subset X</Paragraph>
    <Paragraph position="7"> of terms for which the number of translations is more than one, and the subset Y S of terms which are not found in the existing bilingual lexicon. (Henceforth, the union X</Paragraph>
    <Paragraph position="9"> The translation estimation task here is to estimate translations for the terms of X</Paragraph>
    <Paragraph position="11"> , it is required to select an appropriate translation from the translation candidates found in the existing bilingual lexicon. For example, as a translation of the Japanese technical term &amp;quot;&amp;quot;, which belongs to the logic circuit field, the term &amp;quot;register&amp;quot; should be selected but not the term &amp;quot;regista&amp;quot; of the football field. On the other hand, for the terms of Y S , it is required to generate and validate translation candidates. In this paper, for the above two tasks, we use a domain/topic specific corpus. Each term of X</Paragraph>
    <Paragraph position="13"> has the only one translation in the existing bilingual lexicon. The set of the translations of terms of</Paragraph>
    <Paragraph position="15"> . Then, the domain/topic specific corpus is collected from the Web using the terms in the set X</Paragraph>
    <Paragraph position="17"> is compiled from the result of translation estimation for the terms of X  and their translations found in the existing bilingual lexicon.</Paragraph>
    <Paragraph position="18"> For each term of X</Paragraph>
    <Paragraph position="20"> , from the translation candidates found in the existing bilingual lexicon, we select the one which appears most frequently in the domain/topic specific corpus. The experimental result of this translation selection process is described in Section 5.2.</Paragraph>
    <Paragraph position="21"> As a method of translation generation/validation for technical terms, we propose a compositional translation estimation technique.</Paragraph>
    <Paragraph position="22"> Compositional translation estimation of a term can be done through the process of compositionally generating translation candidates of the term by concatenating the translation of the constituents of the term. Here, those translation candidates are validated using the domain/topic specific corpus.</Paragraph>
    <Paragraph position="23"> In order to assess the applicability of the compositional translation estimation technique, we randomly pick up 667 Japanese and English technical term translation pairs of 10 domains from existing technical term bilingual lexicons. We then manually examine their compositionality, and find out that 88% of them are actually compositional, which is a very encouraging result. Based on this assessment, this paper proposes a method of compositional translation estimation for technical terms, and through experimental evaluation, shows that the domain/topic specific corpus contributes to improving the performance of compositional translation estimation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML