File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-4003_abstr.xml

Size: 8,041 bytes

Last Modified: 2025-10-06 13:44:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4003">
  <Title>Improving Machine Translation Performance</Title>
  <Section position="2" start_page="0" end_page="479" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Parallel texts--texts that are translations of each other--are an important resource in many NLP applications. They provide indispensable training data for statistical machine translation (Brown et al. 1990; Och and Ney 2002) and have been found useful in research on automatic lexical acquisition (Gale and Church 1991; Melamed 1997), cross-language information retrieval (Davis and Dunning 1995; Oard 1997), and annotation projection (Diab and Resnik 2002; Yarowsky and Ngai 2001; Yarowsky, Ngai, and Wicentowski 2001).</Paragraph>
    <Paragraph position="1"> Unfortunately, parallel texts are also scarce resources: limited in size, language coverage, and language register. There are relatively few language pairs for which parallel corpora of reasonable sizes are available; and even for those pairs, the corpora come mostly from one domain, that of political discourse (proceedings of the Canadian or European Parliament, or of the United Nations). This is especially problematic for the field of statistical machine translation (SMT), because translation systems trained on data from a particular domain (e.g., parliamentary proceedings) will perform poorly when translating texts from a different domain (e.g., news articles).</Paragraph>
    <Paragraph position="2"> One way to alleviate this lack of parallel data is to exploit a much more available and diverse resource: comparable non-parallel corpora. Comparable corpora are texts that, while not parallel in the strict sense, are somewhat related and convey overlapping information. Good examples are the multilingual news feeds produced by news agencies such as Agence France Presse, Xinhua News, Reuters, CNN, BBC, etc. Such texts are widely available on the Web for many language pairs and domains. They often [?] 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292. E-mail: {dragos,marcu}@isi.edu. Submission received: 5 November 2004; Accepted for publication: 3 March 2005.</Paragraph>
    <Paragraph position="3"> (c) 2006 Association for Computational Linguistics Computational Linguistics Volume 31, Number 4 contain many sentence pairs that are fairly good translations of each other. The ability to reliably identify these pairs would enable the automatic creation of large and diverse parallel corpora.</Paragraph>
    <Paragraph position="4"> However, identifying good translations in comparable corpora is hard. Even texts that convey the same information will exhibit great differences at the sentence level. Consider the two newspaper articles in Figure 1. They have been published by the English and French editors of Agence France Presse, and report on the same event, an epidemic of cholera in Pyongyang. The lines in the figure connect sentence pairs that are approximate translations of each other. Discovering these links automatically is clearly non-trivial. Traditional sentence alignment algorithms (Gale and Church 1991; Wu 1994; Fung and Church 1994; Melamed 1999; Moore 2002) are designed to align sentences in parallel corpora and operate on the assumption that there are no reorderings and only limited insertions and deletions between the two renderings of a parallel document. Thus, they perform poorly on comparable, non-parallel texts.</Paragraph>
    <Paragraph position="5"> What we need are methods able to judge sentence pairs in isolation, independent of the (potentially misleading) context.</Paragraph>
    <Paragraph position="6"> This article describes a method for identifying parallel sentences in comparable corpora and builds on our earlier work on parallel sentence extraction (Munteanu, Fraser, and Marcu 2004). We describe how to build a maximum entropy-based classifier that can reliably judge whether two sentences are translations of each other, without making use of any context. Using this classifier, we extract parallel sentences from very large comparable corpora of newspaper articles. We demonstrate the quality of our Figure 1 A pair of comparable texts.</Paragraph>
    <Paragraph position="7">  Munteanu and Marcu Exploiting Non-Parallel Corpora extracted sentences by showing that adding them to the training data of an SMT system improves the system's performance. We also show that language pairs for which very little parallel data is available are likely to benefit the most from our method; by running our extraction system on a large comparable corpus in a bootstrapping manner, we can obtain performance improvements of more than 50% over a baseline MT system trained only on existing parallel data.</Paragraph>
    <Paragraph position="8"> Our main experimental framework is designed to address the commonly encountered situation that exists when the MT training and test data come from different domains. In such a situation, the test data is in-domain, and the training data is out-of-domain. The problem is that in such conditions, translation performance is quite poor; the out-of-domain data doesn't really help the system to produce good translations. What is needed is additional in-domain training data. Our goal is to get such data from a large in-domain comparable corpus and use it to improve the performance of an out-of-domain MT system. We work in the context of Arabic-English and Chinese-English statistical machine translation systems. Our out-of-domain data comes from translated United Nations proceedings, and our in-domain data consists of news articles. In this experimental framework we have access to a variety of resources, all of which are available from the Linguistic Data  in-domain comparable corpora: large collections of Arabic, Chinese, and English news articles from various news agencies.</Paragraph>
    <Paragraph position="9"> In summary, we call in-domain the domain of the test data that we wish to translate; in this article, that in-domain data consists of news articles. Out-of-domain data is data that belongs to any other domain; in this article, the out-of-domain data is drawn from United Nations (UN) parliamentary proceedings. We are interested in the situation that exists when we need to translate news data but only have UN data available for training. The solution we propose is to get comparable news data, automatically extract parallel sentences from it, and use these sentences as additional training data; we will show that doing this improves translation performance on a news test set. The Arabic-English and Chinese-English resources described in the previous paragraph enable us to simulate our conditions of interest and perform detailed measurements of the impact of our proposed solution. We can train baseline systems on UN parallel data (using the data from the first bullet in the previous paragraph), extract additional news data from the large comparable corpora (the fourth bullet), accurately measure translation performance on news data against four reference translations (the third bullet), and compare the impact of the automatically extracted news data with that of similar amounts of human-translated news data (the second bullet).</Paragraph>
    <Paragraph position="10"> In the next section, we give a high-level overview of our parallel sentence extraction system. In Section 3, we describe in detail the core of the system, the parallel sen- null tence classifier. In Section 4, we discuss several data extraction experiments. In Section 5, we evaluate the extracted data by showing that adding it to out-of-domain parallel data improves the in-domain performance of an out-of-domain MT system, and in Section 6, we show that in certain cases, even larger improvements can be obtained by using bootstrapping. In Section 7, we present examples of sentence pairs extracted by our method and discuss some of its weaknesses. Before concluding, we discuss related work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML