File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1011_intro.xml
Size: 4,423 bytes
Last Modified: 2025-10-06 14:03:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1011"> <Title>Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora</Title> <Section position="3" start_page="0" end_page="81" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Recently, there has been a surge of interest in the automatic creation of parallel corpora. Several researchers (Zhao and Vogel, 2002; Vogel, 2003; Resnik and Smith, 2003; Fung and Cheung, 2004a; Wu and Fung, 2005; Munteanu and Marcu, 2005) have shown how fairly good-quality parallel sentence pairs can be automatically extracted from comparable corpora, and used to improve the performance of machine translation (MT) systems.</Paragraph> <Paragraph position="1"> This work addresses a major bottleneck in the development of Statistical MT (SMT) systems: the lack of sufficiently large parallel corpora for most language pairs. Since comparable corpora exist in large quantities and for many languages - tens of thousands of words of news describing the same events are produced daily - the ability to exploit them for parallel data acquisition is highly beneficial for the SMT field.</Paragraph> <Paragraph position="2"> Comparable corpora exhibit various degrees of parallelism. Fung and Cheung (2004a) describe corpora ranging from noisy parallel, to comparable, and finally to very non-parallel. Corpora from the last category contain &quot;... disparate, very non-parallel bilingual documents that could either be on the same topic (on-topic) or not&quot;. This is the kind of corpora that we are interested to exploit in the context of this paper.</Paragraph> <Paragraph position="3"> Existing methods for exploiting comparable corpora look for parallel data at the sentence level. However, we believe that very non-parallel corpora have none or few good sentence pairs; most of their parallel data exists at the sub-sentential level. As an example, consider Figure 1, which presents two news articles from the English and Romanian editions of the BBC. The articles report on the same event (the one-year anniversary of Ukraine's Orange Revolution), have been published within 25 minutes of each other, and express overlapping content.</Paragraph> <Paragraph position="4"> Although they are &quot;on-topic&quot;, these two documents are non-parallel. In particular, they contain no parallel sentence pairs; methods designed to extract full parallel sentences will not find any useful data in them. Still, as the lines and boxes from the figure show, some parallel fragments of data do exist; but they are present at the sub-sentential level.</Paragraph> <Paragraph position="5"> In this paper, we present a method for extracting such parallel fragments from comparable corpora.</Paragraph> <Paragraph position="6"> Figure 2 illustrates our goals. It shows two sentences belonging to the articles in Figure 1, and highlights and connects their parallel fragments.</Paragraph> <Paragraph position="7"> Although the sentences share some common meaning, each of them has content which is not translated on the other side. The English phrase reports the BBC's Helen Fawkes in Kiev, as well as the Romanian one De altfel, vorbind inaintea aniversarii have no translation correspondent, either in the other sentence or anywhere in the whole document. Since the sentence pair contains so much untranslated text, it is unlikely that any parallel sentence detection method would consider it useful. And, even if the sentences would be used for MT training, considering the amount of noise they contain, they might do more harm than good for the system's performance. The best way to make use of this sentence pair is to extract and use for training just the translated (highlighted) fragments. This is the aim of our work.</Paragraph> <Paragraph position="8"> Identifying parallel subsentential fragments is a difficult task. It requires the ability to recognize translational equivalence in very noisy environments, namely sentence pairs that express different (although overlapping) content. However, a good solution to this problem would have a strong impact on parallel data acquisition efforts.</Paragraph> <Paragraph position="9"> Enabling the exploitation of corpora that do not share parallel sentences would greatly increase the amount of comparable data that can be used for SMT.</Paragraph> </Section> class="xml-element"></Paper>