File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3208_intro.xml

Size: 9,196 bytes

Last Modified: 2025-10-06 14:02:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3208">
  <Title>Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Parallel sentences are important resources for training and improving statistical machine translation and cross-lingual information retrieval systems.</Paragraph>
    <Paragraph position="1"> Various methods have been previously proposed to extract parallel sentences from multilingual corpora.</Paragraph>
    <Paragraph position="2"> Some of them are described in detail in (Manning and Schu tze, 1999, Wu, 2001, Veronis 2001). The challenge of these tasks varies by the degree of parallel-ness of the input multilingual documents.</Paragraph>
    <Paragraph position="3"> However, the non-parallel corpora used so far in the previous work tend to be quite comparable. Zhao and Vogel (2002) used a corpus of Chinese and English versions of news stories from the Xinhua News agency, with &amp;quot;roughly similar sentence order of content&amp;quot;. This corpus can be more accurately described as noisy parallel corpus. Barzilay and Elhadad (2003) mined paraphrasing sentences from weather reports. Munteanu et al., (2004) used news articles published within the same 5-day window. All these corpora have documents in the same, matching topics. They can be described as on-topic documents. In fact, both Zhao and Vogel (2002) and Barzilay and Elhadad (2003) assume similar sentence orders and applied dynamic programming in their work.</Paragraph>
    <Paragraph position="4"> In our work, we try to find parallel sentences from far more disparate, very-non-parallel corpora than in any previous work. Since many more multilingual texts available today contain documents that do not have matching documents in the other language, we propose finding more parallel sentences from off-topic documents, as well as on-topic documents.</Paragraph>
    <Paragraph position="5"> An example is the TDT corpus, which is an aggregation of multiple news sources from different time periods. We suggest the &amp;quot;find-one-get-more&amp;quot; principle, which claims that as long as two documents are found to contain one pair of parallel sentence, they must contain others as well. Based on this principle, we propose an effective Bootstrapping method to accomplish our task (Figure 1).</Paragraph>
    <Paragraph position="6"> We also apply the IBM Model 4 EM lexical learning to find unknown word translations from the extracted parallel sentences from our system. The IBM models are commonly used for word alignment in statistical MT systems. This EM method differs from some previous work, which used a seed-word lexicon to extract new word translations or word senses from comparable corpora (Rapp 1995, Fung  &amp; McKeown 1997, Grefenstette 1998, Fung and Lo 1998, Kikui 1999, Kaji 2003).</Paragraph>
    <Paragraph position="7"> 2. Bilingual Sentence Alignment  There have been conflicting definitions of the term &amp;quot;comparable corpora&amp;quot; in the research community. In this paper, we contrast and analyze different bilingual corpora, ranging from the parallel, noisy parallel, comparable, to very-non-parallel corpora.</Paragraph>
    <Paragraph position="8"> A parallel corpus is a sentence-aligned corpus containing bilingual translations of the same document. The Hong Kong Laws Corpus is a parallel corpus with manually aligned sentences, and is used as a parallel sentence resource for statistical machine translation systems. There are 313,659 sentence pairs in Chinese and English. Alignment of parallel sentences from this type of database has been the focus of research throughout the last decade and can be accomplished by many off-the-shelf, publicly available alignment tools.</Paragraph>
    <Paragraph position="9"> A noisy parallel corpus, sometimes also called a &amp;quot;comparable&amp;quot; corpus, contains non-aligned sentences that are nevertheless mostly bilingual translations of the same document. (Fung and McKeown 1997, Kikui 1999, Zhao and Vogel 2002) extracted bilingual word senses, lexicon and parallel sentence pairs from such corpora. A corpus such as Hong Kong News contains documents that are in fact rough translations of each other, focused on the same thematic topics, with some insertions and deletions of paragraphs.</Paragraph>
    <Paragraph position="10"> Another type of comparable corpus is one that contains non-sentence-aligned, non-translated bilingual documents that are topic-aligned. For example, newspaper articles from two sources in different languages, within the same window of published dates, can constitute a comparable corpus. Rapp (1995), Grefenstette (1998), Fung and Lo (1998), and Kaji (2003) derived bilingual lexicons or word senses from such corpora. Munteanu et al., (2004) constructed a comparable corpus of Arabic and English news stories by matching the publishing dates of the articles.</Paragraph>
    <Paragraph position="11"> Finally, a very-non-parallel corpus is one that contains far more disparate, very-non-parallel bilingual documents that could either be on the same topic (in-topic) or not (off-topic). The TDT3 Corpus is such a corpus. It contains transcriptions of various news stories from radio broadcasting or TV news report from 1998-2000 in English and Chinese. In this corpus, there are about 7,500 Chinese and 12,400 English documents, covering more around 60 different topics. Among these, 1,200 Chinese and 4,500 English documents are manually marked as being in-topic. The remaining documents are marked as off-topic as they are either only weakly relevant to a topic or irrelevant to all topics in the existing documents. From the in-topic documents, most are found to have high similarity. A few of the Chinese and English passages are almost translations of each other. Nevertheless, the existence of a considerable amount of off-topic document gives rise to more variety of sentences in terms of content and structure. Overall, the TDT 3 corpus contains 110,000 Chinese sentences and 290,000 English sentences. Some of the bilingual sentences are translations of each other, while some others are bilingual paraphrases. Our proposed method is a first approach that can extract bilingual sentence pairs from this type of very-non-parallel corpus.</Paragraph>
    <Paragraph position="12">  3. Comparing bilingual corpora  To quantify the parallel-ness or comparability of bilingual corpora, we propose using a lexical matching score computed from the bilingual word pairs occurring in the bilingual sentence pairs.</Paragraph>
    <Paragraph position="13"> Matching bilingual sentence pairs are extracted from different corpora using existing and the proposed methods.</Paragraph>
    <Paragraph position="14"> We then identify bilingual word pairs that appear in the matched sentence pairs by using a bilingual lexicon (bilexicon). The lexical matching score is then defined as the sum of the mutual information score of a known set of word pairs that appear in the  where f(Wc,We) is the co-occurrence frequency of bilexicon pair (Wc,We) in the matched sentence pairs. f(Wc) and f(We) are the occurrence frequencies of Chinese word Wc and English word We, in the bilingual corpus.</Paragraph>
    <Paragraph position="15">  different corpora Table 1 shows the lexical matching scores of the parallel corpus (Hong Kong Law), a comparable noisy parallel corpus (Hong Kong News), and a very-non-parallel corpus (TDT 3). We can see that the more parallel or comparable the corpus, the higher the overall lexical matching score is.</Paragraph>
    <Paragraph position="16">  4. Comparing alignment principles It is well known that existing work on sentence alignment from parallel corpus makes use of one or multiple of the following principles (Manning and Schu tze, 1999, Somers 2001): * A bilingual sentence pair are similar in length in the two languages; * Sentences are assumed to correspond to those roughly at the same position in the other language; * A pair of bilingual sentences which contain  more words that are translations of each other tend to be translations themselves. Conversely, the context sentences of translated word pairs are similar.</Paragraph>
    <Paragraph position="17"> For noisy parallel corpora, sentence alignment is based on embedded content words. The word alignment principles used in previous work are as follows:  . We have also learned that as bilingual corpora become less parallel, it is better to rely on lexical information rather than sentence length and position information.</Paragraph>
    <Paragraph position="18"> For comparable corpora, the alignment principle made in previous work is as follows: * Parallel sentences only exist in document pairs with high similarity scores &amp;quot;find-topic-extract-sentence&amp;quot; null We take a step further and propose a new principle for our task: * Documents that are found to contain at least one pair of parallel sentences are likely to contain more parallel sentences - &amp;quot;find-one-get-more&amp;quot;</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML