File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3208_metho.xml

Size: 7,010 bytes

Last Modified: 2025-10-06 14:09:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3208">
  <Title>Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Extracting Bilingual Sentences from
Very-Non-Parallel Corpora
</SectionTitle>
    <Paragraph position="0"> Existing algorithms such as Zhao and Vogel, (2002), Barzilay and Elhadad, (2003), Munteanu et al., (2004) for extracting parallel or paraphrasing sentences from comparable documents, are based on the &amp;quot;find-topic-extract-sentence&amp;quot; principle which looks for document pairs with high similarities, and then look for parallel sentences only from these documents.</Paragraph>
    <Paragraph position="1"> Based on our proposed &amp;quot;find-one-get-more&amp;quot; principle, we suggest that there are other, dissimilar documents that might contain more parallel sentences. We can iterate this whole process for improved results using a Bootstrapping method.</Paragraph>
    <Paragraph position="2"> Figure 2 outlines the algorithm in more detail. In the following sections 5.1-5.5, we describe the document pre-processing step followed by the four subsequent iterative steps of our algorithm.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1. Document preprocessing
</SectionTitle>
      <Paragraph position="0"> The documents are word segmented with the</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Language Data Consortium (LDC) Chinese-English
</SectionTitle>
      <Paragraph position="0"> dictionary 2.0.Then the Chinese document is glossed using all the dictionary entries. When a Chinese word has multiple possible translations in English, it is disambiguated by a method extended from (Fung et al. 1999).</Paragraph>
      <Paragraph position="1"> 5.2. Initial document matching This initial step is based on the same &amp;quot;find-topic-extract-sentence&amp;quot; principle as in earlier works. The aim of this step is to roughly match the Chinese-English documents pairs that have the same topic, in order to extract parallel sentences from</Paragraph>
      <Paragraph position="3"> them. Similar to previous work, comparability is defined by cosine similarity between document vectors.</Paragraph>
      <Paragraph position="4"> Both the glossed Chinese document and English are represented in word vectors, with term weights. We evaluated different combinations of term weighting of each word in the corpus: term frequency (tf); inverse document frequency (idf); tf.idf; and the product of a function of tf and idf. The &amp;quot;documents&amp;quot; here are sentences. We find that using idf alone gives the best sentence pair rank. This is probably due to the fact that frequencies of bilingual word pairs are not comparable in a very-non-parallel corpus.</Paragraph>
      <Paragraph position="5"> Pair-wise similarities are calculated for all possible Chinese-English document pairs, and bilingual documents with similarities above a certain threshold are considered to be comparable. For very-non-parallel corpora, this document-matching step also serves as topic alignment.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3. Sentence matching
</SectionTitle>
      <Paragraph position="0"> Again based on the &amp;quot;find-topic-extract-sentence&amp;quot; principle, we extract parallel sentences from the matched English and Chinese documents. Each sentence is again represented as word vectors. For each extracted document pair, pair-wise cosine similarities are calculated for all possible Chinese-English sentence pairs. Sentence pairs above a set threshold are considered parallel and extracted from the documents. Sentence similarity is based on the number of words in the two sentences that are translations of each other. The better our bilingual lexicon is, the more accurate the sentence similarity will be. In the following section, we discuss how to find new word translations.</Paragraph>
      <Paragraph position="1"> 5.4. EM lexical learning from matched sentence pairs This step updates the bilingual lexicon according to the intermediate results of parallel sentence extraction. New bilingual word pairs are learned from the extracted sentence pairs based on an EM learning method. We use the GIZA++ (Och and Ney, 2000) implementation of the IBM statistical translation lexicon Model 4 (Brown et al., 1993) for this purpose.</Paragraph>
      <Paragraph position="2"> This model is based on the conditional probability of a source word being generated by the target word in the other language, based on EM estimation from aligned sentences. Zhao and Vogel (2002) showed that this model lends itself to adaptation and can provide better vocabulary coverage and better sentence alignment probability estimation. In our work, we use this model on the intermediate results of parallel sentence extraction, i.e. on a set of aligned sentence pairs that may or may not truly correspond to each other.</Paragraph>
      <Paragraph position="3"> We found that sentence pairs with high alignment scores are not necessarily more similar than others. This might be due to the fact that EM estimation at each intermediate step is not reliable, since we only have a small amount of aligned sentences that are truly parallel. The EM learner is therefore weak when applied to bilingual sentences from very-non-parallel corpus. We decided to try using parallel corpora to initialize the EM estimation, as in Zhao and Vogel (2002). The results are discussed in Section 6.</Paragraph>
      <Paragraph position="4"> 5.5. Document re-matching: find-one-get-more This step augments the earlier matched documents by the &amp;quot;find-one-get-more&amp;quot; principle. From the set of aligned sentence pairs, we look for other documents, judged to be dissimilar in the first step, that contain one or more of these sentence pairs. We further find other documents that are similar to each of the monolingual documents found. This new set of documents is likely to be off-topic, yet contains segments that are on-topic. Following our new alignment principle, we believe that these documents might still contain more parallel sentence candidates for subsequent iterations. The algorithm then iterates to refine document matching and parallel sentence extraction.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.6. Convergence
</SectionTitle>
      <Paragraph position="0"> The IBM model parameters, including sentence alignment score and word alignment scores, are computed in each iteration. The parameter values eventually stay unchanged and the set of extracted bilingual sentence pairs also converges to a fixed size. The system then stops and gives the last set of bilingual sentence pairs as the final output.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML