File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1011_metho.xml

Size: 20,869 bytes

Last Modified: 2025-10-06 14:10:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1011">
  <Title>Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora</Title>
  <Section position="4" start_page="81" end_page="84" type="metho">
    <SectionTitle>
2 Finding Parallel Sub-Sentential
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="81" end_page="82" type="sub_section">
      <SectionTitle>
Fragments in Comparable Corpora
2.1 Introduction
</SectionTitle>
      <Paragraph position="0"> The high-level architecture of our parallel fragment extraction system is presented in Figure 3.</Paragraph>
      <Paragraph position="1"> The first step of the pipeline identifies document pairs that are similar (and therefore more likely to contain parallel data), using the Lemur information retrieval toolkit1 (Ogilvie and Callan, 2001); each document in the source language is translated word-for-word and turned into a query, which is run against the collection of target language documents. The top 20 results are retrieved and paired with the query document. We then take all sentence pairs from these document pairs and run them through the second step in the pipeline, the candidate selection filter. This step discards pairs which have very few words that are translations of each other. To all remaining sentence pairs we apply the fragment detection method (described in Section 2.3), which produces the output of the system.</Paragraph>
      <Paragraph position="2"> We use two probabilistic lexicons, learned au- null tomatically from the same initial parallel corpus. The first one, GIZA-Lex, is obtained by running the GIZA++2 implementation of the IBM word alignment models (Brown et al., 1993) on the initial parallel corpus. One of the characteristics of this lexicon is that each source word is associated with many possible translations. Although most of its high-probability entries are good translations, there are a lot of entries (of non-negligible probability) where the two words are at most related. As an example, in our GIZA-Lex lexicon, each source word has an average of 12 possible translations.</Paragraph>
      <Paragraph position="3"> This characteristic is useful for the first two stages of the extraction pipeline, which are not intended to be very precise. Their purpose is to accept most of the existing parallel data, and not too much of the non-parallel data; using such a lexicon helps achieve this purpose.</Paragraph>
      <Paragraph position="4"> For the last stage, however, precision is paramount. We found empirically that when using GIZA-Lex, the incorrect correspondences that it contains seriously impact the quality of our results; we therefore need a cleaner lexicon. In addition, since we want to distinguish between source words that have a translation on the target side and words that do not, we also need a measure of the probability that two words are not translations of each other. All these are part of our second lexicon, LLR-Lex, which we present in detail in Section 2.2. Subsequently, in Section 2.3, we present our algorithm for detecting parallel sub-sentential fragments.</Paragraph>
    </Section>
    <Section position="2" start_page="82" end_page="83" type="sub_section">
      <SectionTitle>
2.2 Using Log-Likelihood-Ratios to Estimate
Word Translation Probabilities
</SectionTitle>
      <Paragraph position="0"> Our method for computing the probabilistic translation lexicon LLR-Lex is based on the the Log-</Paragraph>
      <Paragraph position="2"> which has also been used by Moore (2004a; 2004b) and Melamed (2000) as a measure of word association. Generally speaking, this statistic gives a measure of the likelihood that two samples are not independent (i.e. generated by the same probability distribution). We use it to estimate the independence of pairs of words which cooccur in our parallel corpus.</Paragraph>
      <Paragraph position="3"> If source word a0 and target word a1 are independent (i.e. they are not translations of each other), we would expect that a2a4a3a5a1a7a6a0a9a8a11a10 a2a4a3a5a1a7a6a13a12 a0a9a8a14a10 a2a4a3a5a1 a8 , i.e. the distribution of a1 given that a0 is present is the same as the distribution of a1 when a0 is not present. The LLR statistic gives a measure of the likelihood of this hypothesis. The LLR score of a word pair is low when these two distributions are very similar (i.e. the words are independent), and high otherwise (i.e. the words are strongly associated). However, high LLR scores can indicate either a positive association (i.e. a2a4a3a5a1a15a6a0a9a8a17a16 a2a4a3a5a1a7a6a13a12 a0a9a8 ) or a negative one; and we can distinguish between them by checking whether a2a4a3a5a1a19a18 a0a9a8a20a16 a2a4a3a5a1 a8 a2a21a3 a0a9a8 . Thus, we can split the set of cooccurring word pairs into positively and negatively associated pairs, and obtain a measure for each of the two association types. The first type of association will provide us with our (cleaner) lexicon, while the second will allow us to estimate probabilities of words not being translations of each other.</Paragraph>
      <Paragraph position="4"> Before describing our new method more formally, we address the notion of word cooccurrence. In the work of Moore (2004a) and Melamed (2000), two words cooccur if they are present in a pair of aligned sentences in the parallel training corpus. However, most of the words from aligned sentences are actually unrelated; therefore, this is a rather weak notion of cooccurrence. We follow Resnik et. al (2001) and adopt a stronger definition, based not on sentence alignment but on word alignment: two words cooccur if they are linked together in the word-aligned parallel training corpus. We thus make use of the significant amount of knowledge brought in by the word alignment procedure.</Paragraph>
      <Paragraph position="5"> We compute a22a23a22a25a24a26a3a5a1a19a18 a0a9a8 , the LLR score for words a1 and a0 , using the formula presented by Moore (2004b), which we do not repeat here due to lack of space. We then use these values to compute two conditional probability distributions:</Paragraph>
      <Paragraph position="7"> a0a9a8 , the probability that source word a0 trans- null lates into target word a1 , and a27a1a0 a3a5a1a15a6a0a9a8 , the probability that a0 does not translate into a1 . We obtain the distributions by normalizing the LLR scores for each source word.</Paragraph>
      <Paragraph position="8"> The whole procedure follows: a2 Word-align the parallel corpus. Following Och and Ney (2003), we run GIZA++ in both directions, and then symmetrize the alignments using the refined heuristic.</Paragraph>
      <Paragraph position="9">  we reverse the source and target languages and repeat the procedure.</Paragraph>
      <Paragraph position="10"> As we mentioned above, in GIZA-Lex the average number of possible translations for a source word is 12. In LLR-Lex that average is 5, which is a significant decrease.</Paragraph>
    </Section>
    <Section position="3" start_page="83" end_page="84" type="sub_section">
      <SectionTitle>
2.3 Detecting Parallel Sub-Sentential
Fragments
</SectionTitle>
      <Paragraph position="0"> Intuitively speaking, our method tries to distinguish between source fragments that have a translation on the target side, and fragments that do not.</Paragraph>
      <Paragraph position="1"> In Figure 4 we show the sentence pair from Figure 2, in which we have underlined those words of each sentence that have a translation in the other sentence, according to our lexicon LLR-Lex. The phrases &amp;quot;to focus on the past year's achievements, which,&amp;quot; and &amp;quot;sa se concentreze pe succesele anului trecut, care,&amp;quot; are mostly underlined (the lexicon is unaware of the fact that &amp;quot;achievements&amp;quot; and &amp;quot;succesele&amp;quot; are in fact translations of each other, because &amp;quot;succesele&amp;quot; is a morphologically inflected form which does not cooccur with &amp;quot;achievements&amp;quot; in our initial parallel corpus). The rest of the sentences are mostly not underlined, although we do have occasional connections, some correct and some wrong. The best we can do in this case is to infer that these two phrases are parallel, and discard the rest. Doing this gains us some new knowledge: the lexicon entry (achievements, succesele).</Paragraph>
      <Paragraph position="2"> We need to quantify more precisely the notions of &amp;quot;mostly translated&amp;quot; and &amp;quot;mostly not translated&amp;quot;. Our approach is to consider the target sentence as a numeric signal, where translated words correspond to positive values (coming from the a27 a28 distribution described in the previous Section), and the others to negative ones (coming from the a27a8a0 distribution). We want to retain the parts of the sentence where the signal is mostly positive. This can be achieved by applying a smoothing filter to the signal, and selecting those fragments of the sentence for which the corresponding filtered values are positive.</Paragraph>
      <Paragraph position="3"> The details of the procedure are presented below, and also illustrated in Figure 5. Let the Romanian sentence be the source sentence a9 , and the English one be the target, a10 . We compute a word alignment a9a12a11 a10 by greedily linking each English word with its best translation candidate from the Romanian sentence. For each of the linked target words, the corresponding signal value is the probability of the link (there can be at most one link for each target word). Thus, if target word a1 is linked to source word a0 , the signal value corresponding to a1 is a27 a28 a3a5a1a7a6a0a9a8 (the distribution described in Section 2.2), i.e. the probability that a1 is the translation of a0 .</Paragraph>
      <Paragraph position="4"> For the remaining target words, the signal value should reflect the probability that they are not  and target sentence together with their alignment. Above are displayed the initial signal and the filtered signal. The circles indicate which fragments of the target sentence are selected by the procedure. translated; for this, we employ the a27 a0 distribution. Thus, for each non-linked target word a1 , we look for the source word least likely to be its nontranslation: a0a1a0 a10a3a2a5a4a7a6a5a8a10a9a12a11a14a13a16a15a7a17 a27 a0 a3a5a1a7a6a0a9a8 . If a0a1a0 exists, we set the signal value for a1 to a18 a27a1a0 a3a5a1a15a6a0a16a0 a8 ; otherwise, we set it to a18a20a19 . This is the initial signal. We obtain the filtered signal by applying an averaging filter, which sets the value at each point to be the average of several values surrounding it.</Paragraph>
      <Paragraph position="5"> In our experiments, we use the surrounding 5 values, which produced good results on a development set. We then simply retain the &amp;quot;positive fragments&amp;quot; of a10 , i.e. those fragments for which the corresponding filtered signal values are positive.</Paragraph>
      <Paragraph position="6"> However, this approach will often produce short &amp;quot;positive fragments&amp;quot; which are not, in fact, translated in the source sentence. An example of this is the fragment &amp;quot;, reports&amp;quot; from Figure 5, which although corresponds to positive values of the filtered signal, has no translation in Romanian. In an attempt to avoid such errors, we disregard fragments with less than 3 words.</Paragraph>
      <Paragraph position="7"> We repeat the procedure in the other direction</Paragraph>
      <Paragraph position="9"> consider the resulting two text chunks as parallel.</Paragraph>
      <Paragraph position="10"> For the sentence pair from Figure 5, our system will output the pair: people to focus on the past year's achievements, which, he says sa se concentreze pe succesele anului trecut, care, printre</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="84" end_page="86" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> In our experiments, we compare our fragment extraction method (which we call FragmentExtract) with the sentence extraction approach of Munteanu and Marcu (2005) (SentenceExtract).</Paragraph>
    <Paragraph position="1"> All extracted datasets are evaluated by using them as additional MT training data and measuring their impact on the performance of the MT system.</Paragraph>
    <Section position="1" start_page="84" end_page="85" type="sub_section">
      <SectionTitle>
3.1 Corpora
</SectionTitle>
      <Paragraph position="0"> We perform experiments in the context of Romanian to English machine translation. We use two initial parallel corpora. One is the training data for the Romanian-English word alignment task from the Workshop on Building and Using Parallel Corpora3 which has approximately 1M English words. The other contains additional data  from the Romanian translations of the European Union's acquis communautaire which we mined from the Web, and has about 10M English words.</Paragraph>
      <Paragraph position="1"> We downloaded comparable data from three on-line news sites: the BBC, and the Romanian newspapers &amp;quot;Evenimentul Zilei&amp;quot; and &amp;quot;Ziua&amp;quot;. The BBC corpus is precisely the kind of corpus that our method is designed to exploit. It is truly nonparallel; as our example from Figure 1 shows, even closely related documents have few or no parallel sentence pairs. Therefore, we expect that our extraction method should perform best on this corpus. null The other two sources are fairly similar, both in genre and in degree of parallelism, so we group them together and refer to them as the EZZ corpus. This corpus exhibits a higher degree of parallelism than the BBC one; in particular, it contains many article pairs which are literal translations of each other. Therefore, although our sub-sentence extraction method should produce useful data from this corpus, we expect the sentence extraction method to be more successful. Using this second corpus should help highlight the strengths and weaknesses of our approach.</Paragraph>
      <Paragraph position="2"> Table 1 summarizes the relevant information concerning these corpora.</Paragraph>
    </Section>
    <Section position="2" start_page="85" end_page="85" type="sub_section">
      <SectionTitle>
3.2 Extraction Experiments
</SectionTitle>
      <Paragraph position="0"> On each of our comparable corpora, and using each of our initial parallel corpora, we apply both the fragment extraction and the sentence extraction method of Munteanu and Marcu (2005).</Paragraph>
      <Paragraph position="1"> In order to evaluate the importance of the LLR-Lex lexicon, we also performed fragment extraction experiments that do not use this lexicon, but only GIZA-Lex. Thus, for each initial parallel corpus and each comparable corpus, we extract three datasets: FragmentExtract, SentenceExtract, and Fragment-noLLR. The sizes of the extracted datasets, measured in million English tokens, are presented in Table 2.</Paragraph>
    </Section>
    <Section position="3" start_page="85" end_page="86" type="sub_section">
      <SectionTitle>
3.3 SMT Performance Results
</SectionTitle>
      <Paragraph position="0"> We evaluate our extracted corpora by measuring their impact on the performance of an SMT system. We use the initial parallel corpora to train Baseline systems; and then train comparative systems using the initial corpora plus: the FragmentExtract corpora; the SentenceExtract corpora; and the FragmentExtract-noLLR corpora. In order to verify whether the fragment and sentence detection method complement each other, we also train a Fragment+Sentence system, on the initial corpus plus FragmentExtract and SentenceExtract. null All MT systems are trained using a variant of the alignment template model of Och and Ney (2004). All systems use the same 2 language models: one trained on 800 million English tokens, and one trained on the English side of all our parallel and comparable corpora. This ensures that differences in performance are caused only by differences in the parallel training data.</Paragraph>
      <Paragraph position="1"> Our test data consists of news articles from the Time Bank corpus, which were translated into Romanian, and has 1000 sentences. Translation performance is measured using the automatic BLEU (Papineni et al., 2002) metric, on one reference translation. We report BLEU% numbers, i.e. we multiply the original scores by 100. The 95% confidence intervals of our scores, computed by bootstrap resampling (Koehn, 2004), indicate that a score increase of more than 1 BLEU% is statistically significant.</Paragraph>
      <Paragraph position="2"> The scores are presented in Figure 6. On the BBC corpus, the fragment extraction method produces statistically significant improvements over the baseline, while the sentence extraction method does not. Training on both datasets together brings further improvements. This indicates that this corpus has few parallel sentences, and that by going to the sub-sentence level we make better use of it. On the EZZ corpus, although our method brings improvements in the BLEU score, the sen- null since most of the parallel data in this corpus exists at sentence level, the extracted fragments cannot bring much additional knowledge.</Paragraph>
      <Paragraph position="3"> The Fragment-noLLR datasets bring no translation performance improvements; moreover, when the initial corpus is small (1M words) and the comparable corpus is noisy (BBC), the data has a negative impact on the BLEU score. This indicates that LLR-Lex is a higher-quality lexicon than GIZA-Lex, and an important component of our method.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="86" end_page="86" type="metho">
    <SectionTitle>
4 Previous Work
</SectionTitle>
    <Paragraph position="0"> Much of the work involving comparable corpora has focused on extracting word translations (Fung and Yee, 1998; Rapp, 1999; Diab and Finch, 2000; Koehn and Knight, 2000; Gaussier et al., 2004; Shao and Ng, 2004; Shinyama and Sekine, 2004).</Paragraph>
    <Paragraph position="1"> Another related research effort is that of Resnik and Smith (2003), whose system is designed to discover parallel document pairs on the Web.</Paragraph>
    <Paragraph position="2"> Our work lies between these two directions; we attempt to discover parallelism at the level of fragments, which are longer than one word but shorter than a document. Thus, the previous research most relevant to this paper is that aimed at mining comparable corpora for parallel sentences.</Paragraph>
    <Paragraph position="3"> The earliest efforts in this direction are those of Zhao and Vogel (2002) and Utiyama and Isahara (2003). Both methods extend algorithms designed to perform sentence alignment of parallel texts: they use dynamic programming to do sentence alignment of documents hypothesized to be similar. These approaches are only applicable to corpora which are at most &amp;quot;noisy-parallel&amp;quot;, i.e. contain documents which are fairly similar, both in content and in sentence ordering.</Paragraph>
    <Paragraph position="4"> Munteanu and Marcu (2005) analyze sentence pairs in isolation from their context, and classify them as parallel or non-parallel. They match each source document with several target ones, and classify all possible sentence pairs from each document pair. This enables them to find sentences from fairly dissimilar documents, and to handle any amount of reordering, which makes the method applicable to truly comparable corpora.</Paragraph>
    <Paragraph position="5"> The research reported by Fung and Cheung (2004a; 2004b), Cheung and Fung (2004) and Wu and Fung (2005) is aimed explicitly at &amp;quot;very non-parallel corpora&amp;quot;. They also pair each source document with several target ones and examine all possible sentence pairs; but the list of document pairs is not fixed. After one round of sentence extraction, the list is enriched with additional documents, and the system iterates. Thus, they include in the search document pairs which are dissimilar.</Paragraph>
    <Paragraph position="6"> One limitation of all these methods is that they are designed to find only full sentences. Our methodology is the first effort aimed at detecting sub-sentential correspondences. This is a difficult task, requiring the ability to recognize translationally equivalent fragments even in non-parallel sentence pairs.</Paragraph>
    <Paragraph position="7"> The work of Deng et. al (2006) also deals with sub-sentential fragments. However, they obtain parallel fragments from parallel sentence pairs (by chunking them and aligning the chunks appropriately), while we obtain them from comparable or non-parallel sentence pairs.</Paragraph>
    <Paragraph position="8"> Since our approach can extract parallel data from texts which contain few or no parallel sentences, it greatly expands the range of corpora which can be usefully exploited.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML