File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3208_evalu.xml

Size: 5,030 bytes

Last Modified: 2025-10-06 13:59:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3208">
  <Title>Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM</Title>
  <Section position="4" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6. Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluate our algorithm on a very-non-parallel corpus of TDT3 data, which contains various news stories transcription of radio broadcasting or TV news report from 1998-2000 in English and Chinese Channels. We compare the results of our proposed method against a baseline method that is based on the conventional, &amp;quot;find-topic-extract-sentence&amp;quot; principle only. We investigate the performance of the IBM Model 4 EM lexical learner on data from very-non-parallel corpus, and evaluate how our method can boost its performance. The results are described in the following sub-sections.</Paragraph>
    <Paragraph position="1"> 6.1. Baseline method Since previous works were carried out on different corpora, in different language pairs, we cannot directly compare our method against them.</Paragraph>
    <Paragraph position="2"> However, we implement a baseline method that follows the same &amp;quot;find-topic-extract-sentence&amp;quot; principle as in earlier work. The baseline method shares the same preprocessing, document matching and sentence matching steps with our proposed method. However, it does not iterate to update the comparable document set, the parallel sentence set, or the bilingual lexicon.</Paragraph>
    <Paragraph position="3"> Human evaluators manually check whether the matched sentence pairs are indeed parallel. The precision of the parallel sentences extracted is 42.8% for the top 2,500 pairs, ranked by sentence similarity scores.</Paragraph>
    <Paragraph position="4"> 6.2. Bootstrapping performs much better There are 110,000 Chinese sentences and 290,000 English sentences in TDT3, which lead to more than 30 billion possible sentence pairs. Few of the sentence pairs turn out to be exact translations of each other, but many are bilingual paraphrases. For example, in the following extracted sentence pair, the English sentence has the extra phrase &amp;quot;under the agreement&amp;quot;, which is missing from the  Another example of translation versus bilingual paraphrases is as follows: * Zhong Guo Guo Jia Zhu Xi Jiang Ze Min Di Da Ri Ben Ju Xing Guo Shi Fang Wen (The Chinese president Jiang Zemin arrived in Japan today for a state visit) (Translation) Chinese president Jiang Zemin arrived in Japan today for a landmark state visit. * Zhe Ye Shi Zhong Guo Guo Jia Shou Nao Shou Ci Fang Wen Ri Ben (This is a first visit by a Chinese head of state to Japan) (Paraphrase) Mr Jiang is the first Chinese head of state to visit the island country.</Paragraph>
    <Paragraph position="5"> The precision of parallel sentences extraction is 65.7% for the top 2,500 pairs using our method, which has a 50% improvement over the baseline. In addition, we also found that the precision of parallel sentence pair extraction increases steadily over each iteration, until convergence.</Paragraph>
    <Paragraph position="6"> 6.3. Bootstrapping can boost a weak EM lexical learner 6.4. Bootstrapping is significantly more useful than new word translations for mining parallel sentences In this section, we discuss experimental results that lead to the claim that our proposed method can boost a weak IBM Model 4 EM lexical learner.</Paragraph>
    <Paragraph position="7"> It is important for us to gauge the effects of the two main ideas in our algorithm, Bootstrapping and EM lexicon learning, on the extraction parallel sentences from very-non-parallel corpora. The baseline experiment shows that without iteration, the performance is at 42.8%. We carried out another set of experiment of using Bootstrapping where the bilingual lexicon is not updated in each iteration. The bilingual sentence extraction accuracy of the top 2500 sentence pairs in this case dropped to 65.2%, with only 1% relative degradation.</Paragraph>
    <Paragraph position="8"> 6.3.1. EM lexical learning is weak on bilingual sentences from very-non-parallel corpora We compare the performances of the IBM Model 4 EM lexical learning on parallel data (130k sentence pairs from Hong Kong News) and very-non-parallel data (7200 sentence pairs from TDT3) by looking at a common set of source words and their top-N translation candidates extracted. We found that the IBM Model 4 EM learning performs much worse on TDT3 data. Figure 3 shows that the EM learner performs about 30% worse on average on the TDT3 data.</Paragraph>
    <Paragraph position="9"> Based on the above, we conclude that EM lexical learning has little effect on the overall parallel sentence extraction output. This is probably due to the fact that whereas EM does find new word translations (such as Pi Nuo Qie Te /Pinochet), this has little effect on the overall glossing of the Chinese document since such new words are rare.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML