File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-3208_concl.xml
Size: 3,856 bytes
Last Modified: 2025-10-06 13:54:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3208"> <Title>Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> 7. Conclusion </SectionTitle> <Paragraph position="0"> Previous work on extracting bilingual or monolingual sentence pairs from comparable corpora has only been applied to documents that are within the same topic, or have very similar publication dates. One principle for previous methods is &quot;find-topic-extract-sentence&quot; which claims that parallel or similar sentences can only be found in document pairs with high similarity. We propose a new, &quot;find-one-get-more&quot; principle which claims that document pairs that contain at least one pair of matched sentences must contain others, even if these document pairs do not have high similarity scores. Based on this, we propose a novel Bootstrapping method that successfully extracts parallel sentences from a far more disparate and very-non-parallel corpus than reported in previous work. This very-non-parallel corpus, TDT3 data, includes documents that are off-topic, i.e. documents with no corresponding topic in the other language.</Paragraph> <Paragraph position="1"> This is a completely unsupervised method.</Paragraph> <Paragraph position="2"> Evaluation results show that our approach achieves 65.7% accuracy and a 50% relative improvement from baseline. This shows that the proposed method is promising. We also find that the IBM Model 4 lexical learner is weak on data from very-non-parallel corpus, and that its performance can be boosted by our Multilevel Bootstrapping method, whereas using parallel corpus for adaptation is not nearly as useful.</Paragraph> <Paragraph position="3"> Figure 3. EM lexical learning performance 6.3.2. Multilevel Bootstrapping is significantly better than adaptation data in boosting the weak EM lexical learner Since the IBM model parameters can be better estimated if the input sentences are more parallel, we have tried to add parallel sentences to the extracted sentence pairs in each iteration step, as proposed by Zhao and Vogel (2002). However, our experiments showed that adding parallel corpus gives no improvement on the final output. This is likely due to (1) the parallel corpus is not in the same domain as the TDT corpus; and (2) there are simply not enough parallel sentences extracted at each step for the reliable estimation of model parameters.</Paragraph> <Paragraph position="4"> In contrast, Figure 3 shows that when we apply Bootstrapping to the EM lexical learner, the bilingual lexicon extraction accuracy is improved by 20% on the average, evaluated on top-N translation candidates of the same source words, showing that our proposed method can boost a weak EM lexical learner even on data from very-non-parallel corpus.</Paragraph> <Paragraph position="5"> In addition, we compare and contrast a number of bilingual corpora, ranging from the parallel, to comparable, and to very-non-parallel corpora. The parallel-ness of each type of corpus is quantified by a lexical matching score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We show that this scores increases as the parallel-ness or comparability of the corpus increases.</Paragraph> <Paragraph position="6"> Finally, we would like to suggest that Bootstrapping can in the future be used in conjunction with other sentence or word alignment learning methods to provide better mining results.</Paragraph> <Paragraph position="7"> For example, methods for learning a classifier to determine sentence parallel-ness such as that proposed by Munteanu et al., (2004) can be incorporated into our Bootstrapping framework.</Paragraph> </Section> class="xml-element"></Paper>