File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-3208_abstr.xml
Size: 2,501 bytes
Last Modified: 2025-10-06 13:44:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3208"> <Title>Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We present a method capable of extracting parallel sentences from far more disparate &quot;very-non-parallel corpora&quot; than previous &quot;comparable corpora&quot; methods, by exploiting bootstrapping on top of IBM Model 4 EM. Step 1 of our method, like previous methods, uses similarity measures to find matching documents in a corpus first, and then extracts parallel sentences as well as new word translations from these documents. But unlike previous methods, we extend this with an iterative bootstrapping framework based on the principle of &quot;find-one-get-more&quot;, which claims that documents found to contain one pair of parallel sentences must contain others even if the documents are judged to be of low similarity.</Paragraph> <Paragraph position="1"> We re-match documents based on extracted sentence pairs, and refine the mining process iteratively until convergence. This novel &quot;find-one-get-more&quot; principle allows us to add more parallel sentences from dissimilar documents, to the baseline set. Experimental results show that our proposed method is nearly 50% more effective than the baseline method without iteration. We also show that our method is effective in boosting the performance of the IBM Model 4 EM lexical learner as the latter, though stronger than Model 1 used in previous work, does not perform well on data from very-non-parallel corpus.</Paragraph> <Paragraph position="2"> Figure1. Parallel sentence and lexicon extraction via Bootstrapping and EM The most challenging task is to extract bilingual sentences and lexicon from very-non-parallel data. Recent work (Munteanu et al., 2004, Zhao and Vogel, 2002) on extracting parallel sentences from comparable data, and others on extracting paraphrasing sentences from monolingual corpora (Barzilay and Elhadad 2003) are based on the &quot;find-topic-extract-sentence&quot; principle which claims that parallel sentences only exist in document pairs with high similarity. They all use lexical information (e.g. word overlap, cosine similarity) to match documents first, before extracting sentences from these documents.</Paragraph> </Section> class="xml-element"></Paper>