File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/j05-4003_evalu.xml
Size: 7,021 bytes
Last Modified: 2025-10-06 13:59:26
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4003"> <Title>Improving Machine Translation Performance</Title> <Section position="7" start_page="499" end_page="500" type="evalu"> <SectionTitle> 8. Related Work </SectionTitle> <Paragraph position="0"> While there is a large body of work on bilingual comparable corpora, most of it is focused on learning word translations (Fung and Yee 1998; Rapp 1999; Diab and Finch 2000; Koehn and Knight 2000; Gaussier et al. 2004). We are aware of only three previous efforts aimed at discovering parallel sentences. Zhao and Vogel (2002) describe a generative model for discovering parallel sentences in the Xinhua News Chinese-English corpus. Utiyama et. al (2003) use cross-language information retrieval techniques and dynamic programming to extract sentences from an English-Japanese comparable corpus. Fung and Cheung (2004) present an extraction method similar to ours but focus on &quot;very-non-parallel corpora,&quot; aggregations of Chinese and English news stories from different sources and time periods.</Paragraph> <Paragraph position="1"> The first two systems extend algorithms designed to perform sentence alignment of parallel texts. They start by attempting to identify similar article pairs from the two corpora. Then they treat each of those pairs as parallel texts and align their sentences by defining a sentence pair similarity score and use dynamic programming to find the least-cost alignment over the whole document pair.</Paragraph> <Paragraph position="2"> In the article pair selection stage, the researchers try to identify, for an article in one language, the best matching article in the other language. Zhao and Vogel (2002) measure article similarity by defining a generative model in which an English story generates a Chinese story with a given probability. Utiyama et al. (2003) use the BM25 (Robertson and Walker 1994) similarity measure.</Paragraph> <Paragraph position="3"> The two works also differ in the way they define the sentence similarity score.</Paragraph> <Paragraph position="4"> Zhao and Vogel (2002) combine a sentence length model with an IBM Model 1-type translation model. Utiyama et al. (2003) define a score based on word overlap (i.e., number of word pairs from the two sentences that are translations of each other), which also includes the similarity score of the article pair from which the sentence pair originates.</Paragraph> <Paragraph position="5"> The performance of these approaches depends heavily on the ability to reliably find similar document pairs. Moreover, comparable article pairs, even those similar in content, may exhibit great differences at the sentence level (reorderings, additions, etc). Therefore, they pose hard problems for the dynamic programming alignment approach.</Paragraph> <Paragraph position="6"> In contrast, our method is more robust. The document pair selection part plays a minor role; it only acts as a filter. We do not attempt to find the best-matching English document for each foreign one, but rather a set of similar documents. And, most importantly, we are able to reliably judge each sentence pair in isolation, without need for context. On the other hand, the dynamic programming approach enables discovery of many-to-one sentence alignments, whereas our method is limited to finding one-to-one alignments.</Paragraph> <Paragraph position="7"> The approach of Fung and Cheung (2004) is a simpler version of ours. They match each foreign document with a set of English documents, using a threshold on their Munteanu and Marcu Exploiting Non-Parallel Corpora cosine similarity. Then, from each document pair, they generate all possible sentence pairs, compute their cosine similarity, and apply another threshold in order to select the ones that are parallel. Using the set of extracted sentences, they learn a new dictionary, try to extend their set of matching document pairs (by looking for other documents that contain these sentences), and iterate.</Paragraph> <Paragraph position="8"> The evaluation methodologies of these previous approaches are less direct than ours. Utiyama et al. (2003) evaluate their sentence pairs manually; they estimate that about 90% of the sentence pairs in their final corpus are parallel. Fung and Cheung (2004) also perform a manual evaluation of the extracted sentences and estimate their precision to be 65.7% after bootstrapping. In addition, they also estimate the quality of a lexicon automatically learned from those sentences. Zhao and Vogel (2002) go one step further and show that the sentences extracted with their method improve the accuracy of automatically computed word alignments, to an F-score of 52.56% over a baseline of 46.46%. In a subsequent publication, Vogel (2003) evaluates these sentences in the context of an MT system and shows that they bring improvement under special circumstances (i.e., a language model constructed from reference translations) designed to reduce the noise introduced by the automatically extracted corpus. We go even further and demonstrate that our method can extract data that improves end-to-end MT performance without any special processing. Moreover, we show that our approach works even when only a limited amount of initial parallel data (i.e., a low-coverage dictionary) is available.</Paragraph> <Paragraph position="9"> The problem of aligning sentences in comparable corpora was also addressed for monolingual texts. Barzilay and Elhadad (2003) present a method of aligning sentences in two comparable English corpora for the purpose of building a training set of text-to-text rewriting examples. Monolingual parallel sentence detection presents a particular challenge: there are many sentence pairs that have low lexical overlap but are nevertheless parallel. Therefore pairs cannot be judged in isolation, and context becomes an important factor. Barzilay and Elhadad (2003) make use of contextual information by detecting the topical structure of the articles in the two corpora and aligning them at paragraph level based on the topic assigned to each paragraph. Afterwards, they proceed and align sentences within paragraph pairs using dynamic programming.</Paragraph> <Paragraph position="10"> Their results show that both the induced topical structure and the paragraph alignment improve the precision of their extraction method.</Paragraph> <Paragraph position="11"> A line of research that is both complementary and related to ours is that of Resnik and Smith (2003). Their STRAND Web-mining system has a purpose that is similar to ours: to identify translational pairs. However, STRAND focuses on extracting pairs of parallel Web pages rather than sentences. Resnik and Smith (2003) show that their approach is able to find large numbers of similar document pairs. Their system is potentially a good way of acquiring comparable corpora from the Web that could then be mined for parallel sentences using our method.</Paragraph> </Section> class="xml-element"></Paper>