File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/c04-1117_relat.xml
Size: 3,222 bytes
Last Modified: 2025-10-06 14:15:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1117"> <Title>Cognate Mapping -- A Heuristic Strategy for the Semi-Supervised Acquisition of a Spanish Lexicon from a Portuguese Seed Lexicon</Title> <Section position="5" start_page="0" end_page="0" type="relat"> <SectionTitle> 4 Related Work </SectionTitle> <Paragraph position="0"> The rise of the empirical paradigm in the field of machine translation is, to a large degree, due to the wide-spread availability of parallel corpora (Brown et al., 1990). They also constitute an important resource for the automated acquisition of translational lexicons (Turcato, 1998). Unfortunately, the limited availability of parallel corpora (e.g., the Canadian Hansard corpus of English and French parliament debates) restricts this method to a few language pairs, mostly focused on specific sublanguages (e.g.</Paragraph> <Paragraph position="1"> politics, legislation, economy). Neither exist such parallel corpora for the medical sublanguage, nor for the particular language pair, Spanish and Portuguese, we focus on in this work.</Paragraph> <Paragraph position="2"> The acquisition of unrelated, albeit comparable corpora is much easier. Rapp (1999) used unrelated parallel corpora in order to learn English and German word-to-word translations. His approach is based on similarity measures and context clues, using a seed lexicon of trusted translations. Koehn and Knight (2002) derived such a seed lexicon from German-English cognates which were selected by using string similarity criteria. An additional boost can be achieved by retrieving content-related document pairs using CLIR techniques (Utsuro et al., 2003). An alternative generative approach is proposed by Barker and Sutcliffe (2000) who created Polish cognate candidates out of an English wordlist using a set of string mapping rules.</Paragraph> <Paragraph position="3"> Pirkola et al. (2003) used aligned translation dictionaries as source data. Based on that, they created an algorithm to automatically generate transformation rules from five different languages to English, including Spanish. Applying a two-step technique (translation rules and fuzzy n-gram matching), they achieved 81.1% of average precision in a Spanish-to-English context covering biomedical words only. However, their evaluation metrics considerably differed from ours, since they considered multiple hypotheses. null Our work differs from these precursors in many ways. First of all, due to domain and language restrictions the size of our corpora is much smaller than the commonly used newspaper corpora. For the same reasons, CLIR techniques for retrieving comparable documents are not yet available (on the contrary, the goal of our work is to provide resources for a medical CLIR system). Thirdly, the two languages are so similar that a high amount of translations could already be acquired by applying string mapping rules (this approach to cognate mapping has also been discussed by MacWhinney (1995) for second language acquisition of human learners). Finally, rather than acquiring bilateral word translation, our focus lies on assigning subwords to interlingual semantic identifiers.</Paragraph> </Section> class="xml-element"></Paper>