File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2003_intro.xml
Size: 2,878 bytes
Last Modified: 2025-10-06 14:04:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2003"> <Title>Induction of Cross-Language Affix and Letter Sequence Correspondence</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Studying various relationships between languages is a central task in computational linguistics, with many application areas. In this paper we introduce the problem of induction of form relationships between words in different languages. More specifically, we focus on languages having an alphabetic writing system and affixal morphology, and we construct a model for the cross-language correspondence between letter sequences and between affixes. Since the writing system is alphabetic, letter sequences are highly informative regarding sound sequences as well.</Paragraph> <Paragraph position="1"> Concretely, the model is designed to answer the following question: what are the affixes and letter sequences in one language that correspond frequently to similar entities in another language? Such a model has obvious applications to the construction of learning materials in language education and to statistical machine translation.</Paragraph> <Paragraph position="2"> The input to our algorithm consists of word pairs from two languages, a sizeable fraction of which is assumed to be related graphemically and affixally. The algorithm has three main stages. First, an alignment between the word pairs is computed by an EM algorithm that uses an edit distance metric based on an increasingly refined individual letter correspondence cost function. Second, affix pair candidates are discovered and ranked, based on a language independent abstract model of affixal morphology.</Paragraph> <Paragraph position="3"> Third, letter sequences that correspond productively in the two languages are discovered and ranked by EM iterations that use a cost function based on the discovered affixes and on compatibility of alignments.</Paragraph> <Paragraph position="4"> The affix learning part of the algorithm is totally unsupervised, in that we do not assume knowledge of affixes in any of the single languages involved. The letter sequence learning part utilizes a simple initial correspondence between individual letters, and the rest of its operation is unsupervised.</Paragraph> <Paragraph position="5"> We believe that this is the first paper that explicitly addresses cross-language morphology, and the first that presents a comprehensive inter-language word form correspondence model that combines morphology and letter sequences.</Paragraph> <Paragraph position="6"> Section 2 motivates the problem and defines it in detail. In Section 3 we discuss relevant previous work. The algorithm is presented in Section 4, and results for English-Spanish in Section 5.</Paragraph> </Section> class="xml-element"></Paper>