File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2003_metho.xml

Size: 19,995 bytes

Last Modified: 2025-10-06 14:10:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2003">
  <Title>Induction of Cross-Language Affix and Letter Sequence Correspondence</Title>
  <Section position="3" start_page="0" end_page="18" type="metho">
    <SectionTitle>
2 Problem Motivation and Definition
</SectionTitle>
    <Paragraph position="0"> We would like to discover characteristics of word form correspondence between languages.</Paragraph>
    <Paragraph position="1"> In this section we discuss what exactly this means and why it is useful.</Paragraph>
    <Paragraph position="2">  Word form. Word forms have at least three different aspects: sound, writing system, and internal structure, corresponding to the linguistics fields of phonology, orthography and morphology. When the writing system is phonetically based, the written form of a word is highly informative of how the word sounds. Individual writing units are referred to as graphemes.</Paragraph>
    <Paragraph position="3"> Morphology studies the internal structure of words when viewed as comprised of semantics carrying components. Morphological units can be classified into two general classes, stems (or roots) and bound morphemes, which combine to create words using various kinds of operators.</Paragraph>
    <Paragraph position="4"> The linear affixing operator combines stems and bound morphemes (affixes) using linear ordering with possible fusion effects, usually at the seams. Word form correspondence. In this paper we study cross-language word form correspondence.</Paragraph>
    <Paragraph position="5"> We should first ask why there should be any relationship at all between word forms in different languages. There are at least two factors that create such relationships. First, languages may share a common ancestor. Second, languages may borrow words, writing systems and even morphological operators from each other. Note that usage of proper names can be viewed as a kind of borrowing. In both cases form relationships are accompanied by semantic relatedness. Words that possess a degree of similarity of form and meaning are usually termed cognates.</Paragraph>
    <Paragraph position="6"> Our goal in examining word forms in different languages is to identify correspondence phenomena that could be useful for certain applications. These would usually be correspondence similarities that are common to many word pairs.</Paragraph>
    <Paragraph position="7"> Problem statement for the present paper. For reasons of paper length, we focus here on languages having the following two characteristics. First, we assume an alphabetic writing system.</Paragraph>
    <Paragraph position="8"> This implies that grapheme correspondences will be highly informative of sound correspondences as well. From now on we will use the term 'letter' instead of 'grapheme'. Second, we assume linear affixal morphology (prefixing and suffixing), which is an extremely frequent morphological operator in many languages.</Paragraph>
    <Paragraph position="9"> We address the two fundamental word form entities in languages that obey those assumptions: affixes and letter sequences. Our goal is to discover frequent cross-language pairs of those entities and quantify the correspondence. Pairing of letter sequences is expected to be mostly due to regular sound transformations and spelling conventions. Pairing of affixes could be due to morphological principles - predictable relationships between the affixing operators (their form and meaning) - or, again, due to sound transformations and spelling.</Paragraph>
    <Paragraph position="10"> The input to the algorithm consists of a set of ordered pairs of words, one from each language.</Paragraph>
    <Paragraph position="11"> We do not assume that all input word pairs exhibit the correspondence relationships of interest, but obviously the quality of results will depend on the fraction of the pair set that does exhibit them. A particular word may participate in more than a single pair. As explained above, the relationships of interest to us in this paper usually imply semantic affinity between the words; hence, a suitable pair set can be generated by selecting word pairs that are possible translations of each other. Practical ways to obtain such pairs are using a bilingual dictionary or a word aligned parallel corpus. We had used the former, which implies that we addressed only derivational, not inflectional, morphology. Using a dictionary provides a kind of semantic supervision that allows us to focus on the desired form relationships.</Paragraph>
    <Paragraph position="12"> We also assume that the algorithm is provided with a prototypical individual letter mapping as seed. Such a mapping is trivial to obtain in virtually all practical situations, either because both languages utilize the same alphabet or by using a manually prepared, coarse alphabet mapping (e.g., anybody even shallowly familiar with Cyrillic or Semitic scripts can prepare such a mapping in just a few minutes.) We do not assume knowledge of affixes in any of the languages. Our algorithm is thus fully unsupervised in terms of morphology and very weakly seeded in term of orthography.</Paragraph>
    <Paragraph position="13"> Motivating applications. There are two main applications that motivate our research. In second language education, a major challenge for adult learners is the high memory load due to the huge number of lexical items in a language. Item memorization is known to be greatly assisted by tying items with existing knowledge (Matlin02).</Paragraph>
    <Paragraph position="14"> When learning a second language lexicon, it is beneficial to consciously note similarities between new and known words. Discovering and explaining such similarities automatically would help teachers in preparing reliable study materials, and learners in remembering words.</Paragraph>
    <Paragraph position="15"> Recognition of familiar components also helps learners when encountering previously unseen words. For example, suppose an English speaker who learns Spanish and sees the word 'parcial- null mente'. A word form correspondence model would tell her that 'mente' is an affix strongly corresponding to the English 'ly', and that the letter pair 'ci' often corresponds to the English 'ti'. The model thus enables guessing or recalling the English word 'partially'.</Paragraph>
    <Paragraph position="16"> Our model could also warn the learner of cognates that are possibly false, by recognizing similar words that are not paired in the dictionary. A second application area is machine translation. Both cognate identification (Kondrak et al 03) and morphological information in one of the languages (Niessen00) have been proven useful in statistical machine translation.</Paragraph>
  </Section>
  <Section position="4" start_page="18" end_page="18" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> Cross-language models for phonology and orthography have been developed for back-transliteration in cross-lingual information retrieval (CLIR), mostly from Japanese and Chinese to English. (Knight98) uses a series of weighted finite state transducers, each focusing on a particular mapping. (Lin02) uses minimal edit distance with a 'confusion matrix' that models phonetic similarity. (Li04, Bilac04) generalize using the sequence alignment algorithm presented in (Brill00) for spelling correction. (Bilac04) explicitly separates the phonemic and graphemic models. None of that work addresses morphology and in all of it grapheme and phoneme correspondence is only a transient tool which is not studied on its own. (Mueller05) explicitly models phonological similarities between related languages, but does not address morphology and orthography.</Paragraph>
    <Paragraph position="1"> Cognate identification has been studied in computational historical linguistics. (Covington96, Kondrak03a) use a fixed, manually determined single entity mapping. (Kondrak03b) generalizes to letter sequences based on the algorithm in (Melamed97). The results are good for the historical linguistics application. However, morphology is not addressed, and the sequence correspondence model is less powerful than that employed in the back-transliteration and spelling correction literature. In addition, all effects that occur at word endings, including suffixes, are completely ignored. (Mackay05) presents good results for cognate identification using a word similarity measure based on pair hidden Markov models. Again, morphology was not modeled explicitly.</Paragraph>
    <Paragraph position="2"> A nice application for cross-language morphology is (Schulz04), which acquires a Spanish medical lexicon from a Portuguese seed lexicon using a manually prepared table of 842 Spanish affixes.</Paragraph>
    <Paragraph position="3"> Unsupervised learning of affixal morphology in a single language is a heavily researched problem. (Medina00) studies several methods, including the squares method we use in Section 4.</Paragraph>
    <Paragraph position="4"> (Goldsmith01) presents an impressive system that searches for 'signatures', which can be viewed as generalized squares. (Creutz04) presents a very general method that excels at dealing with highly inflected languages. (Wicentowsky04) deals with inflectional and irregular morphology by using semantic similarity between stem and stem+affix, also addressing stem-affix fusion effects. None of these papers deals with cross-language morphology.</Paragraph>
  </Section>
  <Section position="5" start_page="18" end_page="20" type="metho">
    <SectionTitle>
4 The Algorithm
</SectionTitle>
    <Paragraph position="0"> Overview. Letter sequences and affixes are different entities exhibiting different correspondence phenomena, hence are addressed at separate stages. The result of addressing one will assist us in addressing the other.</Paragraph>
    <Paragraph position="1"> The fundamental tool that we use to discover correspondence effects is alignment of the two words in a pair. Stage 1 of the algorithm creates an alignment using the given coarse individual letter mapping, which is simultaneously improved to a much more accurate one.</Paragraph>
    <Paragraph position="2"> Stage 2 discovers affix pairs using a general language independent affixal morphology model.</Paragraph>
    <Paragraph position="3"> In stage 3 we utilize the improved individual letter relation from stage 1 and the affix pairs discovered in stage 2 to create a general letter sequence mapping, again using word alignments.</Paragraph>
    <Paragraph position="4"> In the following we describe in detail each of these stages.</Paragraph>
    <Paragraph position="5"> Initial alignment. The main goal of stage 1 is to align the letters of each word pair. This is done by a standard minimal edit distance algorithm, efficiently implemented using dynamic programming (Gusfield97, Ristad98). We use the standard edit distance operations of replace, insert and delete. The letter mapping given as input defines a cost matrix where replacement of corresponding letters has a low (0) cost and of all others a high (1) cost. The cost of insert and delete is arbitrarily set to be the same as that of replacing non-identical letters. We use a hash table rather than a matrix, to prepare for later stages of the algorithm.</Paragraph>
    <Paragraph position="6"> When the correspondence between the languages is very high, this initial alignment can  already provide acceptable results for the next stage. However, in order to increase the accuracy of the alignment we now refine the letter cost matrix by employing an EM algorithm that iteratively updates the cost matrix using the current alignment and computes an improved alignment based on the updated cost matrix (Brill00, Lin02, Li04, Bilac04). The cost of mapping a letter K to a letter L is updated to be proportional to the count of this mapping in all of the current alignments divided by the total number of mappings of the letter K.</Paragraph>
    <Paragraph position="7"> Affix pairs. The computed letter alignment assists us in addressing affixes. Recall that we possess no knowledge of affixes; hence, we need to discover not only pairing of affixes, but the participating affixes as well. Our algorithm discovers affixes and their pairing simultaneously. It is inspired by the squares algorithm for affix learning in a single language (Medina00)</Paragraph>
    <Paragraph position="9"> The squares method assumes that affixes generally combine with very many stems, and that stems are generally utilized more than once.</Paragraph>
    <Paragraph position="10"> These assumptions are due to a functional view of affixal morphology as a process whose goal is to create a large number of word forms using fewer parameters. A stem that combines with an affix is quite likely to also appear alone, so the empty affix is allowed.</Paragraph>
    <Paragraph position="11"> We first review the method as it is used in a single language. Given a word W=AB (where A and B are non-empty letter sequences), our task is to measure how likely it is for B to be a suffix (prefix learning is similar.) We refer to AB as a segmentation of W, using a hyphen to show segmentations of concrete words. Define a square to be four words (including W) of the forms W=AB, U=AD, V=CB, and Y=CD (one of the letter sequences C, D is allowed to be empty.) Such a square might attest that B, D are suffixes and that A, C are stems. However, we must be careful: it might also attest that B, D are stems and A, C are prefixes. A square attests for a segmentation, not for a particular labeling of its components.</Paragraph>
    <Paragraph position="12"> As an example, if W is 'talking', a possible square is {talk-ing, hold-ing, talk-s, hold-s} where A=talk, B=ing, C=hold, and D=s. Another possible square is {talk-ing, danc-ing, talk-ed, danc-ed}, where A=talk, B=ing, C=danc, and D=ed. This demonstrates a drawback of the  (Medina00) attributes the algorithm to Joseph Greenberg. method, namely its sensitivity to spelling; C with the empty suffix is written 'dance', not 'danc'. The four words {talking, dancing, talk, dance} do not form a square.</Paragraph>
    <Paragraph position="13"> We now count the number of squares in which B appears. If this number is relatively large (which needs to be precisely defined), we have a strong evidence that B is a suffix or a stem. We can distinguish between these two cases using the number of witnesses - actual words in which B appears.</Paragraph>
    <Paragraph position="14"> We generalize the squares method to the discovery of cross-language affix pairs, as follows. We now use W to denote not a single word but a word pair W1:W2. B does not denote a suffix candidate but a suffix pair candidate, B1:B2, and similarly for D. A and C denote stem pair candidates A1:A2 and C1:C2, respectively.</Paragraph>
    <Paragraph position="15"> We now define a key concept. Given a word pair W=W1:W2 aligned under an alignment T, two segmentations W1=A1B1 and W2=A2B2 are said to be compatible if no alignment line of T connects a subset of A1 to a subset of B2 or a subset of A2 to a subset of B1. This definition is also applicable to alignments between letter sequences. null We now impose our key requirement: for all of the words involved in the cross-lingual square, their segmentations into two parts must be compatible under the alignment computed at stage 1. For example, consider the English-Spanish word pair affirmation : afirmacion. The segmentation affirma-tion : afirma-cion is attested by the</Paragraph>
    <Paragraph position="17"> assuming that the appropriate parts are aligned.</Paragraph>
    <Paragraph position="18"> Note that 'tively' is comprised of two smaller affixes, but the squares method legitimately considers it an affix by itself. Note also that since all of A1, A2, C1 and C2 end with the same letter, that letter can be moved to the beginning of B1, B2, D1, D2 to produce a different square (affirmation : afirm-acion, etc.) from the same four word pairs.</Paragraph>
    <Paragraph position="19"> Since we have no initial reason to favor a particular affix candidate over another, and since the total computational cost is not prohibitive, we  now simply count the number of attesting squares for all possible compatible segmentations of all word pairs, and sort the list according to the number of witnesses. To reduce noise, we remove affix candidates for which the absolute number of witnesses or squares is small (e.g., ten.) Letter sequences. The third and last stage of the algorithm discovers letter sequences that correspond frequently. This is again done by an edit distance algorithm, generalizing that of stage 1 so that sequences longer than a single letter can be replaced, inserted or deleted. In order to reduce noise, prior to that we remove word pairs whose stems are very different. Those are identified by comparing their edit distance costs, which should hence be normalized according to length (of the longer stem in a pair.) Note that accuracy is increased by considering only stems: affix pairs might be very different, thus might increase edit distance cost even when the stems do exhibit good sequence pairing effects.</Paragraph>
    <Paragraph position="20"> When generalizing the edit distance algorithm, we need to specify which letter sequences will be considered, because it does not make sense to consider all possible mappings of all subsets to all possible subsets - the number of different such pairs will be too large to show any meaningful statistics.</Paragraph>
    <Paragraph position="21"> The letter sequences considered were obtained by 'fattening' the lines in alignments yielding minimal edit distances, using an EM algorithm as done in (Brill00, Bilac04, Li04). The details of the algorithm can be found in these papers. The most important step, line fattening, is done as follows. We examine all alignment lines, each connecting two letter sequences (initially, of length 1.) We unite those sequences with adjacent sequences in all ways that are compatible with the alignment, and add the new sequences to the cost function to be used in the next EM iteration.</Paragraph>
    <Paragraph position="22"> If we kept letter sequence pairs that are not frequent in the cost function, they would distort the counts of more frequent letter sequences with which they partially overlap. We thus need to retain only some of the sequence pairs discovered. We have experimented with several ways to do that, all yielding quite similar results. For the results presented in this paper, we used the idea that sequences that clearly map to specific sequences are more important to our model than sequences that 'fuzzily' map to many sequences.</Paragraph>
    <Paragraph position="23"> To quantify this approach, for each language-1 sequence we sorted the corresponding language-2 sequences according to count, and removed pairs in which the language-2 item was responsible for only a small percentage of the total (we used a threshold of 0.05). We further removed sequence pairs whose absolute counts are low.</Paragraph>
    <Paragraph position="24"> Discussion. We deal with affixes before sequences because, as we have seen, identification of affixes helps us in identifying sequences, while the opposite order actually hurts us: sequences sometimes contain letters from both stem and affix, which invalidates squares that are otherwise valid.</Paragraph>
    <Paragraph position="25"> It may be asked why the squares stage is needed at all - perhaps affixes would be discovered anyway as sequences in stage 3. Our assumption was that affixes are best discovered using properties resulting from their very nature. We have experimented with the option of removing stage 2 and discovering affixes as letter sequences in stage 3, and verified that it gives markedly lower quality results. Even the very frequent pair -ly:-mente was not signaled out, because its count was lowered by those of the pairs -ly:-ente, -ly:nte, -y:-te, etc.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML