File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/j96-4002_metho.xml
Size: 11,742 bytes
Last Modified: 2025-10-06 14:14:21
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-4002"> <Title>An Algorithm to Align Words for Historical Comparison</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. The Problem </SectionTitle> <Paragraph position="0"> The first step in applying the comparative method to a pair of words suspected of being cognate is to align the segments of each word that appear to correspond. This alignment step is not necessarily trivial. For example, the correct alignment of Latin dcr with Greek did~Ymi is --do-didOmi null and not do .... d--O ...... do didomi didOmi didOmi or numerous other possibilities. The segments of two words may be misaligned because of affixes (living or fossilized), reduplication, and sound changes that alter the number of segments, such as elision or monophthongization. Alignment is a neglected part of the computerization of the comparative method. The computer programs developed by Frantz (1970), Hewson (1974), and Wimbish (1989) require the alignments to be specified in their input. The Reconstruction Engine of Lowe and Mazaudon (1994) requires the linguist to specify hypothetical sound changes and canonical syllable structure. The cognateness tester of Guy (1994) ignores the order of segments, matching any segment in one word with any segment in the other.</Paragraph> <Paragraph position="1"> This paper presents a guided search algorithm for finding the best alignment of one word with another, where both words are given in a broad phonetic transcription. * Artificial Intelligence Center, The University of Georgia, Athens, Georgia 30602-7415. E-mail: mcovingt@ai.uga.edu (~) 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 4 The algorithm compares surface forms and does not look for sound laws or phonological rules; it is meant to correspond to the linguist's first look at unfamiliar data. A prototype implementation has been built in Prolog and tested on a corpus of 82 known cognate pairs from various languages. Somewhat surprisingly, it needs little or no knowledge of phonology beyond the distinction between vowels, consonants, and glides.</Paragraph> </Section> <Section position="3" start_page="0" end_page="482" type="metho"> <SectionTitle> 2. Alignments </SectionTitle> <Paragraph position="0"> If the two words to be aligned are identical, the task of aligning them is trivial. In all other cases, the problem is one of inexact string matching, i.e., finding the alignment that minimizes the difference between the two words. A dynamic programming algorithm for inexact string matching is well known (Sankoff & Kruskal 1983, Ukkonen 1985, Waterman 1995), but I do not use it, for several reasons. First, the strings being aligned are relatively short, so the efficiency of dynamic programming on long strings is not needed. Second, dynamic programming normally gives only one alignment for each pair of strings, but comparative reconstruction may need the n best alternatives, or all that meet some criterion. Third, the tree search algorithm lends itself to modification for special handling of metathesis or assimilation. More about this later; first I need to sketch what the aligner is supposed to accomplish.</Paragraph> <Paragraph position="1"> An alignment can be viewed as a way of stepping through two words concurrently, consuming all the segments of each. At each step, the aligner can perform either a match or skip. A match is what happens when the aligner consumes a segment from each of the two words in a single step, thereby aligning the two segments with each other (whether or not they are phonologically similar). A skip is what happens when it consumes a segment from one word while leaving the other word alone. Thus, the alignment abc -bde null is produced by skipping a, then matching b with b, then matching c with d, then skipping e. Here as elsewhere, hyphens in either string correspond to skipped segments in the other. 1 The aligner is not allowed to perform, in succession, a skip on one string and then a skip on the other, because the result would be equivalent to a match (of possibly dissimilar segments). That is, of the three alignments ab-c a-bc abc a-dc ad-c adc only the third one is permitted; pursuing all three would waste time because they are equivalent as far as linguistic claims are concerned. (Determining whether b and d actually correspond is a question of historical reconstruction, not of alignment.) I call this restriction the no-alternating-skips rule.</Paragraph> <Paragraph position="2"> To identify the best alignment, the algorithm must assign a penalty (cost) to every skip or match. The best alignment is the one with the lowest total penalty. As a first 1 Traditionally, the problem is formulated in terms of operations to turn one string into the other. Skips in string 1 and string 2 are called deletions and insertions respectively, and matches of dissimilar segments are called substitutions. This terminology is inappropriate for historical linguistics, since the ultimate goal is to derive the two strings from a common ancestor.</Paragraph> <Paragraph position="3"> Covington An Algorithm to Align Words approximation, we can use the following penalties: - 1 o 2 skips + 1 exact match = 1.0 The third of these has the lowest penalty (and is the etymologically correct alignment).</Paragraph> </Section> <Section position="4" start_page="482" end_page="484" type="metho"> <SectionTitle> 3. The Search Space </SectionTitle> <Paragraph position="0"> Figure 1 shows, in the form of a tree, all of the moves that the aligner might try while attempting to align two three-letter words (English \[h~ez\] and German \[hat\]). We know that these words correspond segment-by-segment, 2 but the aligner does not. It has to work through numerous alternatives in order to conclude that h~ez hat is indeed the best alignment.</Paragraph> <Paragraph position="1"> The alignment algorithm is simply a depth-first search of this tree, beginning at the top of Figure 1. That is, at each position in the pair of input strings, the aligner tries first a match, then a skip on the first word, then a skip on the second, and computes all the consequences of each. After completing each alignment it backs up to the most recent tmtried alternative and tries a different one. &quot;Dead ends&quot; in the tree are places where further computation is blocked by the no-alternating-skip rule.</Paragraph> <Paragraph position="2"> As should be evident, the search tree can be quite large even if the words being aligned are fairly short. Table 1 gives the number of possible alignments for words of various lengths; when both words are of length n, there are about 3 &quot;-1 alignments, not counting dead ends. Without the no-alternating-skip rule, the number would be about 5&quot;/2. Exact formulas are given in the appendix.</Paragraph> <Paragraph position="3"> Fortunately, the aligner can greatly narrow the search by putting the evaluation metric to use as it works. The key idea is to abandon any branch of the search tree 2 Actually, as an anonymous reviewer points out, the exact correspondence is between German hat and earlier English hath. The current English -s ending may be analogical. This does not affect the validity of the example because/t/and /s/are certainly in corresponding positions, regardless of their phonological history.</Paragraph> <Paragraph position="4"> as soon as the accumulated penalty exceeds the total penalty of the best alignment found so far. Figure 2 shows the search tree after pruning according to this principle. The total amount of work is roughly cut in half. With larger trees, the saving can be even greater.</Paragraph> <Paragraph position="5"> To ensure that a relatively good alignment is found early, it is important, at each stage, to try matches before trying skips. Otherwise the aligner would start by generating a large number of useless displacements of each string relative to the other, all of which have high penalties and do not narrow the search space much. Even so, the algorithm is quite able to skip affixes when appropriate. For example, when asked to align Greek didomi with Latin dO, it tries only three alignments, of which the best two are: didomi didOmi d--o .... dO--Choosing the right one of these is then a task for the linguist rather than the alignment algorithm. However, it would be easy to modify the algorithm to use a lower penalty for skips at the beginning or end of a word than skips elsewhere; the algorithm would then be more willing to postulate prefixes and suffixes than infixes.</Paragraph> </Section> <Section position="5" start_page="484" end_page="486" type="metho"> <SectionTitle> 4. The Full Evaluation Metric </SectionTitle> <Paragraph position="0"> Table 2 shows an evaluation metric developed by trial and error using the 82 cognate pairs shown in the subsequent tables. To avoid floating-point rounding errors, all penalties are integers, and the penalty for a complete mismatch is now 100 rather than 1.0. The principles that emerge are that syllabicity is paramount, consonants matter more than vowels, and affixes tend to be contiguous.</Paragraph> <Paragraph position="1"> Somewhat surprisingly, it was not necessary to use information about place of articulation in this evaluation metric (although there are a few places where it might have helped). This accords with Anttila's (1989, 230) observation that great phonetic subtlety is not needed to align words; what one wants to do is find the exact matches and align the syllabic peaks, matching segments of comparable syllabicity (vowels with vowels and consonants with consonants).</Paragraph> <Paragraph position="2"> the aligner should prefer to match consonants rather than vowels if it must choose between the two) 10 Match of two vowels that differ only in length, or i and y, or u and w 30 Match of two dissimilar vowels 60 Match of two dissimilar consonants 100 Match of two segments with no discernible similarity 40 Skip preceded by another skip in the same word (reflecting the fact that affixes tend to be contiguous) 50 Skip not preceded by another skip in the same word It follows that the input to the aligner should be in broad phonetic transcription, using symbols with closely similar values in both langauges. Excessively narrow phonetic transcriptions do not help; they introduce too many subtle mismatches that should have been ignored.</Paragraph> <Paragraph position="3"> Phonemic transcriptions are acceptable insofar as they are also broad phonetic, but, unlike comparative reconstruction, alignment does not benefit by taking phonemes as the starting point. One reason is that alignment deals with syntagmatic rather than paradigmatic relations between sounds; what counts is the place of the sound in the word, not the place of the sound in the sound system. Another reason is that earlier and later languages are tied together more by the physical nature of the sounds than by the structure of the system. The physical sounds are handed down from earlier generations but the system of contrasts is constructed anew by every child learning to talk.</Paragraph> <Paragraph position="4"> The aligner's only job is to line up words to maximize phonetic similarity. In the absence of known sound correspondences, it can do no more. Its purpose is to simulate a linguist's first look at unfamiliar data. Linguistic research is a bootstrapping process in which data leads to analysis and analysis leads to more and better-interpreted data. In its present form, the aligner does not participate in this process.</Paragraph> </Section> class="xml-element"></Paper>