File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1043_metho.xml
Size: 8,849 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1043"> <Title>Alignment of Multiple Languages for Historical Comparison</Title> <Section position="4" start_page="275" end_page="276" type="metho"> <SectionTitle> 3 Applying an evaluation metric </SectionTitle> <Paragraph position="0"> The phonetic similarity criterion used by Covington (1996) is shown in Table 1. It is obviously just a stand-in for a more sophisticated, perhaps feature-based, system of phonology. The algorithm computes a &quot;badness&quot; or &quot;penalty&quot; for each step (column) in the alignment, summing the values to judge the badness of the whole alignment, thus:</Paragraph> <Paragraph position="2"> The alignment with the lowest total badness is the one with the greatest phonetic similarity.</Paragraph> <Paragraph position="3"> Note that two separate skips count exactly the same as one complete mismatch; thus the alignments null</Paragraph> <Paragraph position="5"> are equally valued. In fact, a &quot;no-alternating-skips rule&quot; prevents the second one from being generated; deciding whether \[e\] and \[I\] correspond is left for another, unstated, part of the comparison process. I will explain below why this is not satisfactory.</Paragraph> <Paragraph position="6"> Naturally, the alignment with the best overall phonetic similarity is not always the etymologically correct one, although it is usually close; we are looking for a good phonetic fit, not necessarily the best one.</Paragraph> <Paragraph position="7"> 4 Generalizing to three or more languages When a guided search is involved, aligning strings from three or more languages is not simply a matter of finding the best alignment of the first two, then adding a third, and then a fourth, and so on. Thus, an algorithm to align two strings cannot be used iteratively to align more than two.</Paragraph> <Paragraph position="8"> The reason is that the best overall alignment of three or more strings is not necessarily the best alignment of any given pair in the set. Fox (1995:68) gives a striking example, originally from Haas (1969). The best alignment of the Choctaw and Cree words for 'squirrel' appears to be: Choctaw fani Cree - i !u Here the correspondence \[a\]:\[i\] is problematic. Add the Koasati word, though, and it becomes clear that the correct alignment is actually:</Paragraph> <Paragraph position="10"> Any algorithm that started by finding the best alignment of Choctaw against Cree would miss this solution.</Paragraph> <Paragraph position="11"> A much better strategy is to evaluate each column of the alignment (I'll call it a &quot;step&quot;) before generating the next column. That is, evaluate the first step, and then the second step,</Paragraph> <Paragraph position="13"> and so on. At each step, the total badness is computed by comparing each segment to all of the other segments. Thus the total badness of</Paragraph> <Paragraph position="15"> That way, no string gets aligned against another without considering the rest of the strings in the set.</Paragraph> <Paragraph position="16"> Another detail has to do with skips. Empirically, I found that the badness of</Paragraph> <Paragraph position="18"> comes out too high if computed as badness(f,p) + badness(p,-) + badness(f,-); that is, the algorithm is too reluctant to take skips. The reason, intuitively, is that in this alignment step, there is really only one skip, not two separate skips (one skipping If\] and one skipping \[p\]). This becomes even more apparent when more than three strings are being aligned.</Paragraph> <Paragraph position="19"> Accordingly, when computing badness I count each skip only once (assessing it 50 points), then ignore skips when comparing the segments against each other. I have not implemented the rule from Covington (1996) that gives a reduced penalty for adjacent skips in the same string to reflect the fact that affixes tend to be contiguous. null</Paragraph> </Section> <Section position="5" start_page="276" end_page="278" type="metho"> <SectionTitle> 5 Searching the set of alignments </SectionTitle> <Paragraph position="0"> The standard way to find the best alignment of two strings is a matrix-based technique known as dynamic programming (Ukkonen 1985, Waterman 1995). However, dynamic programming cannot accommodate rules that look ahead along the string to recognize assimilation or metathesis, a possibility that needs to be left open when implementing comparative reconstruction. Additionally, generalization of dynamic programming to multiple strings does not entirely appear to be a solved problem (cf. Kececioglu 1993).</Paragraph> <Paragraph position="1"> Accordingly, I follow Covington (1996) in recasting the problem as a tree search. Consider the problem of aligning \[el\] with \[le\]. Covington (1996) treats this as a process that steps through both strings and, at each step, performs either a &quot;match&quot; (accepting a character from both strings), a &quot;skip-l&quot; (skipping a character in the first string), or a &quot;skip-2&quot; (skipping a character in the second string). That results in the search tree shown in Fig. 1 (ignoring Covington's &quot;no-alternating-skips rule&quot;).</Paragraph> <Paragraph position="2"> The search tree can be generalized to multiple strings by breaking up each step into a series of operations, one on each string, as shown in Fig. 2. Instead of three choices, match, skip-l, and skip-2, there are really 2x2: accept or skip on string 1 and then accept or skip on string 2. One of the four combinations is disallowed you can't have a step in which no characters are accepted from any string.</Paragraph> <Paragraph position="3"> Similarly, if there were three strings, there would be three two-way decisions, leading to eight (= 2 3) states, one of which would be disallowed. Using search trees of this type, the decisions necessary to align any number of strings can be strung together in a satisfactory way.</Paragraph> <Paragraph position="4"> equivalent and generates only the first of them, leaving it to some later step in the comparison process to decide whether \[e\] and \[1\] really correspond. The rule is: NO-ALTERNATING-SKIPS RULE: If there is a skip in one string, there cannot be a skip in the other string at the next step.</Paragraph> <Paragraph position="5"> Although this tactic narrows the search space, I do not think this is linguistically satisfactory; after all, aligning \[el with \[1\] and skipping them in tandem are quite different linguistic claims.</Paragraph> <Paragraph position="6"> Consider for example the final segment of Spanish \[dos\] and Italian \[due\] 'two'; it is correct to skip the \[s\] and the \[e\] in tandem because they come from different Latin endings. It is not historically correct to pair Is\] with \[e\] in a correspondence set.</Paragraph> <Paragraph position="7"> Also, the no-alternating-skips rule does not generalize easily to multiple strings. I therefore replace it with a different restriction:</Paragraph> </Section> <Section position="6" start_page="278" end_page="278" type="metho"> <SectionTitle> ORDERED-ALTERNATING-SKIPS RULE: A </SectionTitle> <Paragraph position="0"> skip can be taken in strings i and j in successive steps only if i ~_ j.</Paragraph> <Paragraph position="1"> That lets us generate</Paragraph> <Paragraph position="3"> which is undeniably equivalent. It also ensures that there is only one way of skipping several consecutive segments; we get ---abc def- - but not -a-b-c abc--d-e-f .... def or numerous other equivalent combinations of skips.</Paragraph> <Paragraph position="4"> 7 Pruning the search The goal of the algorithm is, of course, to generate not the whole search tree, but only the parts of it likely to contain the best alignments, thereby narrowing the intractably large search space into something manageable. Following Covington (1996), I implemented a very simple pruning strategy. The program keeps track of the badness of the best complete alignment found so far. Every branch in the search tree is abandoned as soon as its total badness exceeds that value. Thus, bad alignments are abandoned when they have only partly been generated.</Paragraph> <Paragraph position="5"> A second part of the strategy is that the computer always tries matches before it tries skips. As a result, if not much material needs to be skipped, a good alignment is found very quickly. For example, three four-character strings have after completing only ten other alignments, although it also pursues several hundred branches of the tree part of the way. (Here the match of \[s\] with \[y\] is problematic, but the computer can't know that; it also finds a number of alternative alignments.)</Paragraph> </Section> class="xml-element"></Paper>