File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/p84-1091_metho.xml
Size: 11,594 bytes
Last Modified: 2025-10-06 14:11:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P84-1091"> <Title>AN ALGORITHM FOR IDENTIFYING COGNATES BETWEEN RELATED LANGUAGES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AN ALGORITHM FOR IDENTIFYING COGNATES BETWEEN RELATED LANGUAGES </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The algorithm takes as only input a llst of words, preferably but not necessarily in phonemic transcription, in any two putatively related languages, and sorts it into decreasing order of probable cognatlon. The processing of a 250-1tem bilingual list takes about five seconds of CPU time on a DEC KLI091, and requires 56 pages of core memory. The algorithm is given no information whatsoever about the phonemic transcription .used, and even though cognate identification is carried out on the basis of a context-free one-for-one matching of indivldual characters, its cognation decisions are bettered by a trained linguist using more information only in cases of wordllsts sharing less than 40% cognates and involving complex, mu\]tlple sound correspondences.</Paragraph> </Section> <Section position="3" start_page="0" end_page="449" type="metho"> <SectionTitle> I FUNDAMENTAL PROCEDURES </SectionTitle> <Paragraph position="0"> &quot;sea&quot; tasi sah &quot;father&quot; tams san &quot;mother&quot; mama nan &quot;tongue&quot; miml nen &quot;shellfish&quot; slsl hehe &quot;bad&quot; satl has &quot;to stand&quot; tl se &quot;to come&quot; me na &quot;with&quot; ml ne &quot;not&quot; sa ha Take the first word pair, mata/nas. We base no information about the phonetic values of their constituent characters, we do not know whether the same system of transcription was used in both wordllsts: for all we know &quot;a&quot; might denotes a high back rounded vowel in Tit~a and a uvular trill in Sese. The only assumption allowed is that in each word llst the same characters represent, more or less, the same sounds. Under this assumption, the possibility that any one character of a member of a word pair may correspond to any character of the other member cannot be discarded. Thus in the pair mata/nas Titia &quot;m&quot; may correspond to Sese &quot;n&quot;, &quot;a&quot;, or &quot;s&quot;, and so may Titia &quot;a&quot;, &quot;t&quot;, &quot;s&quot;, and &quot;s&quot;. We summarize the evidence for these possible correspondences in an TxS matrix, where T is the number of different characters found in the Titla wordllst, S that in the Sese wordllst. Thus the evidence afforded by the first pair, mats/has: Sums a e h n s of rows</Paragraph> <Paragraph position="2"> If character correspondences between tbe Titla and Sese word pairs were random the expected frequency e\[i,J\] of recorded possible correspon- null dences between the ith character of the Tltla alphabet and the jth of the Sese alphabet would be: e\[i ,J\] sum of ith row x sum of Jth column sum of cells giving a matrix of expected frequencies of possible sound correspondences: Note how the six character correspondences wlth the greatest differences between observed and expected frequencies give the simple substitution code used for generating Seat words from pseudo- null Call the difference between the observed and the expected frequency of a character correspondence its weight (s much less primitive definition of weight is used In the actual implementation).</Paragraph> <Paragraph position="3"> Take the first word palr (mats/has) and enter into a 4x3 matrix W the wel~hts of its 12 possible character correspondences: Call potential of a character correspondence the sum of its weight and of the highest potential of all possible character correspondences to its right, i.e.</Paragraph> <Paragraph position="5"> giving the matrix of potentials P for word pair The character correspondence with the blghest potential is here m/n (P\[I,I\]-II.6). Of its possible successors, that with the highest potentlal is a/a (P\[2,2\]ffiS.76), itself followed by t/s (P\[3,3\]-2.48), which has no passible successor. Thus we have: The same procedure applied to the rest of the wordllst gives the proper matches, Tltla flnals in polysyllabic words having been deleted when deriving the corresponding Sese words.</Paragraph> <Paragraph position="6"> C. A Relative Measure of Cognatlon Call index of cognatlon the maximum potentlal of a word palr divided by its number of correspondences, including null correspondences. Thus in the fictitious case of Tttia and Sese tbe index of cognatton of the pair mats/has is 2.9 (its maximum potential, 11.60, divided by the number of correspondences, 4). Word pairs with high cognation indices are foun~ to be more often genetically related than pairs with low cognatlon indices. II C l~REl~'rr DIPLF24E ~rAT I0N A. Weights.</Paragraph> <Paragraph position="7"> The difference between observed and expected frequencies does not provide a satisfactory measurement of the weight of a posslble character correspondence. Several alternative measurements were tested, out of whlcb standardized scores were retained: the weight of a character correspondence was redefined as the probabillty of the discrepancy between its observed and expected frequencies of occurrence not beJng due to chance, expressed as a z score. Where absolute frequencies of 20 and less are involved the exact probabillty is calculated and translated into a z score using a polynomial approximation (Abramowitz and Stegun 1970).</Paragraph> <Paragraph position="8"> B. Vowel/Consonant Correspondences Disallowing correspondences between vowels and consonants vastly improved the performance of the algorltbm. No human intervention is needed to identify vowels from consonants, an improved version of an algorithm described in Suhotln 1962 being used to identify characters which represent vowel sounds. Whether consonants should be allowed to correspond to vowels is left as an option in the current implementation.</Paragraph> <Paragraph position="9"> C. Iterations Performance is again improved when word pairs showing individual character matches as computed from matrices of potentlals (section IB above) are reprocessed. The weights of possible character correspondences are recomputed. This time, however, only characters in the same positions in the two words are scored as possible correspondences. Thus for instance, the first pass of the algorithm having matched the &quot;m&quot; of &quot;mata&quot; to the &quot;n&quot; of &quot;nas&quot;, Titla &quot;m&quot; is scored in the second pass as corresponding possibly only to Sese &quot;n&quot;. Sequences of alternate null correspondences are collapsed so as not to preclude the identification of correspondences which might have been missed in the first pass, e.g. a pair mat/mot matched in the first pass as</Paragraph> <Paragraph position="11"> Weights of possible character correspondences having thus been recomputed, a new matrix of potentials and a new cognatlon index is computed for each word pair. Further iterations were found to yield negligible improvements to the results obtained.</Paragraph> <Paragraph position="12"> D. Improved Weights and Cognation Indices Frequent character correspondences often yield very high z scores (up to 1@.2). The presence of even one such hl~h score in a word pair often invalidates the character-matchlng procedure. A number of alternative alterations to the definition of weight were tried, out of which the simplest proved best: weights beyond an arbitrary value are set to that value. Practice showed a maximum value of 3.0 to 4.0 to give the best results. This is not surprlsing, since there is Do significant difference in the degrees of certainty corresponding to z scores of 4 and beyond.</Paragraph> <Paragraph position="13"> The last improvement in the performance of the algorithm to date was brought by a redefinition of the cognatlon index. Once the individual character matches of a word pair have been identified from its matrix of potentials their weights are adjusted as follows: I) Positive weights less tban 1.28 (corresponding to a 90% significance level) are set to zero; negative weights and weights greater than 1.28 are left unchanged.</Paragraph> <Paragraph position="14"> 2) Positive weights of character-to-zero matches are set to zero; negative weights are left unchanged.</Paragraph> <Paragraph position="15"> The cognatlon index is then defined as the sum of the adjusted weights divided by the number of matches, e.g. (an actual example from two languages of Vanuatu):</Paragraph> </Section> <Section position="4" start_page="449" end_page="450" type="metho"> <SectionTitle> III PERFORMANCE OF THE ALGORITHM </SectionTitle> <Paragraph position="0"> The algorithm as described has been implemented in Simula 67 on a DEC ELI091 and applied to a corpus of some 300 words in 75 languages and dialects of Vanuatu. Results are excellent for languages sharing 40% or more cognates, even when sound correspondences are complex. They deteriorate rapldly when lesser proportions of cognates and complex sound correspondences are involved, but remain excellent when mainly one-to-one correspondences are present. Thus for instance Sakao and Tolomako (Espirltu Santo, Vanuatu) were given as sharing 38.91~ cognates (cut-off cognation index: 1.28), as against a human estimate of 41% backed by a full knowledge of their dlachronlc phonologles and comparisons with other related languages. Out of the 50 word pairs with the highest cognation indices only two (the 38th and the 45th) were deflnltely not cognate and one (the 36th) doubtful. Yet, Sakao has undergone extremely complex phonological changes, viz.:</Paragraph> </Section> <Section position="5" start_page="450" end_page="450" type="metho"> <SectionTitle> IV FDRTHER IMPROVEMENTS </SectionTitle> <Paragraph position="0"> The identification of environmentconditioned phonologlcal correspondences is the next, most obvious stage in further improving the algorithm. This problem has of course been, and is being, investigated. Difficulties arise from the fact that frequencies of possible correspondences in any given environment become too low to be handled by statlstlcal tests. Other approaches -inspired from chess-playlng programs -- have been tried, but have proved too expensive in computer tlme so far. A further, much desirable, improvement is the ~dentlfication of rules of metatbesis. The solution to this problem appears to be subordinated to that of the dlscovery of context-sensitive rules.</Paragraph> <Paragraph position="1"> V PURPOSE OF THE ALGORITHM A billngua\] wordllst is conceptually equivalent to a bilingual text: words of a llst to sentences of a text, phonemes of s word to morphemes of a sentence, cognate pairs to segments of the same meaning, non-cognates to segments of different meanings, and the algorithm described is tbe present state of an attempted solution to the much more general fol\]owlng problem: given two texts of approximately equal lengths in two different languages, determine whether one is the translation of tbe other -- or both translations of a text in a third language -- wholly or In parts, and If so, establish the rules for translating one into the other.</Paragraph> </Section> class="xml-element"></Paper>