File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1035_metho.xml
Size: 20,536 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1035"> <Title>Measuring Language Divergence by Intra-Lexical Comparison</Title> <Section position="4" start_page="273" end_page="275" type="metho"> <SectionTitle> 2 Lexical Metric </SectionTitle> <Paragraph position="0"> The first question posed by the distance-based approach to genetic language taxonomy is: what should we compare? In some approaches (Kondrak, 2002; McMahon et al., 2005; Heggarty et al., 2005; Nerbonne and Heeringa, 1997), the answer to this question is that we should compare the phonetic or phonological realisations of a particular set of meanings across the range of languages being studied. There are a number of problems with using lexical forms in this way.</Paragraph> <Paragraph position="1"> Firstly, in order to compare forms from different languages, we need to embed them in common phonetic space. This phonetic space provides granularity, marking two phones as identical or distinct, and where there is a graded measure of phonetic distinction it measures this.</Paragraph> <Paragraph position="2"> There is growing doubt in the field of phonology and phonetics about the meaningfulness of assuming of a common phonetic space. Port and Leary (2005) argue convincingly that this assumption, while having played a fundamental role in much recent linguistic theorising, is nevertheless unfounded. The degree of difference between sounds, and consequently, the degree of phonetic difference between words can only be ascertained within the context of a single language.</Paragraph> <Paragraph position="3"> It may be argued that a common phonetic space can be found in either acoustics or degrees of freedom in the speech articulators. Language-specific categorisation of sound, however, often restructures this space, sometimes with distinct sounds being treated as homophones. One example of this is the realisation of orthographic rr in European Portuguese: it is indifferently realised with an apical or a uvular trill, different sounds made at distinct points of articulation.</Paragraph> <Paragraph position="4"> If there is no language-independent, common phonetic space with an equally common similarity measure, there can be no principled approach to comparing forms in one language with those of another.</Paragraph> <Paragraph position="5"> In contrast, language-specific word-similarity is well-founded. A number of psycholinguistic models of spoken word recognition (Luce et al., 1990) are based on the idea of lexical neighbourhoods.</Paragraph> <Paragraph position="6"> When a word is accessed during processing, the other words that are phonemically or orthographically similar are also activated. This effect can be detected using experimental paradigms such as priming.</Paragraph> <Paragraph position="7"> Our approach, therefore, is to abandon the cross-linguistic comparison of phonetic realisations, in favour of language-internal comparison of forms. (See also work by Shillcock et al. (2001) and Tamariz (2005)).</Paragraph> <Section position="1" start_page="273" end_page="273" type="sub_section"> <SectionTitle> 2.1 Confusion probabilities </SectionTitle> <Paragraph position="0"> One psychologically well-grounded way of describing the similarity of words is in terms of their confusion probabilities. Two words have high confusion probability if it is likely that one word could be produced or understood when the other was intended. This type of confusion can be measured experimentally by giving subjects words in noisy environments and measuring what they apprehend. null A less pathological way in which confusion probability is realised is in coactivation. If a per-son hears a word, then they more easily and more quickly recognise similar words. This coactivation occurs because the phonological realisation of words is not completely separate in the mind.</Paragraph> <Paragraph position="1"> Instead, realisations are interdependent with realisations of similar words.</Paragraph> <Paragraph position="2"> We propose that confusion probabilities are ideal information to constitute the lexical metric. They are language-specific, psychologically grounded, can be determined by experiment, and integrate with existing psycholinguistic models of word recognition.</Paragraph> </Section> <Section position="2" start_page="273" end_page="274" type="sub_section"> <SectionTitle> 2.2 NAM and beyond </SectionTitle> <Paragraph position="0"> Unfortunately, experimentally determined confusion probabilities for a large number of languages are not available. Fortunately, models of spoken word recognition allow us to predict these probabilities from easily-computable measures of word similarity.</Paragraph> <Paragraph position="1"> For example, the neighbourhood activation model (NAM) (Luce et al., 1990; Luce and Pisoni, 1998) predicts confusion probabilities from the relative frequency of words in the neighbourhood of the target. Words are in the neighbourhood of the target if their Levenstein (1965) edit distance from the target is one. The more frequent the word is, the greater its likelihood of replacing the target. Bailey and Hahn (2001) argue, however, that the all-or-nothing nature of the lexical neighbourhood is insufficient. Instead word similarity is the complex function of frequency and phonetic similarity shown in equation (1). Here A,B,C and D are constants of the model, u and v are words, and d is a phonetic similarity model.</Paragraph> <Paragraph position="3"> We have adapted this model slightly, in line with NAM, taking the similarity s to be the probability of confusing stimulus v with form u. Also, as our data usually offers no frequency information, we have adopted the maximum entropy assumption, namely, that all relative frequencies are equal. Consequently, the probability of confusion of two words depends solely on their similarity distance.</Paragraph> <Paragraph position="4"> While this assumption degrades the psychological reality of the model, it does not render it useless, as the similarity measure continues to provide important distinctions in neighbourhood confusability.</Paragraph> <Paragraph position="5"> We also assume for simplicity, that the constant D has the value 1.</Paragraph> <Paragraph position="6"> With these simplifications, equation (2) shows the probability of apprehending word w, out of a set W of possible alternatives, given a stimulus word ws.</Paragraph> <Paragraph position="8"> The normalising constant N(s) is the sum of the non-normalised values for e[?]d(w,ws) for all words</Paragraph> <Paragraph position="10"/> </Section> <Section position="3" start_page="274" end_page="274" type="sub_section"> <SectionTitle> 2.3 Scaled edit distances </SectionTitle> <Paragraph position="0"> Kidd and Watson (1992) have shown that discriminability of frequency and of duration of tones in a tone sequence depends on its length as a proportion of the length of the sequence. Kapatsinski (2006) uses this, with other evidence, to argue that word recognition edit distances must be scaled by word-length.</Paragraph> <Paragraph position="1"> There are other reasons for coming to the same conclusion. The simple Levenstein distance exaggerates the disparity between long words in comparison with short words. A word of consisting of 10 symbols, purely by virtue of its length, will on average be marked as more different from other words than a word of length two. For example, Levenstein distance between interested and rest is six, the same as the distance between rest and by, even though the latter two have nothing in common. As a consequence, close phonetic transcriptions, which by their very nature are likely to involve more symbols per word, will result in larger edit distances than broad phonemic transcriptions of the same data.</Paragraph> <Paragraph position="2"> To alleviate this problem, we define a new edit distance function d2 which scales Levenstein distances by the average length of the words being compared (see equation 3). Now the distance between interested and rest is 0.86, while that between rest and by is 2.0, reflecting the greater relative difference in the second pair.</Paragraph> <Paragraph position="4"> Note that by scaling the raw edit distance with the average lengths of the words, we are preserving the symmetric property of the distance measure. null There are other methods of comparing strings, for example string kernels (Shawe-Taylor and Cristianini, 2004), but using Levenstein distance keeps us coherent with the psycholinguistic accounts of word similarity.</Paragraph> </Section> <Section position="4" start_page="274" end_page="275" type="sub_section"> <SectionTitle> 2.4 Lexical Metric </SectionTitle> <Paragraph position="0"> Bringing this all together, we can define the lexical metric.</Paragraph> <Paragraph position="1"> A lexicon L is a mapping from a set of meanings M, such as &quot;DOG&quot;, &quot;TO RUN&quot;, &quot;GREEN&quot;, etc., onto a set F of forms such as /pies/, /biec/, /zielony/.</Paragraph> <Paragraph position="2"> The confusion probability P of m1 for m2 in lexical L is the normalised negative exponential of the scaled edit-distance of the corresponding forms. It is worth noting that when frequencies are assumed to follow the maximum entropy distribution, this connection between confusion probabilities and distances (see equation 4) is the same as that proposed by Shepard (1987).</Paragraph> <Paragraph position="4"> A lexical metric of L is the mapping LM(L) : M2 - [0,1] which assigns to each pair of meanings m1,m2 the probability of confusing m1 for m2, scaled by the frequency of m2.</Paragraph> <Paragraph position="6"> where N(m2;L) is the normalising function defined in equation (5).</Paragraph> <Paragraph position="8"> Table 1 shows a minimal lexicon consisting only of the numbers one to five, and a corresponding lexical metric. The values in the lexical metric are one two three four five sisting of the numbers one to five.</Paragraph> <Paragraph position="9"> inferred word confusion probabilities. The matrix is normalised so that the sum of each row is 0.2, ie. one-fifth for each of the five words, so the total of the matrix is one. Note that the diagonal values vary because the off-diagonal values in each row vary, and consequently, so does the normalisation for the row.</Paragraph> </Section> </Section> <Section position="5" start_page="275" end_page="276" type="metho"> <SectionTitle> 3 Language-Language Distance </SectionTitle> <Paragraph position="0"> In the previous section, we introduced the lexical metric as the key measurable for comparing languages. Since lexical metrics are probability distributions, comparison of metrics means measuring the difference between probability distributions. To do this, we use two measures: the symmetric Kullback-Liebler divergence (Jeffreys, 1946) and the Rao distance (Rao, 1949; Atkinson and Mitchell, 1981; Micchelli and Noakes, 2005) based on Fisher Information (Fisher, 1959). These can be defined in terms the geometric path from one distribution to another.</Paragraph> <Section position="1" start_page="275" end_page="275" type="sub_section"> <SectionTitle> 3.1 Geometric paths </SectionTitle> <Paragraph position="0"> The geometric path between two distributions P and Q is a conditional distribution R with a continuous parameter a such that at a = 0, the distribution is P, and at a = 1 it is Q. This conditional distribution is called the geometric because it consists of normalised weighted geometric means of the two defining distributions (equation 6).</Paragraph> <Paragraph position="2"> The function k(a;P,Q) is a normaliser for the conditional distribution, being the sum of the weighted geometric means of values from P and Q (equation 7). This value is known as the Chernoff coefficient or Helliger path (Basseville, 1989). For brevity, the P,Q arguments to k will be treated as implicit and not expressed in equations. null</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="275" end_page="275" type="sub_section"> <SectionTitle> 3.2 Kullback-Liebler distance </SectionTitle> <Paragraph position="0"> The first-order (equation 8) differential of the normaliser with regard to a is of particular interest.</Paragraph> <Paragraph position="2"> Kullback-Liebler distance KL(P|Q) of Q with regard to P (Basseville, 1989). At a = 1, it is the Kullback-Liebler distance KL(Q|P) of P with regard to Q. Jeffreys' (1946) measure is a symmetrisation of KL distance, by averaging the commutations (equations 9,10).</Paragraph> <Paragraph position="4"/> </Section> <Section position="3" start_page="275" end_page="276" type="sub_section"> <SectionTitle> 3.3 Rao distance </SectionTitle> <Paragraph position="0"> Rao distance depends on the second-order (equation 11) differential of the normaliser with regard to a.</Paragraph> <Paragraph position="2"> Equation (13) expresses Fisher information along the path R from P to Q at point a using k and its first two derivatives.</Paragraph> <Paragraph position="4"> The Rao distance r(P,Q) along R can be approximated by the square root of the Fisher information</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="276" end_page="276" type="sub_section"> <SectionTitle> 3.4 The PHILOLOGICON algorithm </SectionTitle> <Paragraph position="0"> Bringing these pieces together, the PHILOLOGICON algorithm for measuring the divergence between two languages has the following steps: 1. determine their joint confusion probability matrices, P and Q, 2. substitute these into equation (7), equation (8) and equation (11) to calculate k(0), k(0.5), k(1), k'(0.5), and k''(0.5), 3. and put these into equation (10) and equation (14) to calculate the KL and Rao distances between between the languages.</Paragraph> </Section> </Section> <Section position="6" start_page="276" end_page="278" type="metho"> <SectionTitle> 4 Indo-European </SectionTitle> <Paragraph position="0"> The ideal data for reconstructing Indo-European would be an accurate phonemic transcription of words used to express specifically defined meanings. Sadly, this kind of data is not readily available. However, as a stop-gap measure, we can adopt the data that Dyen et al. collected to construct a Indo-European taxonomy using the cognate method.</Paragraph> <Section position="1" start_page="276" end_page="276" type="sub_section"> <SectionTitle> 4.1 Dyen et al's data </SectionTitle> <Paragraph position="0"> Dyen et al. (1992) collected 95 data sets, each pairing a meaning from a Swadesh (1952)-like 200word list with its expression in the corresponding language. The compilers annotated with data with cognacy relations, as part of their own taxonomic analysis of Indo-European.</Paragraph> <Paragraph position="1"> There are problems with using Dyen's data for the purposes of the current paper. Firstly, the word forms collected are not phonetic, phonological or even full orthographic representations. As the authors state, the forms are expressed in sufficient detail to allow an interested reader acquainted with the language in question to identify which word is being expressed.</Paragraph> <Paragraph position="2"> Secondly, many meanings offer alternative forms, presumably corresponding to synonyms.</Paragraph> <Paragraph position="3"> For a human analyst using the cognate approach, this means that a language can participate in two (or more) word-derivation systems. In preparing this data for processing, we have consistently chosen the first of any alternatives.</Paragraph> <Paragraph position="4"> A further difficulty lies in the fact that many languages are not represented by the full 200 meanings. Consequently, in comparing lexical metrics from two data sets, we frequently need to restrict the metrics to only those meanings expressed in both the sets. This means that the KL divergence or the Rao distance between two languages were measured on lexical metrics cropped and rescaled to the meanings common to both data-sets. In most cases, this was still more than 190 words.</Paragraph> <Paragraph position="5"> Despite these mismatches between Dyen et al.'s data and our needs, it provides an testbed for the PHILOLOGICON algorithm. Our reasoning being, that if successful with this data, the method is reasonably reliable. Data was extracted to language-specific files, and preprocessed to clean up problems such as those described above. An additional data-set was added with random data to act as an outlier to root the tree.</Paragraph> </Section> <Section position="2" start_page="276" end_page="276" type="sub_section"> <SectionTitle> 4.2 Processing the data </SectionTitle> <Paragraph position="0"> PHILOLOGICON software was then used to calculate the lexical metrics corresponding to the individual data files and to measure KL divergences and Rao distances between them. The program NEIGHBOR from the PHYLIP2 package was used to construct trees from the results.</Paragraph> </Section> <Section position="3" start_page="276" end_page="278" type="sub_section"> <SectionTitle> 4.3 The results </SectionTitle> <Paragraph position="0"> The tree based on Rao distances is shown in figure 1. The discussion follows this tree except in those few cases mentioning differences in the KL tree.</Paragraph> <Paragraph position="1"> The standard against which we measure the success of our trees is the conservative traditional taxonomy to be found in the Ethnologue (Grimes and Grimes, 2000). The fit with this taxonomy was so good that we have labelled the major branches with their traditional names: Celtic, Germanic, etc. In fact, in most cases, the branchinternal divisions -- eg. Brythonic/Goidelic in Western/Northern in Germanic -- also accord.</Paragraph> <Paragraph position="2"> Note that PHILOLOGICON even groups Baltic and Slavic together into a super-branch Balto-Slavic.</Paragraph> <Paragraph position="3"> Where languages are clearly out of place in comparison to the traditional taxonomy, these are highlighted: visually in the tree, and verbally in the following text. In almost every case, there are obvious contact phenomena which explain the deviation from the standard taxonomy.</Paragraph> <Paragraph position="4"> Armenian was grouped with the Indo-Iranian languages. Interestingly, Armenian was at first thought to be an Iranian language, as it shares much vocabulary with these languages. The common vocabulary is now thought to be the result of borrowing, rather than common genetic origin.</Paragraph> <Paragraph position="5"> In the KL tree, Armenian is placed outside of the Indo-Iranian languages, except for Gypsy. On the other hand, in this tree, Ossetic is placed as an outlier of the Indian group, while its traditional classification (and the Rao distance tree) puts it among the Iranian languages. Gypsy is an Indian language, related to Hindi. It has, however, been surrounded by European languages for some centuries. The effects of this influence is the likely cause for it being classified as an outlier in the Indo-Iranian family. A similar situation exists for Slavic: one of the two lists that Dyen et al. offer for Slovenian is classed as an outlier in Slavic, rather than classifying it with the Southern Slavic languages. The other Slovenian list is classified correctly with Serbocroatian. It is possible that the significant impact of Italian on Slovenian has made it an outlier. In Germanic, it is English that is the outlier. This may be due to the impact of the English creole, Takitaki, on the hierarchy. This language is closest to English, but is very distinct from the rest of the Germanic languages. Another misclassification also is the result of contact phenomena. According to the Ethnologue, Sardinian is Southern Romance, a separate branch from Italian or from Spanish. However, its constant contact with Italian has influenced the language such that it is classified here with Italian. We can offer no explanation for why Wakhi ends up an outlier to all the groups.</Paragraph> <Paragraph position="6"> In conclusion, despite the noisy state of Dyen et al.'s data (for our purposes), the PHILOLOGICON generates a taxonomy close to that constructed using the traditional methods of historical linguistics. Where it deviates, the deviation usually points to identifiable contact between languages.</Paragraph> </Section> </Section> class="xml-element"></Paper>