File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1035_evalu.xml
Size: 2,254 bytes
Last Modified: 2025-10-06 13:59:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1035"> <Title>Measuring Language Divergence by Intra-Lexical Comparison</Title> <Section position="7" start_page="278" end_page="278" type="evalu"> <SectionTitle> 5 Reconstruction and Cognacy </SectionTitle> <Paragraph position="0"> Subsection 3.1 described the construction of geometric paths from one lexical metric to another.</Paragraph> <Paragraph position="1"> This section describes how the synthetic lexical metric at the midpoint of the path can indicate which words are cognate between the two languages. null The synthetic lexical metric (equation 15) applies the formula for the geometric path equation (6) to the lexical metrics equation (5) of the languages being compared, at the midpoint a = 0.5.</Paragraph> <Paragraph position="3"> If the words for m1 and m2 in both languages have common origins in a parent language, then it is reasonable to expect that their confusion probabilities in both languages will be similar. Of course different cognate pairs m1,m2 will have differing values for R, but the confusion probabilities in P and Q will be similar, and consequently, the reinforce the variance.</Paragraph> <Paragraph position="4"> If either m1 or m2, or both, is non-cognate, that is, has been replaced by another arbitrary form at some point in the history of either language, then the P and Q for this pair will take independently varying values. Consequently, the geometric mean of these values is likely to take a value more closely bound to the average, than in the purely cognate case.</Paragraph> <Paragraph position="5"> Thus rows in the lexical metric with wider dynamic ranges are likely to correspond to cognate words. Rows corresponding to non-cognates are likely to have smaller dynamic ranges. The dynamic range can be measured by taking the Shannon information of the probabilities in the row.</Paragraph> <Paragraph position="6"> Table 2 shows the most low- and highinformation rows from English and Swedish (Dyen et al's (1992) data). At the extremes of low and high information, the words are invariably cognate and non-cognate. Between these extremes, the division is not so clear cut, due to chance effects in the data.</Paragraph> </Section> class="xml-element"></Paper>