File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1035_intro.xml
Size: 3,575 bytes
Last Modified: 2025-10-06 14:03:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1035"> <Title>Measuring Language Divergence by Intra-Lexical Comparison</Title> <Section position="3" start_page="0" end_page="273" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Recently, there has been burgeoning interest in the computational construction of genetic language taxonomies (Dyen et al., 1992; Nerbonne and Heeringa, 1997; Kondrak, 2002; Ringe et al., 2002; Benedetto et al., 2002; McMahon and McMahon, 2003; Gray and Atkinson, 2003; Nakleh et al., 2005).</Paragraph> <Paragraph position="1"> One common approach to building language taxonomies is to ascribe language-language distances, and then use a generic algorithm to construct a tree which explains these distances as much as possible. Two questions arise with this approach. The first asks what aspects of languages are important in measuring inter-language distance. The second asks how to measure distance given these aspects.</Paragraph> <Paragraph position="2"> A more traditional approach to building language taxonomies (Dyen et al., 1992) answers these questions in terms of cognates. A word in language A is said to be cognate with word in language B if the forms shared a common ancestor in the parent language of A and B. In the cognatecounting method, inter-language distance depends on the lexical forms of the languages. The distance between two languages is a function of the number or fraction of these forms which are cognate between the two languages1. This approach to building language taxonomies is hard to implement in toto because constructing ancestor forms is not easily automatable.</Paragraph> <Paragraph position="3"> More recent approaches, such as Kondrak's (2002) and Heggarty et al's (2005) work on dialect comparison, take the synchronic word forms themselves as the language aspect to be compared.</Paragraph> <Paragraph position="4"> Variations on edit distance (see Kessler (2005) for a survey) are then used to evaluate differences between languages for each word, and these differences are aggregated to give a distance between languages or dialects as a whole. This approach is largely automatable, although some methods do require human intervention.</Paragraph> <Paragraph position="5"> In this paper, we present novel answers to the two questions. The features of language we will compare are not sets of words or phonological forms. Instead we compare the similarities between forms, expressed as confusion probabilities. The distribution of confusion probabilities in one language is called a lexical metric. Section 2 presents the definition of lexical metrics and some arguments for their being good language representatives for the purposes of comparison.</Paragraph> <Paragraph position="6"> The distance between two languages is the divergence their lexical metrics. In section 3, we detail two methods for measuring this divergence: Kullback-Liebler (herafter KL) divergence and Rao distance. The subsequent section (4) describes the application of our approach to automatically constructing a taxonomy of Indo-European languages from Dyen et al. (1992) data.</Paragraph> <Paragraph position="7"> Section 5 suggests how lexical metrics can help identify cognates. The final section (6) presents our conclusions, and discusses possible future directions for this work.</Paragraph> <Paragraph position="8"> Versions of the software and data files described in the paper will be made available to coincide with its publication.</Paragraph> </Section> class="xml-element"></Paper>