File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1217_intro.xml

Size: 4,910 bytes

Last Modified: 2025-10-06 14:06:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1217">
  <Title>I</Title>
  <Section position="4" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Taxonomy
</SectionTitle>
    <Paragraph position="0"> Given the broad agreement found on the taxonomic relationships among languages \[for example, see the introductory textbooks by (Gleason, 1955; Crystal, 1987; Finegan and Besnier, 1987), or the more authoritative (Bright, 1992; Asher and Simpson, 1994; Warnow, 1997)\] the classifications and relationships of figure 1 can be described as uncontroversial. For example, the languages of Dutch and German are  rather self-evidently similar; they are also closely linked in terms of history, culture, and linguistic borrowing; this similarity is one of the sources of evidence for such linkages. Meanwhile, there's little or no evidence that the Germans and the Maori were ever in significant day-to-day contact, a judgement borne out by apparent dissimilarity. The most controversial point of the diagram, as a matter of fact, may be its tree-like structure, as will be discussed later.</Paragraph>
    <Paragraph position="1"> The usual method for generating such trees (or other representational structures) is to painstakingly compare representative samples of language, usually lists of lexical items, and identify similar or isomorphic changes from among the lists (taking into account historical and archeological evidence as appropriate). (Swadesh, 1955), for example, has identified a hundred basic concepts that are, in theory, part of the basic vocabulary of a language and thus resistant to borrowing and replacement and subject only to the slow &amp;quot;evolutionary&amp;quot; pressures of linguistic change. By comparing the presentation of these concepts as lexical items and measuring the degree of change between two languages' presentations, one can determine the amount by which two languages have &amp;quot;drifted.&amp;quot; In summary of the results of these and similar studies, (Finegan and Besnier, 1987) identify no less than eleven subgroups within the Indo-European family. In addition to the weli-known groups like Germanic, Italic, and &amp;quot;Slavonic&amp;quot; (described here), they list Albanian, Anatolian, Armenian, Baltic, Celtic, Greek, Indo-Iranian, and Tocharian. (Crystal, 1987) groups Baltic and Slavic but otherwise agrees with Finegan and Besnier, as does (Gleason, 1955). This shows both the power of this technique as well as the degree to which it requires subjective evaluation; the overall relationships are generally agreed upon, but &amp;quot;the devil is in the details&amp;quot; and opinions about exactly which changes are similar remain to a certain extent educated guesses.</Paragraph>
    <Paragraph position="2"> Other minor problems with this technique exist; for example, Swadesh's vocabulary list is completely insensitive to other aspects of language such as morphology, syntax, and so forth. Because of its focus on specific, basic words, it can be trapped (or tricked) by lexical drift (for example, &amp;quot;meat&amp;quot; is no longer the English word for &amp;quot;any foodstuff&amp;quot;) or lexical holes where a clear cognate is not necessarily the most common or most frequent lexeme ((Forster et al., in press) has found that some of his Alpine languages have no lexeme for &amp;quot;to sit,&amp;quot; for example.) Similar problems exist with regard to lexical borrowing; resistant to borrowing does not equate to proof against borrowing. Finally, this focus on these very basic terms and the evaluation of language as a whole may, to a certain extent, preclude the analysis of the paths of borrowing and the degree to which linguistic change is confined to or driven by particular fields, social strata, and so forth. By confining ourselves to pre-set lists of specific concepts, one runs the risk of picking the wrong concepts, especially for specific sub-field s (which can be as finely subdivided as one likes; is this paper an example of &amp;quot;science,&amp;quot; of &amp;quot;computer science,&amp;quot; of &amp;quot;computational linguistics,&amp;quot; or of &amp;quot;information-theoretic approaches to corpus-based computational linguistics&amp;quot;?) As a simple example, the phrase for &amp;quot;TCP/IP protocol&amp;quot; in most languages of the world is recognizably a borrowing from English, while much of the jargon in the martial arts community shows a strong Japanese influence, even when the martial art itself derives from other countries or cultures.</Paragraph>
    <Paragraph position="3"> This suggests that there is a place for other measures, both of language-in-use and of smaller samples, as a supplement to traditional typological and taxonomic measures. The claim made here is that cross-entropy (or Kullback-Leibler divergence) can be the basis for such a measurement.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML