File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/j94-3004_intro.xml
Size: 7,784 bytes
Last Modified: 2025-10-06 14:05:46
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-3004"> <Title>The Reconstruction Engine: A Computer Implementation of the Comparative Method</Title> <Section position="2" start_page="0" end_page="383" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The essential step in historical reconstruction is the arrangement of related words in different languages into sets of cognates and the specification of the regular phonological correspondences that support that arrangement; the well-known means for carrying out this arrangement and specification is the comparative method (see, for example, Meillet 1966; Hoenigswald 1950, 1960; Watkins 1989; Baldi 1990). Words that are not demonstrably related (via regular sound change) are explained by reference to other diachronic processes that are beyond the scope of the comparative method and of this paper. Sound change is first to be explained as a rule-governed process and other explanations (which invoke more sporadic and less predictable processes) * Department of Linguistics, University of California, Berkeley, CA 94720. E-mail: The complete lexicon.</Paragraph> <Paragraph position="1"> Regular sound change (modeled by RE proper).</Paragraph> <Paragraph position="2"> Regular, &quot;expected&quot; reflexes of the ancestor forms.</Paragraph> <Paragraph position="3"> Domain of &quot;protovariation,&quot; perhaps due to morphological/derivational processes; handled by RE with &quot;fuzzy&quot; constituents.</Paragraph> <Paragraph position="4"> Sub-regularities elicited through relaxed constraints (word families, allofams, 1 etc.) Sociolinguistic explanation. Domain of lexical diffusion and other sporadic processes. Borrowings, analogized forms, hypercorrections, prestige pronunciations, etc. The &quot;mystery pile&quot;: counterexamples and other troublesome words.</Paragraph> <Paragraph position="5"> Figure 1 The &quot;sieve&quot; of explanation in historical linguistics.</Paragraph> <Paragraph position="6"> offered when it is clear that nonphonological forces are at work, as illustrated in Figure 1. There will always be a number of lexical items for which no scientific explanation can be advanced: not all words are entitled to an etymology (Meillet 1966). This paper discusses problems and solutions associated with automating research into diachronic processes acting in (B) in Figure 1 above. Our solutions are implemented in a program we call the Reconstruction Engine, hereinafter RE (earlier versions are described in Lowe and Mazaudon \[1989\] and Mazaudon and Lowe \[1991\]). 2 RE is a prototype computational tool that automates a crucial portion of the comparative 1 The term 'allofamy,' due to Matisoff (1978), refers to relationship 'among the various individual members of the same word-family.' English royal and regal, borrowed from French and Latin, respectively, are both ultimately traceable to the same PIE root *reg-, and so are co-allofams in Modern English (Matisoff 1978:16-18, Matisoff 1992:160). A word family might contain both native words and words borrowed from related languages; the borrowings may be recent or ancient.</Paragraph> <Paragraph position="7"> Lowe and Mazaudon The Reconstruction Engine method: the process of creating cognate sets and proposing reconstructions on the basis of observed correspondences between modern languages. It treats those words of the lexicon that fall into pile C in Figure 1 above (and to a lesser extent those that fall into pile E). It must be emphasized that the relative sizes of the piles in Figure 1 are completely arbitrary. It would not be unusual for the list of problems (H) to be the largest. Especially in cases where languages are in close contact or are only distantly related, the regular component of the lexicon may be expected to be quite small. RE functions as a &quot;checker&quot; of hypotheses proposed by the linguist. It has no inferential component in the sense usually used in describing expert systems (Charniak & McDonald 1985). Our aim is to verify the internal consistency of a set of phonological correspondences, created beforehand by the linguist, against the lexicons of an ensemble of putatively related languages, and to gauge the extent to which those data are consistent with the given phonological and phonotactic descriptions (i.e. correspondences and syllable canon).</Paragraph> <Paragraph position="8"> RE has several features that represent a significant advance in the automated handling of diachronic data. First, it provides exhaustive treatment of the data in several dimensions: * It processes complete lexicons of modern languages. Every modern form is evaluated by the program in a consistent and complete way.</Paragraph> <Paragraph position="9"> * Each form is completely analyzed. Modern forms that are only partially regular are not included in cognate sets.</Paragraph> <Paragraph position="10"> * The correspondences and syllable canon form a complete and unified statement of the diachronic phonology of the languages treated.</Paragraph> <Paragraph position="11"> Second, RE contains a number of features that make it flexible in handling the kinds of data realistically encountered in historical research.</Paragraph> <Paragraph position="12"> * Provisions exist for allowing several different transcriptions to be used in representing the data.</Paragraph> <Paragraph position="13"> * There are no requirements that the data be organized beforehand by gloss, semantic field, phonological shape, or other criteria.</Paragraph> <Paragraph position="14"> * The size and type of constituents used in the analysis are not limited by the program. There is no requirement, for example, that a segmental analysis be used (as opposed to the initial-plus-rhyme-plus-tone analysis commonly used for many Asian languages, for example). However, the program does not provide for nonlinear representations or discontinuous constituents: the &quot;absolute slicing hypothesis&quot; is assumed. Also, the linearization of constituents must be the same for all the language data used by the program. For example, the tone numbers used in the languages cited in this paper, which might equally well be ordered before as after the segmental strings to which they apply, are uniformly written at the beginning.</Paragraph> <Paragraph position="15"> * Several competing analyses of the same data can be managed and compared simultaneously.</Paragraph> <Paragraph position="16"> The rest of the paper is structured as follows: Section 2 introduces some terminology, explains some particulars of the group of Tibeto-Burman languages used in Computational Linguistics Volume 20, Number 3 examples, and describes RE in broad strokes to motivate and provide context for subsequent discussion. Section 3 reviews some of the past work in the area of computational historical linguistics, especially as it relates to the current effort. Section 4 details the algorithms and data structures used in RE. Section 5 discusses the results obtained using RE and comments on practical and methodological limitations to this approach. Section 6 discusses extensions to the &quot;core&quot; functions of RE: the handling of imprecise data, the treatment of variation due to diachronic and synchronic processes, the ad hoc semantic system for disambiguating homophones at both the modern and proto levels, and semi-automatic methods for generalizing over sets of phonological rules. Section 7, the conclusion, offers some caveats about computer applications in the area of historical linguistics and invites collaboration on more comprehensive software of this type.</Paragraph> </Section> class="xml-element"></Paper>