File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/p84-1010_metho.xml

Size: 14,076 bytes

Last Modified: 2025-10-06 14:11:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P84-1010">
  <Title>DENORMALIZATION AND CROSS REFERENCING IN THEORETICAL LEXICOGRAPHY</Title>
  <Section position="4" start_page="38" end_page="38" type="metho">
    <SectionTitle>
II RELATIONS
</SectionTitle>
    <Paragraph position="0"> Relations were proposed by Codd and elaborated on by Fagin, Ullman, and many others. They are unordered sets of tuples, each of which contains an ordered set of fields. Each field has a value taken from a domain -- semantically, from a particular kind of information. In lexicography the tuples correspond, not to entries in a dictionary, but to subentries, each with a particular sense. Each tuple contains fields for various aspects of the form, meaning, meaning-to-form mapping, and use of that sense.</Paragraph>
    <Paragraph position="1"> For the update and retrieval operations defined on relations to work right, the information stored in a relation is normalized. Each field is restricted to an atomic value~ it says only one thing, not a series of different things. No field appears more than once in a tuple. Beyond these formal constraints are conceptual constraints based on the fact that the information in some fields determines what can be in other fields; Ullman spells out the main kinds of such dependency.</Paragraph>
    <Paragraph position="2"> It is possible, as Shu and associates show, to normalize nearly any information structure by partitioning it into a set of normal form relations.</Paragraph>
    <Paragraph position="3"> It can be presented to the user, however, in a view that draws on all these relations but is not itself in normal form.</Paragraph>
    <Paragraph position="4"> Reconstituting a subentry from normal form tuples was beyond the capacity of the equipment that could be used in the field; it would have been cripplingly slow. Before sealed Winchester disks came out, floppies were unreliable in tropical humidity where the work was to be done, and only small digital tape cartridges were thoroughly reliable. So the organization had to be managed by sequential merges across a series of small (.25M) tapes without random access.</Paragraph>
    <Paragraph position="5"> The requirements of normal form came to be an issue in three areas. First, the prosaic matter of examples violates normal form. Nearly any field in a dictionary can take any number of illustrative examples.</Paragraph>
    <Paragraph position="6"> Second, the actants or arguments at the level of semantic representation that corresponds to the definition are in a theoretical status that is not yet clear. Mel'chnk (1981) simply numbers the actants in a way that allows them to map to grammatical relations in as general a way as possible. Others, ~'self included, find recurring components of definitions on the order of Fillmore's cases (1968) that are at least as consistently motivated as are the lexical functions, and that map as sets of actants to sets of grammatical relations. Rather than load the dice at this uncertain stage by designating either numbered or labeled actants as distinct field types, it furthers discussion to be able to have Actant as a single field type that is repeatable, and whose value in each instance is a link between an actant number, a prcposed case, and even possibly a conceptual dependency category for comparison (Schank and Abelson, 1977.11-17).</Paragraph>
    <Paragraph position="7"> Third, lexical correlates are inherently manyto-one. For example, Huichol ~u~i 'house' in its sense labeled 1.1 'where a person lives' has sever= taa. cuaa al antonyms: Ant (~u~i 1.1) + 'space in .. ~ o front of a house', ~ull.ru'aa 'space behlnd a the house', tel.cuarle 'space outside the fence', and J an adverbial use of taa.cuaa 'outdoors' (Grimes, 1981.88).</Paragraph>
    <Paragraph position="8"> One could normalize the cases of all three types. But both lexicographers and users expect the information to be in nonnormal form. Furthermore, we can make a realistic assumption that relational operations on a field are satisfied when there is one instance of that field that satisfies them.</Paragraph>
    <Paragraph position="9"> This is probably fatal for Joins like &amp;quot;get me the Huichol word for 'travel', then merge its definition with the definitions of all other words whose agent and patient are inherently coreferential and involve motion'. But that kind of capability is beyond a small implementation anyway; the lexicographer who makes that kind of pass needs a large scale, fully normalized system. The kinds of selections one usually does can be aimed at any instance of a field, and projections can produce all instances of a field, quite happily for most work, and at an order of magnitude lower cost.</Paragraph>
    <Paragraph position="10"> The important thing is to denormalize systematically so that normal form can be recovered when it is needed. Actants denormalize to fields repeated in a specified order. Examples denormalize to strings of examples appended to whatever field they illustrate. Lexical correlates denormalize to strings of values of particular functions, as in the antonym example Just given. The functions themselves are ordered by a conventional list that groups similar functions together (Grimes 1981.288291). null</Paragraph>
  </Section>
  <Section position="5" start_page="38" end_page="39" type="metho">
    <SectionTitle>
III CROSS REFERENCING
</SectionTitle>
    <Paragraph position="0"> To build a dictionary consistently along the lines chosen, a computational tool needs to incorporate cross referencing. This means that for each field that is built, dummy entries are created for all or most of the words in the field.</Paragraph>
    <Paragraph position="1"> For example, the definition for 'opossum', y~uxu, includes clauses like ca +u.~u+urime Ucu~'aa w 'eats things that are not green' and pUcu~i.m~es_~e 'its tail is bare'. From these notes are generated that guarantee that each word used in the definition will ultimately either get defined itself or will be tagged yuun~itG mep~im~ate 'everybody knows it' to identify it as a zero level form that is undefinable. Each note tells what subentry its own head word is taken out of, and what field; this information is merged into a repeatable Notes field in the new entry. Under the stem~ruuri B 'be  alive, grow' appears the note d (y~uxu) * i cayuu.yuu* J o rMne pUcua'aa 'eats thlngs that are not green'. This is a reminder to the lexicographer, first that there needs to be an entry for yuuri in sense B, and second that it needs to account at the very least for the way that stem is used in the definition (d) field of the entry for yeuxu.</Paragraph>
    <Paragraph position="2"> Cross referencing to guarantee full coverage of all words that are used in definitions backs up a theoretical claim about definitional closure: the state where no matter how many words are added to the dictionary, all the words used to define them are themselves already defined, back to a finite set of zero level defining vocabulary. There is no clai, r that such a set is the only one possible; only that at least one such set is l~Ossible. To reach closure even on a single set is such an ~--,ense task -- I spent eight months full time on Huichol lexicography and didn't get even a twentieth of the everyday vocabulary defined -- that it can be approached only by some such systematic means.</Paragraph>
    <Paragraph position="3"> There are sets of conformable definitions that share most parts of their definitions, yet are not synonyms. Related species and groups of als~mals and plants have conformable definitions that are largely identical, but have differentiating parts as well (Grimes 1980). The same is true of sets of verbs llke ca/tel 'be sitting somewhere', ve/'u 'he standing somewhere', ma/mane 'be spread out somewhere', and caa/hee 'be laid out straight somewhere' (the slash separategunitary and multiple reference stems), which all share as part of their * . * , J * . deflnltlons ee.p~reu.teevl X-s~e cayupatatU* xa~.s~e 'spend an extended time at X without changing to another location', but differ regarding the spatial orientation of what is at X. Cross referencing of words in definitions helps identify these cases.</Paragraph>
    <Paragraph position="4"> Values of lexical functions are not always completely specified by the lexical function and the head word, so they are always cross referenced to create the opportunity for saying more about them.</Paragraph>
    <Paragraph position="5"> Qu~i 1.1 'house' in the sense of 'habitation of humans'--~ersus 'stable' or 'lair' or 'hangar' 1.2 and 'ranch' 1.3) is pretty well defined by the function S_, substantive of the second actant, plus the head v~rb ca/tel 1.2 'live in a house' (versus 'be sitting somewhere', 1,1 and 'live in a locality' 1.3). Nevertheless it ha~ fifteen lexical functions of its own, includin@ the antonym set given earlier, and only one of those functions matches one of the nine that are associated with ca/tel 1.2: S. (ca/tei 1.2) = S 2 (~u~i 1.1) = ~u~ 'inhabitant, householder'.</Paragraph>
    <Paragraph position="6"> Stepping outside the theoretical constraints of lexicography proper, the same cross referencing mechanism helps set up bilingual dictionaries. Definitions are always in the language of the entries, but it is useful in many situations to gloss the definitions in some language of scientific discourse or trade, then cross reference on the glosses by adding a tag that puts the notes from them into a separate section. I have done this both for Spanish, the national language of the country where Huichol is spoken, and for Latin, the language of the Linnean names of life forms. What results is not really a bilingual dictionary, because it explains nothing at all about the second or third language -- no definitions, no mapping between grammatical relations and actants, no lexical functions for that language. It simply gives examples of counterparts of glosses. As such, however, it is no less useful than some bilingual dictionaries. To be consistent, the entries on the second language side would have to be as full as the first language entries, and some mechanism would have to be introduced for distinguishing translation equivalents rather than Just senses in each language. As it is, cross referencing the glosses gives what is properly called an indexed unilingual dictionary as a handy intermediate stage.</Paragraph>
  </Section>
  <Section position="6" start_page="39" end_page="40" type="metho">
    <SectionTitle>
IV IMPLEMENTATION
</SectionTitle>
    <Paragraph position="0"> Because of the field situation far which the computational tool was required, it was implemented first in 1979 on an 8080 microcomputer with 32/( of memor~and two 130K sequentially accessible tape cartridges as an experimental package, later moved to an LSI-11/2 under RT-11 with .25M tapes. The language used was Simons's PTP (198h), designed for perspicuous handling of linguistic data. Data management was done record by record to maintain integrity, but the normal form constraints on atomicity and singularity of fields were dropped.</Paragraph>
    <Paragraph position="1"> Functions were implemented as subtypes of a single field type, ordered with reference to a special list.</Paragraph>
    <Paragraph position="2"> Because dictionary users expect ordered records, that constraint was added, with provision for mapping non-ASCII sort sequences to an ASCII sort key that controlled merging.</Paragraph>
    <Paragraph position="3"> Data entry and merging both put new instances of fields after existing instances of the same field, but this order of inclusion could be modified by the editor. Furthermore, multiple instances of a field could be collapsed into a single non-atomic value with separator symbols in it, or such a string value could be returned to multiple instances, both by the editor. Transformations between repeated fields, strings of atomic values, and various normal forms were worked out with Gary Simons but not implemented.</Paragraph>
    <Paragraph position="4"> Cross referencing was done in two ways: automatically for values of lexical functions, and by means of tags written in while editing for any field. Tags directed the processor to build a cross reference note for a full word, prefix, stem, or suffix, and to file it in the first, second, or third language part. In every case the lexicographer had opportunity to edit in order to remove irrelevant material and to associate the correct name form.</Paragraph>
    <Paragraph position="5"> Besides the major project in Huichol, the system was used by students for original lexicographic work in Dinka of the Sudan, Korean, and Isnag of the Philippines. If I were to rebuild the system now, I would probably use the University of California at Davis's CP/M version of Mumps on a portable Winchester machine in order to have total  random access in portable form. The strategy of data management, however, would remain the same, as it fits the application area well. I suspect, but have not proved, that full normalization capability provided by random access would still turn out unacceptably slow on a small machine.</Paragraph>
    <Paragraph position="6"> V DISCUSSION Investigation of a language centers around four collections of information that computationally are like data bases: field notes, text collection with glosses and translations, grammar, and dictionary. The first two fit the relational paradigm easily, and are especially useful when supplemented with functions that display glosses interlinearly. null The grammar and dictionary, however, require denormalization in order to handle multiple examples, and dictionaries require the other kinds of denormalization that are presented here. Ideally those examples come out of the field notes and texts, where they are discovered by an automatic parsing component of the grammar that is used by the selection algorithm, and they are attached to the appropriate spots in the grammar and dictionary by relational join operations. ~-</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML