DENORMALIZATION AND CROSS REFERENCING IN THEORETICAL LEXICOGRAPHY 
Joseph E. Grimes 
DMLL, Morrill Hall, Cornell University 
Ithaca NY lh853 USA 
Summer Institute of Linguistics 
7500 West Camp Wisdom Road 
Dallas TX 75236 USA 
ABSTRACT 
A computational vehicle for lexicography was 
designed to keep to the constraints of meaning- 
text theory: sets of lexical correlates, limits on 
the form of definitions, and argument relations 
similar to lexical-functional grA--~-r. 
Relational data bases look like a natural frame- 
work for this. But linguists operate with a non- 
normalized view. Mappings between semantic actants 
and grammatical relations do not fit actant fields 
uniquely. Lexical correlates and examples are poly- 
valent, hence denormalized. 
Cross referencing routines help the lexicogra- 
pher work toward a closure state in which every 
term of a definition traces back to zero level 
terms defined extralinguistically or circularly. 
Dummy entries produced from defining terms ensure 
no trace is overlooked. Values of lexical corre- 
lates lead to other word senses. Cross references 
for glosses produce an indexed unilingual diction- 
ary, the start of a fully bilingual one. 
To assist field work a small structured editor 
for a systematically denormalized data base was 
implemented in PTP under RT-11; Mumps would now be 
easier to implement on small machines. It allowed 
fields to be repeated and nonatomic strings includ- 
ed, and produced cross reference entries. It 
served for a monograph on a language of Mexico? 
and for student projects from Africa and Asia.- 
I LEXICOGRAPHY 
Natural language dictionaries seem like obvious 
candidates for information management in data base 
form, at least until you try to do one. Then it ap- 
pears as if the better the dictionary in terms of 
lexicographic theory, the more awkward it is to 
fit relational constraints. Vest pocket tourist 
dictionaries are a snap; Webster's Collegiate and 
parser dictionaries require careful thought; the 
Mel'chuk style of explanatory-combinatory diction- 
ary forces us out of the strategies that work on 
ordinary data bases. 
In designing a tool to manage lexicographic 
field work under the constraints of Mel'chuk's 
meaning-text model, the most fully specified one 
available for detailed lexicography, I laid down 
specifications in four areas. First, it must han- 
dle all lexical correlates of the head word. Lex- 
ical correlates relate to the head in ways that 
have numerous parallels within the language. In 
English, for example, we have nouns that denote 
the doer of an action. Some, such as driver, writ- 
er, builder, are morphologically transparent. 
Others like pilot (from fly) and cook (from cook) 
are not; yet they relate to the corresponding verbs 
in the same way as the transparent ones do. Mel'- 
chuk and associates have identified about fifty 
such types, or lexical functions, of which S_, the 
habitual first substantive Just illustrated, is 
one. 
These types appear to have analogous meanings in 
different languages, though not all types are nec- 
essarily used in every language, and the relative 
popularity of each differs from one language to an- 
other, as does the extent to which each is grammat- 
icalized. For example, English has a rich vocabu- 
lary of values for a relation called Ma~n (from 
Latin magnus) that denotes the superlative degree 
of its argument: Magn (sit) = ti6ht, Magn (black) 
=Jet, pitch, coal, Magn (left) = hard, Magn---~ay) 
= for all you're worth, and on and on. On the other 
hand Huichol, a Uto-Aztecan language of Mexico I 
have been working on since 1952, has no such vo- 
cabulary; it uses the simple intensives yeme and 
va~c~a for all this, and2picks up its lexical 
richness in other areas. 
Second, a theoretically sound definition uses 
words that are themselves defined through as long 
a chain as possible back to zero level words that 
can be defined only in one of two ways: by accept- 
ing that some definitions -- as few as possible -- 
may be circular, or by defining the zero level via 
extralinguistic experiences. Some dictionaries de- 
fine sweet circularly in terms of sugar and vice 
versa; but one could also begin by passing the sug- 
ar bowl and thus break the circularity. The tool 
must help trace the use of defining words. 
Third, the arguments in the semantic represen- 
tation of a word have to relate explicitly to 
grammatical elements like subjects and objects and 
possessors: his projection of the budget and 
1 NSF grant BNS-79060hl funded some of this work. 
2 Huichol transcription follows Spanish except 
high back unrounded, ' glottal stop, • high tone, 
W long syllable, ~ rhythm break, ~ voiced retro- 
flex alveopalatal fricative, ~ retroflex flap, cuV 
labiovelar stop. 
38 
please turn out the li6ht each involve two argu- 
ments to the main operative word (him and budget, 
you and li6ht), but the relationship is handled in 
different grammatical frames. 
Finally, the tool must run on the smallest, 
most portable machine available, if necessary trad- 
ing processing time for memory and external space. 
II RELATIONS 
Relations were proposed by Codd and elaborated 
on by Fagin, Ullman, and many others. They are un- 
ordered sets of tuples, each of which contains an 
ordered set of fields. Each field has a value tak- 
en from a domain -- semantically, from a particu- 
lar kind of information. In lexicography the tuples 
correspond, not to entries in a dictionary, but to 
subentries, each with a particular sense. Each 
tuple contains fields for various aspects of the 
form, meaning, meaning-to-form mapping, and use of 
that sense. 
For the update and retrieval operations defined 
on relations to work right, the information stored 
in a relation is normalized. Each field is restric- 
ted to an atomic value~ it says only one thing, not 
a series of different things. No field appears more 
than once in a tuple. Beyond these formal con- 
straints are conceptual constraints based on the 
fact that the information in some fields determines 
what can be in other fields; Ullman spells out the 
main kinds of such dependency. 
It is possible, as Shu and associates show, to 
normalize nearly any information structure by par- 
titioning it into a set of normal form relations. 
It can be presented to the user, however, in a view 
that draws on all these relations but is not itself 
in normal form. 
Reconstituting a subentry from normal form 
tuples was beyond the capacity of the equipment 
that could be used in the field; it would have been 
cripplingly slow. Before sealed Winchester disks 
came out, floppies were unreliable in tropical hu- 
midity where the work was to be done, and only 
small digital tape cartridges were thoroughly reli- 
able. So the organization had to be managed by se- 
quential merges across a series of small (.25M) 
tapes without random access. 
The requirements of normal form came to be an 
issue in three areas. First, the prosaic matter of 
examples violates normal form. Nearly any field in 
a dictionary can take any number of illustrative 
examples. 
Second, the actants or arguments at the level of 
semantic representation that corresponds to the 
definition are in a theoretical status that is not 
yet clear. Mel'chnk (1981) simply numbers the act- 
ants in a way that allows them to map to gram- 
matical relations in as general a way as possible. 
Others, ~'self included, find recurring components 
of definitions on the order of Fillmore's cases 
(1968) that are at least as consistently motivated 
as are the lexical functions, and that map as sets 
of actants to sets of grammatical relations. Rather 
than load the dice at this uncertain stage by des- 
ignating either numbered or labeled actants as dis- 
tinct field types, it furthers discussion to be 
able to have Actant as a single field type that is 
repeatable, and whose value in each instance is a 
link between an actant number, a prcposed case, and 
even possibly a conceptual dependency category for 
comparison (Schank and Abelson, 1977.11-17). 
Third, lexical correlates are inherently many- 
to-one. For example, Huichol ~u~i 'house' in its 
sense labeled 1.1 'where a person lives' has sever- 
= taa. cuaa al antonyms: Ant (~u~i 1.1) + 'space in 
.. ~ o front of a house', ~ull.ru'aa 'space behlnd a the 
house', tel.cuarle 'space outside the fence', and 
J an adverbial use of taa.cuaa 'outdoors' (Grimes, 
1981.88). 
One could normalize the cases of all three 
types. But both lexicographers and users expect the 
information to be in nonnormal form. Furthermore, 
we can make a realistic assumption that relational 
operations on a field are satisfied when there is 
one instance of that field that satisfies them. 
This is probably fatal for Joins like "get me the 
Huichol word for 'travel', then merge its defini- 
tion with the definitions of all other words whose 
agent and patient are inherently coreferential and 
involve motion'. But that kind of capability is be- 
yond a small implementation anyway; the lexicogra- 
pher who makes that kind of pass needs a large 
scale, fully normalized system. The kinds of selec- 
tions one usually does can be aimed at any instance 
of a field, and projections can produce all in- 
stances of a field, quite happily for most work, 
and at an order of magnitude lower cost. 
The important thing is to denormalize systemat- 
ically so that normal form can be recovered when 
it is needed. Actants denormalize to fields repeat- 
ed in a specified order. Examples denormalize to 
strings of examples appended to whatever field 
they illustrate. Lexical correlates denormalize to 
strings of values of particular functions, as in 
the antonym example Just given. The functions them- 
selves are ordered by a conventional list that 
groups similar functions together (Grimes 1981.288- 
291). 
III CROSS REFERENCING 
To build a dictionary consistently along the 
lines chosen, a computational tool needs to incor- 
porate cross referencing. This means that for each 
field that is built, dummy entries are created for 
all or most of the words in the field. 
For example, the definition for 'opossum', y~u- 
xu, includes clauses like ca +u.~u+urime Ucu~'aa 
w 'eats things that are not green' and pUcu~i.m~e- 
s_~e 'its tail is bare'. From these notes are gener- 
ated that guarantee that each word used in the def- 
inition will ultimately either get defined itself 
or will be tagged yuun~itG mep~im~ate 'everybody 
knows it' to identify it as a zero level form that 
is undefinable. Each note tells what subentry its 
own head word is taken out of, and what field; 
this information is merged into a repeatable Notes 
field in the new entry. Under the stem~ruuri B 'be 
39 
alive, grow' appears the note d (y~uxu) • i cayuu.yuu- 
• J o rMne pUcua'aa 'eats thlngs that are not green'. 
This is a reminder to the lexicographer, first that 
there needs to be an entry for yuuri in sense B, 
and second that it needs to account at the very 
least for the way that stem is used in the defini- 
tion (d) field of the entry for yeuxu. 
Cross referencing to guarantee full coverage of 
all words that are used in definitions backs up a 
theoretical claim about definitional closure: the 
state where no matter how many words are added to 
the dictionary, all the words used to define them 
are themselves already defined, back to a finite 
set of zero level defining vocabulary. There is no 
clai, r that such a set is the only one possible; on- 
ly that at least one such set is l~Ossible. To reach 
closure even on a single set is such an ~--,ense 
task -- I spent eight months full time on Huichol 
lexicography and didn't get even a twentieth of the 
everyday vocabulary defined -- that it can be ap- 
proached only by some such systematic means. 
There are sets of conformable definitions that 
share most parts of their definitions, yet are not 
synonyms. Related species and groups of als~mals and 
plants have conformable definitions that are large- 
ly identical, but have differentiating parts as 
well (Grimes 1980). The same is true of sets of 
verbs llke ca/tel 'be sitting somewhere', ve/'u 'he 
standing somewhere', ma/mane 'be spread out some- 
where', and caa/hee 'be laid out straight some- 
where' (the slash separategunitary and multiple 
reference stems), which all share as part of their 
• . • , J • . deflnltlons ee.p~reu.teevl X-s~e cayupatatU• xa~.- 
s~e 'spend an extended time at X without changing 
to another location', but differ regarding the 
spatial orientation of what is at X. Cross refer- 
encing of words in definitions helps identify 
these cases. 
Values of lexical functions are not always com- 
pletely specified by the lexical function and the 
head word, so they are always cross referenced to 
create the opportunity for saying more about them. 
Qu~i 1.1 'house' in the sense of 'habitation of hu- 
mans'--~ersus 'stable' or 'lair' or 'hangar' 1.2 
and 'ranch' 1.3) is pretty well defined by the 
function S_, substantive of the second actant, plus 
the head v~rb ca/tel 1.2 'live in a house' (versus 
'be sitting somewhere', 1,1 and 'live in a locality' 
1.3). Nevertheless it ha~ fifteen lexical functions 
of its own, includin@ the antonym set given ear- 
lier, and only one of those functions matches one 
of the nine that are associated with ca/tel 1.2: 
S. (ca/tei 1.2) = S 2 (~u~i 1.1) = ~u~ 'inhab- 
itant, householder'. 
Stepping outside the theoretical constraints of 
lexicography proper, the same cross referencing 
mechanism helps set up bilingual dictionaries. Def- 
initions are always in the language of the entries, 
but it is useful in many situations to gloss the 
definitions in some language of scientific dis- 
course or trade, then cross reference on the glos- 
ses by adding a tag that puts the notes from them 
into a separate section. I have done this both for 
Spanish, the national language of the country where 
Huichol is spoken, and for Latin, the language of 
the Linnean names of life forms. What results is 
not really a bilingual dictionary, because it ex- 
plains nothing at all about the second or third 
language -- no definitions, no mapping between 
grammatical relations and actants, no lexical func- 
tions for that language. It simply gives examples 
of counterparts of glosses. As such, however, it is 
no less useful than some bilingual dictionaries. To 
be consistent, the entries on the second language 
side would have to be as full as the first language 
entries, and some mechanism would have to be intro- 
duced for distinguishing translation equivalents 
rather than Just senses in each language. As it is, 
cross referencing the glosses gives what is prop- 
erly called an indexed unilingual dictionary as a 
handy intermediate stage. 
IV IMPLEMENTATION 
Because of the field situation far which the 
computational tool was required, it was implement- 
ed first in 1979 on an 8080 microcomputer with 32/( 
of memor~and two 130K sequentially accessible tape 
cartridges as an experimental package, later moved 
to an LSI-11/2 under RT-11 with .25M tapes. The 
language used was Simons's PTP (198h), designed 
for perspicuous handling of linguistic data. Data 
management was done record by record to maintain 
integrity, but the normal form constraints on at- 
omicity and singularity of fields were dropped. 
Functions were implemented as subtypes of a single 
field type, ordered with reference to a special 
list. 
Because dictionary users expect ordered records, 
that constraint was added, with provision for map- 
ping non-ASCII sort sequences to an ASCII sort key 
that controlled merging. 
Data entry and merging both put new instances 
of fields after existing instances of the same 
field, but this order of inclusion could be modi- 
fied by the editor. Furthermore, multiple instances 
of a field could be collapsed into a single non- 
atomic value with separator symbols in it, or such 
a string value could be returned to multiple in- 
stances, both by the editor. Transformations be- 
tween repeated fields, strings of atomic values, 
and various normal forms were worked out with Gary 
Simons but not implemented. 
Cross referencing was done in two ways: automat- 
ically for values of lexical functions, and by 
means of tags written in while editing for any 
field. Tags directed the processor to build a cross 
reference note for a full word, prefix, stem, or 
suffix, and to file it in the first, second, or 
third language part. In every case the lexicogra- 
pher had opportunity to edit in order to remove ir- 
relevant material and to associate the correct name 
form. 
Besides the major project in Huichol, the system 
was used by students for original lexicographic 
work in Dinka of the Sudan, Korean, and Isnag of 
the Philippines. If I were to rebuild the system 
now, I would probably use the University of Cali- 
fornia at Davis's CP/M version of Mumps on a port- 
able Winchester machine in order to have total 
40 
random access in portable form. The strategy of da- 
ta management, however, would remain the same, as 
it fits the application area well. I suspect, but 
have not proved, that full normalization capability 
provided by random access would still turn out un- 
acceptably slow on a small machine. 
V DISCUSSION 
Investigation of a language centers around four 
collections of information that computationally 
are like data bases: field notes, text collection 
with glosses and translations, grammar, and dic- 
tionary. The first two fit the relational para- 
digm easily, and are especially useful when sup- 
plemented with functions that display glosses in- 
terlinearly. 
The grammar and dictionary, however, require de- 
normalization in order to handle multiple examples, 
and dictionaries require the other kinds of denorm- 
alization that are presented here. Ideally those 
examples come out of the field notes and texts, 
where they are discovered by an automatic parsing 
component of the grammar that is used by the selec- 
tion algorithm, and they are attached to the ap- 
propriate spots in the grammar and dictionary by 
relational join operations. ~- 
VI REFERENCES 
Codd, E. F. 1970. A relational model for large 
shared data banks. Communications of the ACM 
13:6.377-387. 
Fagin~ R. 1979. A normal form for relational data- 
bases that is based on domains and keys. IBM 
Research Report RJ 2520. 
Fillmore, Charles J. 1968. The case for case. In 
~m~on Bach and Robert T. Harms, eds., Univers- 
als in linguistic theory, New York: Holt, Rine- 
hart and Winston, 1-88. 
Grimes, Joseph E. 1980. Huichol life form clas- 
sification I: Animals. Anthropological Linguist- 
ics 22:5.187-200. II: Plants. Anthropological 
Linguistics 22:6.264-27h. 
W . ..... . 1981. E1 huiehol: apuntes sobre el lexlco 
\[Huichol: notes on the lexicon\], with P. de la 
Cruz, J. Carrillo, F. Dzaz, R. Dlaz, and A. de 
la Rosa. ERIC document ED 210 901, microfiche. 
Kaplan, Ronald M. and Joan Bresnan. 1982. Lexical- 
functional grammar: a formal system for gram- 
matical representation. In Joan Bresnan, ed. 
The mental representation of grammatical rela- 
tions, Cambridge: The MIT Press, 173-281. 
Mel'chuk, Igor A. 1981. Meaning-text models: a 
recent trend in Soviet linguistics. Annual Re- 
view of Anthropology 10:27-62. 
..... , A. K. Zholkovsky, and Ju. D. Apresyan. in 
press. Tolkovo-kombinatornyJ slovar' russkogo 
jazyka (with English introduction). Vienna: 
Wiener SlawistischerAlmanach. 
Schank, Roger C. and Robert P. Abelson. 1977. 
Scripts, plans, goals and understanding: an in- 
quiry into hnma~ knowledge structures. Hillsdale 
NJ: Lawrence Erlbaum Associates. 
Simons, Gary F. 198h. Powerful ideas for text pro- 
cessing. Dallas: Summer Institute of Linguist- 
ics. 
Ullman, Jeffrey D. 1980. Principles of database 
systems. Rockville MD: Computer Science Press. 
Wong, H. K. T. and N. C. Shu. 1980. An approach to 
relational data base scheme design. IBM Computer 
Science Research Report RJ 2688. 
41 
