American Journal of Computational Linguistics 
Nicholas V. Findler and Heino Viil 
Department of Computer Science 
State University of New York 
Buff a10 
Microfiche 4 
@I974 by the Association for C~mputational Linguistics 
Nicholas V. Findlbr and fleino Viil 
Department of L'omputcr Science 
State university of New York at Uuffalo 
We describe a branch of dictionary science, and recormend the 
term lexicometry for it, that deals with the mathematical and 
statistical aspects of dictionaries. It is related to both 
and former 
de not incj the description 
of lexical material and the latter its analysis and study. 
Many problems in computational linguistics require the use of 
a stored dictionary easily accessible to a coqutw program. In 
the course of an investigation, such a dictionary may have to be 
expanded, reduced, rcanrranged, or modified in various ways o Also 
several nonlinguistic disciplines using the cmputer, such as 
psychology, biology, medicine, and sociology, often need a large 
data base in the £om of a dictionary. The relevant structural 
properties of a dictionary, however, have not yet been 
sufficiently and systematically investigated. Research in this 
area is needed in order to optimize the construction af stored 
dictionaries and to manipulate them in efficient ways. 
1 
A considerably extended version of this paper was sUbmitted 
to the State University of New York in Buffalo in partial 
satisfaction of the requirements for the dbcjree of of 
science of Ileino ~iil. The project rcpresen:.~ the continuation 
of an earlier work by 1Jickolas V. ~indfer. Ilmy ideas and all 
the prog~mming effort is due to Iieino ~iil. The write-up is a 
oint effort. The work reported here was supported by National 
cience Foundation Grant GJ-G58. 2 
F'irs t , we 
review critically the problems of meaning and its 
representation, the questions relating to lexical definitions.. to 
PO~YS='~Y, homonymy, semantic depletion. synonymy, and 
lexicography and Lexicology in general. We also discuss the 
concept of lexical valence and elaborate a novel idea, coverage, 
which is of both theoretical and practical importance. In this 
context, relationship6 are established among three variables : 
the size of the covered set, the sire of the covering set, and 
the maximum defbition length. both, the size of the covering 
set and the maximum definition length should be small for 
economic considerations. But decreasing one will increase the 
other. It is therefore important to establish these 
relatibnships empirically. The knowledge, so gained will 
constitute a basis for opthizing the structure of a dictionary 
for specified size of the covered set and a specified machine. 
The present pilot project in this virgin field has an 
objective of verifying some conjectures. It establishes some 
principles of constructing, formatting, and storing a large data 
base in dictionary form. it develops programs for displaying, 
handling, and modifying such a data base. The paper offers an 
example how a conceptually continuous operatioh on large amounts 
of data can be reduded ts operating on a fraction of the whole 
data base at a time by successive small increments of time. We 
finally demonstrate the feasibility of solving lexicomatric 
problems on the computer and, at the same time, show the cost 
involved in doing such work in terms of both human effort and 
machine time, 
we describe the program that accomplishes the above tasks, 
and the results that were obtained in using an existing 
dictionary of computer terminology of more than 1,800 entries. 
The effort required was considerable: 6 man - month's, work and 
about 14 hours of CDC 6400 comphter time. Pxogramning was done 
in SLIP/AMPPL-11, a list processing and associqtive memory plus 
parallel processing language package enbedded in FORTRAN IV. 
TABLE OF CONTENTS 
............. 
Some problems of lexical relatedness 11 
............ 
1 . Polysemy and homonymy .. .. 11 
2 . Synonymy ..............a.m.m... 13 
..................... 
3 . Definitions 13 
............. Aspects of the science of dictionary 15 
1 . General concepts .................. 15 
2 . The problem-of coverage ............... 20 
On lexicometric relationships among the size of d'efining 
set. the size of the defined set and the maximum length 
of definitions ...................... 26 
1 . Some measures of coverage .............. 26 
............ 2 . Construction of the data base 29 
3 . The results of the computations .. ......... 42 
Acknowledgement ....................... 51 
References ......................a. -51 
Appendix I 
Program DeVelopment .................. 54 
Appendix I1 
Some ideas for the program to investigate the relatianship 
covering set size versus maximum definition length . 67 
INTRODUCTION 
Since the early days of electronic computing, two kinds of 
associations have existed between computers and dictionaries : 
either the computer uses, for various purposes, a stored 
dictionary of some sort (lexicon, vocabulary, glossary, 
thesaurus) or the compuker is employed for constructing and 
analyzing a dictionary. The latter activity was given a strong 
impetus in the late 1950's by the formation ofthecentre 
dlEtudes du Vocabulaire Francais and its publication, the 
Cahiers de Lexicologie. Thus lexicography was among the first 
ncn-mathematical disciplines to make use of the symbol 
manipulating capability of computers. 
While formal theories of syntax have been $uccessful in 
describing the rules of gramnatical accepeability of natural 
language utterances, the study of meaning, usually called 
semantics, has not yet produced a theory of the semantic 
structure of languages, based on observation and analysis. It is 
beyond the scope af this paper to discuss, even superficially, 
the various viewpoints concerned with the concept of meaning. 
One of us, Viil (19741, has, however, compiled a reasonably 
exhaustive critical survey of the relevant literature. 
For the purposes of this work, it suffi-ces to present the 
following categories of meaning, as set out by Longyear (19.71)-; 
1. Logical meaning applies to such attempts to deal with 
meaning as symbolic logic and mathematics. The meanings with 
which the signals of such systems correlate are unique 
outside-world referents or unique meanings within the logical 
system that eventually have outside-world referents. 
2. General-sernant'4c meanings are also uniqne in their 
reference to outside world, but the semanticists are less 
stringent in scope than the logicians. Nevertheless, their 
scope is an idealized language, much more limited than 
ordinary language. 
3. Communication-theory meaning is equivalent to the amount 
of information that can be transmitted per unit time in a 
comunication .system. 
4. Lexicoqraphical meaning is that of "words, " and the 
I 
outside-world reference is what we ordinarily call 
meaning. 11 
5. Psycholoqical meaning has so great a scope 
that the par& 
involving ordinary language becomes nearly trivi a1 . It 
encompasses overt or covert behavior of any organism as 
responses to stimuli. 
6. Word-mind meaning h$q the scope equivalent to that of 
ordinary language. The "words " here are linguistic 
structures, but the "meanings" are ideas, mental states, and 
conceptual categories. To ordinary meanings (in the lexical 
sense) here correspond signals by which mental states are 
ascertained. 
7. Linquistic meaning refers to signals as the pieces out 
of which language is made, i.e. microlinguis tic, 
ph~nological, and syntactic signals. 
In the framework .of our particular topic we shall be mainly 
concerned with categories 4 and 7. 
According to Weinreich (1 966 ) , unilingual deflning 
dictionaries appear to be based on a model that assumes a 
distinction between meaning proper (signification, comprehension, 
intension) and the thing meant by a sign (denotation, reference, 
extension) . On the basis of what is meant by a sign, Osgaod, 
suci, and Tannenbaum (1 95 7) distinguish three kinds of meaning. 
1. Pragmatical (sociological) meaning : the relation of 
signs to situations and behaviors. 
2 . (linauis tic) meaning : the relation of signs 
to other signs. 
3. Semantical meaning: the relation of signs to their 
significates . It is easy to see that these classes are in 
correspondence with Longyear's three layers in category 7. 
Homing onto our primary target, we may now restrict our 
interests somewhat further and concentrate on the two last 
classes of meaning, known under various designations but, by the 
majority of writers, distinguished as structural meaning and 
lexical meaning. 
Mackey (19653 finds structural meanings in (1 ) structure 
words, (2) inflectional forms, and (31 types of word order. 
Examples of structure words are articles and prepoai tions, and 
these, he insists, although often called meaningless or empty, 
may have a large number of meanings. Similarly, the inflectional 
forms, such as the genitive case and present tense, may have a 
number of meanings, and so may some types of word order. Lexical 
mefinings, on the other hand, refer to the meanings of the content 
words, in which the differences in meaning are most easily seen. 
In Russell's vim (1 967) the structure words, such as "than, I& 
"or, " "howe~er," have meaning only in a suitable verbal context 
and cannot stand alone. The content words, which he calls object 
words, such as proper names, class names of animals, names of 
colors, do not presuppose ~ther words and can be used in 
isolation. Their meaning is learnt by confrontation with objects 
that are what they mean or instances of what they mean. As soon 
as the association between an object word and what it means has 
been established by the learner's hearing, if frequently 
pronounced 
in the presence of the object, the word is understood 
also in the absence of the object. Thi$ explanation, of course, 
excludes words that denote abstract entities, which are not 
object-like and usually cannot have a "presence." It also denies 
that every structure word inherently denotes one or a few 
definite relationships even in isolation. If this were not so, 
one could not understand what kind of relationship it designates 
if used in a context. 
Lyans (1 969) , quite sensibly, distinguishes between three 
different kinds of structural, or grammatf cal meaning. 
1. The meaning of gramnatical items, such as prppositions 
and conjunctions. 
2. The meaning of grammatical functions, such as subject 
and object, i.e. syntactical relations. 
3 The meaning associated with notions such as declarative, 
interrogative, imperative, i. e. syntactical types. 
Ile further rightly observes that grammatical items belong to 
closed sets, which have a fixed, small membership, e.g. personal 
pronouns. Lexical items, on the other hand belong to open sets, 
which have an unrestricted, large memhership, e .go nouns 
Moreover, lexical items have both lexical (material) and 
gramnatical meaning whereas grammati ca1 items have only 
grammatical meaning. 
In our work, the distinction between structure words and 
contents words is essential. This fact is clearl~ seen in the 
preparation of the dictionary used for our experiments. 
SOME; PROBLEhIS OF LEXICAL HELATEDNESS 
1. Polysemy and Homonymy 
While the problem of meaning is complex in itself, the 
difficulty increases by another order of magnitude if one has to 
deal with words of many n~eanings or different words with 
different meanings thak have identical spellings or 
pronoupciations. And the decision as to whether a given case 
represents one polysemous word or two (or more) homonyms is far 
from being well defined. 
The separation can be based on morphological criteria. First 
of all, two graphematically identical word forms with different 
meanings are regarded as homqraphs and separated if they display 
a phonematic difference or if they belong to different word 
classes. They are also homographs even if they belong to the 
same word class but possess different inflection systems. 
otherwise, they represent the same word. More than one meaning 
of one word constitutes a case of polysemy. In contrast with 
such diversified meanings of one word, we talk about hm, in 
which case two words have by chance acquired the same external 
appearance. A distinction between the two can only be made, if 
at all, on the basis of the historical origin of the words 
invo lved. Direct, transferred and specialized senses of a word 
can be listed along ope dimension of meaning, dominant and basic 
senses represent certain measures along another dimension. 
Another concept is semantic depletion, in which case the word 
occurs in scores of expressions. Mere, the verbal or situational 
context - adds substantially to the meaning of the word in 
question. With polysemy, however, the context eliminates those 
senses of the word that do not apply and thereby disambiguates 
the polysemous word. It is, therefore, important from the 
lexicographical point of view to distinguish between the degrees 
of interaction between the context and the meaning of individual 
(a) in case of weak in£ luence, we talk about autosemantic or 
semantically autonomous words ; 
(b) a strong influence performs a disambiguation of 
polysemous or homonymous words; 
(c) the context defines the 'meaning of synsemantic or 
semantically depleted words. 
Needless to say that the above, as innumerable other, 
decisions must often be based on subjective criteria. Finally, 
it 
could be noted that, in exceptional cases, even the inmediate 
context cannot resolve the ambiguity4 and two or more 
interpretations are acceptable. This phenpenon is the 
It is clear even to the casual observer that total 
interchangeability in all contexts, and identity in both 
cognitive and emotive senses, of two lexical units (words, in the 
simplest case] are not possible in general. The semantic 
relationshir; between synonymy is based on and measured by a level 
of similarity. 
Rather than distinguishing between the "meaning" and the 
"usage" of a word, one should assume the view that the former is 
the sum total of the possibilit!ies of the latter. This is 
basically what justifies the existence of any monolingual (and, 
possibly, bilingual) dictionary. 
The entries in the dictionaries we are concerned with are 
both words (the interpretation and definition of which units are 
less than clear-cut) and multi-word lexical units. The two are 
of the same standing and function, and they will be treated 
identically. 
3. Definitions 
Definition is the most fuhdamental concept associated with 
dictionaries. We shall be concerned with both classical 
Aristotelian definitions, based on "class" and "characteristics", 
and operational definitions which use sententialw generative 
terms. In fact, it is often difficult or impossible to separate 
equivalence or paraphrase ciefinitions , on one hand, and those 
that are process-oriented reproductions, on the other, 
In general., *he lexical meaning can be rendered by four basic 
instruments and their various combinations : 
(a) the lexicpgraphic definition enumerates the most 
important features of the lexical unit being defined, in the 
simplest possible terms; 
(b) qualified synonyms provide a system of semantically most 
related words; 
(c) exemplification puts the defined unit in functional 
combination with other units; 
(d) a gloss is an explanator or descriptive comnent related 
to the dictionary entry; it may also skate similarities to 
and dif firences from other entries. 
- 15 - 
AsPECrS. OF THE SCIENCE OF DICPIONARY 
1 . General, Concepts 
Uaough definitions abound, a reasonable distinction seems 
to be to say that the semantic description of 
individual 
terms, 
the 
inventory of words is the customary province of Lexicoqraphy 
whereas le&coloqy refers to the study of the lexical material, 
of the recurrent patterns of semantic relationships, and of any 
formal devices, such as phonological and granmatical $ystems, 
that generate the latter. 
To construct 
a dictionary of a given size,. one could choose 
the entries on the basis of their frequency of occurrence or in 
relying on some measure of *utility that is vaguely tied to the 
semantic generality of the candidates. No solution is perfect or 
even uniformly useful over the whole dictionary. 
Even the arrangement of meanings of a given entry is moot. 
we talk about logical, historical and empirical orders. (The 
latter starts with the comon and current usage followed by 
obsolete, colloquial, provincial, slang and technical meanings. ) 
We can dif ferenkiate between 
engyclopedi c and 
linguistic dicstionaries. 
 he latter are primarily concerned with 
the lexical units of the language and all their linguistic 
properties . The former, on the other hand, give information 
about sane samponent of the extralihguis tic world. Our work 
derives its data base from an encyclopedic dictionary. It ehould 
be noted that the highly polysemous nature of the entries in a 
linguistic dictionary would have constituted an addi t iona 1 
complication in this pilot project, which has now been avoided 
without affecting the general validity of the resu Its. 
We propose to introduce the tern lexicometry to designate the, 
discipline which investigates and analyzes the quantitative 
aspects of dictionaries, the vocabulary of a language and various 
subsets of the lattet. Lexicometry would count, weigh and 
. . 
measure, and express the results in statistical and 
mathmatical 
terms. Many such studies are widely known. Such is the one 
reported by GU~ raud (1 959 : 
The most frequent words are: 
(a) the shortest, 
b the oldest, 
(c) the morphologically simplest, 
(d) the semanti-caf ly most extended, i .e. 
greatest number of meanings. 
possessing the 
As to the measure of frequency, 
n 
the first 100 words cover 608 of an averagen text, 
I# 81 tl tl M I( Q 
1000 85% 
f 
to I a I n m n 
4000 9705% 
Thus the remaining X (?) thousand words cover only 2.5% of the 
text. HoweVer, from an information theoretic point of view, 
the first 100 words comprise 30% of the information, 
I n n n w I 
1000 
50% " 
11 H (I II n n 
4000 
70% ' 
Consequently, rare words konvey a great deal of information. We 
could say that a frequent word is most useful in the aggregate, 
and a rare word in a particular case. 
Other studies in glottochronology mhcern thanselves with the 
rate of change in Language and in basic vocabulary. Further, 
distribution of the frequencies of occurrence with or without 
reference to any particular vocabulary has also been studied. 
Finding relations of the above kind is not just an academic 
exercise to satisfy the curiosity of a few linguists, but these 
relationships may have various practical applications. For 
example, Maas (1972) asserts that the knowledge of a functional 
relation between the length of a text and the size of the 
vocabulary used in it would be desirable in order to estimate the 
effort needed for extension of a machine dictionary or in 
comparison of vocabulary contents of texts of dif ferent lengths. 
In the latter case, one can standardize or normalize the texts 
under investigation by reducing them to a common minimal length 
through computational methods and then compare the resulting 
vocabulary volumes. 
Let V be the number of elements (words) in a text and N the 
I - 
length of the text. Then we surmise, says Maas, a functional 
relatiurnship to exist between N - and V: 
- 
Muller (1964) reported a relation between V and d such that 
- - 
the ratio of their logarithms is constant: 
lo N 
-3.- = a,or va = N, 
or, 
10q- v 
1 
if we set - = k, 
k 
V=LJ, 
a 
Since the vocabulary of a language, however, is supposed to 
be restricted, so argues Maas, the existence of a limiting value 
is to be postulated: 
V,= lim f (N) 
N+m 
As the derivative of f at a given value of N represents the 
- - 
relative increase in V -1 it is to be stated that f' (N) 
approaches 
0 with increasing - N, 
The derivative of a f at the point 1 is assumed to be 1 because 
a text of length 1 has a vocabulary consisting of one word, hence 
Therefore - f' is a function that decreases mnotonically from 1 to 
As a consequence of the above speculations, in the expression 
V = N~ 
k cannot be constant. 
' - 
statistical investigations of the dramas by Corneille have 
resulted in the relationship 
1 
log E 
= 0.0137. (log N) 
1/3 
~hus, if N I is given, k IC can be determined, and V - can be calculated 
from 
Another noteworthy concept is that of repetition factor : 
which shows how of ten word has occurred in a text 
on the averaqe. 
The following relationship has been determined: 
log R = (0.179 log N + 0.026)~, 
which displays a very good agreement with reality. 
NO single empirical law sews to exist between N and V for 
I D 
all N. - 
2, The Problem of Coverase 
We are now coming close to the core subject matter of this 
paper. Mackey (1965) states that 
 he coverage or covering capacity of an item is the 
number of things one can say with it. It can be measured by 
the number of other items which it can displace. )I 
According to him, words can displace other words by Eour 
means: (1 ) inclusion, (2) extension, (3) combination, and (4! 
definition, 
1, 
A word that already includes the meaning of other words 
- 
can be used instead of these (e.g., seat includes chair 
L- -P 
bench, stool, and place) , 
- lLlCI 
2. Words the meanings of which are easily extended 
me'kaphorically can be used to eliminate others (e.g., 
tributary of a river can be covered by branch or arm). 
- - 
3 . Certain simple words can displace others by combining 
either together or with simple word endings (em g. , news + 
paper + man = journalist ; 
hand + book = manual) . 
4 . Certain words can be replaced by simple definition 
(erg., breakfast can be defined as morning meal; pony as 
small horse). 
As an example of the application of the above principle, in 
the derivation of Basic English (by definition), the language was 
first reduced to 7500 words, and, by redefinition, cut down to 
1500. These were further reduced to the eventual 850 by a 
technique of "panoptic" definition (eliminate each word on the 
grounds that it is some sort of modification of other words, e. g. 
a modification in time, numbe-r, or size) . 
Basic English, which was founded essentially on the principle 
of cove rage, was a conscious reaction against the 
over-application of the principle of frequency in selection. For 
Ogden (1 933) , it was not the frequency of a word which makes it 
useful, it was its usefulness which makes it frequent. 
In the following part of this section, we attempt to present 
some of the salierit points of Savard (1 970). 
The vocabulary indices most widely known today are those of 
frequency, of distribution, and of availability. But these are 
not sufficient to select words for a restricted vocabulary for 
the purpose of teaching a foreign language, such as Wench, to 
beginners. 
An objective criterion is lexical valence. It would allow 
1 . to obtain a novel principle of vocabulary selection, 
2 . to assist the investigators in setting up a base 
vocabulary for French, 
3. to provide a usable definition, combination, inclusion, 
and extension vocabulary, 
4 a to correct all the already existing scales of French 
vocabulary, 
5. to provide a valid working tool for the analysis of 
teaching material. 
The valence problem is a problem of verbal economy. \that he 
calls valence is the fundamental capability of a word to be 
substituted for another word. It is Mackey's coveraqe that he 
renders as valence, 
Like Mackey (1965), he maintains that the substitution of one 
word for another can he made by virtue of four criteria: (1) 
definition, (2) inclusion, (3) combinatiori, (4) extension. 
~efinition has already been discussed previously. 
Linguists do not talk specifically about inclusion; 
rather, 
they deal with synonymy or lexical parallelism. Synonyms are. 
words that have nearly the same meaning, e.g. - lieu and endroit. 
For Savard, the basic criterion that permits to establish a 
series of 
the possibility of substituting one 
term 
for another. 
One of the simplest amng all the procedures of vocabulary 
enrichment consists of joining 
two words order to make 
compound words. The principle 0-f combination appears as another 
phenomenon common t~ all langrlages. 
It is not necessary that the number of simple words be 
unbounded because almost all verbs have a potential of 
undetermined sense, and so do the adjectives. A word is said to 
have more or less extension according to wheaer it can "cover" a 
more or less great number of fully or p~rtially different 
notions. 
Polysemy is the exact opposite of synonymy. 
Polysemy becomes 
complicated dw to the phenomenon of homonymy. Polysemy and 
homonymy constitute two very rich sources of lexical economy. 
Togethel: they form Savard' s last criterion of lexical 
valence--the semantic extension, 
Although the valence itself has rfever been mathematically 
measured and although there exis- no scientific means of showing 
its existence, it has neverthe less, been proven that four formal 
proceaures of lexical economy permt to replace certain words by 
other words, and that is what Sav~d calls lexical valence. 
The postulated existence hypothesis of lexical valence leads 
to the calculation of a global index of valence for ekdry word. 
To evaluate the power of of a word, one inspects, 
in the dictionary, each element of the general 139t and counts 
how many times a word enters into the definition of another. 
To measure the power of combination of a lexical unit, one 
inspects in the dictionary all the compound words joined by a 
hyphen, all the Gallicisms (in English, these would be 
Anglicisms) and, in general, all the word groups. 
With a view of appraising the power of inclusion, one 
inspects me units of the general list in two synonym 
dictionaries and takes the higher number. The numbei of synonyms 
that possess a word constitutes a measure of the nunber of words 
for which can substituted. 
To measure the power of semantid extension, one inspects each 
of the elements of the general list in the dictionary and counts 
the number of meanings given by the author to such a word in the 
list. The number of meanings of a word is considered as a 
masure of its power of semantic extension. 
The global index of lexical valence is the sum of the four 
normalized counts. The two critedia having the highest 
correlation are definition and combination. 
In the beginning of the study, it was assumed that thk four 
variables were entirely independent of each other. The results 
of a facto~analysis indicate that they are not completely so. A 
factor rotation shows, however , that the variables are 
sufficiently independent to make it necessary to retain the four 
criteria of lexical valence. 
A comparison of the rank of the first 40 content words on the 
valence scale with the same words on the frequency list allows to 
frame a hypothesis that the correlation between valence and 
frequency would be rather weak. A more complete study would show 
without doubt that: we have there two very different selection 
prhciples. 
In conclusion, i eSan be stated with confidence that the 
measure of valence is no less valid than that of frequency, 
distribution . and availability. These concepts will eventually 
lead to more efficient dictionaries with respect to precision, 
compactness and lexical economy. 
ON LEXICOMETRIC RELATIONSHIPS AMONG THE SIZE OF DEFINING SET, 
TUE: SIZE OF DEFINED SET AND TILE MAXIMUM LENGTH OF DEFINITIONS 
1. Some Measures of Coverage 
A dictionary may be considered efficient and economical if it 
uses a reasonably small set of words to define 9 relatively large 
set of entries. We have, however, a very vague idea about what 
size vocabulary is needed to cover a given number 
of dictionary 
entries. (The related problem of ci~cular definitions seems to 
have to wait for a camputer soluti~n.) 
It is known, for example, that Basic English, Ogden (1 933) , 
involves a list of 850 English words and 50 international words, 
which were eventually used to define the 20,000 English words of 
Basic Ynglish Dictionary. This gives a ratio of the number of 
covering. words to that of defined words of 0.045. 
West studied the problem of what constitutes a simple 
definition and established a minimum defining vocabulary of 1 ,4 90 
words. The meaning of sane 18,000 words and 6,OQ'u idioms, i.e. 
about 2 4,000 expressions, was explained exclusively by these 
1,490 words, which were not defined themselves. The results were 
published in 1961 as The 14ew Method English Dictionary bf Hopm 
West and J. G. Endicott. The corresponding size ratio here is 
0.062, 
The above roughly indicates that a set of about 1.0 00 words 
can define d set of about 20 times mat size, but in general the 
behavior of these variables has not been investigated and is not 
known in any detail. 
One of us, in Findler (1370), has formulated the problem in 
de.Ein ; ie terms, Three variables were considered : (1) the 
covered set S of size %, (2) the Coverinq set R of size %, and 
L .I) 
- - 
(3) the'max&mum definition\ length - N, such that each word in S can 
- 1 
bq defined by at most N ordered words 
- 
The task 
find: 
(a) VR as a function of vS at different values of - N as a 
- 
parameter, and 
(b) v as a function of N at different values of v 
- 
as a 
l? 
- Li 
parameter. 
Usinq the terminology of increment ratio for Av /Av and size 
=;: S 
ratio for vR/vS , it was postulated for case (a) that 
2 
* the increment ratio is, in qeneral, less than one , 
2 
* the increment ratio, in general, decreases as v increases , 
S 
- 
* for large values of P, vR 
L 
asymptoticallv approaches a 
- 
limiting value as vS increases, 
- 
* the increment ratio will, never exceed the sixe ratio. 
L 
An exception to this rule would occur in a dictionary 
system, which does not treat liomon~ms as individual entries, 
A 
every time a new word with many homonyms is introduced into the 
Covered Set. 
It was further asSumed that for B=l , the coverincr set and the 
covered set are of the. same size, i .o. both the increment ratio 
and the size ratio equal one. We 'must now correct this statement 
becallse not every word is defined by itself only. If a new word 
is Lntroduced that already has a synonym in the covering set, it 
will be defined by that synonym. Then the inc'rement ratio is 0 
and the size ratio become less than 1. 
For tke second case, (b) , it is nostulated that 
* v monotonically decreases as I) N increases, 
* for any fixed v value, v asymptotically approaches a 
S 
- 
R- 
- 
lower limit qs 11 increases without bbund. 
C 
It was finally pointed out that vR should be small. ko 
- 
minimize 
storage requirements,, and - N should he small to mlnlmize 
processing time and output volume. A . compromise on these 
mnf licting requirements is needed. The ultimate question is :- 
given 
"What are the optimum yl and - 11 values for a v for certain 
- 
A s 
- 
computer applications on a machine with a given cost structure?" 
Xt is reasooable to assume that the behavior of the three 
variales and therefore the answer to the last question will 
larqely denend on the semaptic index of the elements of the 
covered set: and on the lexical valence of the elements of the 
coverins set, The latter implies that, for An efficient and 
economical di,ktionary, the emmknts of the covering set must be 
chosen fro* the available vocabulary on the hasis of a careful 
analysis. As research aimed at these goals is pratically 
nonexistent, it is safe to assume that most of the existing 
dictionaries are suboptimal. Work in this area will be useful, 
challenging, and rewarding, but the investigators must be 
prepared to spend a considerable amount of time and effort on it. 
So much the more as the entire problem complex outlined in tFle 
preceding parts will directly or indirectly enter into such 
investigations. 
ThR project described here is only a small beginning. It vas 
originally intended to complete the investigation of both cases, 
(a) and (b) , defined above. In view of the effort needed, in 
terms of human and machine time, only the first part i-s 
accomplished at the time of writing this report. Appendix. I1 
contains the design of the program for case (b) . 
2. Construction of the Data Base 
The data base was not derived from a text but was based on an 
existing dictionary of computer terminology, Chandor ( 1970) . A 
derivation from a text, if used, should be automatic and woulcl 
constitute a large-scale programming project in its own rigslt: 
In creating the data base, it was attempted to keep its structure 
simple and uniform without sacrificing its general validity. It 
was tried to avoid problems that would introduce distracting 
complications, from both theoretical and practical noint of view, 
into the subsgquent operations. All this led to the selection 
- 30 - 
and construction principles outlined below. 
Terms with excessively long definitions were avoided, i.e. 
definitions were held reasonably short. It was found that 
lexical units 
limiting bhe maximum definition length to 22,,did not undulv d 
restxict the selection. In some cases too long definitions were 
shortened by leaving out redundant words, glolses, r nxplanatorv 
notes. 
Every element of the covered set was considered a lexical 
item, regardless of whether the oriqinal dictionary entry 
consisted o fa one, two, or more vrords. For programing 
convenience every word was coded as a string of no more than 10 
symbols. Thus accumulator was represented as ACUMULATOR, 
absolute address appeared as ABSADDMSS, and 
absolute value computer as ABSIrALCOnIP. 
Polvsemous terms were avoided. 1f such a term was used, onlv 
its dominant meaning was recorded. In the data-base dictionary, 
then, each entry (element of thr covered set) has only one 
meaning and one definition. 
Tefms used in the definitions (elements of theacopering set) 
were also considered t% be lexical items, i.e. oriqi nal 
multiword terms appear as a single element, and every element is 
represented as a string of no more than 1q symbols. 
All terms occurring in the definitions are themselves 
defined, i.e. 
each element of th,e covering set appears also in 
the covered set. This principle implies t\at there is a Set of 
words each element of which is defined by itself. Such a sat may 
be called the basic vocabulary, consisting of vrords the meanings 
of which the user of the dictionary is suprmsed to know in order 
to use the dicti~na~zy. As in this particular case, the 
dictionary is one of computer terms and the hasic voca5nlary 
contains the nontechnical words used in the definitions of the 
technical terms. 
In the definit'ions, a definite distinction was made Sctween 
content words and function words, also called operators. The 
latter were not included in the covering set nor were they 
counted in determining the definition length. Hence, thcs 
covering set ~onsists only of content words. 
The set of function words is defined rather broadly. I't 
contains a wide variety of expressions that do not directly 
contribute anything to the content of the definition hut only 
indicate grammatical and logical rel.ationships between the worgs 
that form the content. It includ'es: 
1) prepositions, e.g. of, in to; 
- f .- 
2) conjunctions, e.g. - and, or - if; 
3) the relative pronoun which; 
- 
4) combinations of preposition and relative pronoun, e. go 
in which, to which, by which; 
5) present participles equivalent to a preposition, e .g. 
ULI, containing, representins; 
combinations 
participle and preposition, 
consisting of, oppo 
-- 
sed to, applied to; 
7) corhbinations of adjective and prewsition, e.g. capable 
of, exclusive of, equal' to; 
LII 
8) comhinaeions of noun and preposition, e. g. part of, set 
- 
of, 
- 
number of : 
9) combinations of preposition, poun, and preposition, e.g. 
in terms of, by means of, in the form of; 
- - 
prepositional phrases associated with- fol lowing 
infinitive, e,g. used to, necessary to, in order -to: 
1 1 ) 
other frequenfly used purely functional expressions, e. g. 
for example, namely, kno,wn as. 
Actually, the function words were repl-aced by code numbers in 
the dictionary. The code numbers were assigned consecutively a!; 
the function words were neeeed during the construction of the 
data base so that the order is purely random. A complete list of 
the 121 function words used, together with their code numbers, is 
given in Table I. 
----.-------o------&lY-----.I--m---.Il-C-----.--.--------- 
INSERT TABLE I ABOUT HEW 
-l-----l---l-mL1111liII-#----I(r------I---I----------------- 
is equivalent to 
of 
in 
in terms of 
using 
and 
which 
in which 
between 
to 
or 
from 
used to 
necessary to 
part of 
consisting of 
containing 
capable of 
by means of 
opposed to 
when 
on 
so that 
in order to 
exclusive of 
for 
pertaining to 
if 
among 
namely 
related to 
concerned with 
based on 
constituting 
resulting from 
set of 
including 
followed by 
provided by 
developed by 
assiqned to 
referred to 
on which 
used as 
in the form of 
from which 
into which 
number of 
1 e s.s 
defining 
86. known as 
87. performing 
88. performed hy 
under 
as 
such as 
equal to 
into 
with 
according to 
applied to 
depending on 
to which 
whose 
obtained by 
inheren$ in 
through 
during 
where 
during which 
out of 
at 
by which 
used in 
without 
caused by 
over 
not 
but 
extended to 
89. independent of 
chosen by 
for w~ich 
at which 
whether 
used by 
about 
before 
Per 
having 
formed bq* 
around 
after 
since 
against 
until 
whereupon 
except 
determined by 
over ~hich 
in relation to 
belonginq to 
corresponding to 
due to 
required for 
type sf 
across 
55. so as to 
56. for example 
57. represented by 
58. along which 
59. representing 
60. against which 
61. similar to 
116, because 
117, designed to 
1 1 R. indicating 
111,. produced by 
12q. outside 
12 1 , towards 
TABLE I 
List of Function Words 
The original definitions were somewhat simplified and 
standardized. In this process, articles were omitted (many 
languages do very well without them).. On the other hand, 
implicit relationships were made explicit. A few examples shall 
serve as illustratfions, with the function words (in parentheses) 
inserted explicitly instead of their code numbers. 
Original dictionary entry: 
aberration A defect in the electronic lens svstem of a cathode 
ray tube. 
Definition in the data base: 
DEFECT (in) SYSTEM (of) ELECTRO?JIC LENS (of) CATHPAYTUB 
Note that electronic lens systemn (should he : 
electronic-lens system) means "system of electronic le-n?ns" (as 
opposed to "electronic system of lens"), and this relationship is 
made explicit. Yote also that "cathode ray tuben is a single 
lexical item, 
Nouns are represented in singular, thus avoiding anothcr 
dictianary entry for plural or, what would he worse, proqraminq 
a "grammar." Likewise, finite verb forms are represented in third 
person plural present indicative active, Avoiding the third 
person singular ~Mminates another dictionary entrv, and avoidinq 
the passive voice eliminates a great *any participles, which 
otherwise would have had to he entered. Of course, present and 
past participles (the former identical to gerund in form) could 
not always be avoided and had to be entered in the dictionary 
where needed. Auxiliary verbs were automatically eliminated by 
avoiding compound tenses and the passive voice. Finally, "to don 
associated 7~3th negation was simply omitted. 
Original : 
absolute coding Program instructions which have been written in 
absolute code, and do not require further processing before being 
intelligible to the computer. 
Data-base entry: ABSOCODING 
Definition: 
PROGRAM INSTRUCT10 (which) ONE WRITE (in) ABSOLUCODE (and whic5 
not) REQUIRE FURTHER PROCESSING (before) INTELIGIBL (to) COMPUTER 
Note that the first predicate in the relative clause, third 
person plural perfect indicative passive, is represented by the 
singnlar indefinite pronoun "one" as subject, followed by the 
- 
standard plural active verb. The auxiliary "do" has been omitted 
and the negation is represented by a function word. The 
virtually redundant "being" has also been left out. In general, 
the copula is omitted (some 
languages do very well without it). 
Original : 
analytical function generator 
A function generator in which the 
Function is a physical law. Also lcnown as naturAl laiu function 
- 38 - 
generator, natural functipn generator. 
Data-base entry: AI\ILYTI;T\TC,EN 
l3efinitio.n : 
FUWCGENRTR (in which) T'Il~TCTIOtJ PM'ISICAL LAW 
IJote also the omission of the gloss "Also known as 
The stylized definitions are easily understandable even to 
human readers as the printout of the dictionary demonst~ates. 
The data base was constructed by selecting the first entry, 
then entering all the lexical items in its definition, 
subsequentlv A entering all the lexical items in the definitions of 
these etc. Words that were not defined in the original 
dictionary were entered and defined by themselves; they 
constitute the basic vocabulary. - This procedure was continued 
until everything was defined, i.e, until all the terms in the 
covering set were also in the covered set. Then the next entry 
was selected from the dictionary, and the above process was 
repeated. 
It had been tentatively intended to compile a covered set of 
about 1 ,000 lexical items. When this number was reached, a rough 
pencil-and-paper check indicated that the size ratio was about 
0.91 at that point. It was then decided that the data base sould 
he somewhat larger to show the relationships under investigation 
more perceptibly, and more words were added. 
\fien the size ratio had decreased to about 3.79, the 
construction of. the data base was concluded as proc&ssing 
difficulties were anticipated with too large a data volume. At 
that point the data-base dictionary had precisely 1,856 entries 
(as was later verified by the program). This was considered to 
be a satisfactory compromise. 
The dictionqry was arranged in the form of a SLIP list, 
Pindler et al. (1371). Everv - entry (element of the covered set) 
occupies four cells in this list: (1) entry word (in A10 
format), (2) definition length (an integer) , (3) type of entry 
(an integer) , (4) suhlist name. 
Three types of entries t were distinguished for programming 
convenience : 
1) code 0 indicates that the entry itself is not used in. 
any definition, i.e. it occurs only in the covered set and 
not in the covering set; 
2 code 1 indicates that the entry occurs in both sets and 
is not an element of the hasic vocahulary; 
3 code 2 indicates that the entrs is defined by itself, 
i.e. it belongs to the basic vocabularv. - 
The sublist, the name of which is in the fourth cell for 
every entry in the main list, contains the definition. This 
arrangement 
conveqiently separates the entry words from those In 
the definitions. 
A cell in this second level contains either a word (in Alr) 
format), i.e. an element of the covering set, or a sublist name. 
The codes for function words (integers) are contained in the 
cells in the third level. This arrangement is convenient for 
bypassing the function words in processing when they are not 
needed. A typical dictionarv - entry is illustrated in Figure 1. 
I--~~--,-.-,~~-~o-.II-L-~-------.I)-----.---.-----~----------.--~--.- 
INSERT FIGURE 1 ABOUT E1CW 
The fact that every dictionary entry owns a sublist is 
practical in another respect: useful information about the entry 
can be collected and deposited in a description list associated 
with the sublist. For example, if it were desired to evaluate 
the definition component of the lexica& valence of each lexical 
item, a prbgram could e developed that counts how many times a 
particular item occurs in the definition of other items and 
stores this information in the description list created for that 
item. Investigations of this nature will he done at a- future 
date. 
The program developed for processing all the necessary 
information is rather complex. Since many of iks organizational 
characteristics may he of fairly general interest to those who 
wish to engage in lexicometric studies, a brief description is 
P$g, 1.-A represerltative entry In the datn-base 
dictionary 
I 
En t r y 
~rord 
Y 
Definition 
length 
Entry 
type 
Sublist 
name 
9 
'7 
Word 
. 
Word 
- 
L 
Sublist 
header 
I 
' name header 
Word Function 
code 
w . 
Sublist SublXst 
name header 
- - 
7 
Word 
Word 
Sublist 
name 
Word 
r 
Sublist 
Function 
Fl 
J 
Sublist 
header 
Function 
code 
name 
- 
-I 
I 
# 
Sublist 
header 
Function 
code 
Word 
F 
I 
Word 
Word 
I 
Sublist 
name 
Word 
Function 
rl 
b 
Sublist 
header 
Function 
code 
f 
Sublist 
9 
Sub 11s t 
given in Appendix I. 
3. The Results of the Computations. 
The relationships between the size of the covering set vR and 
- 
that of the covered set vS are summarized in Table 11. The table 
II 
lists the size of both sets, the size ratio, the increment of 
either set, and the increment ratio for Wur values of N. Figure 
L 
2 presents v~ as a function of v~, with N as. a parameter, in 
- 
- 
- 
graphi~al fon. 
INSERT TABLE 11 PJD FIGURE 2 ABOUT HERE 
The table shows that, iq general, the increment ratio is less 
than T, except for one case, to which we shall return helow. In 
the meantime note that, for full, dictionary, the table 
definitely verifies the assumption that the increment ratio 
decreases with increasing vS. This, however, does not seem to he 
- 
true for the reduced dictionary. In fact, for all three cases of 
4 
the latter, the ratio tends to increase with increasing vs. 
- 
Therefore the single occurrence of the value 1 is plainly a 
random event as the ratio is very close to 1 at the largest as 
- 
value also in the two other cases.. The sequence of value3 is 
evidently approaching unity. 
This somewhat unexpected, though not particularly surprising, 


TABLE I1 
Covered-Covering Relationships 
phenomenon is due to the combination of a number of 
circumstances. We are dealing with a specific technical 
dictionary. Ih such a dictionary, nontechnical, i,e, 
ordinary-language, words are not defined. However, a sizeable 
set of nontechnical words is necessary to define the technical 
terms. All the former, in our case, belong to the set of basic 
vocabulary and are defined by themselves. The result is an 
inordinate proportion of the set of basic words even in the full 
dictionary. A rough pencil check during the construction of the 
data base shoved that the basic vo~ahulary forms about 0.55 of 
the entire covered set, 
We recall that, in anticipation of this kind of difficulty, 
the function words were eliminated from the covering set, to 
begin with. If this had not been done, the situation would have 
been aggravated by an order of magnitude. To eliminate, or at 
least to alleviate this bias, a 'Eonsiderahly larger data base 
should be used, which, as explained before, would have heen 
beyond the scow - of this pilot project. 
Another, and more important, factor that contributes to the 
problem in question is the fact that our data-hase dictionary was 
not derived from a text hut constructed from another dictionary. 
This was done, as described earlier, by selecting entries 
starting from the beginning of the dictionary and stopping when 
the data base was of satisfactory size. As a result, while the 
basic vocabulary may be assumed to be uniformly distributed over 
the dictionary, the important content words, with lonqer 
definitions, are not. The selection of entries, in fact, was 
stopped at the letter H. Words beyond that point are there only 
because they happened to occur in definitions. Thus, at least 
the words that occur only in the covered set (and not in the 
covering set) are crowded toward the beginning of the dictionary. 
What happened when the dictionary was reduced is now obvious. 
The weighty words with long definitions were eliminated hut the 
entire basic vocabulary remained. This, of course, is quite 
appropriate and consistent with our principles. If, for example, 
the dictianary had been reduced to N = 1, virtually only the 
basic vocabulary would have been retained, and we should have 
obtained tho postulated linear one-to-one relationship between 
vF 
- 
and v 
3' 
Nevertheless, this procedure enhances the proportion of 
the basic vocabulary, and the bias increases. As the technical 
words are relatively scarce in the last third of the dictionary 
to begin with, the situation gets worse, with the reduction, 
toward the end of the dictionary. This accounts for th5 
increasing increment ratio. The last increment with ?J = 16 must 
have consisted entirely of basic words, therefore the ratio of 
unity. 
It is suggested that, for further investiqation, a more 
complicated dictionary-reduction program be. developed, vh-ich 
would comnare all the basic words with all the remaining 
definitions and eliminate those that do not occur in any 
definition. Thus a basic wosd would occur in the dictionary onlv - 
if it is needed.in a definition, which was the case in the 
unreduced dictionary This :?av .I a more natural proportior. hctveen 
khe basic words and others would he restored, 
It is the same set of circumstances that also explains the 
fact that, in the reduced dictionary, -. the increment ratio almost 
consistently exceeds the size ratio. This, however, is not the 
case for the full dictionary, which definitely verifies the 
respective assumption in Findler (1 97q) . 
To demonstrate that v approaches an upper limit wit11 
R 
increasing vS for large N, - a much larger dictionary vould be 
- 
needed, Ilovever, the curve in Figure.~Z for ?1 = 22 unmistaIra52y 
shows a tendency in this direction. 
There is, of course, another way of varying N: 
- instead .of 
reducing it, it could he increased, and certain words in the 
definitions could be replaced hy their definitions. This would 
be a complicated procedure and difficult to control. If few such 
xeplacements are made, v will not change appreciably. If many 
R 
- 
are made, some replacements tend to reintroduce precilsely the 
words others try to eliminate. In any case, the result would he 
a set of awkward and unnatural definitions of erratic lengfhs. 
In 
order to use such a procedure, an efficient dicFlonarp should 
first be compiled, with short definitions and well controlled 
covering set. The concept of lexical valence should he utilized, 
but this entails more research in this area. It ~vould also gat 
the researcher involved in the problem discussed in the preceding 
parts. 
The curves for N = 16, N = 8, and N = 4 in Figure 2 a1.l 
display the basic-vocabulary bias of the reduced dictionary. The 
last one very nearly approximates a one-to-one ratio. e must 
appreciate the fact that the 1,047 entries of the respective 
reduced dictionary contain about 1,000 5asic words. 
It is also to he noted that the full dictionary, with 1\1 = 22, 
in the region of v = 600 requires 
a larger covering set t5an any 
S 
- 
of the reduced versions. This is understandable as we rellize 
that the routine that computes the data points actually 
simulates, rather artificially, the construction of a dictionary 
from a source text. The full dictionarv at that stage is close 
to encompassing the whole source, where complex technical terms 
are being defined, whereas the reduced versions, at t same 
Wlue, are already in the area in which the basic yocabulary 
dominates- 
The project has been informative in another respect, vhich ,is 
not unimportant: it has given an indication of t'lc effort 
involved in this type of work. It h~s taken ,3 total of about 711 
hours of cofiputor time. T\c dsvelogment of ttl~ 
dictionary-display program and 013taining the printout 15 i3 a 
matter of about 7 minutes and is therefore neqlisihle. Of the .I11 
hours, ahout 3 were spent on dictionarv reduction (thr~c seri9s 
of runs) and 11 on the analjrsis. Although some d~buqginq +had to 
be done, this was generally insignificant as corn7are-l to the 
total effort, so that nearly all the 14 hours 9a.; hcer, us~f.111 
running time. 
It is also interesting that time wems to 11e very dcpenlent 
on the volume of data Reing +andled. 13f the 11 hours, more thn 
9 were spent on running the full dictionary (N = 32) and aborlt 1 
hour on the reduced version of .J = 1G. Completinq the running of 
the last two series (I1 = R ancl :I = 4) tool togetFlrr less t11an an 
hour of machine time. 
In terms of human effort, the accomplis'ling of th4 qroject 
required ahout six man-months' .#.or!:. 
Finally, Appendix I1 contains a brief (.lescri?tion of 2 
plannned program that v~ould investiqate the relationship +?t\t~mn 
the size of tk covering set an? the maximum definition lengtll 
for fixed values of the covered set size. 
rie wish to express our gratitude to the manaqement of Penuuin 
Rooks Ltd. for the permission to use their puSl icatinn 
A ~ictionarv of Computers hy A. 
Chandor as source for gexerating 
the data base of this oroject. 

APPENDIX I 
Proqram Development 
The entire data base was first punched on cards to be 
inputted as a single list structure, with the dictionary entries 
alphabetically ordered. It was soon established that this 
arrangement by far exceeded run-time storage limit-ations (using a 
field length of 100,000, ) . Only ahout one fifth of the material 
could be accomfnodated at one time without exhausting the 
available space. Therefore the dictionary was split into five 
individual List skructures, and the correspofidbng card imaqes 
were stored on disk as five separate f-iles. These were brought 
in, one at a time, for processing ss needed. Recause of space 
limitations, also processed data and intermediate results had to 
"I 
be put in external storage during run tine and, of course, 
between runs, therefore more files had to he created as de.;crihzd 
later. Thus, a great deal of programming effort went into file 
manipulation. 
The purpose of the first program, designated AMALEX, was 
simply to display the dictionary. It Yirst reads the function 
words from the cards and stores them in the form of a 121x2 
array. (The width of the array is 2 because many function words 
are longer than 10 charaeters.) 
Using a function READLS, the program reads the dictionary and 
stores 15: in the form of a list structure as described above. 
On 
this occasion, it also measures the space required for the 
dictionary. It was found that a field length of more than 
235,680, locations ~muld he needed to accommodate the entire data 
base. 
A subroutine called RITELS prints out the dictionary, 
specifying each entry by the definition in the form of at most 
10 
words to the line. The routine also checks the operator code 
numbers in the third'-level sublists and replaces these in t3e 
printout by the appropriate function tmrds from the array. 
The dictionary was printed out in four separate rvns as the 
dictionary was initially divided into four lists. Since the 
ANALEX program does no further processing and accumulates no new 
lists, no storage problems arose. It was not until later that it 
was established that a division into five parts was necessary to 
perform subsequent operations in the space - available. 
The first printouts were carefully examined for punching 
errors and omissions. Detected errors ware corrected and the 
files were updated accordingly. 
The actuaI working program is named COVSET. If the entire 
data base were one single list and if time were available 
indefinitelv, A this pro-gram would do the complete work in a single 
run. In this case, it would print a fable of corresponding v 
and % 
values for a given value of N, would reduce the value of 
I 
- 
N and print out another table, etch , and repeat this for all 
- 
desired values of N. 
- 
This, of course, could not be done because, in the first 
place, only one of the five parts of the dictionary could he 
worked on at a time and, in thse second place, the program had to 
be run in time increments of 600 s or less, which was the set 
time limit, 
The principal routine in COVSET is ca-lled COVRYG, which 
computes the values of v for given values of v 
2 A* 
Its simpliqied 
flow diagram is given in Frgu~e 3. 
INS-ERT FIGURl2 3 ABOUT HERE 
wom~o~~~r~r~o~mwmw~o~~-~~~~)I-om~~~.~~~-~~~~~~~~~~~~~~~~~ 
As the inherently continuous orogram cannot klc3 run 
continuuously, a few control variables are needed to provide 
criteria for interruption and to transfer information from one 
run to the next. These are read from cards in the beginning of 
the routine. 
A reference value LSTRCF is used to control t5e spBcing of 
the recordings of v and because too close spacing would 
S 
- - 
introduce random irregularities into the otherwise smoothly 
cllanging tendency. The reference is automatically updated after 
 very printout of the. 5 and # values. During the analysis of 
* 
yake next word Yes 
fron dictionary 
I 
, 
9 - 
List in R 
~ncremefit count 
L 
List in S 
BIncrement count 
v 
Put definientia 
on waiting list 
Print 
vi3 vR 
L 
I v _I L 
C 
Take next word 4 
from waiting list 
I a 
- 
No 
- 
List in S 
Print 
final 
VS 3 JR 
, Increment count 
1 
v $ L 
L 
,Put definientia 
on ~aiting list 
b 
Fig. 3. -Flow diagram of COVRNG. 
1 
- 
the full dictionary, the reference was incremehteit hy 200; 
later,, in the processing of the reduced dictionarv, -. it :~ss 
incremented hy 100. 
A criterion is needed for interrupting the program hef~re it 
exceeds the time limit. An estimated increase in vs WC?S 
- 
initially used for this purcpose. A value '3AXLCFJ vas innut and 5 
- 
compared rith it every time a new word was acldcd to t"l set. 
\%en the coun-t reached the reference value, the program :Jas 
discontinued. 0 the average, about 15 words per ptun could hc 
added to the covered set. 
Later it was found that better control could he exercised Ilv 
counting the number of times that a new section of the dictionary 
was brought in for processing. A Value !WXW,P was read Ln and 
when the above counter, starting from O, reached this va'lue, the 
run was interrupted. 
The variables KNTCVD and KNTCNG are counters for v 
and vR , 
- - 
respectively. Their current values are transferred from one rur. 
to the other. The value of KNTPRT indicates the sectiofi of the 
dictionary currently under investigation. 
The variable IrtCONT is set to 0 for t'le very first run far 
each N value. This tells the routine to set up new lista fox 
- 
Covered List, Covering List, and a so-called Waiting List. In 
all successive runs its value is 1, indicating that the program 
must bring these lists in from the external file. 
The routine exmines the current. section of the dictionary, 
entrymby entry. In the first series of runs, it deals with one 
of the five sections, stored in one of the five files, in the 
form of the original card images. A sixth file was created for 
storing all the lists generated by the program. Idhen the 
dictionary was later reduced (for reduced values of nr , tne 
C 
corresponalng sections of the reduced dictionary were also stored 
in that sixth file. 
If the current entry is an element of the Sas.ic vocabulary 
(type 2) , the routine bypasses it and takes the next entry. Tflis 
can be dane in the processing of the full dictionary because all 
these words occur in the definitions and will certainly he caught 
later. This is no longer so in processing the reduced dictionary 
because the words in the definitions of which they occur may have 
been eliminated. In the latter case, therefore, this tvpe 0.f a 
word is immediately added to both the Covered List and the 
Covering List (it always covers itself) . 
If the current entry is a word that does not occur in any 
definition (type O), it is being encountered the firsttime, and 
we are sure that it is not alteady on the Covered- List; hence, 
this question need not be asked. 
Otherwise the routine tests if the word is already on the 
Covered List, which may well he the case hecause the word may 
have occurred earlier in the definition of another word. If so, 
the routine proceeds to the next ward in t'le dictionary. 
If the word is not found on the Covered List, it is put 
there, and KNTCVD is incremented. Then all the words in tho 
definition of the word in question are put on the Waiting List, 
which is subsequently processedd This is necessary hecause of 
the adopted principke that all the covering :vords nust t'lemselveq 
be covered. An entry in the r3 versvR # tahle is meaningful 
- - 
only if this condition is satisfied. 
The current dictionary entry itself. is recorded as t'le valug 
of the variable DREF, which passes the information on, from one 
run to the next, where in the dictionarv the program is currently 
in action. 
The routine then examines the Waiting List, word by vord. If 
the current vmrd is already on the Covered List (it mav have 
Occurred earlier in the dictionarv), the ro~kine c3ecks if it is 
also on t$e Covering List (it may not he hecause it has not vst 
occurred in the definition of another swrd) . If not, it is put- 
there, and KNTCNG is incremented. A11 words on the Waiting List 
come from definitions and must therefore he added to the Cowring 
List. After a word has be& processed, it is deleted from the 
Waiting List. 
~f the current ptord is not on the Covered List, it must 
obviously be put sthere. 
First, however, the routine tests if the 
word occurs in the section of the dictionarv currenflv in store 
by checking whether its numerical value is between those 
of the 
first and the last word of the section. If the word is not 
there, the routine postpones its processing and takes the next 
word from the Waiting List because it is more econoznical to 
process first all the words available in the dictionarv - sedtion 
present than to read in other sections of the dictionarv as the 
words dictate it (memory swapping is* cxpensive) . 
Should the word be in that section, the routine adds it to 
the Covered List, increments KNTCVD, and actually looks for the 
word in the dictionary. If it does nok find it, it gives an 
error medage, prints out the questionable word, and terminattbs 
the run. This way t'he remaining punching errors in the datn hase 
were detected, and a ferv words were found missing (due to human 
error quring the construction of the data base when it was 
forgotten to enter words that acutally occured in definitions) . 
The files were updated accordingly. 
If the word is found, the routine adds all the words in its 
definition to the Waiting List, then investigates its presence on 
the Fovering List, and proceeds as described before. When the 
bottom of the Waiting ~1st is reached and the list is not empty, 
the words remaining on it must be in other sections of the 
dictionary. The section present is then erased and the next 
section is brought in (if the current one is section 5, section 1 
is read in). The processing - of the Waiting List now starts from 
the beyinning and continues as described above. 
If the Waiting List is finally empty, and K?ITCVD equals or 
exceeds LSTREF, the routine increments LSTREF by the prescriber? 
amount, and prints the values of KT!TTCVD and I3VTCTTG. If the couflt 
is less than the reference value, the routine simnlv proceg'ds. 
In any case, it tests if the proper section of the dictipnarv - 
happens to he in the store (it knows that hv - the value of 
KNTPRT) . If it does not, the section present' is erased an3 the 
right section is read in. 
Next the routine looks for the word 3t v11ic5 it had 
wviously stopped tracing tha dictionary (it knows that l?y t\e 
contents of DREF). An error message has been rrovided - for the 
case in which it does not find the reference for some re~son. 
FortUnately, the program never made use of this message. After 
finding the reference, the routine takes the next v~ord from the 
dictionary and proceeds as alread described. 
When the routine reaches the bottom of the dictionary, it 
tests if it is the last section. If not, the next section is 
processed as described. At the end of the last section the 
routine prints the final values of v and v and vith this the 
S R ' 
- - 
prbcessing is finished for a given value of I) N. 
The ahove srriooth description involves countless runs. 
Interruption criteria are tested at appropriate places, and the 
processing is discontinued accorcl~ngly. Whenever a run is 
terminated, the three compiled lists are saved ??y storing t'lnn in 
t!e external file (we shall call it File 9 for t\e sake of 
convenience) . The control parameters and reference variables are 
printed out,. The data cards are changed accordinsly, for input 
to the next run. 
The first series of runs was perZomed with tho ell 11 
dictionary, for which the maximum definition length ir is 22. In 
- 
dhe following qeries of runs N was gradually decreased. 
I) 
It was 
then also necessasv to reduce the dictionarv hv eliminating all 
& 
words with definition length greater than the 'curre3t hl, t!~en 
- 
eliminating 1 words containing them in their definitions, 
subsequently eliminating all words the definitions of vthic\ 
contain the latter, etc. 
The program calls another qajor subroutine, named DICRED, to 
carry out this operation. The routine is hasicallv simple; what 
makes it appear complicated is the manipulation of t\e files. It 
was found to be most convenient to search one section of the 
dictionary per run. 
From the data cards, the routi'ne reads a reference parameter 
called KYTSCT, which indicates the !lBighcst consecutive section 
number that has been seuched. The control variable IDRPhas 
valne 0 at input; the routine. changes it to 1 if any w,qrds were 
removed from the section currently Seing searched, ot3erwise it 
ternsins 0 at output. The variable KNTRPT shows the nurnh~r of the 
section currently being searched. #The parameter INDFIL is set to 
0 every time a new section is searched the first time. This 
tells the routine to bring rn khe section indicated hv KVTSCT. 
If its value is 1, the seccion to be read is indicated hy KMTPPT. 
The reduced sections are stored if Tile consecutively. If 
KNTRPT is less than KNTSCT, the sections follo~dng the one 
currently searched are stored on a temporary file because the 
length of the one being searched mav decrease. Not until the 
search has ended and the current section has been stored back at 
its proper place are the following gections transferre2 back to 
File 9.. For example, if KNTRPT = 1 and IZNTSCT = 5, then sections 
2, 3, 4, and 5 are stored away. 
In the very f irgt . run for a given - ?I value, i .e. if K'7TSCT 
equals 1, the routine creates an nmnty list for the so-zalled 
Removal List. In the quhsequent runs the routine read3 in thz 
Removal. List from the file. 
The routine examines the definition lengths of the entries in 
the current section. itemhy item. The entries the definition 
length of which is greater than the set N - value are put o~ t'ls 
Removal List dfid deleted from t!le dictionarv. T5e value of IPRD 
is .set tn 4 if such entries are found. The remove2 words are 
printed out for reference. 
Then the dictionary is searched artd all definitions are 
checked against the items on the Removal List. If a clefinitipn 
containing a removed word is found, the respectiv~ entry itself 
is added to the Removal List and sul\sequently deleted from tho 
dictionary. If a search results in any new additions to the 
Removal List, the search is repeated. This is continued until no 
new deletions occur. 
After the ILL n-th section has been processed the first time and 
if deletions have occurred, KNTRPT is set to 1 ,2, . . 
, ;r 
respectively., in n - succeeding runs. If anv one of these produce., 
deletions (IDW set to I), the sequence is repeated. This is 
continued until IDRP remains in all n runs. 
LI 
At the end of every run, after the temporarilv -- saved 
dictionary sections nave bee restored, the Removal List is 
stored as the last in File 9. Then the values of the key 
variables are printeH out. The data cards are changed 
accordingly for the next run. After the sequence of runs with 
KNTSCT = 5 has been coqpleted, the operation is finished. 
The reduction was aarried out with values of N equal 
to 
16, 
L 
8, and 4. The valbp 10 W~S. tried after 16, but the resulting 
r~duction was too slight so that the series was discarded and the 
value R was us& insteati. At N = 4. the size ratio was already 
so close 
to unity that a further reduction to 2 would no longer 
have been very informative. 
All sections of all the successivelv redu-ged dictionaries 
have been preserved on File 9. Presently File. 9 has 1 5 lists, 
each ending with an EOF. The 16-th contaxns the Covered List, 
the Covering List, and the Waiting List from the last run. These 
three are not separated by EOF1s as there was no necessity for 
separati~g them. This list collection has no gat ticular 
importance. 
The remaining subroutines in the program are short auxiliary 
routines for aiding the principal routines where needed. The 
function INPUTL reads in a list structure from the card images on 
file, without printing out the lisk as does the original SLIP 
routine. It constructs erasable local sublists. It is virtually 
the same routine as READLS used by ANA-LEX. 
RESTOR is equivalent to the SLIP subroutine of the same name 
except that it does not leave a SLIP cell with a list name as 
datum floating in the mailable space. (The latter tends to 
cause program termination with an error message to the effect 
that a list was required but not found.) 
The subrolltine SKIP is needed for convenient accessing of the 
various lists in File 9. Finally, th$ function DLTLST is the 
most effective means so far tried for deleting list structures 
built by the SLIP routine BUIBPL. (It does not comp1etel.r - 
destroy them, however, and if BiJINPL is used reneatedlv, -. the 
store is still gradually filled vith residues that rake available 
space unavailable. ) 
APPENDIX I1 
Some Ideas for the Program to Investigate the Relationship 
Coverinq Set Size versus ~lax~mum Definition Lenqth 
The second proposed problem, viz. finding vR as. a function 
of N for fixed values of vS, is diccussed now. This will he a 
- 
task of proportions no less than the present, except for 
construction of the data base. The following ~rocedure, 
represented hy a simplified flow chart in Figure 4, is suggesteci 
for carrying out this task. 
INSERT FIGURE 4 ABOUT HER3 
The program starts with known values of Y and v~ (in this 
- 
case 22 and 1,464,respectively). It first replaces words in F 
- 
having a definition length of 1 (except, of course, those defined 
by themselves) by their definition in all definitions. Then the 
program looks for words of short detinition length in R (x = 
m 
2,3,4, etc.) . It substitutes their definition for them in all 
definitions and counts them out from v . 
R 
Simultaneouslv, it 
.. 
keeps track of possible increase in-N due to th3.s process and 
L. 
- 68 - 17 
Take next word 
witti 1 definl.cns 
L 
- 
-~ 
Increment. coun-t 
* 
v * 
Substitute 
Sub~titute a everywhere , 
Adjust N 
+- 
I 
v = vR - count 
- 
R V~ - v~ 
- count 
+ - 
v L, 
* 
PrLnt Print 
N' V* N' VR 
L 
Take next word, with , d 
x definientia 
- 
Q 
Put definientia 
+ 
on list d Increment x 
w 
'Take next word from tist 
- 
i 
Fig, 4 .-Fl.orv diagram for establishing N-v relations 
R 
records the fmlue. The process is repeated with reclucecl 
dictionaries, which have different vs values. 
- 
As pointed out earlier it is not suggested that definitions 
so created are usable or acceptable to the speaker of a natural 
language. The procedure, however, will produce the numerical 
relationships des,ired. 
The existing data base, together with its reduce& versions, 
has been stored on magnetic tape. and is ready to be used as input 
into the propos~d procedure. 

References
Chhndor, A. (1970) A ~ictionarv of Computers, Penguin Books, 
Ilarmonds~lrorth, England. 

Findler , N.V. (1970), Some conjectures in Computational 
Linguistics, Linguistics, 10, 64, 5.

Findler, N,V., J.L. Pfaltz and H.J. Bernstein (1972), Four High 
Level Extensions of FORTRAN IV: SLIP, AVP?L-If, TRCETRAW and 
SY?IT30LANG, S3artan - Rooks, New York. 

Guiraud, P. (1959), Prohlemes et mothodes Ae la statistique 
linguistique, Reidel, Dordrecht. 

Longyear, Linguisticallv determined categories 
meanings, danua Linguarum, series practica, 92, louton, T~P 
Hague, Ilolland. 

Lyons, J. ( 1969) . Introduction to Theoretical Linguistics, 
Cambridge Univbrsity Press, Cambridge, England. 

'4aa s , H.D. (1P72), bber den Zusammenhang zwischen 
Wortschatzumfang und Lange eines Textes, 
Zeitschrift fur Mterdturwissenschaft und Linguistik, - 2, No. 
8, 

Hackey , CJ.F, ( 1 9 6 5) , Language Teaching Analysis, Indiana 
University Press, Bloomington. 

Pluller, C. (1964), Essai de statistique lexicale, Librairie 
Klincksieck, Paris. 

Ogden, C.K. (1933) , Basic English: An Introduction with Rules 
and Grammar, 4th ed., Kegan Paul, Trench, Truhner & Co., London, 
England. 

Osgood, C.E., G.J. Suci and P.H. Tannenhaum ( 19 57) , The 
Measurement of Meaning, University of Illinois Press, Urhana, 
Illinois, 

Russel, B. (1967) , An Inquiry into Meaning and Truth, Penguin 
Books, Baltimore, Maryland. 

Savard, J.G. (1970), La valence lexicale, ~idier, Paris. 

Viil, H. (1 974) , Some Lexicometric Properties of a Dictionary, 
Unpublished M.S. 
Project at the Stqte University of New York at 
Buffalo. 

IFleinreich, U. (19661, ~xplorations in semantic theory, - in 
current Trends in ~inguiatics , Vol. I11 : Theoretical 
Foundations" (T .A. Sebeok, Ed.) , pp. 395-477, >!outon, The 
Hague, Holland. 
