File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/79/j79-1004_metho.xml

Size: 58,068 bytes

Last Modified: 2025-10-06 14:11:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="J79-1004">
  <Title>American Journal of Computational Linguistics</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
SOME; PROBLEhIS OF LEXICAL HELATEDNESS
1. Polysemy and Homonymy
</SectionTitle>
    <Paragraph position="0"> While the problem of meaning is complex in itself, the difficulty increases by another order of magnitude if one has to deal with words of many n~eanings or different words with different meanings thak have identical spellings or pronoupciations. And the decision as to whether a given case represents one polysemous word or two (or more) homonyms is far from being well defined.</Paragraph>
    <Paragraph position="1"> The separation can be based on morphological criteria. First of all, two graphematically identical word forms with different meanings are regarded as homqraphs and separated if they display a phonematic difference or if they belong to different word classes. They are also homographs even if they belong to the same word class but possess different inflection systems.</Paragraph>
    <Paragraph position="2"> otherwise, they represent the same word. More than one meaning of one word constitutes a case of polysemy. In contrast with such diversified meanings of one word, we talk about hm, in which case two words have by chance acquired the same external appearance. A distinction between the two can only be made, if at all, on the basis of the historical origin of the words invo lved. Direct, transferred and specialized senses of a word can be listed along ope dimension of meaning, dominant and basic senses represent certain measures along another dimension.</Paragraph>
    <Paragraph position="3"> Another concept is semantic depletion, in which case the word occurs in scores of expressions. Mere, the verbal or situational context - adds substantially to the meaning of the word in question. With polysemy, however, the context eliminates those senses of the word that do not apply and thereby disambiguates the polysemous word. It is, therefore, important from the lexicographical point of view to distinguish between the degrees of interaction between the context and the meaning of individual  (a) in case of weak inPS luence, we talk about autosemantic or semantically autonomous words ; (b) a strong influence performs a disambiguation of polysemous or homonymous words; (c) the context defines the 'meaning of synsemantic or  semantically depleted words.</Paragraph>
    <Paragraph position="4"> Needless to say that the above, as innumerable other, decisions must often be based on subjective criteria. Finally, it could be noted that, in exceptional cases, even the inmediate context cannot resolve the ambiguity4 and two or more interpretations are acceptable. This phenpenon is the It is clear even to the casual observer that total interchangeability in all contexts, and identity in both cognitive and emotive senses, of two lexical units (words, in the simplest case] are not possible in general. The semantic relationshir; between synonymy is based on and measured by a level of similarity.</Paragraph>
    <Paragraph position="5"> Rather than distinguishing between the &amp;quot;meaning&amp;quot; and the &amp;quot;usage&amp;quot; of a word, one should assume the view that the former is the sum total of the possibilit!ies of the latter. This is basically what justifies the existence of any monolingual (and, possibly, bilingual) dictionary.</Paragraph>
    <Paragraph position="6"> The entries in the dictionaries we are concerned with are both words (the interpretation and definition of which units are less than clear-cut) and multi-word lexical units. The two are of the same standing and function, and they will be treated identically.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3. Definitions
</SectionTitle>
    <Paragraph position="0"> Definition is the most fuhdamental concept associated with dictionaries. We shall be concerned with both classical Aristotelian definitions, based on &amp;quot;class&amp;quot; and &amp;quot;characteristics&amp;quot;, and operational definitions which use sententialw generative terms. In fact, it is often difficult or impossible to separate equivalence or paraphrase ciefinitions , on one hand, and those that are process-oriented reproductions, on the other, In general., *he lexical meaning can be rendered by four basic instruments and their various combinations : (a) the lexicpgraphic definition enumerates the most important features of the lexical unit being defined, in the simplest possible terms;  (b) qualified synonyms provide a system of semantically most related words; (c) exemplification puts the defined unit in functional combination with other units; (d) a gloss is an explanator or descriptive comnent related to the dictionary entry; it may also skate similarities to and dif firences from other entries.</Paragraph>
    <Paragraph position="1"> - 15 -AsPECrS. OF THE SCIENCE OF DICPIONARY 1 . General, Concepts  Uaough definitions abound, a reasonable distinction seems to be to say that the semantic description of individual terms, the inventory of words is the customary province of Lexicoqraphy whereas le&amp;coloqy refers to the study of the lexical material, of the recurrent patterns of semantic relationships, and of any formal devices, such as phonological and granmatical $ystems, that generate the latter.</Paragraph>
    <Paragraph position="2"> To construct a dictionary of a given size,. one could choose the entries on the basis of their frequency of occurrence or in relying on some measure of *utility that is vaguely tied to the semantic generality of the candidates. No solution is perfect or even uniformly useful over the whole dictionary. Even the arrangement of meanings of a given entry is moot. we talk about logical, historical and empirical orders. (The latter starts with the comon and current usage followed by obsolete, colloquial, provincial, slang and technical meanings. ) We can dif ferenkiate between engyclopedi c and linguistic dicstionaries.</Paragraph>
    <Paragraph position="3"> he latter are primarily concerned with the lexical units of the language and all their linguistic properties . The former, on the other hand, give information about sane samponent of the extralihguis tic world. Our work derives its data base from an encyclopedic dictionary. It ehould be noted that the highly polysemous nature of the entries in a linguistic dictionary would have constituted an addi t iona 1 complication in this pilot project, which has now been avoided without affecting the general validity of the resu Its. We propose to introduce the tern lexicometry to designate the, discipline which investigates and analyzes the quantitative aspects of dictionaries, the vocabulary of a language and various subsets of the lattet. Lexicometry would count, weigh and . .</Paragraph>
    <Paragraph position="4"> measure, and express the results in statistical and mathmatical terms. Many such studies are widely known. Such is the one reported by GU~ raud (1 959 : The most frequent words are:  (a) the shortest, b the oldest, (c) the morphologically simplest, (d) the semanti-caf ly most extended, i .e.</Paragraph>
    <Paragraph position="5"> greatest number of meanings.</Paragraph>
    <Paragraph position="6">  possessing the As to the measure of frequency, n the first 100 words cover 608 of an averagen text,</Paragraph>
    <Paragraph position="8"> Thus the remaining X (?) thousand words cover only 2.5% of the text. HoweVer, from an information theoretic point of view, the first 100 words comprise 30% of the information,  Consequently, rare words konvey a great deal of information. We could say that a frequent word is most useful in the aggregate, and a rare word in a particular case.</Paragraph>
    <Paragraph position="9"> Other studies in glottochronology mhcern thanselves with the rate of change in Language and in basic vocabulary. Further, distribution of the frequencies of occurrence with or without reference to any particular vocabulary has also been studied. Finding relations of the above kind is not just an academic exercise to satisfy the curiosity of a few linguists, but these relationships may have various practical applications. For example, Maas (1972) asserts that the knowledge of a functional relation between the length of a text and the size of the vocabulary used in it would be desirable in order to estimate the effort needed for extension of a machine dictionary or in comparison of vocabulary contents of texts of dif ferent lengths. In the latter case, one can standardize or normalize the texts under investigation by reducing them to a common minimal length through computational methods and then compare the resulting vocabulary volumes.</Paragraph>
    <Paragraph position="10"> Let V be the number of elements (words) in a text and N the</Paragraph>
    <Paragraph position="12"> length of the text. Then we surmise, says Maas, a functional relatiurnship to exist between N - and V:  Since the vocabulary of a language, however, is supposed to be restricted, so argues Maas, the existence of a limiting value is to be postulated: V,= lim f (N) N+m As the derivative of f at a given value of N represents the</Paragraph>
    <Paragraph position="14"> The derivative of a f at the point 1 is assumed to be 1 because a text of length 1 has a vocabulary consisting of one word, hence Therefore - f' is a function that decreases mnotonically from 1 to As a consequence of the above speculations, in the expression</Paragraph>
    <Paragraph position="16"> statistical investigations of the dramas by Corneille have resulted in the relationship</Paragraph>
    <Paragraph position="18"> ~hus, if N I is given, k IC can be determined, and V - can be calculated from Another noteworthy concept is that of repetition factor : which shows how of ten word has occurred in a text on the averaqe.</Paragraph>
    <Paragraph position="19"> The following relationship has been determined: log R = (0.179 log N + 0.026)~, which displays a very good agreement with reality. NO single empirical law sews to exist between N and V for</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2, The Problem of Coverase
</SectionTitle>
    <Paragraph position="0"> We are now coming close to the core subject matter of this paper. Mackey (1965) states that he coverage or covering capacity of an item is the number of things one can say with it. It can be measured by the number of other items which it can displace. )I According to him, words can displace other words by Eour means: (1 ) inclusion, (2) extension, (3) combination, and (4! definition, 1, A word that already includes the meaning of other words   can be used instead of these (e.g., seat includes chair L- -P bench, stool, and place) , - lLlCI 2. Words the meanings of which are easily extended me'kaphorically can be used to eliminate others (e.g., tributary of a river can be covered by branch or arm). - 3 . Certain simple words can displace others by combining either together or with simple word endings (em g. , news +</Paragraph>
    <Paragraph position="2"> (erg., breakfast can be defined as morning meal; pony as small horse).</Paragraph>
    <Paragraph position="3"> As an example of the application of the above principle, in the derivation of Basic English (by definition), the language was first reduced to 7500 words, and, by redefinition, cut down to 1500. These were further reduced to the eventual 850 by a technique of &amp;quot;panoptic&amp;quot; definition (eliminate each word on the grounds that it is some sort of modification of other words, e. g. a modification in time, numbe-r, or size) .</Paragraph>
    <Paragraph position="4"> Basic English, which was founded essentially on the principle of cove rage, was a conscious reaction against the over-application of the principle of frequency in selection. For Ogden (1 933) , it was not the frequency of a word which makes it useful, it was its usefulness which makes it frequent.</Paragraph>
    <Paragraph position="5"> In the following part of this section, we attempt to present some of the salierit points of Savard (1 970).</Paragraph>
    <Paragraph position="6"> The vocabulary indices most widely known today are those of frequency, of distribution, and of availability. But these are not sufficient to select words for a restricted vocabulary for the purpose of teaching a foreign language, such as Wench, to beginners.</Paragraph>
    <Paragraph position="7"> An objective criterion is lexical valence. It would allow  1 . to obtain a novel principle of vocabulary selection, 2 . to assist the investigators in setting up a base vocabulary for French, 3. to provide a usable definition, combination, inclusion, and extension vocabulary, 4 a to correct all the already existing scales of French vocabulary, 5. to provide a valid working tool for the analysis of teaching material.</Paragraph>
    <Paragraph position="8">  The valence problem is a problem of verbal economy. \that he calls valence is the fundamental capability of a word to be substituted for another word. It is Mackey's coveraqe that he renders as valence, Like Mackey (1965), he maintains that the substitution of one word for another can he made by virtue of four criteria: (1) definition, (2) inclusion, (3) combinatiori, (4) extension. ~efinition has already been discussed previously.</Paragraph>
    <Paragraph position="9"> Linguists do not talk specifically about inclusion; rather, they deal with synonymy or lexical parallelism. Synonyms are. words that have nearly the same meaning, e.g. - lieu and endroit. For Savard, the basic criterion that permits to establish a series of the possibility of substituting one term for another.</Paragraph>
    <Paragraph position="10"> One of the simplest amng all the procedures of vocabulary enrichment consists of joining two words order to make compound words. The principle 0-f combination appears as another phenomenon common t~ all langrlages.</Paragraph>
    <Paragraph position="11"> It is not necessary that the number of simple words be unbounded because almost all verbs have a potential of undetermined sense, and so do the adjectives. A word is said to have more or less extension according to wheaer it can &amp;quot;cover&amp;quot; a more or less great number of fully or p~rtially different notions.</Paragraph>
    <Paragraph position="12"> Polysemy is the exact opposite of synonymy.</Paragraph>
    <Paragraph position="13"> Polysemy becomes complicated dw to the phenomenon of homonymy. Polysemy and homonymy constitute two very rich sources of lexical economy. Togethel: they form Savard' s last criterion of lexical valence--the semantic extension, Although the valence itself has rfever been mathematically measured and although there exis- no scientific means of showing its existence, it has neverthe less, been proven that four formal proceaures of lexical economy permt to replace certain words by other words, and that is what Sav~d calls lexical valence. The postulated existence hypothesis of lexical valence leads to the calculation of a global index of valence for ekdry word. To evaluate the power of of a word, one inspects, in the dictionary, each element of the general 139t and counts how many times a word enters into the definition of another. To measure the power of combination of a lexical unit, one inspects in the dictionary all the compound words joined by a hyphen, all the Gallicisms (in English, these would be Anglicisms) and, in general, all the word groups.</Paragraph>
    <Paragraph position="14"> With a view of appraising the power of inclusion, one inspects me units of the general list in two synonym dictionaries and takes the higher number. The numbei of synonyms that possess a word constitutes a measure of the nunber of words for which can substituted.</Paragraph>
    <Paragraph position="15"> To measure the power of semantid extension, one inspects each of the elements of the general list in the dictionary and counts the number of meanings given by the author to such a word in the list. The number of meanings of a word is considered as a masure of its power of semantic extension.</Paragraph>
    <Paragraph position="16"> The global index of lexical valence is the sum of the four normalized counts. The two critedia having the highest correlation are definition and combination.</Paragraph>
    <Paragraph position="17"> In the beginning of the study, it was assumed that thk four variables were entirely independent of each other. The results of a facto~analysis indicate that they are not completely so. A factor rotation shows, however , that the variables are sufficiently independent to make it necessary to retain the four criteria of lexical valence.</Paragraph>
    <Paragraph position="18"> A comparison of the rank of the first 40 content words on the valence scale with the same words on the frequency list allows to frame a hypothesis that the correlation between valence and frequency would be rather weak. A more complete study would show without doubt that: we have there two very different selection prhciples.</Paragraph>
    <Paragraph position="19"> In conclusion, i eSan be stated with confidence that the measure of valence is no less valid than that of frequency, distribution . and availability. These concepts will eventually lead to more efficient dictionaries with respect to precision, compactness and lexical economy.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
ON LEXICOMETRIC RELATIONSHIPS AMONG THE SIZE OF DEFINING SET,
TUE: SIZE OF DEFINED SET AND TILE MAXIMUM LENGTH OF DEFINITIONS
1. Some Measures of Coverage
</SectionTitle>
    <Paragraph position="0"> A dictionary may be considered efficient and economical if it uses a reasonably small set of words to define 9 relatively large set of entries. We have, however, a very vague idea about what size vocabulary is needed to cover a given number of dictionary entries. (The related problem of ci~cular definitions seems to have to wait for a camputer soluti~n.) It is known, for example, that Basic English, Ogden (1 933) , involves a list of 850 English words and 50 international words, which were eventually used to define the 20,000 English words of Basic Ynglish Dictionary. This gives a ratio of the number of covering. words to that of defined words of 0.045.</Paragraph>
    <Paragraph position="1"> West studied the problem of what constitutes a simple definition and established a minimum defining vocabulary of 1 ,4 90 words. The meaning of sane 18,000 words and 6,OQ'u idioms, i.e. about 2 4,000 expressions, was explained exclusively by these 1,490 words, which were not defined themselves. The results were published in 1961 as The 14ew Method English Dictionary bf Hopm West and J. G. Endicott. The corresponding size ratio here is 0.062, The above roughly indicates that a set of about 1.0 00 words can define d set of about 20 times mat size, but in general the behavior of these variables has not been investigated and is not known in any detail.</Paragraph>
    <Paragraph position="2"> One of us, in Findler (1370), has formulated the problem in de.Ein ; ie terms, Three variables were considered : (1) the covered set S of size %, (2) the Coverinq set R of size %, and  An exception to this rule would occur in a dictionary system, which does not treat liomon~ms as individual entries, A every time a new word with many homonyms is introduced into the Covered Set.</Paragraph>
    <Paragraph position="3"> It was further asSumed that for B=l , the coverincr set and the covered set are of the. same size, i .o. both the increment ratio and the size ratio equal one. We 'must now correct this statement becallse not every word is defined by itself only. If a new word is Lntroduced that already has a synonym in the covering set, it will be defined by that synonym. Then the inc'rement ratio is 0 and the size ratio become less than 1.</Paragraph>
    <Paragraph position="4"> For tke second case, (b) , it is nostulated that * v monotonically decreases as I) N increases, * for any fixed v value, v asymptotically approaches a  storage requirements,, and - N should he small to mlnlmize processing time and output volume. A . compromise on these mnf licting requirements is needed. The ultimate question is :given null &amp;quot;What are the optimum yl and - 11 values for a v for certain  computer applications on a machine with a given cost structure?&amp;quot; Xt is reasooable to assume that the behavior of the three variales and therefore the answer to the last question will larqely denend on the semaptic index of the elements of the covered set: and on the lexical valence of the elements of the coverins set, The latter implies that, for An efficient and economical di,ktionary, the emmknts of the covering set must be chosen fro* the available vocabulary on the hasis of a careful analysis. As research aimed at these goals is pratically nonexistent, it is safe to assume that most of the existing dictionaries are suboptimal. Work in this area will be useful, challenging, and rewarding, but the investigators must be prepared to spend a considerable amount of time and effort on it. So much the more as the entire problem complex outlined in tFle preceding parts will directly or indirectly enter into such investigations.</Paragraph>
    <Paragraph position="5"> ThR project described here is only a small beginning. It vas originally intended to complete the investigation of both cases, (a) and (b) , defined above. In view of the effort needed, in terms of human and machine time, only the first part i-s accomplished at the time of writing this report. Appendix. I1 contains the design of the program for case (b) .</Paragraph>
    <Paragraph position="6"> 2. Construction of the Data Base The data base was not derived from a text but was based on an existing dictionary of computer terminology, Chandor ( 1970) . A derivation from a text, if used, should be automatic and woulcl constitute a large-scale programming project in its own rigslt: In creating the data base, it was attempted to keep its structure simple and uniform without sacrificing its general validity. It was tried to avoid problems that would introduce distracting complications, from both theoretical and practical noint of view, into the subsgquent operations. All this led to the selection - 30 and construction principles outlined below.</Paragraph>
    <Paragraph position="7"> Terms with excessively long definitions were avoided, i.e. definitions were held reasonably short. It was found that lexical units limiting bhe maximum definition length to 22,,did not undulv d restxict the selection. In some cases too long definitions were shortened by leaving out redundant words, glolses, r nxplanatorv notes.</Paragraph>
    <Paragraph position="8"> Every element of the covered set was considered a lexical item, regardless of whether the oriqinal dictionary entry consisted o fa one, two, or more vrords. For programing convenience every word was coded as a string of no more than 10 symbols. Thus accumulator was represented as ACUMULATOR, absolute address appeared as ABSADDMSS, and absolute value computer as ABSIrALCOnIP.</Paragraph>
    <Paragraph position="9"> Polvsemous terms were avoided. 1f such a term was used, onlv its dominant meaning was recorded. In the data-base dictionary, then, each entry (element of thr covered set) has only one meaning and one definition.</Paragraph>
    <Paragraph position="10"> Tefms used in the definitions (elements of theacopering set) were also considered t% be lexical items, i.e. oriqi nal multiword terms appear as a single element, and every element is represented as a string of no more than 1q symbols.</Paragraph>
    <Paragraph position="11"> All terms occurring in the definitions are themselves defined, i.e.</Paragraph>
    <Paragraph position="12"> each element of th,e covering set appears also in the covered set. This principle implies t\at there is a Set of words each element of which is defined by itself. Such a sat may be called the basic vocabulary, consisting of vrords the meanings of which the user of the dictionary is suprmsed to know in order to use the dicti~na~zy. As in this particular case, the dictionary is one of computer terms and the hasic voca5nlary contains the nontechnical words used in the definitions of the technical terms.</Paragraph>
    <Paragraph position="13"> In the definit'ions, a definite distinction was made Sctween content words and function words, also called operators. The latter were not included in the covering set nor were they counted in determining the definition length. Hence, thcs covering set ~onsists only of content words.</Paragraph>
    <Paragraph position="14"> The set of function words is defined rather broadly. I't contains a wide variety of expressions that do not directly contribute anything to the content of the definition hut only  indicate grammatical and logical rel.ationships between the worgs that form the content. It includ'es: 1) prepositions, e.g. of, in to; - f .2) conjunctions, e.g. - and, or - if; 3) the relative pronoun which;  4) combinations of preposition and relative pronoun, e. go in which, to which, by which; 5) present participles equivalent to a preposition, e .g. ULI, containing, representins; combinations participle and preposition, consisting of, oppo  -sed to, applied to; 7) corhbinations of adjective and prewsition, e.g. capable of, exclusive of, equal' to; LII 8) comhinaeions of noun and preposition, e. g. part of, set  of, null  number of : 9) combinations of preposition, poun, and preposition, e.g. in terms of, by means of, in the form of; - - null prepositional phrases associated with- fol lowing infinitive, e,g. used to, necessary to, in order -to:</Paragraph>
    <Paragraph position="16"> other frequenfly used purely functional expressions, e. g.</Paragraph>
    <Paragraph position="17"> for example, namely, kno,wn as.</Paragraph>
    <Paragraph position="18"> Actually, the function words were repl-aced by code numbers in the dictionary. The code numbers were assigned consecutively a!; the function words were neeeed during the construction of the data base so that the order is purely random. A complete list of the 121 function words used, together with their code numbers, is given in Table I.</Paragraph>
    <Paragraph position="19">  The original definitions were somewhat simplified and standardized. In this process, articles were omitted (many languages do very well without them).. On the other hand, implicit relationships were made explicit. A few examples shall serve as illustratfions, with the function words (in parentheses) inserted explicitly instead of their code numbers.</Paragraph>
    <Paragraph position="20"> Original dictionary entry: aberration A defect in the electronic lens svstem of a cathode ray tube.</Paragraph>
    <Paragraph position="21"> Definition in the data base: DEFECT (in) SYSTEM (of) ELECTRO?JIC LENS (of) CATHPAYTUB Note that electronic lens systemn (should he : electronic-lens system) means &amp;quot;system of electronic le-n?ns&amp;quot; (as opposed to &amp;quot;electronic system of lens&amp;quot;), and this relationship is made explicit. Yote also that &amp;quot;cathode ray tuben is a single lexical item, Nouns are represented in singular, thus avoiding anothcr dictianary entry for plural or, what would he worse, proqraminq a &amp;quot;grammar.&amp;quot; Likewise, finite verb forms are represented in third person plural present indicative active, Avoiding the third person singular ~Mminates another dictionary entrv, and avoidinq the passive voice eliminates a great *any participles, which otherwise would have had to he entered. Of course, present and past participles (the former identical to gerund in form) could not always be avoided and had to be entered in the dictionary where needed. Auxiliary verbs were automatically eliminated by avoiding compound tenses and the passive voice. Finally, &amp;quot;to don associated 7~3th negation was simply omitted.</Paragraph>
    <Paragraph position="22"> Original : absolute coding Program instructions which have been written in absolute code, and do not require further processing before being intelligible to the computer.</Paragraph>
    <Paragraph position="23"> Data-base entry: ABSOCODING Definition: PROGRAM INSTRUCT10 (which) ONE WRITE (in) ABSOLUCODE (and whic5 not) REQUIRE FURTHER PROCESSING (before) INTELIGIBL (to) COMPUTER Note that the first predicate in the relative clause, third person plural perfect indicative passive, is represented by the singnlar indefinite pronoun &amp;quot;one&amp;quot; as subject, followed by the  standard plural active verb. The auxiliary &amp;quot;do&amp;quot; has been omitted and the negation is represented by a function word. The virtually redundant &amp;quot;being&amp;quot; has also been left out. In general, the copula is omitted (some languages do very well without it).</Paragraph>
    <Paragraph position="24"> Original : analytical function generator A function generator in which the Function is a physical law. Also lcnown as naturAl laiu function</Paragraph>
    <Paragraph position="26"> IJote also the omission of the gloss &amp;quot;Also known as The stylized definitions are easily understandable even to human readers as the printout of the dictionary demonst~ates.</Paragraph>
    <Paragraph position="27"> The data base was constructed by selecting the first entry, then entering all the lexical items in its definition, subsequentlv A entering all the lexical items in the definitions of these etc. Words that were not defined in the original dictionary were entered and defined by themselves; they constitute the basic vocabulary. - This procedure was continued until everything was defined, i.e, until all the terms in the covering set were also in the covered set. Then the next entry was selected from the dictionary, and the above process was repeated.</Paragraph>
    <Paragraph position="28"> It had been tentatively intended to compile a covered set of about 1 ,000 lexical items. When this number was reached, a rough pencil-and-paper check indicated that the size ratio was about 0.91 at that point. It was then decided that the data base sould he somewhat larger to show the relationships under investigation more perceptibly, and more words were added.</Paragraph>
    <Paragraph position="29"> \fien the size ratio had decreased to about 3.79, the construction of. the data base was concluded as proc&amp;ssing difficulties were anticipated with too large a data volume. At that point the data-base dictionary had precisely 1,856 entries (as was later verified by the program). This was considered to be a satisfactory compromise.</Paragraph>
    <Paragraph position="30"> The dictionqry was arranged in the form of a SLIP list, Pindler et al. (1371). Everv - entry (element of the covered set) occupies four cells in this list: (1) entry word (in A10 format), (2) definition length (an integer) , (3) type of entry (an integer) , (4) suhlist name.</Paragraph>
    <Paragraph position="31"> Three types of entries t were distinguished for programming convenience : 1) code 0 indicates that the entry itself is not used in.</Paragraph>
    <Paragraph position="32"> any definition, i.e. it occurs only in the covered set and not in the covering set; 2 code 1 indicates that the entry occurs in both sets and is not an element of the hasic vocahulary; 3 code 2 indicates that the entrs is defined by itself, i.e. it belongs to the basic vocabularv. The sublist, the name of which is in the fourth cell for every entry in the main list, contains the definition. This arrangement conveqiently separates the entry words from those In the definitions.</Paragraph>
    <Paragraph position="33"> A cell in this second level contains either a word (in Alr) format), i.e. an element of the covering set, or a sublist name. The codes for function words (integers) are contained in the cells in the third level. This arrangement is convenient for bypassing the function words in processing when they are not needed. A typical dictionarv - entry is illustrated in Figure 1.  I--~~--,-.-,~~-~o-.II-L-~-------.I)-----.---.-----~----------.--~--.-INSERT FIGURE 1 ABOUT E1CW The fact that every dictionary entry owns a sublist is practical in another respect: useful information about the entry can be collected and deposited in a description list associated with the sublist. For example, if it were desired to evaluate the definition component of the lexica&amp; valence of each lexical item, a prbgram could e developed that counts how many times a particular item occurs in the definition of other items and stores this information in the description list created for that item. Investigations of this nature will he done at a- future date.</Paragraph>
    <Paragraph position="34"> The program developed for processing all the necessary information is rather complex. Since many of iks organizational characteristics may he of fairly general interest to those who wish to engage in lexicometric studies, a brief description is</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="4" type="metho">
    <SectionTitle>
3. The Results of the Computations.
</SectionTitle>
    <Paragraph position="0"> The relationships between the size of the covering set vR and  that of the covered set vS are summarized in Table 11. The table II lists the size of both sets, the size ratio, the increment of either set, and the increment ratio for Wur values of N. Figure  INSERT TABLE 11 PJD FIGURE 2 ABOUT HERE The table shows that, iq general, the increment ratio is less than T, except for one case, to which we shall return helow. In the meantime note that, for full, dictionary, the table definitely verifies the assumption that the increment ratio decreases with increasing vS. This, however, does not seem to he  true for the reduced dictionary. In fact, for all three cases of  the latter, the ratio tends to increase with increasing vs.  Therefore the single occurrence of the value 1 is plainly a random event as the ratio is very close to 1 at the largest as  value also in the two other cases.. The sequence of value3 is evidently approaching unity.</Paragraph>
    <Paragraph position="1"> This somewhat unexpected, though not particularly surprising,</Paragraph>
  </Section>
  <Section position="9" start_page="4" end_page="4" type="metho">
    <SectionTitle>
TABLE I1
Covered-Covering Relationships
</SectionTitle>
    <Paragraph position="0"> phenomenon is due to the combination of a number of circumstances. We are dealing with a specific technical dictionary. Ih such a dictionary, nontechnical, i,e, ordinary-language, words are not defined. However, a sizeable set of nontechnical words is necessary to define the technical terms. All the former, in our case, belong to the set of basic vocabulary and are defined by themselves. The result is an inordinate proportion of the set of basic words even in the full dictionary. A rough pencil check during the construction of the data base shoved that the basic vo~ahulary forms about 0.55 of the entire covered set, We recall that, in anticipation of this kind of difficulty, the function words were eliminated from the covering set, to begin with. If this had not been done, the situation would have been aggravated by an order of magnitude. To eliminate, or at least to alleviate this bias, a 'Eonsiderahly larger data base should be used, which, as explained before, would have heen beyond the scow - of this pilot project.</Paragraph>
    <Paragraph position="1"> Another, and more important, factor that contributes to the problem in question is the fact that our data-hase dictionary was not derived from a text hut constructed from another dictionary. This was done, as described earlier, by selecting entries starting from the beginning of the dictionary and stopping when the data base was of satisfactory size. As a result, while the basic vocabulary may be assumed to be uniformly distributed over the dictionary, the important content words, with lonqer definitions, are not. The selection of entries, in fact, was stopped at the letter H. Words beyond that point are there only because they happened to occur in definitions. Thus, at least the words that occur only in the covered set (and not in the covering set) are crowded toward the beginning of the dictionary. What happened when the dictionary was reduced is now obvious.</Paragraph>
    <Paragraph position="2"> The weighty words with long definitions were eliminated hut the entire basic vocabulary remained. This, of course, is quite appropriate and consistent with our principles. If, for example, the dictianary had been reduced to N = 1, virtually only the basic vocabulary would have been retained, and we should have obtained tho postulated linear one-to-one relationship between vF  and v  Nevertheless, this procedure enhances the proportion of the basic vocabulary, and the bias increases. As the technical words are relatively scarce in the last third of the dictionary to begin with, the situation gets worse, with the reduction, toward the end of the dictionary. This accounts for th5 increasing increment ratio. The last increment with ?J = 16 must have consisted entirely of basic words, therefore the ratio of unity.</Paragraph>
    <Paragraph position="3"> It is suggested that, for further investiqation, a more complicated dictionary-reduction program be. developed, vh-ich would comnare all the basic words with all the remaining definitions and eliminate those that do not occur in any definition. Thus a basic wosd would occur in the dictionary onlv if it is needed.in a definition, which was the case in the unreduced dictionary This :?av .I a more natural proportior. hctveen khe basic words and others would he restored, It is the same set of circumstances that also explains the fact that, in the reduced dictionary, -. the increment ratio almost consistently exceeds the size ratio. This, however, is not the case for the full dictionary, which definitely verifies the respective assumption in Findler (1 97q) .</Paragraph>
    <Paragraph position="4"> To demonstrate that v approaches an upper limit wit11 R increasing vS for large N, - a much larger dictionary vould be  needed, Ilovever, the curve in Figure.~Z for ?1 = 22 unmistaIra52y shows a tendency in this direction.</Paragraph>
    <Paragraph position="5"> There is, of course, another way of varying N: - instead .of reducing it, it could he increased, and certain words in the definitions could be replaced hy their definitions. This would be a complicated procedure and difficult to control. If few such xeplacements are made, v will not change appreciably. If many R  are made, some replacements tend to reintroduce precilsely the words others try to eliminate. In any case, the result would he a set of awkward and unnatural definitions of erratic lengfhs. In order to use such a procedure, an efficient dicFlonarp should first be compiled, with short definitions and well controlled covering set. The concept of lexical valence should he utilized, but this entails more research in this area. It ~vould also gat the researcher involved in the problem discussed in the preceding parts.</Paragraph>
    <Paragraph position="6"> The curves for N = 16, N = 8, and N = 4 in Figure 2 a1.l display the basic-vocabulary bias of the reduced dictionary. The last one very nearly approximates a one-to-one ratio. e must appreciate the fact that the 1,047 entries of the respective reduced dictionary contain about 1,000 5asic words.</Paragraph>
    <Paragraph position="7"> It is also to he noted that the full dictionary, with 1\1 = 22, in the region of v = 600 requires a larger covering set t5an any S  of the reduced versions. This is understandable as we rellize that the routine that computes the data points actually simulates, rather artificially, the construction of a dictionary from a source text. The full dictionarv at that stage is close to encompassing the whole source, where complex technical terms are being defined, whereas the reduced versions, at t same Wlue, are already in the area in which the basic yocabulary  dominates-The project has been informative in another respect, vhich ,is not unimportant: it has given an indication of t'lc effort involved in this type of work. It h~s taken ,3 total of about 711 hours of cofiputor time. T\c dsvelogment of ttl~ dictionary-display program and 013taining the printout 15 i3 a matter of about 7 minutes and is therefore neqlisihle. Of the .I11 hours, ahout 3 were spent on dictionarv reduction (thr~c seri9s of runs) and 11 on the analjrsis. Although some d~buqginq +had to be done, this was generally insignificant as corn7are-l to the total effort, so that nearly all the 14 hours 9a.; hcer, us~f.111 running time.</Paragraph>
    <Paragraph position="8"> It is also interesting that time wems to 11e very dcpenlent on the volume of data Reing +andled. 13f the 11 hours, more thn 9 were spent on running the full dictionary (N = 32) and aborlt 1 hour on the reduced version of .J = 1G. Completinq the running of the last two series (I1 = R ancl :I = 4) tool togetFlrr less t11an an hour of machine time.</Paragraph>
    <Paragraph position="9"> In terms of human effort, the accomplis'ling of th4 qroject required ahout six man-months' .#.or!:.</Paragraph>
    <Paragraph position="10"> Finally, Appendix I1 contains a brief (.lescri?tion of 2 plannned program that v~ould investiqate the relationship +?t\t~mn the size of tk covering set an? the maximum definition lengtll for fixed values of the covered set size.</Paragraph>
    <Paragraph position="11"> rie wish to express our gratitude to the manaqement of Penuuin Rooks Ltd. for the permission to use their puSl icatinn A ~ictionarv of Computers hy A.</Paragraph>
    <Paragraph position="12"> Chandor as source for gexerating the data base of this oroject.</Paragraph>
  </Section>
  <Section position="10" start_page="4" end_page="10" type="metho">
    <SectionTitle>
APPENDIX I
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="10" type="sub_section">
      <SectionTitle>
Proqram Development
</SectionTitle>
      <Paragraph position="0"> The entire data base was first punched on cards to be inputted as a single list structure, with the dictionary entries alphabetically ordered. It was soon established that this arrangement by far exceeded run-time storage limit-ations (using a field length of 100,000, ) . Only ahout one fifth of the material could be accomfnodated at one time without exhausting the available space. Therefore the dictionary was split into five individual List skructures, and the correspofidbng card imaqes were stored on disk as five separate f-iles. These were brought in, one at a time, for processing ss needed. Recause of space limitations, also processed data and intermediate results had to &amp;quot;I be put in external storage during run tine and, of course, between runs, therefore more files had to he created as de.;crihzd later. Thus, a great deal of programming effort went into file manipulation.</Paragraph>
      <Paragraph position="1"> The purpose of the first program, designated AMALEX, was simply to display the dictionary. It Yirst reads the function words from the cards and stores them in the form of a 121x2 array. (The width of the array is 2 because many function words are longer than 10 charaeters.) Using a function READLS, the program reads the dictionary and stores 15: in the form of a list structure as described above. On this occasion, it also measures the space required for the dictionary. It was found that a field length of more than 235,680, locations ~muld he needed to accommodate the entire data base.</Paragraph>
      <Paragraph position="2"> A subroutine called RITELS prints out the dictionary, specifying each entry by the definition in the form of at most  words to the line. The routine also checks the operator code numbers in the third'-level sublists and replaces these in t3e printout by the appropriate function tmrds from the array. The dictionary was printed out in four separate rvns as the dictionary was initially divided into four lists. Since the ANALEX program does no further processing and accumulates no new lists, no storage problems arose. It was not until later that it was established that a division into five parts was necessary to perform subsequent operations in the space - available.</Paragraph>
      <Paragraph position="3"> The first printouts were carefully examined for punching errors and omissions. Detected errors ware corrected and the files were updated accordingly.</Paragraph>
      <Paragraph position="4"> The actuaI working program is named COVSET. If the entire data base were one single list and if time were available indefinitelv, A this pro-gram would do the complete work in a single run. In this case, it would print a fable of corresponding v and % values for a given value of N, would reduce the value of</Paragraph>
      <Paragraph position="6"> N and print out another table, etch , and repeat this for all  desired values of N.</Paragraph>
      <Paragraph position="7">  -This, of course, could not be done because, in the first place, only one of the five parts of the dictionary could he worked on at a time and, in thse second place, the program had to be run in time increments of 600 s or less, which was the set time limit, The principal routine in COVSET is ca-lled COVRYG, which computes the values of v for given values of v</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="10" end_page="10" type="metho">
    <SectionTitle>
2 A*
</SectionTitle>
    <Paragraph position="0"> Its simpliqied flow diagram is given in Frgu~e 3. INS-ERT FIGURl2 3 ABOUT HERE wom~o~~~r~r~o~mwmw~o~~-~~~~)I-om~~~.~~~-~~~~~~~~~~~~~~~~~ As the inherently continuous orogram cannot klc3 run continuuously, a few control variables are needed to provide criteria for interruption and to transfer information from one run to the next. These are read from cards in the beginning of the routine.</Paragraph>
    <Paragraph position="1"> A reference value LSTRCF is used to control t5e spBcing of the recordings of v and because too close spacing would</Paragraph>
    <Paragraph position="3"> introduce random irregularities into the otherwise smoothly cllanging tendency. The reference is automatically updated after very printout of the. 5 and # values. During the analysis of  the full dictionary, the reference was incremehteit hy 200; later,, in the processing of the reduced dictionarv, -. it :~ss incremented hy 100.</Paragraph>
    <Paragraph position="4"> A criterion is needed for interrupting the program hef~re it exceeds the time limit. An estimated increase in vs WC?S  initially used for this purcpose. A value '3AXLCFJ vas innut and 5  compared rith it every time a new word was acldcd to t&amp;quot;l set. \%en the coun-t reached the reference value, the program :Jas discontinued. 0 the average, about 15 words per ptun could hc added to the covered set.</Paragraph>
    <Paragraph position="5"> Later it was found that better control could he exercised Ilv counting the number of times that a new section of the dictionary was brought in for processing. A Value !WXW,P was read Ln and when the above counter, starting from O, reached this va'lue, the run was interrupted.</Paragraph>
    <Paragraph position="6"> The variables KNTCVD and KNTCNG are counters for v and vR , - respectively. Their current values are transferred from one rur. to the other. The value of KNTPRT indicates the sectiofi of the dictionary currently under investigation.</Paragraph>
    <Paragraph position="7"> The variable IrtCONT is set to 0 for t'le very first run far each N value. This tells the routine to set up new lista fox  Covered List, Covering List, and a so-called Waiting List. In all successive runs its value is 1, indicating that the program must bring these lists in from the external file.</Paragraph>
    <Paragraph position="8"> The routine exmines the current. section of the dictionary, entrymby entry. In the first series of runs, it deals with one of the five sections, stored in one of the five files, in the form of the original card images. A sixth file was created for storing all the lists generated by the program. Idhen the dictionary was later reduced (for reduced values of nr , tne C corresponalng sections of the reduced dictionary were also stored in that sixth file.</Paragraph>
    <Paragraph position="9"> If the current entry is an element of the Sas.ic vocabulary (type 2) , the routine bypasses it and takes the next entry. Tflis can be dane in the processing of the full dictionary because all these words occur in the definitions and will certainly he caught later. This is no longer so in processing the reduced dictionary because the words in the definitions of which they occur may have been eliminated. In the latter case, therefore, this tvpe 0.f a word is immediately added to both the Covered List and the Covering List (it always covers itself) .</Paragraph>
    <Paragraph position="10"> If the current entry is a word that does not occur in any definition (type O), it is being encountered the firsttime, and we are sure that it is not alteady on the Covered- List; hence, this question need not be asked.</Paragraph>
    <Paragraph position="11"> Otherwise the routine tests if the word is already on the Covered List, which may well he the case hecause the word may have occurred earlier in the definition of another word. If so, the routine proceeds to the next ward in t'le dictionary. If the word is not found on the Covered List, it is put there, and KNTCVD is incremented. Then all the words in tho definition of the word in question are put on the Waiting List, which is subsequently processedd This is necessary hecause of the adopted principke that all the covering :vords nust t'lemselveq be covered. An entry in the r3 versvR # tahle is meaningful - only if this condition is satisfied.</Paragraph>
    <Paragraph position="12"> The current dictionary entry itself. is recorded as t'le valug of the variable DREF, which passes the information on, from one run to the next, where in the dictionarv the program is currently in action.</Paragraph>
    <Paragraph position="13"> The routine then examines the Waiting List, word by vord. If the current vmrd is already on the Covered List (it mav have Occurred earlier in the dictionarv), the ro~kine c3ecks if it is also on t$e Covering List (it may not he hecause it has not vst occurred in the definition of another swrd) . If not, it is putthere, and KNTCNG is incremented. A11 words on the Waiting List come from definitions and must therefore he added to the Cowring List. After a word has be&amp; processed, it is deleted from the Waiting List.</Paragraph>
    <Paragraph position="14"> ~f the current ptord is not on the Covered List, it must obviously be put sthere.</Paragraph>
    <Paragraph position="15"> First, however, the routine tests if the word occurs in the section of the dictionarv currenflv in store by checking whether its numerical value is between those of the first and the last word of the section. If the word is not there, the routine postpones its processing and takes the next word from the Waiting List because it is more econoznical to process first all the words available in the dictionarv - sedtion present than to read in other sections of the dictionarv as the words dictate it (memory swapping is* cxpensive) .</Paragraph>
    <Paragraph position="16"> Should the word be in that section, the routine adds it to the Covered List, increments KNTCVD, and actually looks for the word in the dictionary. If it does nok find it, it gives an error medage, prints out the questionable word, and terminattbs the run. This way t'he remaining punching errors in the datn hase were detected, and a ferv words were found missing (due to human error quring the construction of the data base when it was forgotten to enter words that acutally occured in definitions) . The files were updated accordingly.</Paragraph>
    <Paragraph position="17"> If the word is found, the routine adds all the words in its definition to the Waiting List, then investigates its presence on the Fovering List, and proceeds as described before. When the bottom of the Waiting ~1st is reached and the list is not empty, the words remaining on it must be in other sections of the dictionary. The section present is then erased and the next section is brought in (if the current one is section 5, section 1 is read in). The processing - of the Waiting List now starts from the beyinning and continues as described above.</Paragraph>
    <Paragraph position="18"> If the Waiting List is finally empty, and K?ITCVD equals or exceeds LSTREF, the routine increments LSTREF by the prescriber? amount, and prints the values of KT!TTCVD and I3VTCTTG. If the couflt is less than the reference value, the routine simnlv proceg'ds. In any case, it tests if the proper section of the dictipnarv happens to he in the store (it knows that hv - the value of KNTPRT) . If it does not, the section present' is erased an3 the right section is read in.</Paragraph>
    <Paragraph position="19"> Next the routine looks for the word 3t v11ic5 it had wviously stopped tracing tha dictionary (it knows that l?y t\e contents of DREF). An error message has been rrovided - for the case in which it does not find the reference for some re~son. FortUnately, the program never made use of this message. After finding the reference, the routine takes the next v~ord from the dictionary and proceeds as alread described.</Paragraph>
    <Paragraph position="20"> When the routine reaches the bottom of the dictionary, it tests if it is the last section. If not, the next section is processed as described. At the end of the last section the routine prints the final values of v and v and vith this the</Paragraph>
    <Paragraph position="22"> prbcessing is finished for a given value of I) N.</Paragraph>
    <Paragraph position="23"> The ahove srriooth description involves countless runs.</Paragraph>
    <Paragraph position="24"> Interruption criteria are tested at appropriate places, and the processing is discontinued accorcl~ngly. Whenever a run is terminated, the three compiled lists are saved ??y storing t'lnn in t!e external file (we shall call it File 9 for t\e sake of convenience) . The control parameters and reference variables are printed out,. The data cards are changed accordinsly, for input to the next run.</Paragraph>
    <Paragraph position="25"> The first series of runs was perZomed with tho ell 11 dictionary, for which the maximum definition length ir is 22. In  dhe following qeries of runs N was gradually decreased.</Paragraph>
    <Paragraph position="26"> I) It was then also necessasv to reduce the dictionarv hv eliminating all &amp; words with definition length greater than the 'curre3t hl, t!~en  eliminating 1 words containing them in their definitions, subsequently eliminating all words the definitions of vthic\ contain the latter, etc.</Paragraph>
    <Paragraph position="27"> The program calls another qajor subroutine, named DICRED, to carry out this operation. The routine is hasicallv simple; what makes it appear complicated is the manipulation of t\e files. It was found to be most convenient to search one section of the dictionary per run.</Paragraph>
    <Paragraph position="28"> From the data cards, the routi'ne reads a reference parameter called KYTSCT, which indicates the !lBighcst consecutive section number that has been seuched. The control variable IDRPhas valne 0 at input; the routine. changes it to 1 if any w,qrds were removed from the section currently Seing searched, ot3erwise it ternsins 0 at output. The variable KNTRPT shows the nurnh~r of the section currently being searched. #The parameter INDFIL is set to 0 every time a new section is searched the first time. This tells the routine to bring rn khe section indicated hv KVTSCT. If its value is 1, the seccion to be read is indicated hy KMTPPT. The reduced sections are stored if Tile consecutively. If KNTRPT is less than KNTSCT, the sections follo~dng the one currently searched are stored on a temporary file because the length of the one being searched mav decrease. Not until the search has ended and the current section has been stored back at its proper place are the following gections transferre2 back to File 9.. For example, if KNTRPT = 1 and IZNTSCT = 5, then sections 2, 3, 4, and 5 are stored away.</Paragraph>
    <Paragraph position="29"> In the very f irgt . run for a given - ?I value, i .e. if K'7TSCT equals 1, the routine creates an nmnty list for the so-zalled Removal List. In the quhsequent runs the routine read3 in thz Removal. List from the file.</Paragraph>
    <Paragraph position="30"> The routine examines the definition lengths of the entries in the current section. itemhy item. The entries the definition length of which is greater than the set N - value are put o~ t'ls Removal List dfid deleted from t!le dictionarv. T5e value of IPRD is .set tn 4 if such entries are found. The remove2 words are printed out for reference.</Paragraph>
    <Paragraph position="31"> Then the dictionary is searched artd all definitions are checked against the items on the Removal List. If a clefinitipn containing a removed word is found, the respectiv~ entry itself is added to the Removal List and sul\sequently deleted from tho dictionary. If a search results in any new additions to the Removal List, the search is repeated. This is continued until no new deletions occur.</Paragraph>
    <Paragraph position="32"> After the ILL n-th section has been processed the first time and if deletions have occurred, KNTRPT is set to 1 ,2, . .</Paragraph>
    <Paragraph position="33"> , ;r respectively., in n - succeeding runs. If anv one of these produce., deletions (IDW set to I), the sequence is repeated. This is continued until IDRP remains in all n runs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML