TOWARDS A CORE VOCAIIULARY FOIl A NATURAL I.,ANGUAGE 
SYSTEM 
I lubcrt l.chnmnn 
IBM l)cutschhmd Gmbll, Scientific ('enter 
institute for Knowledge Based Systems 
Wilckensstr. la 
I)-69(10 11cidelbcrg, (;ci'many 
lhnaih !,1'~!! at I)! ll)iBM I.BITNI'71' 
ABSTRACT 
The desire to construct robust and portable na- 
tural language systems has led to research on 
how a core vocabulary for such systems can be 
defined. Stalistical methods and semantic criteria 
for doing this arc discussed and compared. Cur- 
rcnlly it docs not seem possible to precisely de- 
fine the notion of core vocabulary, but it is 
argued that workable criteria can nevertheless be 
\['o1.1110. l:inally it is emplmsized that the imple- 
mentation of a core vt~cabulary must be seen as 
a long-range research prt~gram rather than as a 
short-term goal. 
Motiva!ion 
Rcasearch on natural language processing sys- 
tems today strives for the construction of robust 
and portable systems) A system is robust, if it 
can handle a large variety of user inputs without 
giving up or producing unexpected results. A 
system is portable in the sense intended here, if 
it is not geared to a single subject domain, but 
can be ported with a reasonable effort to a vail- 
ely of subjccl domains. It is common under- 
standing that there cxisls a t:t:llhal \[i'aglnt'.tlt of a 
language which I. is required for dealing with 
virtually any subject d(~main, anti 2. is invariant 
with respect to meaning and use accross subject 
donmins, it is of course a non-trivial empirical 
question whether such a cen{ral fragment really 
exists, and if so, to say what it is, but a number 
of researchers scenl tO share the asstllnption that 
it does (ef. e.g. Alshawi ct al. (19R8)). Any ro- 
bust and portablc system would then have to 
handle this core fi'agmcnl. 
in this paper I am concerned with a second 
- related - assumption, namely that there ex- 
ists a core vocabulary which is needed for handl- 
ing any subject domain. 'llfis assumpti(m is also 
shared by many researchers, and it tmdcrlies the 
production of basic vocabularies for language 
learning such as Ochlcr (1980). l.Jsually the att- 
thors claim that their word lists are I)ased on 
statistical investigations, but they also emphasize 
that. they did not slavishly stick to the statistics 
but used additional crilcria such as "usage 
value", "availability", "familiarity", or 
"lcamability" without ever saying how these are 
eslablished.2 
I will address the following questions: 
I. llow can the intuilivc notion of core vocab- 
ulary be properly dciincd? 
2. llow can statistical methods be employed 
to define a core vocabtdaD, and how do they 
relate to semantic criteria? 
3. What semantic criteria can be found to de- 
fine a core vocabulary? 
Definitions of a core vocabulary 
"l:here are several ways to define core 
vocab,lary, I can think of the following three: 
1. The core vocabulary consists of the n most 
frequent words of a hmguage. 
2. The core w~cabulary is that vocabulary 
which is common to all nativc speakers of 
a language. 
3. The sema,tic core vocabulary consists of 
lhose words which suffice to dcfinc all of the 
remaining vocabuh~ry of a language. 
The first two definitions call Re" statistical 
methods which shall be discussed in the next 
gcction, and the third onc obviously requires a 
.~cmanlic approach which shall bc discusscd in 
~cclion "Soma,tic crilcria". 
Statistical methods 
l:rcqucncy counts have well established the basic 
propcrtics of the frequency distribution for text 
corpora. Thus in Kuccra and l;rancis (1967) we 
get coverage ligures like lifts lbr their complete 
corl~US of about I million tokens: 
10 most frequent words: 
100 most frequent words: 
1000 most frequent words: 
24.26 % 
47.43 % 
68.86 % 
The research deso'ibed here has been comlucted i. the eonh;xt of the I .I I,O(; project (I lerzog et al., 1986). 
II has profiled fi'om inlensive discussio.s with R. Maye,'. Mnch of the underlying statistical work on text 
corpora is dec Io U, Bandara and (L Walch from Ihe speech recognition project SPRIN(\] (Wothke et at., 1989). 
Our investigations ;ire based (.I (~erman, but for ease of refere.ce also some l!nglish examples are given. 
- 303 - 
These figt,rcs vary only slightly with corlms size, 
and also for German similar values are observed. 
! lowevcr, while coverage figures are rather stable 
with respect to the n most frequent words of a 
corpus, what are tile n most frequent words may 
vary widely with corpora or subcorpora. Two 
parameters rcsponsible for this variation are ob- 
vious: 
I. Subject matter and 
2. ' Communicative function. 
Thus in the "Kultur" section of a newspaper 
which we have analyzed we see that words like 
Musik, Theater, Regisseur, etc. occur with a 
drastically higher freqt, ency than in the other 
sections, which of course can be attributed to 
subject matter. 13ul personal pronouns, in par- 
ticular 1st and 2nd person pronouns, also show 
a much higher frequency, and this can hardly be 
attrihutcd to subject matter, rather to different 
communicative functi(ms of feuillet(mistie writ- 
ing and say economic news. 
All of tiffs relates of course to tile much dis- 
cussed issue of what c,mstitutes a representatitve 
corpus for statistical linguistic analysis. Since 
specific subject matters and communicative 
functions vary in importance for different speak- 
ers of a language, it will be difficult if not ina- 
possible to eliminate arbitrariness. Rather, a 
definition of representative corpus must take into 
account tile research goals pursued. 
For a natural language system which is sup- 
posed to analyze and generate texts, to engage in 
dialogues with users, and which is to acquire 
knowledge fi'om the analysis of definiti(ms and 
rules formulaled in natural language, one needs 
a corpus of texts where all these aspects are suf- 
ficicntly rcprcscntcd. We were able to draw upon 
a wtriety of corpora none of w\[fich would sh()w 
all the featt, res rcquircd, but the combination of 
them seems to be quite reasonable. 
We conlpared Ihe fi)llowing five word lists: 
I. Oehlcr (1980): (;rundwortschatz consisting 
of 2247 words, 
2. Erk (1972): scientific texts from 34 disci- 
plines, 1283 words with fl'cquency > 20, 
3. Pregel/Rickhcit (1987): texts by primary 
school children, 593 words with frequency 
> 20, 
4. SPRING-corpus of newspaper texts, 2733 
most frequent words, 
5. I)UI)EN (1989): definitions for words be- 
ginning with a, 2693 words with frequency 
>4. 
Frona these, word lists IL were formed consisting 
of those words occurring in at least n of the ori- 
ginal word lists (I < n < 5). The lengths of these 
lisls are B~: 5409, !12 : 2248, l.h: 1215, B4: 565, and 
B5 : 116. 
The size of /Is shows thai a really common 
core of a varicty of texts may be extremely small, 
the successive losening of rcstrictimls used here 
allows for a balanced extension of this very smaU 
core. The list//3 was chosen as the statistical core 
vocabulary serving as a base for applying se- 
mantic criteria, becat, se the overall core vocabu- 
lary was envisaged to have a size of approx. 1500 
words. Inspection shows that many intuitively 
basic words and very few idiosyncratic words are 
contained due to the method of intersecting the 
word lists, l lence, !1~ seems quite reasonable. 
Semanlic criteria 
If one takes tim n most frequent words of any 
frequency count one will no doubt discover that 
these words will not exhibit a linguistic closure 
in the sense that natural scntcnces can be formed 
with all and only the words in the set. l:urther 
one will see that semantic relations will be in- 
complete. Thus one tinds in Oehler (1980) which 
is based on the old Kacding count that weiblich 
(female) occurs but not its antonym re&milch 
(male). For a core vocabulary to bc set up for a 
natural language system, 1 think, tmc must strive 
for lingt, istic closure, since otherwise, one cnds 
up with words one cannot use. This means that 
you cannot base the core w~cabulary on fre- 
quency counts alone. 
l~urthermore, one cannot expect that one will 
imve just the vocabulary needed to formulate 
delinitions for the words in the list chosen. To 
avoid circularity, one will have to accept that 
certai,i words cannot bc defined wilhin the vo- 
cabulary, but one will also have to accept that 
for some words less than complete definitions 
can be given. Because of this lack of delinability, 
a sere'retie core wwal,llary can only be under- 
stood as an approximativc notion geared towards 
"the best cmc can do". What one can hope to 
do, is to define 
1. taxonomic rdations, 
2. "selcctional restrictions" or constraints on 
seunauflic compatibility, and 
3. meaning rules of arbitrary complexity (in- 
cluding classical definitions). 
1 propose to formulate all of these typcs of rules 
in natural language for B3 trying to stay within 
at least tile vocabulary of B, , to add lhe words 
used in the fommlations to the original set, and 
continue until one cannot think of further rules. 
I claim that one can achieve a fixed point from 
where on no new words are added to tim set, and 
that at this moment one has reached a rather 
good approximation to a semantic core vocabu- 
lary. 
There is undoubtedly a relationship between 
frequency and semantic relevance: since 
taxonomic relations are often exemplified by 
anaphoric references, since semantic compatibil- 
ity constraints lead to tile co-occurrence of ap- 
- 304 - 
propriate words, and since other more complex 
semantic relationslfips arc bound to be exhibited 
in the various threads of discourse, one has all 
reason to expect a certain amount of congruence 
between frequency counts and the semantic core 
vocabulary as defined above. 
The work on fimnulating taxonomic re- 
lati(,ns, semantic constraints and other meaning 
rules is underway, and since it will inw,lve all of 
the w)cabulary, linguistic closure will be achieved 
at the same time. 
As an example, take a taxonomic rule for 
Arm which is in Bs 
Jeder(B3) Arm ist Tei1(B4) eines KiSrpers(B3) 
(Every arm is part of a body.) 
The word Kreperteil (body part) is only avail- 
able in Bj and was therefore not used, or instead 
o1' 7"eil one could also have used Glied(B3, 
member), bu! then the rule would not have 
covered arms of machines or rivers. This high- 
lights a big problem in the natural language for- 
mt, lation of meaning rules: how is ambiguity 
dealt with? Space does not permit a full dis- 
cussion here, therefore suffice it to say that it is 
one of our research goals to formulate meaning 
rules which specify criteria for disambiguation. 
Linguistic description 
The preceding discussion has concentrated on 
how to establish a core vocabulary. Now a few 
brief remarks shall follow on how the words of 
the core vocabulary can bc linguistically de- 
scribed. 
'l'he morphology of I:mguagcs such as 
(\]crman is well understood and has been coded 
for an extendcd vocabulary in Ihc lexical data- 
base of the IJ:,X project (llamett ct al., 1986). 
This database also conlains dctailcd syntactic 
inforuaation, in pa~l.icular on government pat- 
terus. 
It is the description tat' lhe semanlie (and 
pragmatic) properties of many words one wouhl 
obviously wanl h) include in a core vocabulary 
lhat will confront us with huge unsoivcd theore- 
tical problems. Be it modal verbs or proposi- 
tional attitudes, sentence adverbs or "abstract" 
nouns of various kinds, hwestigations on some 
individual words havc generated heaps of litera- 
ture, for others it seems that people have not 
even dared to look at thcm. l)oes this make the 
enterprise of implementing a core w)cabulary a 
futile one? I think not. I think the implementa- 
tion of a core w)cabulary should be seen as a 
long-range research goal for both computational 
and theoretical linguistics, and filrthermore that 
natural language systems provide a good envi- 
ronmcnt for doing experiments in semantics, be- 
cause they encourage an integrated treatment of 
linguistic phenomena. 
Conclusions 
()ur research on establishing a core vocabulary 
tor German in the framework of the I,I1,OG 
project Ires revealed that currently no absolute 
definition can be given, but ways have been 
shown how to arrive at a working dclinition with 
respect to the objectives of natural language sys- 
tems. It has been shown that both statistical 
mclhods and semanlic criteria can, and I think, 
have to contribute to the establishment of a core 
vocabulary. 
The linguistic description and thus the im- 
plcmentation of a core vocabulary depends 
heavily on progress in theoretical linguistics, in 
particular in semantics and pragmatics, but 1 
want to stress that h)cussing on a core w~cabu- 
lary is a fruitful way to direct linguistic research, 
which can be supported by the need for inte- 
graled treatments in natural language systems. 
References 
Alshawi, !I., D. M. Carter, .I. van F, ijck, R. C. 
Moore, I). B. Moran, ! ;. C. N. l'crcira, and A. 
G. Smith (1988): "Research Programme in Na- 
tural language Processing - Annual Report", 
Nattie Project Document NA-16, ('ambridge: 
SR! International. 
Barnett, B., 11. l,chmann, M. Zocppritz 
(1986): "A word database for natural language 
processing", Proceedings I lth lnterm~tional Con- 
.ference OlZ Complllalional Lingui.t'tic,~ COI.ING86 
Auyttst 25th to 29th, 1986, Bonn. l,'e&,ral Repub- 
lit: of (;ermany. 435-440. 
I;,rk, !!. (1972): Zur Lexik wisscnschafificher 
I,'achtexte, Mi'mchen: I h, eber. 
Ilcrzog, O. et al. (1986): "I.II.OG -- l,in- 
guistic and Ix,gic Mcthods Ibr the ('omputa- 
tional Undcrshmding of (;crvnan", 
IJLOG-Report Ib, Stuttgart: IBM I)eutschhmd. 
Kucera, 1 I., W. N. Francis (1967): Computa- 
tional ,4nalysis tf Present-Day /Irnerican F, nglish. 
Providence, Rh Brown University Press. 
Oehlcr, II. (1981)): KLETT Grund- und 
Auflmmvortschatz Deutsch. Stuttgart: Klett. 
l'regel, D., G. Rickheit (1987): Der 
Wortschatz im Grumlschulalter. I iildeshcim: 
()Ires. 
Wothke, K., U. B~mdara, J. Kempf, E. 
Keppel, K. Mohr, G. Walch (1989): "The 
SI~RING Speech Recognitk)n System for 
German", in: Proceedings of Eurospeech "89. 
VoL 2, 9-12. 
- 305 - 
