A STOCHASTIC PROCESS FOR WORD FREQUENCY 
DISTRIBUTIONS 
Harald Baayen* 
Maz-Planck-Institut fiir Psycholinguistik 
Wundtlaan 1, NL-6525 XD Nijmegen 
Internet: baayen@mpi.nl 
ABSTRACT 
A stochastic model based on insights of Man- 
delbrot (1953) and Simon (1955) is discussed 
against the background of new criteria of ade- 
quacy that have become available recently as a 
result of studies of the similarity relations be- 
tween words as found in large computerized text 
corpora. 
FREQUENCY DISTRIBUTIONS 
Various models for word frequency distributions 
have been developed since Zipf (1935) applied 
the zeta distribution to describe a wide range of 
lexical data. Mandelbrot (1953, 1962)extended 
Zipf's distribution 'law' 
K f, = ?x, (i) 
where fi is the sample frequency of the i th type 
in a ranking according to decreasing frequency, 
with the parameter B, 
K 
f~ = B + i~ ' (2) 
by means of which fits are obtained that are more 
accurate with respect to the higher frequency 
words. Simon (1955, 1960) developed a stochas- 
tic process which has the Yule distribution 
f, = AB(i, p + 1), (3) 
with the parameter A and B(i, p + i) the Beta 
function in (i, p + I), as its stationary solutions. 
For i --~ oo, (3) can be written as 
f~ ~ r(p + 1)i -(.+I) , 
in other words, (3) approximates Zipf's law with 
respect to the lower frequency words, the tail of 
*I am indebted to Kl~as van Ham, Richard Gill, Bert 
Hoeks and Erlk Schils for stimulating discussions on the 
statistical analysis of lexical similarity relations. 
the distribution. Other models, such as Good 
(1953), Waring-Herdan (Herdan 1960, Muller 
1979) and Sichel (1975), have been put forward, 
all of which have Zipf's law as some special or 
limiting form. Unrelated to Zipf's law is the 
lognormal hypothesis, advanced for word fre- 
quency distributions by Carroll (1967, 1969), 
which gives rise to reasonable fits and is widely 
used in psycholinguistic research on word fre- 
quency effects in mental processing. 
A problem that immediately arises in the con- 
text of the study of word frequency distribu- 
tions concerns the fact that these distributions 
have two important characteristics which they 
share with other so-called large number of rare 
events (LNRE) distributions (Orlov and Chi- 
tashvili 1983, Chltashvili and Khmaladze 1989), 
namely that on the one hand a huge number of 
different word types appears, and that on the 
other hand it is observed that while some events 
have reasonably stable frequencies, others occur 
only once, twice, etc. Crucially, these rare events 
occupy a significant portion of the list of all 
types observed. The presence of such large num- 
bers of very low frequency types effects a signifi- 
cant bias between the rank-probability distribu- 
tion and the rank-frequency distributions lead- 
ing to the contradiction of the common mean 
of the law of large numbers, so that expressions 
concerning frequencies cannot be taken to ap- 
proximate expressions concerning probabilities. 
The fact that for LNRE distributions the rank- 
probability distributions cannot be reliably esti- 
mated on the basis of rank-frequency distribu- 
tions is one source of the lack of goodness-of-fit 
often observed when various distribution 'laws' 
are applied to empirical data. Better results are 
obtained with Zipfian models when Orlov and 
Chitashvili's (1983) extended generalized Zipf's 
law is used. 
A second problem which arises when the ap- 
propriateness of the various lexical models is 
271 
considered, the central issue of the present dis- 
cussion, concerns the similarity relations among 
words in lexical distributions. These empirical 
similarity relations, as observed for large corpora 
of words, impose additional criteria on the ad- 
equacy of models for word frequency distribu- 
tions. 
SIMILARITY RELATIONS 
There is a growing consensus in psycholinguis- 
tic research that word recognition depends not 
only on properties of the target word (e.g. its 
length and frequency), but also upon the number 
and nature of its lexical competitors or neigh- 
bors. The first to study similarity relations 
among lexical competitors in the lexicon in re- 
lation to lexical frequency were Landauer and 
Streeter (1973). Let a seighbor be a word that 
differs in exactly one phoneme (or letter) from 
a given target string, and let the neighborhood 
be the set of all neighbors, i.e. the set of all 
words at Hamming distance 1 from the target. 
Landauer and Streeter observed that (1) high- 
frequency words have more neighbors than low- 
frequency words (the neighborhood density ef- 
fect), and that (2) high-frequency words have 
higher-frequency neighbors than low-frequency 
words (the neighborhood frequency effect). In 
order to facilitate statistical analysis, it is con- 
venient to restate the neighborhood frequency 
effect as a correlation between the target's num- 
ber of neighbors and the frequencies of these 
neighbors, rather than as a relation between 
the target's frequency and the frequencies of its 
neighbors -- targets with many neighbors having 
higher frequency neighbors, and hence a higher 
mean neighborhood frequency .f,~ than targets 
with few neighbors. In fact, both the neighbor- 
hood density and the neighborhood frequency 
effect are descriptions of a single property of 
lexical space, namely that its dense similarity 
regions are populated by the higher frequency 
types. A crucial property of word frequency dis- 
tributions is that the lexical similarity effects oc- 
cur not only across but also within word lengths. 
Figure 1A displays the rank-frequency distri- 
bution of Dutch monomorphemic phonologically 
represented stems, function words excluded, and 
charts the lexical similarity effects of the subset 
of words with length 4 by means of boxplots. 
These show the mean (dotted line), the median, 
the upper and lower quartiles, the most extreme 
data points within 1.5 times the interquartile 
range, and remaining outliers for the number of 
neighbors (#n) against target frequency (neigh- 
borhood density), and for the mean frequency of 
the neighbors of a target (f,~) against the hum- 
Table i: Spearman rank correlation analysis of 
the neighborhood density and frequency effects 
for empirical and theoretical words of length 4. 
Dutch Mand. Mand.-Simon 
dens. 
freq. 
r, 0.24 0.65 0.31 
0.06 0.42 O. I0 ~e 
t 9.16 68.58 11.97 
df 1340 6423 1348 
rs 0.51 0.62 0.61 
2 0.26 0.38 0.37 7" i 
t 21.65 63.02 28.22 
df 1340 6423 1348 
ber of neighbors of the target (neighborhood fre- 
quency), for targets grouped into frequency and 
density classes respectively. Observe that the 
rank-frequency distribution of monomorphemic 
Dutch words does not show up as a straight 
line in a double logarithmic plot, that there is 
a small neighborhood density effect and a some- 
what more pronounced neighborhood frequency 
effect. A Spearman rank correlation analysis 
reveals that the lexlcal similarity effects of fig- 
ure 1A are statistically highly significant trends 
(p <~ 0.001), even though the correlations them- 
selves are quite weak (see table 1, column 1): in 
the case of lexical density only 6% of the variance 
is explained. 1 
STOCHASTIC MODELLING 
By themselves, models of the kind proposed 
by Zipf, Herdan and Muller or Sichel, even 
though they may yield reasonable fits to partic- 
ular word frequency distributions, have no bear- 
ing on the similarity relations in the lexicon. 
The only model that is promising in this respect 
is that of Mandelbrot (1953, 1962). Mandel- 
brot derived his modification of Zipf's law (2) 
on the basis of a Markovlan model for generat- 
ing words as strings of letters, in combination 
with some assumptions concerning the cost of 
transmitting the words generated in some op- 
timal code, giving a precise interpretation to 
Zipf's 'law of abbreviation'. Miller (1057), wish- 
ing to avoid a teleological explanation, showed 
that the Zipf-Mandelbrot law can also be de- 
rived under slightly different assumptions. Inter- 
estingiy, Nusbaum (1985), on the basis of sim- 
ulation results with a slightly different neighbor 
definition, reports that the neighborhood density 
and neighborhood frequency effects occur within 
XNote that the larger value of r~ for the neighborhood 
frequency eiTect is a direct consequence of the fact that 
the frequencies of the neighbors of each target are a~- 
eraged before they enter into the calculations, masking 
much of the variance. 
272 
lO t 
10 a 
10 ~ 
10 x 
I0 ° 
~Tt 
20 
16 
12 
8 
4 
0 
I 
2000 
I000 
500 
100 
50 \]l \]l \]I II l\[ 0 
FC 1 ! ! 
12 456 
~,e ~e3 ~vs 3~s xe, oe ea # items 
o 
" i DC I I I | I I l I I 1 
23456 10° lOt 102 lO! lOt ~li J.o ~ox I,. ,o x. # itenm 
A: Dutch monomorphemic stems in the CELEX database, standardized at 1,00O,0OO. For the total 
distribution, N = 224567, V = 4455. For strings of length 4,/V = 64854, V = 1342. 
l0 t 55 
10 a 10110 ~ "° . -- 443322 11 
10° ~ , i 0 
I0 ° I0 x 102 I0 ~ lO t I0 ~ 
, , , , , FC 1 
1 34567 
:sso xs,41oo:svv :8v x~J s7 # itenm 
I000 
500 
I00 
50 
I0 iilIlii  
, , DC 
1234567 
3s4 ..'r .es e~x so. e.uoo~ # items 
B: Simulated Dutch monomorphemic stems, as generated by a Markov process. For the total distribu- 
tion, N = 224567, V = 58300. For strings of length 4, N = 74618, V -- 6425. 
/, #n /. 
104 35 I000 ~ • I 
"" . ! \[II\]ll 
103 28 
I00 
21 50 
I0 s 
14 
10 
101 7 
10° , i 0 FC 1 DC 
345 I0° 10z 10s 103 104 3w 
2s~ 20, ~ss z~o ,. xs~ # items xg~ ~o 23~ ~ ~ov ~v a~ # items 
C: Simulated Dutch monomorphemic stems, as generated by the Mandelbrot-Simon model (a = 0.01, 
Vc = 2000). For the total distribution, N = 291944, V = 4848. For strings of length 4, N = 123317, 
V = 1350. 
Figure 1: Rank-frequency and lexical similarity characteristics of the empirical and two simulated 
distributions of Dutch phonological stems. From left to right: double logarithmic plot of rank i versus 
frequency fi, boxplot of frequency class FC (1:1;2:2-4;3:5-12;4:13-33;5:34-90;6:91-244;7:245+) versus 
number of neighbors #n (length 4), and boxplot of density class DC ( 1:1-3;2:4-6;3:7-9;4:10-12;5:13- 
15;6:16-19;7:20+) versus mean frequency of neighbors fn (length 4). (Note that not all axes are scaled 
equally across the three distributions). N: number of tokens, V: number of types. 
273 
a given word length when the transition proba- 
bilities are not uniformly distributed. Unfortu- 
nately, he leaves unexplained why these effects 
occur, and to what extent his simulation is a 
realistic model of lexical items as used in real 
speech. 
In order to come to a more precise understand- 
ing of the source and nature of the lexical simi- 
larity effects in natural language we studied two 
stochastic models by means of computer simu- 
lations. We first discuss the Markovian model 
figuring in Mandelbrot's derivation of (2). 
Consider a first-order Markov process. Let 
A = {0,1,...,k} be the set of phonemes of 
the language, with 0 representing the terminat- 
ing character space, and let T ~ : (P~j)i,jeA with 
P00 = 0. If X,~ is the letter in the r, th position of 
a string, we define P(Xo = i) = po~, i E A. Let 
y be a finite string (/o,/1,...,/m-z) for m E N 
and define X (m) := (Xo, XI,... ,Xm-1), then 
Pv := p(X(") = l~) = Po~01~0~l...l~.._0~,_,. 
(4) 
The string types of varying length m, terminat- 
ing with the space and without any intervening 
space characters, constitute the words of the the- 
oretical vocabulary 
s,,, := {(io, i~,...,~,,_=,o): 
ij E A \ O,j =O,I,...,m- 2, mE N}. 
With N~ the token frequency of type y and 
V the number of different types, the vec- 
tor (N~,N~= , .... N~v) is multinomially dis- 
tributed. Focussing on the neighborhood den- 
sity effect, and defining the neighborhood of a 
target string yt for fixed length rn as 
Ct := ~y E 
such 
we have that the 
of Yt equals 
S,,, : 3!i e {0, 1,..., m - 2} 
that yl ¢ yt} , 
expected number of neighbors 
E\[V(Ct)\] = ~ {1 - (1 - p~)N}, (5) 
IIEC, 
with N denoting the number of trials (i.e. the 
number of tokens sampled). Note that when the 
transition matrix 7 ) defines a uniform distribu- 
tion (all pi# equal), we immediately have that 
the expected neighborhood density for length rnl 
is identical for all targets Yt, while for length 
m~ > rnl the expected density will be less than 
that at length ml, since p(,n=) < p(,m) given 
(4). With E\[Ny\] = Np~, we find that the neigh- 
borhood density effect does occur across word 
lengths, even though the transition probabilities 
are uniformly distributed. 
In order to obtain a realistic, non-trivial the- 
oretical word distribution comparable with the 
empirical data of figure 1A, the transition matrix 
7 ~ was constructed such that it generated a sub- 
set of phonotactically legal (possible) monomor- 
phematic strings of Dutch by conditioning con- 
sonant CA in the string X~XjC~ on Xj and the 
segmental nature (C or V) of Xi, while vowels 
were conditioned on the preceding segment only. 
This procedure allowed us to differentiate be- 
tween e.g. phonotactically legal word initial kn 
and illegal word final k  sequences, at the same 
time avoiding full conditioning on two preced- 
ing segments, which, for four-letter words, would 
come uncomfortably close to building the prob- 
abilities of the individual words in the database 
into the model. 
The rank-frequency distribution of 58300 
types and 224567 tokens (disregarding strings of 
length 1) obtained by means of this (second or- 
der) Markov process shows up in a double Iog- 
arithrnic plot as roughly linear (figure IB). Al- 
though the curve has the general Zipfian shape, 
the deviations at head and tail are present by ne- 
cessity in the light of Rouault (1978). A compar- 
ison with figure 1A reveals that the large surplus 
of very low frequency types is highly unsatisfac- 
tory. The model (given the present transition 
matrix) fails to replicate the high rate of use of 
the relatively limited set of words of natural lan- 
guage. 
The lexlcal similarity effects as they emerge 
for the simulated strings of length 4 are displayed 
in the boxplots of figure lB. A very pronounced 
neighborhood density effect is found, in combi- 
nation with a subdued neighborhood frequency 
effect (see table 1, column 2). 
The appearance of the neighborhood density 
effect within a fixed string length in the Marko- 
vian scheme with non-uniformly distributed p~j 
can be readily understood in the simple case 
of the first order Markov model outlined above. 
Since neighbors are obtained by substitution of 
a single element of the phoneme inventory A, 
two consecutive transitional probabilities of (4) 
have to be replaced. For increasing target prob- 
ability p~,, the constituting transition probabil- 
ities Pij must increase, so that, especially for 
non-trivial m, the neighbors y E Ct will gen- 
erally be protected against low probabilities py. 
Consequently, by (5), for fixed length m, higher 
frequency words will have more neighbors than 
lower frequency words for non-uniformly dis- 
tributed transition probabilities. 
The fact that the lexical similarity effects 
emerge for target strings of the same length is 
a strong point in favour of a Markovian source 
274 
for word frequency distributions. Unfortunately, 
comparing the results of figure 1B with those 
of figure 1A, it appears that the effects are of 
the wrong order of magnitude: the neighborhood 
density effect is far too strong, the neighborhood 
frequency effect somewhat too weak. The source 
of this distortion can be traced to the extremely 
large number of types generated (6425) for a 
number of tokens (74618) for which the empirical 
data (64854 tokens) allow only 1342 types. This 
large surplus of types gives rise to an inflated 
neighborhood density effect, with the concomi- 
tant effect that neighborhood frequency is scaled 
down. Rather than attempting to address this 
issue by changing the transition matrix by using 
a more constrained but less realistic data set, 
another option is explored here, namely the idea 
to supplement the Markovian stochastic process 
with a second stochastic process developed by 
Simon (1955), by means of which the intensive 
use can be modelled to which the word types of 
natural language are put. 
Consider the frequency distribution of e.g. a 
corpus that is being compiled, and assume that 
at some stage of compilation N word tokens have 
been observed. Let n (Jr) be the number of word 
types that have occurred exactly r times in these 
first N words. If we allow for the possibilities 
that both new types can be sampled, and old 
types can be re-used, Simon's model in its sim- 
plest form is obtained under the three assump- 
tions that (1) the probability that the (N + 1)-st 
word is a type that has appeared exactly r times 
is proportional to r~ Iv), the summed token fre- 
quencies of all types with token frequency r at 
stage N, that (2) there is a constant probability 
c~ that the (N-f 1)-st word represents a new type, 
and that (3) all frequencies grow proportionaly 
with N, so that 
n~ (Iv+l) N + 1 
g~'-----V = "-W-- for all r, lv. 
Simon (1955) shows that the Yule-distribution 
(3) follows from these assumptions. When the 
third assumption is replaced by the assumptions 
that word types are dropped with a probabil- 
ity proportional to their token frequency, and 
that old words are dropped at the same rate at 
which new word types are introduced so that 
the total number of tokens in the distribution is 
a constant, the Yule-distribution is again found 
to follow (Simon 1960). 
By itself, this stochastic process has no ex- 
planatory value with respect to the similarity 
relations between words. It specifies use and re- 
use of word types, without any reference to seg- 
mental constituency or length. However, when a 
Markovian process is fitted as a front end to Si- 
mon's stochastic process, a hybrid model results 
that has the desired properties, since the latter 
process can be used to force the required high 
intensity of use on the types of its input distri- 
bution. The Markovian front end of the model 
can be thought of as defining a probability dis- 
tribution that reflects the ease with which words 
can be pronounced by the human vocal tract, 
an implementation of phonotaxis. The second 
component of the model can be viewed as simu- 
lating interfering factors pertaining to language 
use. Extralinguistic factors codetermine the ex- 
tent to which words are put to use, indepen- 
dently of the slot occupied by these words in the 
network of similarity relations, ~ and may effect 
a substantial reduction of the lexlcal similarity 
effects. 
Qualitatively satisfying results were obtained 
with this 'Mandelbrot-Simon' stochastic model, 
using the transition matrix of figure IB for the 
Markovlan front end and fixing Simon's birth 
rate a at 0.01. s An additional parameter, Vc, 
the critical number of types for which the switch 
from the front end to what we will refer to as 
the component of use is made, was fixed at 2000. 
Figure 1C shows that both the general shape of 
the rank-frequency curve in a double logarith- 
mic grid, as well as the lexical similarity effects 
(table 1, column 3) are highly similar to the em- 
pirical observations (figure 1A). Moreover, the 
overall number of types (4848) and the number 
of types of length 4 (1350) closely approximate 
the empirical numbers of types (4455 and 1342 
respectively), and the same holds for the overall 
numbers of tokens (291944 and 224567) respec- 
tively. Only the number of tokens of length 4 
is overestimated by a factor 2. Nevertheless, the 
type-token ratio is far more balanced than in the 
original Markovian scheme. Given that the tran- 
sition matrix models only part of the phonotaxis 
of Dutch, a perfect match between the theoret- 
ical and empirical distributions is not to be ex- 
pected. 
The present results were obtained by imple- 
menting Simon's stochastic model in a slightly 
modified form, however. Simon's derivation of 
the Yule-distribution builds on the assumption 
that each r grows proportionaly with N, an as- 
2For instance, the Dutch word kuip, 'barrel', is a low- 
frequency type in the present-day language, due to the 
fact that its denotatum has almost completely dropped 
out of use. Nevertheless, it was a high-frequency word 
in earlier centuries, to which the high frequency of the 
surname ku~per bears witness. 
~The new types entering the distribution at rate 
were generated by means of the tr~nsitlon matrix of figure 
113. 
275 
sumption that does not lend itself to implemen- 
tation in a stochastic process. Without this as- 
sumption, rank-frequency distributions are gen- 
erated that depart significantly from the empir- 
ical rank-frequency curve, the highest frequency 
words attracting a very large proportion of all 
tokens. By replacing Simon's assumptions 1 and 
3 by the 'rule of usage' that 
the probability that the (N+ 1)-st word 
is a type that has appeared exactly r 
times is proportional to 
H,. := \]~, ~'~ log , (6) 
theoretical rank-frequency distributions of the 
desired form can be obtained. Writing 
rn~ v(,') "= 
for the probability of re-using any type that has 
been used r times before, H, can be interpreted 
as the contribution of all types with frequency 
r to the total entropy H of the distribution of 
ranks r, i.e. to the average amount of informa- 
tion 
lz = 
P 
Selection of ranks according to (6) rather than 
proportional to rnT (Simon's assumption I) en- 
sures that the highest ranks r have lowered prob- 
abilities of being sampled, at the same time 
slightly raising the probabilities of the inter- 
mediate ranks r. For instance, the 58 highest 
ranks of the distribution of figure 1C have some- 
what raised, the complementary 212 ranks some- 
what lowered probability of being sampled. The 
advantage of using (6) is that unnatural rank- 
frequency distributions in which a small number 
of types assume exceedingly high token frequen- 
cies are avoided. 
The proposed rule of usage can be viewed as a 
means to obtain a better trade-off in the distri- 
bution between maximalization of information 
transmission and optimalization of the cost of 
coding the information. To see this, consider 
an individual word type Z/. In order to mini- 
malize the cost of coding C(y) = -log(Pr(y)), 
high-frequency words should be re-used. Unfor- 
tunately, these high-frequency words have the 
lowest information content. However, it can be 
shown that maximalization of information trans- 
mission requires the re-use of the lowest fre- 
quency types (H, is maximal for uniformly dis- 
tributed p(r)). Thus we have two opposing re- 
quirements, which balance out in favor of a more 
intensive use of the lower and intermediate fre- 
quency ranges when selection of ranks is propor- 
tional to (6). 
The 'rule of usage' (6) implies that higher 
frequency words contribute less to the average 
amount of information than might be expected 
on the basis of their relative sample frequen- 
cies. Interestingly, there is independent evidence 
for this prediction. It is well known that the 
higher-frequency types have more (shades of) 
meaning(s) than lower-frequency words (see e.g. 
Reder, Anderson and Bjork 1974, Paivio, Yuille 
and Madigan 1968). A larger number of mean- 
ings is correlated with increased contextual de- 
pendency for interpretation. Hence the amount 
of information contributed by such types out of 
context (under conditions of statistical indepen- 
dence) is less than what their relative sample 
frequencies suggest, exactly as modelled by our 
rule of usage. 
Note that this semantic motivation for se- 
lection proportional to H, makes it possible 
to avoid invoking external principles such as 
'least effort' or 'optimal coding' in the mathe- 
matical definition of the model, principles that 
have been criticized as straining one's credulity 
(Miller 1957). 4 
FUNCTION WORDS 
Up till now, we have focused on the modelling 
of monomorphemic Dutch words, to the exclu- 
sion of function words and morphologically com- 
plex words. One of the reasons for this ap- 
proach concerns the way in which the shape of 
the rank-frequency curves differs substantially 
depending on which kinds of words are included 
in the distribution. As shown in figure 2, the 
curve of monomorphemic words without func- 
tion words is highly convex. When function 
words are added, the head of the tail is straight- 
ened out, while the addition of complex words 
brings the tail of the distribution (more or less) 
in line with Zipf's law. Depending on what kind 
of distribution is being modelled, different crite- 
ria of adequacy have to be met. 
Interestingly, function words, -- articles, pro- 
nouns, conjunctions and prepositions, the so- 
called closed classes, among which we have also 
reckoned the auxiliary verbs -- typically show up 
as the shortest and most frequent (Zipf) words in 
frequency distributions. In fact, they are found 
with raised frequencies in the the empirical rank- 
frequency distribution when compared with the 
curve of content words only, as shown in the first 
4In this respect, Miller's (1957) alternative derivation 
of (2) in terms of random spacing is unconvincing in the 
light of the phonotactlc constraints on word structure. 
276 
105 
104 
l0 s 
I02 
101 
I00 
I0 5 lO s 
oe ee 
104 104 
• oo 
IO s lO s 
I02 102 
101 I01 
z . 
~" i I0 ° , i , , , , , , , i 10 0 
I0 ° 101 I0 = l0 s 104 l0 s I0 ° I01 I0= l0 s 104 l0 s I0 ° I01 I0= I0 ~ 104 l0 s 
Figure 2: Rank-frequency plots for Dutch phonological sterns. From left to right: monomorphemic 
words without function words, monomorphemic words and function words, complete distribution. 
two graphs of figure 2. Miller, Newman & Fried- 
man (1958), discussing the finding that the fre- 
quential characteristics of function words differ 
markedly from those of content words, argued 
that (1958:385) 
Inasmuch as the division into two 
classes of words was independent of the 
frequencies of the words, we might have 
expected it to simply divide the sam- 
ple in half, each half retaining the sta- 
tistical properties of the whole. Since 
this is clearly not the case, it is ob- 
vious that Mandelbrot's approach is 
incomplete. The general trends for 
all words combined seem to follow a 
stochastic pattern, but when we look 
at syntactic patterns, differences begin 
to appear which will require linguistic, 
rather than mere statistical, explana- 
tions. 
In the Mandelbrot-Simon model developed here, 
neither the Markovian front end nor the pro- 
posed rule of usage are able to model the ex- 
tremely high intensity of use of these function 
words correctly without unwished-for side effects 
on the distribution of content words. However, 
given that the semantics of function words are 
not subject to the loss of specificity that char- 
acterizes high-frequency content words, function 
words are not subject to selection proportional 
to H~. Instead, some form of selection propor- 
tional to rn~ probably is more appropriate here. 
MORPHOLOGY 
The Mandelbrot-Simon model has a single pa- 
rameter ~ that allows new words to enter the dis- 
tribution. Since the present theory is of a phono- 
logical rather than a morphological nature, this 
parameter models the (occasional) appearance 
of new simplex words in the language only, and 
cannot be used to model the influx of morpho- 
logically complex words. 
First, morphological word formation processes 
may give rise to consonant clusters that are per- 
mitted when they span morpheme boundaries, 
but that are inadmissible within single mor- 
phemes. This difference in phonotactic pattern- 
ing within and across morphemes already re- 
reales that morphologically complex words have 
a dLf\[erent source than monomorpherpJc words. 
Second, each word formation process, whether 
compounding or affixation of sufr-txes like -mess 
and -ity, is characterized by its own degree of 
productivity. Quantitatively, differences in the 
degree of productivity amount to differences in 
the birth rates at which complex words appear 
in the vocabulary. Typically, such birth rates, 
which can be expressed as E\[n~\] where n~ and Nl , 
A r' denote the number of types occurring once 
only and the number of tokens of the frequency 
distributions of the corresponding morphologi- 
cal categories (Basyen 1989), assume values that 
are significantly higher that the birth rate c~ of 
monomorphemic words. Hence it is impossible 
to model the complete lexical distribution with- 
out a worked-out morphological component that 
specifies the word formation processes of the lan- 
guage and their degrees of productivity. 
While actual modelling of the complete distri- 
bution is beyond the scope of the present paper, 
we may note that the addition of birth rates for 
word formation processes to the model, neces- 
sitated by the additional large numbers of rare 
277 
words that appear in the complete distribution, 
ties in nicely with the fact that the frequency 
distributions of productive morphological cate- 
gories are prototypical LNRE distributions, for 
which the large values for the numbers of types 
occurring once or twice only are characteristic. 
With respect to the effect of morphological 
structure on the lexical similarity effects, we fi- 
nally note that in the empirical data the longer 
word lengths show up with sharply diminished 
neighborhood density. However, it appears that 
those longer words which do have neighbors are 
morphologically complex. Morphological struc- 
ture raises lexical density where the phonotaxis 
fails to do so: for long monomorphemic words 
the huge space of possible word types is sampled 
too sparcely for the lexical similarity effects to 
emerge. 

REFERENCES 
Baayen, R.H. 1989. A Corpus-Based Approach 
to Morphological Productivity. Statistical Anal- 
ysis and Psycholinguistic Interpretation. Diss. 
Vrije Universiteit, Amsterdam. 
Carroll, J.B. 1967. On Sampling from a Log- 
normal Model of Word Frequency Distribution. 
In: H.Ku~era 0 W.N.Francis 1967, 406-424. 
Carroll, 3.B. 1969. A Rationale for an Asymp- 
totic Lognormal Form of Word Frequency Distri- 
butions. Research Bulletin -- Educational Test. 
ing Service, Princeton, November 1969. 
Chitaivili, P~J. & Khmaladse, E.V. 1989. Sta- 
tistical Analysis of Large Number of Rare Events 
and Related Problems. ~Vansactions of the Tbil- 
isi Mathematical Instflute. 
Good, I.J. 1953. The population frequencies of 
species and the estimation of population param- 
eters, Biometrika 43, 45-63. 
Herdan, G. 1960. Type-toke~ Mathematics, 
The Hague, Mouton. 
Ku~era~ H. & Francis, W.N. 1967. Compa- 
Lational Analysis of Prese~t-Day American En- 
glish. Providence: Brown University Press. 
Landauer, T.K. & Streeter, L.A. 1973. Struc- 
tural differences between common and rare 
words: failure of equivalence assumptions for 
theories of word recognition, Journal of Verbal 
Learning and Verbal Behavior 12, 119-131. 
Mandelbrot, B. 1953. An informational the- 
ory of the statistical structure of language, in: 
W.Jackson (ed.), Communication Theory, But- 
terworths. 
Mandelbrot, B. 1962. On the theory of word 
frequencies and on related Markovian models 
of discourse, in: R.Jakobson, Structure of Lan- 
guage and its Mathematical Aspects. Proceedings 
of Symposia in Applied Mathematics Vol XII, 
Providence, Rhode Island, Americal Mathemat- 
ical Society, 190-219. 
Miller, G.A. 1954. Communication, Annual 
Review of Psychology 5, 401-420. 
Miller, G.A. 1957. Some effects of intermittent 
silence, The American Jo~trnal of Psychology 52, 
311-314. 
Miller, G.A., Newman, E.B. & Friedman, E.A. 
1958. Length-Frequency Statistics for Written 
English, Information and control 1, 370-389. 
Muller, Ch. 1979. Du nouveau sur les distri- 
butions lexicales: la formule de Waring-Herdan. 
In: Ch. Muller, Langue Frangaise et Linguis- 
tique Quantitative. Gen~ve: Slatkine, 177-195. 
Nusbaum, H.C. 1985. A stochastic account 
of the relationship between lexical density and 
word frequency, Research on Speech Perception 
Report # 1I, Indiana University. 
Orlov, J.K. & Chitashvili, R.Y. 1983. Gener- 
alized Z-distribution generating the well-known 
'rank-distributions', Bulletin of the Academy of 
Sciences, Georgia 110.2, 269-272. 
Paivio, A., Yuille, J.C. & Madigan, S. 1968. 
Concreteness, Imagery and Meaningfulness Val- 
ues for 925 Nouns. Journal of Ezperimental Psy- 
chology Monograph 76, I, Pt. 2. 
Reder, L.M., Anderson, J.R. & Bjork, R.A. 
1974. A Semantic Interpretation of Encoding 
Specificity. Journal of Ezperimental Psychology 
102: 648-656. 
Rouault, A. 1978. Lot de Zipf et sources 
markoviennes, Ann. Inst. H.Poincare 14, 169- 
188. 
Sichel, H.S. 1975. On a Distribution Law for 
Word Frequencies. Journal of Lhe American Sta- 
tistical Association 70, 542-547. 
Simon, H.A. 1955. On a class of skew distri- 
bution functions, Biometrika 42, 435-440. 
Simon, H.A. 1960. Some further notes on a 
class of skew distribution functions, Information 
and Control 3, 80-88. 
Zipf, G.K. 1935. The Psycho.Biology of Lan- 
guage, Boston, Houghton Mifflin. 
