A MATHEMATICAL MODEL OF THE VOCABULARY-TEXT RELATION 
JuhanTuldava 
Tartu, Estonia, USSR 
A new method for calculating vo- 
cabulary size as a function of text 
length is discussed. The vocabulary 
growth is treated as a probabilistic 
process governed by the principle of 
"the restriction of variety" of lex- 
ics. Proceeding from the basic model 
of the vocabulary-text relation a 
formula with good descriptive power 
is constructed. The statistical fit 
and the possibilities of extrapola- 
tion beyond the limits of observable 
data are illustrated on the material 
of several languages belonging to dif- 
ferent typological groups. 
by deducing the relation between V and 
N from some other important quantita- 
tive characteristics of text such as 
Zipf's law and Yule's distribution 
(Kalinin, Orlov) 3. The author under- 
lines the importance of these concep- 
tions for the theory of quantitative 
linguistics on the whole, but points 
out their insufficiency in solving 
some practical linguo-statistical 
problems where greater exactness and 
reliability are needed (style-statis- 
tical analysis, text attribution, ex- 
trapolation beyond the limits of ob- 
servable data, etc.). 
i. There are a great number of at- 
tempts to construct an appropriate 
mathematical model which would ex- 
press the dependence of the size of 
the size of vocabulary (V) on the size 
of text (N). This is not only of 
practical importance for the resolu- 
tion of a series of problems in the 
automatic processing of texts, but it 
is also connected with the theoreti- 
cal explanation of some important as- 
pects of text generation. In practice 
one often makes use of various empir- 
ical formulae which describe the 
growth of vocabulary with sufficient 
precision in the case of concrete 
texts and languages l, though such for- 
mulae do not have any general signif- 
icance. Of special interest are some 
"complex" models derived from theo- 
retical considerations, e.g., by bas- 
ing one*s considerations on the hypo- 
thesis about the lognormal distribu- 
tion of words in a text (Carroll) 2 or 
2. Instead of the "complex" mode/s 
a "direct" method is proposed where 
the relation between V and N is re- 
garded as the primary component with 
its own immanent properties in the 
statistical organization of text. The 
relation between V and N has to be 
analyzed ua the background of some es- 
sential inner factors of text genera- 
tion. The dynamics of vocabulary 
growth is considered as the result of 
the interaction of several linguistic 
and extra, linguistic factors which in 
an integral way are governed by the 
principle of "the restriction of va- 
riety" of lexics (an analogue of the 
principle of the decrease of entropy 
in self-regulating systems). The con- 
cept of the variety of lexics is de- 
fined as the relation between the size 
of vocabulary and the size of text in 
the form of V/N (type-token ratio, or 
coefficient of variety) or N/V (aver- 
age frequency of word occurrences). 
--600-- 
The coefficient of variety is sup- 
posed to be correlated with the prob- 
abilistic process of choosing "new" 
(unused) and "old" (already used in 
the text) words at each stage of text 
generation. The steady decrease of the 
degree of variety V/N = p is attended 
by the increase of its counterpart: 
(N-V)/N- I-V/N= q (p, q- 1), 
which can be interpreted as the "pres- 
sure of recurrency" of words in real 
texts (analogous to the concept of re- 
dundancy in the theory of informa- 
tion) : 
~ II 
N- V 
3. The formulae of the relation between 
V and N are constructed from the basic 
models: V = Np or V = N(1 - q). 
For this purpose the quantitative 
changes of V/N = p depending on the 
size of text are analyzed. According 
to the initial hypothesis the rela- 
tion between V/N and N is approxi- 
mated by the power function of the 
type: V/N = aN B (a and B are con- 
stants; B < O), which leads to the 
well-known formula of G. Herdan@: 
V = aN b (where b = B + 1). A verifi- 
cation shows good agreement with em- 
pirical data in the initial stages 
of text formation (in the limits of 
about ~,OOr~ - 5,000 tokens which cor- 
respond to a short communication). 
Later on the rate of the diminishing 
of the degree of variety (V/N) grad- 
ually slows down (due to the rise of 
new themes in the course of text gen- 
eration). Accordingly the initial 
formula has to be modified and this 
can be done by logarithmization of 
the variables. The first attempt gives 
us in (V/N) = aN B, which leads to 
some variants of the Weibull distri- 
bution. This kind of distribution 
shows good agreement with the empir- 
ical data within the boundaries of a 
text of medium length, but it is not 
good for extrapolation. Only after 
balancing the initial formula by the 
logarithmization of both variables we 
obtain in (V/N) = a(ln N) B and the 
corresponding formula for expressing 
the relation between V and N: 
V = Ne -a(ln N) B 
or V = N I - a(ln N)b (where b = 
= B - l) p, which turns out to be the 
most adequate formula for solving our 
problems. The constants a and B 
(which, of course, are not identical 
with those of the previously mention- 
ed formulae) may be determined on the 
basis of linearization: lnln (N/V) = 
= A + B luln N, where A = In a, using 
the method of least squares. In prin- 
ciple it would be sufficient to have 
two empirical points for the calcula- 
tion of the values of the constants 
but for greater reliability more 
points are needed. 
4. The good descriptive power of 
the given function and the possibill- 
601 
ties of extrapolation in both direc- 
tions (from the beginning up to a text 
of about N = lO 7) has been verified on 
the basis of experimental material 
taken from several languages belonging 
to different typological groups (Esto- 
nian, Kazakh, Latvian, Russian, Po- 
lish, Czech, Rumanian, English). The 
function may be applied to the analy- 
sis of individual texts as well as 
composite homogeneous (similar) texts 
and the size of vocabulary (V) may be 
determined by counting either word 
forms of lexemes. (See Tables 1 and 2.) 
This seems to corroborate the assump- 
tion about the existence of a univer- 
sal law (presumably of phylogenetic 
origin) which governs the process of 
text formation on the quantitative 
level. 
Table i 
The empirical size (V) and the teoreti- 
cal size (V') of vocabulary plotted against 
the length of the text (N). The formula: 
V" = Ne -a(ln N) B 
a) Latvian newspapers (lexemes) 6 
N V V" 
50000 7065 7025 
lOOOOO 98~+ 9919 2ooooo 13389 1351o 
300000 16103 15912 
lO 6 - 24000 
lO 7 - 37000 
(a = 0.003736, B = 2.6304) 
b) Czech texts of technical sci- 
ences (word forms) 7 
N V V' 
25000 4829 4827 
75000 9605 9626 
125000 13056 13050 175000 15858 15853 
lO 6 - 40000 
l07 - ll~O00 
(a = 0.01123, B = 2.1539) 
c) Kazakh newspapers (word forms) 8 
N V V, 
25000 9088 9161 
50000 15047 1@875 
I000OO 23895 23523 
150000 29785 50578 
106 - 87000 
lO 7 - 23OOOO 
(a = 0.001372, B = 2.8488) 
d) Polish belles-lettres 
(word forms) 9 
N V V ° 
12172 545@ 3458 
29787 6146 604@ 
@8255 8026 7998 
64510 9250 9398 
106 - 33000 
107 - 6OOO0 
(a = 0.00364, B = 2.6081) 
602 
e) English texts on electronics 
(word forms) l0 
N V V" 
50000 5599 5457 
I00000 7855 7728 
150000 9361 9371 200000 10582 10682 
106 - 20000 
lO 7 - 58000 
(a = 0.009152, B = 2.5057) 
g) Russian texts on electronics 
(word forms) 12 
N V V" 
50000 c~+64 9588 
lO0000 14062 14168 
150000 17265 17805 200000 21468 20818 
106 - 45000 
i07 - 94000 
(a = 0.00@284, B = 2.5058) 
f) Rumanian texts on electronics 
(word forms) II 
N V V" 
50000 6785 6841 
lO0000 10281 10070 
150000 12477 12479 
200000 14292 14454 
106 - 50000 
107 - 68000 
(a = 0.008148, B = 2.5086) 
Table 2 
Prediction on the basis of two empi- 
rical points (marked with an asterisk) 
a) English: literary texts 13 
(word forms) 
N V V ° 
10051 5009~ 5009 
101566 15706 ~ 15709 
Prediction: 
io - 9 
i00 - 78 
i000 - 554 
2000 700-1000 917 
50721 8749 8905 
255558 25655 25447 101~252 50406 49280 
l07 - 140000 
(a = 0.007879, B = 2.2652) 
c) Russian: A. S. Pushkin's "Queen 
of Spades" (lexemes) 15 
N V V" 
lOOO ~62 K _ 462 
2000 787 ~ 787 
Prediction: 
5000 1067 1068 
40OO 1348 1321 
5000 1541 1556 
6000 1752 1776 
6861 1928 1957 
(the whole book) 
(a = 0.01699, B = 1.9747) 
b) Estonian: A. H. Tammsaare's novell4 
"Truth and Justice" I (lexemes) 
N V V" 
i0000 2114 ~ 2114 
20000 5124 m 3124 
Prediction: 
114124 7548 7207 
(the whole book) 
(a = 0.006714, B = 2.4521) 
603 

References 

Kuraszkiewicz, W., Statystyczne 
badanie slownictwa polskich tekst6wIVI 
wieku. In: Z Polskich Studi6w Slawis- 
tycz~ch. Warszawa 1958. 

Gulraud, P., Probl~mes et m~thodes 
de la statistique linguistique. Dor- 
drecht 1959. 

Somers, H. H., Analyse math@matique 
de langa~e. Louvain 1959. 

Miller, W., Wortschatzumfang und 
Textl~uge. In: Muttersprache, 81. Jg., 
Nr. 4. Mannheim-Z~rich 1971. 

Ne~itoj V. V., Dlina teksta i ob'em 
slovarja. In: Metody izu~enijaleksikl. 
Minsk 1975. 

Carroll, J. B., On Sampling From a 
Lognormal Model of Frequency Distri- 
bution. In: Computational Analysis of 
l~esent-Day American English. Provi- 
dence, R. I., 1967. 

Kalinin, V. M., Nekotorye statisti- 
~eskie zakony matemati~eskoj lingvis- 
tiki. In: Problemy kibernetiki, ll, 
1964. 

Orlov, Yu. K., Obob~ennyj zakon 
Zlpfa-Mandelbreta i ~astotnye struk- 
tury informacionnykh edinic razli~nykh 
urovnej. In: Vy~islitelnaja lingvisti- 
ka. Moscow 1976. 

@° Herdan, G., Quantitative Linguis- 
tics. London 1964, 

Tuldava, J., Quantitative Relations 
between the Size of Text and the Size 
of Vocabulary. In: Journal of Linguis- 
tic Calculus. SMIL Quarterly 1977:4. 

Latvie~u valodas bie~uma v~Yrdnica. 
II:l. Ed. T. Jakubaite. Riga 1969. 

Be~ka, J. V., La structure lexicale 
des textes t@chniques en teh~que, In: 
Philologica Pragensia, 1972, vol. 15, 
No. 1. 

Akhabayev, A., Statisti~eskij ana- 
liz leksiko-morfologi~eskoj struktury 
jazyka kazakhskojpublicistiki. Alma- 
Ata 1971. 

Sambor, J., Analiza stosunku "type- 
token", czyli objeto~ci (W) i d~ugo~ci 
tekstu (N). In: Prace filologiczne. 
Tom XX. Warszawa 1970. 

Alekseev, P.M., Leksi~eskaja i 
morfologi6eskaja statistika anglijsko- 
go pod'jazyka elektroniki. In: Statis- 
tika re~i. Leningrad 1968. 

ELan, L. J., Opyt statisti~eskogo 
opisanija nau~no-tekhni~eskogo stilja 
rumynskogo jazyka. Leningrad 1966. 

Kalinina, E. A., Castotnyj slovar" 
russkogo pod'jazyka elektronikl. In: 
Statistika re~i. Leningrad 1968. 

Ku~era H., and Francis, W. N. 
(ed.), ComputationalAnalysis of Pre- 
sent-Day American English. Providence, 
R. I. 1967. 

Villup, A., A. H. Tammsaare romaa- 
ni "T~de ja ~igus" I k6ite autori- ja 
tegelask~ne sageduss~nastik. In: Acta 
et Commentationes Universitatis Tartu- 
ensis. Vol. 446. Tartu 1978. 

0rlov, Yu. K., ~odel" ~astotnoj 
struktury leksiki. In: Issledovanija v 
oblasti vy~islitelSnoj lingvistiki i 
lingvostatistiki. Moscow 1978. 
