Relating Turing's Formula and Zipf's Law 
Christer Samuelsson 
Universit£t des Saarlandes, FR 8.7, Computerlinguistik 
Postfach 1150, D-66041 Saarbriicken, Germany 
chr ± st er~col i. uni- sb. de 
Abstract 
An asymptote is derived from Turing's local reestimation formula for population 
frequencies, and a local reestimation formula is derived from Zipf's law for the 
asymptotic behavior of population frequencies. The two are shown to be qualita- 
tively different asymptotically, but nevertheless to be instances of a common class 
of reestimation-formula-asymptote pairs, in which they constitute the upper and 
lower bounds of the convergence region of the cumulative of the frequency func- 
tion, as rank tends to infinity. The results demonstrate that Turing's formula is 
qualitatively different from the various extensions to Zipf's law, and suggest that it 
smooths the frequency estimates towards a geometric distribution. 
1 Introduction 
Turing's formula \[Good 1953\] and Zipf's law \[Zipf 1935\] indicate how population frequencies in 
general tend to behave. Turing's formula estimates locally what the frequency count of a species 
that occurred r times in a sample really would have been, had the sample accurately reflected 
the underlying population distribution. Zipf's law prescribes the asymptotic behavior of the 
relative frequencies of species as a function of their rank. The ranking scheme in question 
orders the species by frequency, with the most common species ranked first. The reason 
that these formulas are of interest in computational linguistics is that they can be used to 
improve probability estimates from relative frequencies, and to predict the frequencies of unseen 
phenomena, e.g., the frequency of previously unseen words encountered in running text. 
Due to limitations in the amount of available training data, the so-called sparse-data pro- 
blem, estimating probabilities directly from observed relative frequencies may not always be 
very accurate. For this reason, Turing's formula, in the incarnation of Katz's back-off scheme 
\[Katz 1987\], has become a standard technique for improving parameter estimates for probabi- 
listic language models used by speech recognizers. A more theoretical treatment of Turing's 
formula itseff can be found in \[N£das 1985\]. 
Zipf's law is commonly regarded as an empirically accurate description of a wide variety 
of (linguistic) phenomena, but too general to be of any direct use. For a bit of historic con- 
troversy on Zipf's law, we refer to \[Simon 1955\], \[Mandelbrot 1959\], and subsequent articles in 
Information and Control. The model presented there for the stochastic source generating the 
various Zipfian distributions is however linguistically highly dubious: a version of the monkey- 
with-typewriter scenario. 
The remainder if this article is organized as follows. In Section 2, we induce a recurrence 
equation from Turing's local reestimation formula and from this derive the asymptotic beha- 
vior of the relative frequency as a function of rank, using a continuum approximation. The 
resulting probability distribution is then examined, and we rederive the recurrence equation 
from it. In Section 3, we start with the asymptotic behavior stipulated by Zipf's law, and 
derive a recurrence equation similar to that associated with Turing's formula, and from this 
770 
induce a corresponding reestimation formula. We then rederive the Zipfian asymptote from 
the established recurrence equation. In Section 4, similar techniques are used to establish the 
asymptotic behavior inherent in a general class of recurrence equations, parameterized by a 
real-valued parameter, and then to rederive the recurrence equations from their asymptotes. 
The convergence region of this parameter for the cumulative of the frequency function, as rank 
approaches infinity, is also investigated. In Section 5, we summarize the results, discuss how 
they might be used practically, and compare them with related work. 
2 An Asymptote for Turing's Formula 
Turing's formula reestimates population frequencies locally: 
x* = (x+l). N.+~ N~ (1) 
Here N~ is the number of species with frequency count x, and x* is the improved estimate of 
x. Let N be the size of the entire population and note that 
x ~r 
N = ~x'N. and f~ = 
x----1 
where X is the count of the most populous species and f~ is the relative frequency of any 
species with frequency count x. 
Let r(x) be the rank of the last species with frequency count x. This means that quite in 
general 
r(x) X = Z Nk 
k=x 
X X 
ix = ~ ik - ~ ik = r(~)- r(x + 1) (2) 
k=x k=x+l 
2.1 A continuum approximation 
We first make a continuum approximation by extending Nx from the integer points x = 1, 2,... 
to a continuous function N(x) on \[1, oo). This means that 
x x 
r(x) = ~Nk ,~ ~ N(y)dy 
k----x 
Differentiating this w.r.t, z, the lower bound of the integral, yields 
dr(x) d f x dx - dx N(y) dy = -N(x) 
and using the chain rule for differentiation yields 
dr dr dx 
= d~'~ = -N(xl.N (a) 
Continuum approximations are useful techniques for establishing the dependence of a sum 
on its bounds, to the leading term, and for determining convergence. For example, if we wish to 
71 
n in n 3 -- 1 study the sum ~ k 2, we note that the corresponding integral Jl x 2 dx = -- and conclude 
k---1 3 
2n 3 -F 3n 2 + n that the sum behaves like n 3. The exact formula is 6 , so we in fact even got 
CO 
the leading coefficient right. Likewise, we can establish for what values of a the sum ~ k ~ 
k=l converges by explicitly calculating 
fl X a dx = X c~+l 
\[~-~-lJl for o~ -~ -1 
indicating that the integral, and thus the sum, converge for c~ < -1 and diverge for ~ > -1. 
We have to be a bit careful with the transition to the continuous case. We will first let N x 
become large and then establish what happens for small, but non-zero, values of ff = ~. So 
although x will be small compared to N, it will be large compared to any constant C. This 
means that 
x x+C f = lim ~ = lim 
N-*oo N N--+co N 
for any additive constant C, and we may approximate x + C with x, motivating 
and similar approximations in the following. 
1 1 
z+l x 
2.2 The asymptotic distribution 
For an ideal Turing population, we would have x = z*. This gives us the recurrence equation 
Ig 
N=+i = x + l " N= (4) 
implying that there are equally many inhabitants for frequency count x as for frequency count 
x + 1. This introduces several additional constraints, namely 
.N 1 x • N= = 1. N 1 and thus N= = 
X 1 
N = X'N1 and thus fx - N N1 
(5) 
We are now prepared to derive the asymptotic behavior of the relative frequency f(r) of 
species as a function of their rank r implicit in Eq. (4). Combining Eq. (5) with Eq. (3) yields 
dr N1 N1 
df - N(x). N = ---. N = - z y 
This determines the rank r(f) as a function of the relative frequency f: 
r(f) = C-Nllnf (6) 
Inverting this gives us the sought-for function f(r): 
C--r r 
/(r) = em- = C'e-  
72 
Utilizing the fact that the relative frequencies should be normalized to one, we find that 
oo 1 
1 = f(r) dr = C'.Nle-~" 
and that thus "Turing's asymptotic law" is 
r--1 f(r) = ~-e "~ (7) 
Note that, reassuringly, the relative frequency of the most populous species, fx, is preserved: 
1 X 
f(1) = N1 N - fx 
r--1 Upon examining the frequency function --N1 e ~ , we realize that we have an exponential 
1 distribution with intensity parameter ~-, the probability of the most common species. This 
distribution was created by approximating our original discrete distribution with a continuous 
one. The discrete counterpart of an exponential distribution is a geometric distribution 
P(r) = p. (1- p)r-1 r=l,2,... 
parameterized by p, the probability of some outcome occurring in one trial. P(r) can then 
be interpreted as the probability of waiting r trials for the first occurrence of the outcome. 
Thus, Turing's formula seems to be smoothing the frequency estimates towards a geometric 
distribution. 
2.3 Rederiving Turing's formula 
To test our derivation of the asymptotic equation (7) from the recurrence equation (4), we will 
attempt to rederive Eq. (4) from Eq. (7). Since Eq. (7) implies Eq. (6), we start from the latter 
and establish that 
x r(x) = c - gl ln ~ 
Inserting this into Eq. (2) yields 
z x+l = NllnZ+l N~ = r(z)-r(z+l) = -N~lny+N~ln-N-- 1 
-- Nlln(l+:) 
This means that 
We first note that 
1 < in:x+!  < 1 
x+l - \x\] - x 
We also note that the numerator can bewrittenasg(x+l)-g(x)forg(y)= yln(1 +~), 
wMch in turn can be written as Jx g(Y) dy, i.e., as ~,x In 1 + 1 + 
73 
further note that if A < h(y) <_ B on (a, b), then A(b - a) < /b h(y) dy < B(b - a). Hence 
( 1 ) ( ~) l'+'( ( 1) 1 ) (x+l)In i+.~-~-~ -xln I+ in 14~ l÷'y ely 
0 < = ~ 
- (x + 1)ln (1+ ~) (x + 1)ln (1+ ~) 
/.+1(~ 1 ) /~+' 1 /'+' 1 1 < l+y dy = -- dy < -- 
J~ Jx y(l + y) dy < -- Jx V -- X2 
We have thus proved that \]Nx+l x \[ 1 Nx x+l < ~ and since we assume that x >> 1, this 
< 
1 reestablishes Eq. (4) (to the second power of ~). 
3 A Reestimation Formula for Zipf's Law 
Zipf's law concerns the asymptotic behavior of the relative frequencies f(r) of a population as a 
function of rank r. It states that, asymptotically, the relative frequency is inversely proportional 
to rank: 
A 
f(~) = B+~ (8) 
This implies a finite total population, since the cumulative (i.e, the sum or integral) of the 
relative frequency over rank does not converge as rank approaches infinity: 
fi f(k) in the discrete case 
F(r) = k=~ 
f /(o) dp 
lim F(r) 
~ -.4. 00 
To localize Zipf's law, we utilize Eq. (2) and observe that r(x) = 
in the continuous case 
lim Aln(B+r) = cx~ 
r---moo 
A 
f(x) 
A' A t 
g~+: ~(~ + 1) - ~(x + 2) x + 1 x + 2 
gx ~(~) - ~(~ + 1) A' A' 
x x+l 
This suggests "Zipf's local reestimation formula" 
X* = (T + 2)" Nx+l 
Nx 
A I m-B = ---B, 
X 
(x + 1). (x + 2) x 
1 - x+2 (9) 
(10) 
x.(x+l) 
which is deceptively similar to Turing's formula, Eq. (1), the only difference being that it x+2 
assigns ~ more relative-frequency mass to frequency count x. 
3.1 Rederiving Zipf's law 
If we rederive the asymptotic behavior, we again obtain Zipf's law. Assuming the recurrence 
equation 
X N~+~ = ~.N~ 
x+2 
74 
we have that 
x x....-1 2 C 
Nx+l = x+---~'N~, = (x+2).....3"N1 = (x+2).(x+l)'N1 ~, (x+l)2 
We again use the equation for the derivative of the rank, Eq. (3), but now 
dr C C' = -N(x).N ~ -~.N = -f--~ 
C I Integration yields r = 7 + C" and function inversion 
C I 
f(r) = r 
-- C tt 
Identifying C ~ with A and C" with -B recovers Eq. (8). 
(11) 
4 A General Correspondence 
If we generalize the rederivation of Zipf's law in Eq. (11) to p = 2, 3,..., x, we find that 
x X'..." 1 x! C 
Nx+l = x+---~'Nx = (x+p).....(l+p)'N1 = i.I~=l(k+p)'il ,~, (x+l)P 
C C' 1 We integrate ~ to get r(f) - fp-1 + C", yielding a r-7:i" asymptote for f(r). 
Although a nontrivial generalization, it is in fact the case that for real-valued 0 : 1 # 0 < z, 
X Nx+l = --'N~ (12) 
z+8 
results in the asymptote 1 
1 f(r) = Cr -o-i (13) 
The key observation here is that also for real-valued O < x in general, 
z! C 
I-II_-l(k + o) (x + 1)o 
This means that we have a single reestimation equation 
x* = (x + 0)- Nx+l (14) 
parameterized by the real-valued parameter O, with the asymptotic behavior 
Cr-~:r 0 # 1 (15) .f(r) 
= Ce 0=1 
Although this correspondence was derived with the requirement that 0 < x, we can in view 
of the discussion in Section 2.1 assume that x is not only considerably larger than 1, but also 
greater than any fixed value of 0. The extension to the negative real numbers is straight- 
forward, although perhaps not very sensible. In fact, the convergence region for the cumulative 
of the frequency function as rank goes to infinity, 
oo 
f(r) or /(r) dr 
is 0 E \[1, 2), establishing Turing's formula and Zipf's law as the two extremes of this reestima- 
tion formula, in terms of resulting in a proper probability distribution for infinite populations; 
while the former does so, the latter does not. 
1If 9 = 1, we have the Turing case with an exponentially declining asymptote, cf. Eq. (7). 
75 
4.1 Reversing the directions 
Finally, assuming the asymptotic behavior of Eq. (13), we rederive the recurrence equation (12). 
The mathematics are very similar to those used to rederive Turing's formula in Section 2.3. 
1 C t 
Inverting the asymptotic behavior f(r) = Cr-~=~ gives us r(f) - f0-1, which in turn yields 
C tt 
r(x) - xe-1 
For notational convenience, let (2 denote 6 - 1, and assume that 0 < 0 # 1, i.e., -1 < (2 # O. 
x ~+1 
x+a+l ~ = 
1 1 
I ~ ~(~+1)-~(~+2)\] • (~+1)- (~+2)° 
x+a+l 7(7-) = ;(TT T) = x+(2+1 1 1 
x~ (z + 1)" 1) (i 1) 
x (x+l)~ ' --(x+(2+l) (x+l)a (x+2)o ` _ 
1 
(z+(2+ 1)(~ (x + 1)") 
x + 1 "b (2 x+l ~ ( _x._+_a x 
(x+2) a - (x+l) "\]- \(x+l) a ~-a\] (: : ) 
(x+(2+l) ~ (x+l) ~ 
y+(2 1 As before, the numerator can be written as g(x + 1) - g(x), now for g(y) = (y + 1) a ya-1 : 
< 
I/ ~+1(.-1 (a-- 1)y+ (22 -- 1~ j. \ yo --T~-7-V)o-~--\] dy 1./1 1 'x \[ (x+a+)L~j-(x+i). ) 
I f~+l 1 1 (2 dy 
1(2-11 (~ (Y+l) ~ (y+l) ~+1) 
x+(2+1 " 1 1 
x" (x + 1)" I x+l y+l (2 (2 
x+(2+l" \[~+1 a __j~ y~+l dy 
/x+l( 1 1)dy 
1(2- 11 .,x ~ (Y + 1) a+i __ 
x+(2+1 \[~+1 1 Jz yc~+l dy 
(/S o+: ) 
1(2- 1\[ Jx v z~+2 dz dy 
x + (2 + 1 \[x+l 1 Jx y~+l dy < 
< 
76 
/ =+I 1 < \[a 2-1\[ ."z y,~+2dY < \[a 2-1\[ . 1_ < \[~2_1\[ 
- x + a + 1 Jz/z+l ya+ll dy - x + a + 1 x - x 2 
This recaptures Eq. (12). Note that the derivation of Zipf's recurrence equation in Eq. (9) of 
Section 3 corresponds to the special case where a = 1, i.e., where 8 = 2. 
5 Conclusions 
The relationship between Turing's formula and Zipf's law, which both concern population fre- 
quencies, was explored in the present article. The asymptotic behavior of the relative frequency 
as a function of rank implicit in one interpretation of Turing's local reestimation formula was 
derived and compared with Zipf's law. While the latter relates the rank and relative frequency 
as asymptotically inversely proportional, the former states that the frequency declines expo- 
nentially with rank. This means that while Zipf's law implies a finite total population, Turing's 
formula yields a proper probability distribution also for infinite populations. 
In fact, it is tempting to interpret Turing's formula as smoothing the relative-frequency 
estimates towards a geometric distribution. This could potentially be used to improve sparse- 
data estimates by assuming a geometric distribution (tail), and introducing a ranking based 
on direct frequency counts, frequency counts when backing off to more general conditionings, 
order of appearance in the training data, or, to break any remaining ties, lexicographical order. 
Conversely, a local reestimation formula in the vein of Turing's formula was derived from 
Zipf's law. Although the two equations are similar, Turing's formula shifts the frequency mass 
towards more frequent species. The two cases were generalized to a single spectrum of reesti- 
marion formulas and corresponding asymptotes, parameterized by one real-valued parameter. 
Furthermore, the two cases correspond to the upper and lower bounds of this parameter for 
which the cumulative of the frequency function converges as rank tends to infinity. 
These results are in sharp contrast to common belief in the field; in \[Baayen 1991\], for 
example, we read: "Other models, such as Good (1953) ... have been put forward, all of 
which have Zipf's law as some special or limiting form." All of the Zipf-Simon-Mandelbrot 
distributions exhibit the same basic asymptotic behavior, 
C f(r) = r"'~ 
parameterized by the positive real-valued parameter 8- Comparing this with Eq. (15), we find 
1 1 that ~ - 8--Z-1 > 0 and thus 8 = 1 + ~ > 1. In view of the established exponentially declining 
asymptote of the ideal Turing distribution, corresponding to 8 = 1, we can conclude that the 
latter is qualitatively different. 
Acknowledgements 
This article originated from inspiring discussions with David Milward and Slava Katz. Many 
thanks! Most of the work was done while the author was visiting IRCS at the University of 
Pennsylvania at the invitation of Aravind Joshi, and a number of New York pubs at the in- 
vitation of Jussi Karlgren, both of which was very much appreciated. I wish to thank Mark 
Lauer for helpful comments and suggestions to improvements, Seif Haridi for constituting the 
entire audience at a seminar on this work and focusing the question session on the convergence 
region of the parameter 0, and/~.ke Samuelsson for providing a bit of mathematical elegance. 
77 
I also gratefully acknowledge Rens Bod's encouraging comments and useful pointers to related 
work. Special credit is due to Mark Liberman for sharing his insights about Zipf's law, for dra- 
wing my attention to the Simon-Mandelbrot controversy, and for supplying various background 
material. This article, like others, has benefited greatly from comments by Khalil Sima'an. 

References 
\[Baayen 1991\] Harald Baayen. 1991. "A Stochastic Process for Word Frequency Distributions". 
In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, 
pp. 271-278, ACL. 
\[Good 1953\] I. J. Good. "The Population Frequencies of Species and the Estimation of Popu- 
lation Parameters". In Biometrika ~0(3~4), pp. 237-264, 1953. 
\[Katz 1987\] Slava M. Katz. "Estimation of Probabilities from Sparse Data for the Language 
Model Component of a Speech Recognizer". In IEEE Transactions on Acoustics, Speech, 
and Signal Processing 35(3), pp. 400-401, 1987. 
\[N£das 1985\] Arthur N£das. "On Turing's Formula for Word Probabilities". In IEEE Tran- 
sactions on Acoustics, Speech, and Signal Processing 33(6), pp. 1414-1416, 1985. 
\[Zipf 1935\] G. K. Zipf. The Psychobiology off Language. Houghton Mifflin, Boston, 1935. 
The Simon-Mandelbrot Dispute 
\[Simon 1955\] Herbert A. Simon. "On a Class of Skew Distribution Functions". In Biometrika 
42, pp. 425-440(?), 1953. 
\[Mandelbrot 1959\] Benoit Mandelbrot. "A Note on a Class of Skew Distribution Functions: 
Analysis and Critique of a Paper by H. A. Simon". In Information and Control 2, pp. 90-99, 
1959. 
\[Simon 1960\] Herbert A. Simon. "Some Further Notes on a Class of Skew Distribution Func- 
tions". In Information and Control 3, pp. 80-88, 1960. 
\[Mandelbrot 1961\] Benoit Mandelbrot. "Final Note on a Class of Skew Distribution Functions 
= Analysis and Critique of a Model due to H. A. Simon". In Information and Control 4, 
pp. 198-???, 1961. 
\[Simon 1961\] Herbert A. Simon. "Reply to 'Final Note' by Benoit Mandelbrot". In Information 
and Control 4, PP. 217-223, 1961. 
\[Mandelbrot 1961\] Benoit Mandelbrot. "Post Scriptum to 'Final Note' ". In Information and 
Control 4, PP. 300-304(?), 1961. 
\[Simon 1961\] Herbert A. Simon. "Reply to Dr. Mandelbrot's Post Scriptum". In Information 
and Control 4, PP- 305-308, 1961. 
