File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0106_metho.xml

Size: 8,233 bytes

Last Modified: 2025-10-06 14:14:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0106">
  <Title>Relating Turing's Formula and Zipf's Law</Title>
  <Section position="4" start_page="770" end_page="770" type="metho">
    <SectionTitle>
2 An Asymptote for Turing's Formula
</SectionTitle>
    <Paragraph position="0"> Turing's formula reestimates population frequencies locally:</Paragraph>
    <Paragraph position="2"> where X is the count of the most populous species and f~ is the relative frequency of any species with frequency count x.</Paragraph>
    <Paragraph position="3"> Let r(x) be the rank of the last species with frequency count x. This means that quite in</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="770" end_page="770" type="sub_section">
      <SectionTitle>
2.1 A continuum approximation
</SectionTitle>
      <Paragraph position="0"> We first make a continuum approximation by extending Nx from the integer points x = 1, 2,...</Paragraph>
      <Paragraph position="1"> to a continuous function N(x) on \[1, oo). This means that</Paragraph>
      <Paragraph position="3"> Differentiating this w.r.t, z, the lower bound of the integral, yields dr(x) d f x dx - dx N(y) dy = -N(x) and using the chain rule for differentiation yields</Paragraph>
      <Paragraph position="5"> Continuum approximations are useful techniques for establishing the dependence of a sum on its bounds, to the leading term, and for determining convergence. For example, if we wish to  n in n 3 -- 1 study the sum ~ k 2, we note that the corresponding integral Jl x 2 dx = -- and conclude</Paragraph>
      <Paragraph position="7"> the leading coefficient right. Likewise, we can establish for what values of a the sum ~ k ~ k=l converges by explicitly calculating fl X a dx = X c~+l \[~-~-lJl for o~ -~ -1 indicating that the integral, and thus the sum, converge for c~ &lt; -1 and diverge for ~ &gt; -1. We have to be a bit careful with the transition to the continuous case. We will first let N x become large and then establish what happens for small, but non-zero, values of ff = ~. So although x will be small compared to N, it will be large compared to any constant C. This means that</Paragraph>
      <Paragraph position="9"> for any additive constant C, and we may approximate x + C with x, motivating and similar approximations in the following.</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="2" start_page="770" end_page="770" type="sub_section">
      <SectionTitle>
2.2 The asymptotic distribution
</SectionTitle>
      <Paragraph position="0"> For an ideal Turing population, we would have x = z*. This gives us the recurrence equation</Paragraph>
      <Paragraph position="2"> implying that there are equally many inhabitants for frequency count x as for frequency count x + 1. This introduces several additional constraints, namely</Paragraph>
      <Paragraph position="4"> We are now prepared to derive the asymptotic behavior of the relative frequency f(r) of species as a function of their rank r implicit in Eq. (4). Combining Eq. (5) with Eq. (3) yields dr N1 N1 df - N(x). N = ---. N = - z y This determines the rank r(f) as a function of the relative frequency f:</Paragraph>
      <Paragraph position="6"> Inverting this gives us the sought-for function f(r):</Paragraph>
      <Paragraph position="8"> Utilizing the fact that the relative frequencies should be normalized to one, we find that</Paragraph>
      <Paragraph position="10"> and that thus &amp;quot;Turing's asymptotic law&amp;quot; is</Paragraph>
      <Paragraph position="12"> r--1 Upon examining the frequency function --N1 e ~ , we realize that we have an exponential 1 distribution with intensity parameter ~-, the probability of the most common species. This distribution was created by approximating our original discrete distribution with a continuous one. The discrete counterpart of an exponential distribution is a geometric distribution</Paragraph>
      <Paragraph position="14"> parameterized by p, the probability of some outcome occurring in one trial. P(r) can then be interpreted as the probability of waiting r trials for the first occurrence of the outcome.</Paragraph>
      <Paragraph position="15"> Thus, Turing's formula seems to be smoothing the frequency estimates towards a geometric distribution.</Paragraph>
    </Section>
    <Section position="3" start_page="770" end_page="770" type="sub_section">
      <SectionTitle>
2.3 Rederiving Turing's formula
</SectionTitle>
      <Paragraph position="0"> To test our derivation of the asymptotic equation (7) from the recurrence equation (4), we will attempt to rederive Eq. (4) from Eq. (7). Since Eq. (7) implies Eq. (6), we start from the latter and establish that  further note that if A &lt; h(y) &lt;_ B on (a, b), then A(b - a) &lt; /b h(y) dy &lt; B(b - a). Hence</Paragraph>
      <Paragraph position="2"> We have thus proved that \]Nx+l x \[ 1 Nx x+l &lt; ~ and since we assume that x &gt;&gt; 1, this</Paragraph>
      <Paragraph position="4"/>
    </Section>
  </Section>
  <Section position="5" start_page="770" end_page="770" type="metho">
    <SectionTitle>
3 A Reestimation Formula for Zipf's Law
</SectionTitle>
    <Paragraph position="0"> Zipf's law concerns the asymptotic behavior of the relative frequencies f(r) of a population as a function of rank r. It states that, asymptotically, the relative frequency is inversely proportional to rank: A</Paragraph>
    <Paragraph position="2"> This implies a finite total population, since the cumulative (i.e, the sum or integral) of the relative frequency over rank does not converge as rank approaches infinity:</Paragraph>
    <Paragraph position="4"> To localize Zipf's law, we utilize Eq. (2) and observe that r(x) = in the continuous case</Paragraph>
    <Paragraph position="6"> which is deceptively similar to Turing's formula, Eq. (1), the only difference being that it x+2 assigns ~ more relative-frequency mass to frequency count x.</Paragraph>
    <Section position="1" start_page="770" end_page="770" type="sub_section">
      <SectionTitle>
3.1 Rederiving Zipf's law
</SectionTitle>
      <Paragraph position="0"> If we rederive the asymptotic behavior, we again obtain Zipf's law. Assuming the recurrence</Paragraph>
      <Paragraph position="2"> We again use the equation for the derivative of the rank, Eq. (3), but now</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="770" end_page="770" type="metho">
    <SectionTitle>
4 A General Correspondence
</SectionTitle>
    <Paragraph position="0"> If we generalize the rederivation of Zipf's law in Eq. (11) to p = 2, 3,..., x, we find that</Paragraph>
    <Paragraph position="2"> results in the asymptote 1</Paragraph>
    <Paragraph position="4"> The key observation here is that also for real-valued O &lt; x in general, z! C I-II_-l(k + o) (x + 1)o This means that we have a single reestimation equation</Paragraph>
    <Paragraph position="6"> parameterized by the real-valued parameter O, with the asymptotic behavior Cr-~:r 0 # 1 (15) .f(r) = Ce 0=1 Although this correspondence was derived with the requirement that 0 &lt; x, we can in view of the discussion in Section 2.1 assume that x is not only considerably larger than 1, but also greater than any fixed value of 0. The extension to the negative real numbers is straightforward, although perhaps not very sensible. In fact, the convergence region for the cumulative of the frequency function as rank goes to infinity, oo f(r) or /(r) dr is 0 E \[1, 2), establishing Turing's formula and Zipf's law as the two extremes of this reestimation formula, in terms of resulting in a proper probability distribution for infinite populations; while the former does so, the latter does not.</Paragraph>
    <Section position="1" start_page="770" end_page="770" type="sub_section">
      <SectionTitle>
4.1 Reversing the directions
</SectionTitle>
      <Paragraph position="0"> Finally, assuming the asymptotic behavior of Eq. (13), we rederive the recurrence equation (12).</Paragraph>
      <Paragraph position="1"> The mathematics are very similar to those used to rederive Turing's formula in Section 2.3.</Paragraph>
      <Paragraph position="3"> This recaptures Eq. (12). Note that the derivation of Zipf's recurrence equation in Eq. (9) of Section 3 corresponds to the special case where a = 1, i.e., where 8 = 2.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML