File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/80/c80-1091_abstr.xml

Size: 6,833 bytes

Last Modified: 2025-10-06 13:45:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C80-1091">
  <Title>A MATHEMATICAL MODEL OF THE VOCABULARY-TEXT RELATION</Title>
  <Section position="1" start_page="0" end_page="603" type="abstr">
    <SectionTitle>
A MATHEMATICAL MODEL OF THE VOCABULARY-TEXT RELATION
JuhanTuldava
</SectionTitle>
    <Paragraph position="0"> Tartu, Estonia, USSR A new method for calculating vocabulary size as a function of text length is discussed. The vocabulary growth is treated as a probabilistic process governed by the principle of &amp;quot;the restriction of variety&amp;quot; of lexics. Proceeding from the basic model of the vocabulary-text relation a formula with good descriptive power is constructed. The statistical fit and the possibilities of extrapolation beyond the limits of observable data are illustrated on the material of several languages belonging to different typological groups.</Paragraph>
    <Paragraph position="1"> by deducing the relation between V and N from some other important quantitative characteristics of text such as Zipf's law and Yule's distribution (Kalinin, Orlov) 3. The author underlines the importance of these conceptions for the theory of quantitative linguistics on the whole, but points out their insufficiency in solving some practical linguo-statistical problems where greater exactness and reliability are needed (style-statistical analysis, text attribution, extrapolation beyond the limits of observable data, etc.).</Paragraph>
    <Paragraph position="2"> i. There are a great number of attempts to construct an appropriate mathematical model which would express the dependence of the size of the size of vocabulary (V) on the size of text (N). This is not only of practical importance for the resolution of a series of problems in the automatic processing of texts, but it is also connected with the theoretical explanation of some important aspects of text generation. In practice one often makes use of various empirical formulae which describe the growth of vocabulary with sufficient precision in the case of concrete texts and languages l, though such formulae do not have any general significance. Of special interest are some &amp;quot;complex&amp;quot; models derived from theoretical considerations, e.g., by basing one*s considerations on the hypothesis about the lognormal distribution of words in a text (Carroll) 2 or 2. Instead of the &amp;quot;complex&amp;quot; mode/s a &amp;quot;direct&amp;quot; method is proposed where the relation between V and N is regarded as the primary component with its own immanent properties in the statistical organization of text. The relation between V and N has to be analyzed ua the background of some essential inner factors of text generation. The dynamics of vocabulary growth is considered as the result of the interaction of several linguistic and extra, linguistic factors which in an integral way are governed by the principle of &amp;quot;the restriction of variety&amp;quot; of lexics (an analogue of the principle of the decrease of entropy in self-regulating systems). The concept of the variety of lexics is defined as the relation between the size of vocabulary and the size of text in the form of V/N (type-token ratio, or coefficient of variety) or N/V (average frequency of word occurrences).</Paragraph>
    <Paragraph position="3">  --600--The coefficient of variety is supposed to be correlated with the probabilistic process of choosing &amp;quot;new&amp;quot; (unused) and &amp;quot;old&amp;quot; (already used in the text) words at each stage of text generation. The steady decrease of the degree of variety V/N = p is attended by the increase of its counterpart: (N-V)/N- I-V/N= q (p, q- 1), which can be interpreted as the &amp;quot;pressure of recurrency&amp;quot; of words in real texts (analogous to the concept of redundancy in the theory of informa-</Paragraph>
    <Paragraph position="5"> 3. The formulae of the relation between V and N are constructed from the basic models: V = Np or V = N(1 - q).</Paragraph>
    <Paragraph position="6"> For this purpose the quantitative changes of V/N = p depending on the size of text are analyzed. According to the initial hypothesis the relation between V/N and N is approximated by the power function of the type: V/N = aN B (a and B are constants; B &lt; O), which leads to the well-known formula of G. Herdan@: V = aN b (where b = B + 1). A verification shows good agreement with empirical data in the initial stages of text formation (in the limits of about ~,OOr~ - 5,000 tokens which correspond to a short communication).</Paragraph>
    <Paragraph position="7"> Later on the rate of the diminishing of the degree of variety (V/N) gradually slows down (due to the rise of new themes in the course of text generation). Accordingly the initial formula has to be modified and this can be done by logarithmization of the variables. The first attempt gives us in (V/N) = aN B, which leads to some variants of the Weibull distribution. This kind of distribution shows good agreement with the empirical data within the boundaries of a text of medium length, but it is not good for extrapolation. Only after balancing the initial formula by the logarithmization of both variables we obtain in (V/N) = a(ln N) B and the corresponding formula for expressing the relation between V and N:</Paragraph>
    <Paragraph position="9"> most adequate formula for solving our problems. The constants a and B (which, of course, are not identical with those of the previously mentioned formulae) may be determined on the basis of linearization: lnln (N/V) =</Paragraph>
    <Paragraph position="11"> the method of least squares. In principle it would be sufficient to have two empirical points for the calculation of the values of the constants but for greater reliability more points are needed.</Paragraph>
    <Paragraph position="12"> 4. The good descriptive power of the given function and the possibill- null ties of extrapolation in both directions (from the beginning up to a text of about N = lO 7) has been verified on the basis of experimental material taken from several languages belonging to different typological groups (Estonian, Kazakh, Latvian, Russian, Polish, Czech, Rumanian, English). The function may be applied to the analysis of individual texts as well as composite homogeneous (similar) texts and the size of vocabulary (V) may be determined by counting either word forms of lexemes. (See Tables 1 and 2.) This seems to corroborate the assumption about the existence of a universal law (presumably of phylogenetic origin) which governs the process of text formation on the quantitative level.</Paragraph>
    <Paragraph position="13"> Table i The empirical size (V) and the teoretical size (V') of vocabulary plotted against the length of the text (N). The formula:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML