File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/j96-4001_abstr.xml
Size: 7,851 bytes
Last Modified: 2025-10-06 13:48:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-4001"> <Title>The Effects of Lexical Specialization on the Growth Curve of the Vocabulary</Title> <Section position="2" start_page="0" end_page="456" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> When reading through a text, word token by word token, the number of different word types encountered increases, quickly at first, and ever more slowly as one progresses through the text. The number of different word types encountered after reading N tokens, the vocabulary size V(N), is a function of N. Analytical expressions for V(N) based on the urn model are available. A classic problem in word frequency studies is, however, that these analytical expressions tend to overestimate the observed vocabulary size, irrespective of whether these expressions are nonparametric (Good 1953; Good and Toulmin 1956; Muller 1979; Brunet 1978) or parametric (Sichel 1986; Khmaladze and Chitashvili 1989; Chitashvili and Baayen 1993) in nature.</Paragraph> <Paragraph position="1"> Although the theoretical or expected vocabulary size E\[V(N)\] generally is of the same order of magnitude as the observed vocabulary size, the lack of precision one observes time and again casts serious doubt on the reliability of a number of measures in word frequency statistics. For instance, Baayen (1989, 1992) and Baayen and Renouf (1996) exploit the Good-Turing estimate for the probability of sampling unseen types (Good 1953) to develop measures for the degree of productivity of affixes, Baayen and Sproat (to appear) apply this Good-Turing estimate to obtain enhanced estimates of lexical priors for unseen words, and the Good-Turing estimates also play an important role for estimating population probabilities (Church and Gale 1991). If a simple random variable such as the vocabulary size reveals consistent and significant deviation from its expectation, the accuracy of the Good-Turing estimates is also called into question. The aim of this paper is to understand why this deviation between the* Wundtlaan 1, 6525 XD Nijmegen, The Netherlands. E-mail: baayen@mpi.nl (~) 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 4 ory and observation arises in word frequency distributions, and in this light evaluate applications of the Good-Turing results.</Paragraph> <Paragraph position="2"> The remainder of this paper is structured as follows. In Section 2, I introduce some basic notation and the expressions for the growth curve of the vocabulary with which we will be concerned throughout, including a model proposed by Hubert and Labbe (1988), which, by introducing a smoothing parameter, leads to much-improved fits. Unfortunately, this model is based on a series of unrealistic simplifications, and cannot serve as an explanation for the divergence between the observed and expected vocabulary size. In Section 3, therefore, I consider a number of possible sources for the misfit in greater detail: nonrandomness at the sentence level due to syntactic structure, nonrandomness due to the discourse structure of the text as a whole, and nonrandomness due to thematic cohesion in restricted sequences of sentences (paragraphs). Section 4 traces the implications of the results obtained for distributions of units derived from words, such as syllables and digrams, and examines the accuracy of the Good-Turing frequency estimates. A list of symbols is provided at the end of the paper.</Paragraph> <Paragraph position="3"> 2. The Growth Curve of the Vocabulary Let N be the size of a text in word tokens, and let V denote the total number of different word types observed among the N word tokens. Roughly half of the word types occur only once, the so-called hapax legomena, others occur with higher frequencies) Let V(N, 1) denote the number of once-occurring types among N tokens, and, similarly, let V(N,f) denote the number of types occurringf times after sampling N tokens. The expected number of different types E\[V(M)\] for M < N conditional on the frequency spectrum {V(N,f)},f -- 1, 2, 3,... can be estimated by</Paragraph> <Paragraph position="5"> A proof for (1) is presented in the appendix.</Paragraph> <Paragraph position="6"> Figure 1 illustrates the problems that arise when (1) is applied to three texts, Alice in Wonderland, by Lewis Carroll (upper panels), Moby Dick by Herman Melville (middle panels), and Max Havelaar by Multatuli (the pseudonym of Eduard Douwes Dekker, bottom panels). 2 All panels show the sample size N on the horizontal axis. Thus the horizontal axis can be viewed as displaying the &quot;text time&quot; measured in word tokens. The vertical axis of the left-hand panels shows the number of observed word types (dotted line) and the number of types predicted by the model (solid line) obtained using (1). These panels reveal that the expected vocabulary size overestimates the observed vocabulary size for almost all of the 40 equidistant measurement points. To the eye, the overestimation seems fairly small. Nevertheless, in absolute terms the expectation may be several hundreds of types too high, and may run up to 5% of the total vocabulary size.</Paragraph> <Paragraph position="7"> 1 The type definition I have used throughout is based on the orthographic word form: house and houses are counted as two different types, houses and houses as two tokens of the same type. No lemmatization has been attempted, first, because the probabilistic aspects of the problem considered here are not affected by whether or not lemmatization is carried out, and second, because it is of interest to ascertain how much information can be extracted from texts with minimal preprocessing.</Paragraph> <Paragraph position="8"> 2 These texts were obtained by anonymous ftp from the Project Gutenberg at obi.std.com. The header of the electronic version of Moby Dick requires mention of E.F. Tray at the University of Colorado, Boulder, who prepared the text on the basis of the Hendricks House Edition.</Paragraph> <Paragraph position="9"> The growth curve of the vocabulary. Observed vocabulary size V(N) (dotted lines) and expected vocabulary size E\[V(N)\] (solid lines) for three novels (left-hand panels) and the corresponding overstimation errors E\[V(N)\] - V(N) (dotted lines) and their sentence-randomized versions (&quot;+&quot;-lines, see Section 3.1) (right-hand panels).</Paragraph> <Paragraph position="10"> The right-hand panels of Figure I show the overestimation error functions E\[V(N)\] - V(N) corresponding to the left-hand panels using dotted lines. For the first 20 measurement points, the instances for which E\[V(N)\] diverges significantly from V(N) are shown in bold. 3 Clearly, the divergence is significant for almost all of the first 20 measurement points. This suggests informally that the discrepancy between E\[V(N)\] and V(N) is significant over a wide range of sample sizes.</Paragraph> <Section position="1" start_page="456" end_page="456" type="sub_section"> <SectionTitle> 2.1 The Model Proposed by Hubert and Labbe </SectionTitle> <Paragraph position="0"> The problem of the systematic estimation error of E\[V(N)\] has been pointed out by Muller (1979) and Brunet (1978), who hypothesize that lexical specialization is at issue.</Paragraph> <Paragraph position="1"> In any text, there are words the use of which is mainly or even exclusively restricted to a given subsection of that text. Such locally concentrated clusters of words are at odds with the randomness assumption underlying the derivation of (1), and may be the cause of the divergence illustrated in Figure 1. Following this line of reasoning, Hubert and Labbe (1988) propose a model according to which (1) should be modified</Paragraph> </Section> </Section> class="xml-element"></Paper>