File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/j96-4001_metho.xml
Size: 48,743 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-4001"> <Title>The Effects of Lexical Specialization on the Growth Curve of the Vocabulary</Title> <Section position="3" start_page="456" end_page="460" type="metho"> <SectionTitle> 3 Since the expression for an estimate of the variance of V(N) figuring in the Z-scores used here requires </SectionTitle> <Paragraph position="0"> knowledge of E\[V(2N)\], the significance of the divergence for the second 20 measurement points is not available. For technical details, see Chitashvili and Baayen (1993).</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 22, Number 4 as follows (see the appendix for further details): M (1 p)V- (1 - p) ~V(N,f) 1 - . EHL\[V(M)\] = p--~V + - f (2) Hubert and Labbe's model contains one free parameter, the coefficient of vocabulary partition p, an estimate of the proportion of specialized words in the vocabulary. Given K different text sizes for which the observed and expected vocabulary sizes are known, p can be estimated by minimizing the mean squared error (MSE)</Paragraph> <Paragraph position="3"> (conveniently ignoring that the variance of V(M) increases with M, see Chitashvili and Baayen \[1993\]). For Alice in Wonderland, minimalization of (4) for K = 40 leads to p = 0.16, and according to this rough estimate of goodness-of-fit the revised model fits the data very well indeed (X~39) 7- 3.58, p > 0.5). For Moby Dick, however, the chi-squared statistic suggests a significant difference between the observed and expected vocabulary sizes (X~39) = 172.93,p < 0.001), even though the value of the p parameter (0.12) leads to a fit that is much improved with respect to the unadjusted growth curve (X~39) = 730.47). Closer inspection of the error pattern of the adjusted estimate reveals the source of the misfit: for the first 12 measurement points, the observed vocabulary size is consistently overestimated. From the 14th observation onwards, the Hubert-Labbe model consistently underestimates the real vocabulary size. Apparently, the development of the vocabulary in Moby Dick can be modeled globally, but local fluctuations introducing additional points of inflection into the growth curve are outside its scope--a more detailed study of the development of lexical specialization in the narrative is required if the appearance of these points of inflection are to be understood.</Paragraph> <Paragraph position="4"> In spite of this deficiency, the Hubert-Labbe curve appears to be an optimal smoother, and this suggests that the value obtained for the coefficient of vocabulary partition p is a fairly reliable estimate of the extent to which a text is characterized by lexical specialization. In this light, the evaluation by Holmes (1994), who suggests that p might be a useful discriminant for authorship attribution studies, is understandable.</Paragraph> <Paragraph position="5"> Unfortunately, the assumptions underlying (2) are overly simplistic, and seriously call into question the reliability of p as a measure of lexical specialization, and the same holds for the explanatory value of this model for the inaccuracy of E\[V(N)\].</Paragraph> <Section position="1" start_page="457" end_page="458" type="sub_section"> <SectionTitle> 2.2 Problems with the Hubert and Labbe Model </SectionTitle> <Paragraph position="0"> One highly questionable simplification underlying the derivation of (2) spelled out in the appendix is that specialized words are assumed to occur in a single text slice only.</Paragraph> <Paragraph position="1"> Consider Figure 2, which plots the number of times Ahab appears in 40 successive, equally sized text slices that jointly constitute the full text of Moby Dick. The dotted line reveals the main developmental pattern (time-series smoothing using running medians). Even though Ahab is one of the main characters in Moby Dick, and even though his name certainly belongs to the specialized vocabulary of the novel, Ahab is</Paragraph> </Section> <Section position="2" start_page="458" end_page="460" type="sub_section"> <SectionTitle> Baayen Lexical Specialization </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Nonrandom word usage illustrated for Ahab in Moby Dick. The horizontal axis plots the 40 equally sized text slices, the vertical axis the frequency of Ahab in these text slices. The dotted line represents a time-series smoother using running medians (Tukey 1977).</Paragraph> <Paragraph position="3"> not mentioned by name in one text slice only, as the Hubert-Labbe model would have.</Paragraph> <Paragraph position="4"> What we find is that he is not mentioned at all in the first five text slices. Following this we observe a series of text slices in which he appears frequently. These are in turn succeeded by slices in which Ahab is hardly mentioned, but he reappears in the last part of the book, and as the book draws to its dramatic close, the frequency of Ahab increases to its maximum. This is an illustration of what Indefrey and Baayen (1994) refer to as inter-textual cohesion: the word Ahab enjoys specialized use, but it occurs in a series of subtexts within the novel as a whole, contributing to its overall cohesion.</Paragraph> <Paragraph position="5"> Within text slices where Ahab is frequently mentioned, the intra-textual cohesion may similarly be strengthened. For instance, Ahab appears to be a specialized word in text slice 23, but he is mentioned only in passing in text slice 25. His appearance in the two text slices strengthens the intertextual cohesion of the whole novel, but it is only the intra-textual cohesion of slice 23 that is raised. The presence of inter-textual cohesion in addition to intra-textual cohesion and the concomitant phenomenon of global lexical specialization suggest that in order to understand the discrepancy between V(N) and its expectation, a more fine-grained approach is required.</Paragraph> <Paragraph position="6"> A second question concerns how lexical specialization affects the empirical growth curve of the vocabulary. Inspection of plots such as those presented in Figure 1 for Alice in Wonderland suggests that the effects of lexical specialization appear in the central sections of the text, as it is there that the largest differences between the expected and the observed vocabulary are to be observed--differences that are highly penalized by the MSE and chi-squared techniques used to estimate the proportion of specialized words in the vocabulary. Unfortunately, the central sections are not necessarily the ones characterized by the highest degree of lexical specialization. To see this, consider Error scores for the influx of new types in Alice in Wonderland. The k -- 1, 2 ..... 40 text slices are displayed on the horizontal axis, the progressive difference scores D(k) are shown on the vertical axis. The dashed line represents a nonparametric scatterplot smoother (Cleveland 1979), the dotted line a least squares regression line (the negative slope is significant,</Paragraph> <Paragraph position="8"> (1) and the observed number of new types for the successive text slices of Alice in Wonderland. More precisely, for each text slice k, k = 1 ..... 40, we calculate the progressive difference error scores D(k), k = 1... 40:</Paragraph> <Paragraph position="10"> Note that in addition to positive difference scores, which should be present given that E\[V(Mk)\] > V(Mk) for most, or, as in Alice in Wonderland, for all values of k, we also have negative difference scores. Text slices containing more types than expected under chance conditions are necessarily present given the existence of text slices k for which E\[V(Mk)\] - V(Mk) > 0: the total number of types accumulated over the 40 text slices has to sum up to V(N). Figure 3 shows that the expected numbers of new word types are overestimated for the initial part of the novel, that the theoretical estimates are fairly reliable for the middle section of the novel, while the final chapters show a slightly greater increase in the number of new types than expected under chance conditions.</Paragraph> <Paragraph position="11"> If lexical specialization affects the influx of new types, its effects appear not in the central sections of the novel as suggested by Figure 1, but rather in the beginning and perhaps at the end. This finding seriously questions the appropriateness of using the growth curve of the vocabulary for deriving a measure of lexical specialization.</Paragraph> <Paragraph position="12"> A third question arises with respect to how one's measure of lexical concentration is affected by the number of text slices K. In Hubert and Labbe's model, the optimal value of the p parameter is independent of the number of text slices K for not-toosmall K (K > 10). Since the expected growth curve and the observed growth curve are completely fixed and independent of K--the former is fully determined by the fre-</Paragraph> </Section> <Section position="3" start_page="460" end_page="460" type="sub_section"> <SectionTitle> Baayen Lexical Specialization </SectionTitle> <Paragraph position="0"> quency spectrum of the complete text, the latter is determined by the text itself--the choice of K influences only the number of points at which the divergence between the two curves is measured. Increasing the number of measurement points increases the degrees of freedom along with the deviance, and the optimal value of the p parameter remains virtually unchanged. But is this a desirable property for a measure of lexical specialization? Even without taking the effects of inter-textual cohesion into account, and concentrating solely on local specialization and intra-textual cohesion, formulating lexical specialization in terms of concentration at a particular point in the text is unrealistic: it is absurd to assume that all tokens of a specialized word appear in one chunk without any other intervening words. A more realistic definition of (local) lexical specialization is the concentration of the tokens of a given word within a particular text slice. In such an approach, however, the size of the text slice is of crucial importance. A word appearing only in the first half of a book enjoys some specialized use, but to a far lesser extent than a word with the same frequency that occurs in the first half of the first chapter only. In other words, an approach to lexical specialization in terms of concentration of use is incomplete without a specification of the unit of concentration itself.</Paragraph> </Section> </Section> <Section position="4" start_page="460" end_page="468" type="metho"> <SectionTitle> 3. Sources of Nonrandomness </SectionTitle> <Paragraph position="0"> To avoid these problems, I will now sketch a somewhat more fine-grained approach to understanding why V(N) and its expectation diverge, adopting Hubert and Labbe's central insight that lexical specialization can be modeled in terms of local concentration. Consider again the potential sources for violation of the randomness assumption underlying the derivation of E\[V(N)\]. At least three possibilities suggest themselves: syntactic constraints on word usage within sentences, global discourse organization, and local repetition. I will consider these possibilities in turn.</Paragraph> <Section position="1" start_page="460" end_page="461" type="sub_section"> <SectionTitle> 3.1 Syntactic Constraints </SectionTitle> <Paragraph position="0"> Syntactic constraints at the level of the sentence introduce many restrictions on the occurrence of words. For instance, in normal written English, following the determiner the the appearance of a second instance of the same determiner (as in this sentence), is extremely unlikely. According to the urn model, however, such a sequence is likely to occur once every 278 words (the relative frequency of the in English is approximately 0.06), say once every two pages. This is not what we normally find. Clearly, syntax imposes severe constraints on the occurrence of words. Does this imply that the urn model is wrong? For individual sentences, the answer is undoubtedly yes. But for more global textual properties such as vocabulary size, a motivated answer is less easy to give. According to Herdan (1960, 40), reacting to Halle's criticism of the urn model as a realistic model for language, there is no problem, since statistics is concerned with form, not content) Whatever the force of this argument may be, Figure 1 demonstrates clearly that the urn model lacks precision for our data.</Paragraph> <Paragraph position="1"> In order to ascertain the potential relevance of syntactic constraints referred to by Halle, we may proceed as follows: If sentence-level syntax underlies the misfit between the observed and the expected vocabulary size, then this misfit should remain visible for randomized versions of the text in which the sentences have been left unchanged, but in which the order of the sentences has been permuted. If the misfit disappears, 4 M. Halle, &quot;In defence of the number two,&quot; in Studies Presented to J. Whatmough, The Hague, 1957, quoted in Herdan, 1960, page 40.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 22, Number 4 we know that constraints the domain of which are restricted to the sentence can be ruled out.</Paragraph> <Paragraph position="3"> The results of this randomization test applied to Alice in Wonderland, Moby Dick, and Max Havelaar are shown in the right-hand panels of Figure 1 by means of &quot;+&quot; symbols. What we find is that following sentence randomization, all traces of a significant divergence between the observed and expected vocabulary size disappear.</Paragraph> <Paragraph position="4"> The differences between E\[V(N)\] and V(N) are substantially reduced and may remain slightly negative, as in Alice in Wonderland, or slightly positive, as for Moby Dick, or they may fluctuate around zero in an unpredictable way, as in Max Havelaar. Since we are left with variation that is probably to be attributed to the particularities of the individual randomization orders, we may conclude that at the global level of the text as an (unordered) aggregate of sentences, the randomness assumption remains reasonable. The nonrandomness at the level of sentence structure does not influence the expected vocabulary size. As a global text characteristic, it is probably insensitive to the strictly local constraints imposed by syntax. Apparently, it is the sequential order in which sentences actually appear that crucially determines the bias of our theoretical estimates. There are at least two domains where this sequential order might be relevant: the global domain of the discourse structure of the text as a whole, and the more local domain of relatively small sequences of sentences sharing a particular topic.</Paragraph> <Paragraph position="5"> To explore these two potential explanatory domains in detail, we need a method for linking topical discourse structure and local topic continuity with word usage.</Paragraph> <Paragraph position="6"> Lexical specialization, informally defined as topic-linked concentrated word usage, and formalized in terms of underdispersion, provides us with the required tool.</Paragraph> </Section> <Section position="2" start_page="461" end_page="462" type="sub_section"> <SectionTitle> 3.2 Lexical Specialization </SectionTitle> <Paragraph position="0"> Recall that the word Ahab is unevenly distributed in Moby Dick. Given its high frequency (510), one would expect it to occur in all 40 text slices, but it does not. In fact, there are 11 text slices where Ahab is not mentioned at all. Technically speaking, Ahab is underdispersed. If there are many such words, and if these underdispersed words cluster together, the resulting deviations from randomness may be substantial enough to become visible as a divergence between the observed and theoretical growth curves of the vocabulary.</Paragraph> <Paragraph position="1"> In order to explore this intuition, we need a reliable way to ascertain whether a word is underdispersed. Let the dispersion di of a word ~d i be the number of different text slices in which Od i appears. Analytical expressions for E\[di\] and VAR\[di\] are available (Johnson and Kotz 1977, 113-114), so that In principle Z-scores can be calculated. These Z-scores can then be used to ascertaIn which words are significantly underdispersed in that they occur in significantly too few text slices given the urn model (cf. Baayen, 1996). Unfortunately, dispersions deviate substantially from normality, so that Z-scores remain somewhat impressionistic. I have therefore used a randomization test to ascertain which words are significantly underdispersed.</Paragraph> <Paragraph position="2"> The randomization test proceeded as follows: The sequence of words of a text was randomized 1,000 times. For each permutation, the dispersion of each word type in that particular permutation was obtained. For each word, we calculated the proportion of permutations for which the dispersion was lower than or equal to the empirical dispersion. For Ahab, all 1,000 permutations revealed full dispersion (d -- 40), which suggests that the probability that the low empirical dispersion of Ahab (d = 28) is due to chance is (much) less than .001. 5 The content words singled out as being signifi5 I am indebted to an anonymous referee for pointing out to me that Z-scores are imprecise. I am</Paragraph> </Section> <Section position="3" start_page="462" end_page="462" type="sub_section"> <SectionTitle> Baayen Lexical Specialization </SectionTitle> <Paragraph position="0"> cantly underdispersed at the 1% level (the significance level I will use throughout this study for determining underdispersion) reveal a strong tendency to be key words. For instance, for Moby Dick, the ten most frequent underdispersed content words are Ahab, boat, captain, said, white, Stubb, whales, men, sperm, and Queequeg. The five most frequent underdispersed function words are you, ye, such, her, and any. 6 The number of chunks in which an underdispersed word appears, and the frequencies with which such a word appears in the various chunks, cannot be predicted on the basis of the urn model. (Instead of the binomial or Poisson models, the negative binomial has been found to be a good model for such words, see, e.g., Church and Gale \[1995\]). Before studying how these words appear in texts and how they affect the growth curve of the vocabulary, it is useful to further refine our definition of underdispersion.</Paragraph> <Paragraph position="1"> Consider again the distribution of the word Ahab in Figure 2. In text slice 25, Ahab occurs only once. Although this single occurrence contributes to the inter-textual cohesion of the novel as a whole, it can hardly be said to be a key word within text slice 25. In order to eliminate such spurious instances of key words, it is useful to set a frequency threshold. The threshold used here is that the frequency of the word in a given text slice should be at least equal to the mean frequency of the word calculated for the text slices in which the word appears. More formally, let~,k be the frequency of the i-th word type in the k-th text slice, and define the indicator variable di, k as follows:</Paragraph> <Paragraph position="3"> The number of underdispersed types in text slice k, VU(k), and the corresponding number of underdispersed tokens, NU(k), can now be defined as</Paragraph> <Paragraph position="5"/> </Section> <Section position="4" start_page="462" end_page="464" type="sub_section"> <SectionTitle> 3.3 Lexical Specialization and Discourse Structure </SectionTitle> <Paragraph position="0"> We are now in a position to investigate where underdispersed words appear and how they influence the observed growth curve of the vocabulary. First consider Figure 4, which summarizes a number of diagnostic functions for Alice in Wonderland.</Paragraph> <Paragraph position="1"> The upper panels plot VU(k) (left) and NU(k) (right), the numbers of underdispersed types and tokens appearing in the successive text chunks. Over sampling time, we observe a slight increase in both the numbers of tokens and the numbers of types.</Paragraph> <Paragraph position="2"> Both trends are significant according to least squares regressions, represented by dotted lines (F(1,38) = 6.591,p < .02 for VU(k); F(1,38) = 16.58,p < .001 for NU(k)). A time-series smoother using running medians (Tukey 1977), represented by solid lines, similarly indebted to Fiona Tweedie, who suggested the use of the randomization test. Comparison of the results based on Z-scores (see Baayen, to appear) and the results based on the randomization test, however, reveal only minor differences that leave the main patterns in the data unaffected.</Paragraph> <Paragraph position="3"> 6 The present method of finding underdispersed words appears to be fairly robust with respect to the number of text slices K. For different numbers of text chunks, virtually the same high-frequency words appear to be underdispersed. The number of text chunks exploited in this paper, 40, has been chosen to allow patterns in &quot;sampling time&quot; to become visible without leading to overly small text slices for the smaller texts.</Paragraph> <Paragraph position="5"> Diagnostic functions for Alice in Wonderland. VU(k) and NU(k): numbers of underdispersed types and tokens in text slice k; ACF: auto-correlation function; Pr(U, type) and Pr(U, token): proportions of underdispersed types and tokens; D(k) and DU(k): progressive difference scores for the overall vocabulary and the underdispersed words.</Paragraph> <Paragraph position="6"> suggests a slightly oscillating pattern. At least for a time lag of 1, this finds some support in the autocorrelation functions, shown in the second line of panels of Figure 4.</Paragraph> <Paragraph position="7"> Clearly, key words are not uniformly distributed in Alice in Wonderland. Not only does the use of key words in one text slice appear to influence the intensity with which key words are used in the immediately neighboring text slices, but as the novel proceeds key words appear with increasing frequency.</Paragraph> <Paragraph position="8"> How does this nonrandom organization of key words in the discourse as a whole influence V(N)? To answer this question, it is convenient to investigate the nature of the new types that arrive with the successive text slices. Let</Paragraph> <Paragraph position="10"> denote the number of new underdispersed types for text slice k. The proportion of new underdispersed types in text slice k on the total number of new types, Pr(U, type, k) is given by AVU(k) (11) Pr(U, type, k)- AV(k) The plot of Pr(U, types, k) is shown on the third row of Figure 4 (left-hand panel). According to a least squares regression (dotted line), there is a significant increase in</Paragraph> </Section> <Section position="5" start_page="464" end_page="467" type="sub_section"> <SectionTitle> Baayen Lexical Specialization </SectionTitle> <Paragraph position="0"> the proportion of underdispersed new types as k increases (F(1, 38) = 5.804, p < .05).</Paragraph> <Paragraph position="1"> The right-hand side counterpart shows a similar trend for the word tokens that is also supported by a least squares regression (F(1, 38) = 5.681, p < .05). Here, the proportion of new underdispersed tokens on the total number of new tokens is defined as</Paragraph> <Paragraph position="3"> The increase in the proportions of new underdispersed types and tokens shows that the pattern observed for the absolute numbers of types and tokens observed in the top panels of Figure 4 persists with respect to the new types and tokens.</Paragraph> <Paragraph position="4"> We can now test to what extent the underdispersed types are responsible for the divergence of E\[V(N)\] and its expectation by comparing the progressive difference scores D(k) defined in (5) with the progressive difference scores for the subset of the underdispersed words DU(k), defined as</Paragraph> <Paragraph position="6"> The two progressive difference score functions are shown in the bottom left panel of Figure 4, and the residuals D(k) - DU(k) are plotted in the bottom right-hand panel.</Paragraph> <Paragraph position="7"> The residuals do not reveal any significant trend (F(1, 38) < 1), which suggests that the underdispersed vocabulary is indeed responsible for the main trend in the progressive difference scores D(k) of the vocabulary and hence for the divergence between E\[V(N)\] and V(N). In the next section, I will argue that intra-textual cohesion is in large part responsible for the general downward curvature of DU(k). In what follows, I will first present an attempt to understand the differences in the error scores E\[V(N)\] - V(N) shown in Figure 1 as a function of differences in the use of key words at the discourse level.</Paragraph> <Paragraph position="8"> In Alice in Wonderland, key words are relatively rare in the initial text slices. As a result, these text slices reveal fewer types than expected under chance conditions.</Paragraph> <Paragraph position="9"> Consequently, V(N) is smaller than E\[V(N)\]. For increasing k, as shown in the upper right panel of Figure 1, the divergence between V(N) and its expectation first increaseswthe initial text slices contain the lowest numbers of underdispersed types and tokens--and then decreases as more and more underdispersed words appear.</Paragraph> <Paragraph position="10"> Thus the semi-circular shape of the error scores E\[V(N)\] - V(N) shown in Figure 1 is a direct consequence of the topical structure at discourse level of Alice in Wonderland.</Paragraph> <Paragraph position="11"> The error scores E\[V(N)\] - V(N) for Moby Dick and Max Havelaar shown in Figure 1 reveal a different developmental profile. In these novels, the maximal divergence appears early on in the text, after which the divergence decreases until, just before the end, V(N) becomes even slightly larger than its expectation. Is it possible to understand this qualitatively different pattern in terms of the discourse structure of these novels? First, consider Moby Dick. A series of diagnostic plots is shown in Figure 5. The numbers of underdispersed types and tokens VU(k) and NU(k) reveal some variation, but unlike in Alice in Wonderland, there is only a nonsignificant trend</Paragraph> <Paragraph position="13"> Figure 5 Diagnostic functions for Moby Dick. VU(k) and NU(k): numbers of underdispersed types and tokens in text slice k; Pr(U, type) and Pr(U, token): proportions of underdispersed types and tokens; D(k) and DU(k): progressive difference scores for the overall vocabulary and the underdispersed words; f\[Ahab\](k): frequency of Ahab in text slice k.</Paragraph> <Paragraph position="15"> sion to occur more often as the novel progresses. The absence of a trend is supported by the proportions of underdispersed types and tokens, shown in the second row of panels (F < 1 for both types and tokens). In the last text slices, underdispersed words are even underrepresented. The bottom panels show that the progressive difference scores DU(k) for the underdispersed words capture the main trend in the progressive difference scores of the total vocabulary D(k) quite well: The residuals D(k) - DU(k) do not reveal a significant trend (F(1, 38) = 1.08, p > .3).</Paragraph> <Paragraph position="16"> Interestingly, the use of underdispersed words in Moby Dick is to some extent correlated with the frequency of the word Ahab, with respect to both types and tokens</Paragraph> <Paragraph position="18"> NU(k). The panels on the third row of Figure 5 show the frequencies of Ahab (left) and VU(k) as a function of the frequency of Ahab (right). A nonparametric time series smoother (solid line) supports the least squares regression line (dotted line). In other words, the key figure of Moby Dick induces a somewhat more intensive use of the key words of the novel.</Paragraph> <Paragraph position="19"> The nonuniform distribution of Ahab sheds some light on the details of the shape of the difference function E\[V(N)\] - V(N) shown in Figure 1. The initial sections do not mention Ahab, it is here that D(k) reveals its highest values, and here too we find the largest discrepancies between E\[V(N)\] and V(N). By text slice 20, Ahab has been firmly established as a principal character in the novel, and the main key words have</Paragraph> <Paragraph position="21"> Diagnostic functions for Max Havelaar. VU(k) and NU(k): numbers of underdispersed types and tokens in text slice k; ACF: auto-correlation function; Pr(U, type) and Pr(U, token): proportions of underdispersed types and tokens; D(k) and DU(k): progressive difference scores for the overall vocabulary and the underdispersed words.</Paragraph> <Paragraph position="22"> appeared. The overestimation of the vocabulary is substantially reduced. As the novel draws to its dramatic end, the frequency of Ahab increases to its maximum. The plots on the first row of Figure 5 suggest that underdispersed types and tokens are also used more intensively in the last text slices. However, the proportions plots on the second row show a final dip, suggesting that at the very end of the novel, a more than average number of normally dispersed new types appears. Considered together, this may explain why at the very end of the novel the expected vocabulary slightly underestimates the observed vocabulary size, as shown in Figure 1.</Paragraph> <Paragraph position="23"> Finally, consider the diagnostic plots for Max Havelaar, shown in Figure 6. The time series smoother (solid line) for the absolute numbers of underdispersed types (VU(k)) and tokens (NU(k)) suggests an oscillating use of key words without any increase in the use of key words over time (the dotted lines represent the least squares regression lines, neither of which are significant: F < 1 in both cases). This oscillatory structure receives some support from the autocorrelation functions shown in the second row of panels.</Paragraph> <Paragraph position="24"> Especially in the token analysis, there is some evidence for positive autocorrelation at lag 1, and for a negative polarity at time lags 8 and 9. No trend emerges from the proportions of new underdispersed types and tokens (third row, F < 1 in both analyses). A comparison of the progressive difference scores D(k) and DU(k) (bottom row) shows that the underdispersed words are again largely responsible for the large values of D(k) for small k. No significant trend remains in the residuals D(k) - DU(k) (F(1, 38) ---- 1.848, p > .15).</Paragraph> <Paragraph position="25"> Computational Linguistics Volume 22, Number 4 Figure 1 revealed that E\[V(N)\] - V(N) is largest around text slices 3 to 7, but becomes negative for roughly the last third of the novel. This pattern may be due to the oscillating use of key words in Max Havelaar. Although there is a fair number of key words in the first few text chunks, the intensity of key words drops quickly, only to rise again around chunk 20. Thus, key words are slightly underrepresented in the first part of the novel, allowing the largest divergence between the expected and observed vocabulary size to emerge there.</Paragraph> </Section> <Section position="6" start_page="467" end_page="468" type="sub_section"> <SectionTitle> 3.4 The Paragraph as the Domain of Topic Continuity </SectionTitle> <Paragraph position="0"> The preceding analyses all revealed violations of the randomness assumption underlying the urn model that originate in the topical structure of the narrative as a whole.</Paragraph> <Paragraph position="1"> I have argued that a detailed analysis of the distribution of key word tokens and types may shed some light on why the theoretical vocabulary size sometimes overestimates and sometimes underestimates the observed vocabulary size. We are left with the question of to what extent repeated use of words within relatively short sequences of sentences, henceforth for ease of reference paragraphs, affects the accuracy of E\[V(N)\]. I therefore carried out two additional analyses, one using five issues of the Dutch newspaper Trouw, and one using the random samples of the Dutch newspaper De Telegraaf available in the Uit den Boogaart (1975) corpus. For both texts, no overall topical discourse structure is at issue, so that we can obtain a better view of the effects of intra-textual cohesion by itself.</Paragraph> <Paragraph position="2"> For each newspaper, the available texts were brought together in one large corpus, preserving chronological order. Each corpus was divided into 40 equally large text slices. The upper left panel of Figure 7 shows that in the consecutive issues of Trouw (March 1994) the expected vocabulary size differs significantly from the observed vocabulary size for all of the first 20 measurement points, the domain for which significance can be ascertained (see footnote 3). The upper right panel reveals that for the chronologically ordered series of samples from De Telegraaf in the Uit den Boogaart corpus (268 randomly sampled text fragments with on average 75 word tokens) only 3 text chunks reveal a significant difference between E\[V(N)\] and V(N). The bottom panels of Figure 7 show the corresponding plots of the progressive difference scores for the complete vocabulary (D(k), &quot;.&quot;) and underdispersed words (DU(k), &quot;+&quot;). The least squares regression lines (dotted) for D(k), supported by nonparametric scatterplot smoothers (solid lines), reveal a significant negative slope (F(1, 38) = 6.89, p < .02 for Trouw, F(1, 38) = 10.99, p < .001 for De Telegraaf). The residuals D(k) - DU(k) do not reveal any significant trends (F < 1 for both newspapers). Note that for De Telegraaf DU(k) does not capture the downward curvature of D(k) as well as it should for large k. This may be due to the relatively small number of words that emerge as significantly underdispersed for this corpus.</Paragraph> <Paragraph position="3"> rise to substantial deviation between E\[V(N)\] and V(N) in texts with no overall discourse organization. Within successive issues of a newspaper, in which a given topic is often discussed on several pages within the same newspaper, and in which a topic may reappear in subsequent issues, strands of inter-textual cohesion may still contribute significantly to the large divergence between the observed and expected vocabulary size. It is only by randomly sampling short text fragments, as for the data from the Uit den Boogaart corpus, which contains samples evenly spread out over a period of one year, that a substantial reduction in overestimation is obtained. Note, however, that even for the corpus data we again find that the expectation of V(N) is consistently too high. Within paragraphs, words tend to be reused more often than expected under change conditions. This reuse pre-empts the use of other word tokens, among which Diagnostic plots for two Dutch newspapers. The difference between the expected and observed vocabulary size for the Trouw data (five issues from March 1994) and the random samples of De Telegraaf in the Uit den Boogaart colpus (upper panels; significant differences are highlighted for the first 20 measurement points). The bottom panels show the progressive difference error scores for the total vocabulary (D(k)) and for the subset of underdispersed words (DU(k)). The dotted line is a least squares regression, the solid line a nonparametric scatterplot smoother.</Paragraph> <Paragraph position="4"> tokens of types that have not been observed among the preceding tokens, and leads to a decrease in type richness. Since intra-textual cohesion is also present in the texts of novels, we may conclude that the overestimation bias in novels is determined by a combination of intra-textual and inter-textual cohesion.</Paragraph> </Section> </Section> <Section position="5" start_page="468" end_page="473" type="metho"> <SectionTitle> 4. Implications </SectionTitle> <Paragraph position="0"> We have seen that intra-textual and inter-textual cohesion lead to a significant difference between the expected and observed vocabulary size for a wide range of sample sizes. This section addresses two additional questions. First, to what extent does the nonrandomness of word occurrences affect distributions of units selected or derived from words? Second, how does cohesive word usage affect the Good-Turing frequency estimates?</Paragraph> <Section position="1" start_page="468" end_page="470" type="sub_section"> <SectionTitle> 4.1 Word-derived Units </SectionTitle> <Paragraph position="0"> First consider the effect of nonrandomness on the frequency distributions of morphological categories. The upper panels of Figure 8 plot the difference between the expected and observed vocabulary size for the morphological category of words with Diagnostic plots for affixes, syllables, and digraphs* The difference between the expected and observed vocabulary size for the morphological category of words with the Dutch suffix -heid '-ness' in Max Havelaar (upper left) and in Trouw (upper right), for syllables in Trouw (lower left), and for digraphs in Alice in Wonderland* Significant differences are shown in bold for the first half of the tokens* the Dutch suffix -heid, which, like -ness in English, is used to coin abstract nouns from adjectives (e.g., snelheid, 'speed', from snel, 'quick'). The plots are based on samples consisting of all and only those words occurring in Max Havelaar (upper left) and Trouw (upper right) that belong to the morphological category of -heid, ignoring all other words, and preserving their order of appearance in the original texts. The sample of -heid words in Max Havelaar consisted of 640 tokens representing 260 types, of which 146 hapax legomena. From Trouw, 1145 tokens representing 394 types were extracted, among which 246 hapax legomena.</Paragraph> <Paragraph position="1"> In Max Havelaar, a number of words in -heid, such as waarheid 'truth' and vrijheid 'freedom', are underdispersed key words. Not surprisingly, this affects the growth curve of -heid itself. For small values of k, we observe a significant divergence between E\[V(N)\] and V(N). In the newspaper Trouw, where -heid words do not play a central role in an overall discourse, no significant divergence emerges. Nevertheless, we again observe a consistent trend for the expected vocabulary size to overestimate the actual</Paragraph> </Section> <Section position="2" start_page="470" end_page="470" type="sub_section"> <SectionTitle> Baayen Lexical Specialization </SectionTitle> <Paragraph position="0"> and 1,909 hapax legomena), Figure 8 reveals significant deviation in the first half of both texts. This suggests that the nonrandomness observed for words carries over to word-based units such as digraphs and syllables.</Paragraph> </Section> <Section position="3" start_page="470" end_page="472" type="sub_section"> <SectionTitle> 4.2 Accuracy of Good-Turing Estimates </SectionTitle> <Paragraph position="0"> Samples of words generally contain--often small--subsets of all the different types available in the population. The probability mass of the unseen types is generally large enough to significantly bias population probabilities estimated from sample relative frequencies. Good (1953) introduced an adjusted frequency estimate (which he credits to Turing) to correct this bias. Instead of estimating the probability of a word with frequency f by its sample relative frequency f (16) pi= , Good suggests the use of the adjusted estimate</Paragraph> <Paragraph position="2"> A closely related statistic is the probability ~P(N) of sampling a new, unseen type after N word tokens have been sampled:</Paragraph> <Paragraph position="4"> These estimates are in wide use (see, e.g., Church and Gale \[1991\] for application to bigrams, Bod \[1995\] for application to syntax, and Baayen \[1992\] and Baayen and Sproat \[1996\] for application to morphology). Hence, it is useful to consider in some detail how their accuracy is affected by inter-textual and intra-textual cohesion. To this end, I carried out a short series of experiments of the following kind.</Paragraph> <Paragraph position="5"> Assume that the Trouw data used in the previous section constitute a population of N = 265,360 word tokens from which we sample the first N/2 = 132,680 words. For the Trouw data, this is a matter of stipulation, but for texts such as Moby Dick or Alice in Wonderland, an argument can be made that the novel is the true population rather than a sample from a population. For the present purposes, the crucial point is that we now have defined a population for which we know exactly what the population probabilities--the relative frequencies in the complete texts--are.</Paragraph> <Paragraph position="6"> First consider how accurately we can estimate the vocabulary size of the population from the sample. The expression for E\[V(N)\] given in (1) that we have used thus far does not allow us to extrapolate to larger sample sizes. However, analytical expressions that allow both interpolation (in the sense of estimating V(N) on the basis of the frequency spectrum for sample sizes M < N) and extrapolation (in the sense of estimating V(M) for M > N) are available (for a review, see Chitashvili and Baayen \[1993\]). Here, I will make use of a smoother developed by Sichel (1986). The three parameters of this smoother are estimated by requiring that E\[V(N)\] = V(N), that E\[V(N, 1)\] = V(N, 1), and by minimizing the chi-square statistic for a given span of frequency ranks.</Paragraph> <Paragraph position="7"> The upper left panel of Figure 9 shows that it was possible to select the parameters of Sichel's model such that the observed frequencies of the first 20 frequency ranks (V(N,f),f = 1 .... ,20) do not differ significantly from their model-dependent Interpolation and extrapolation from sample (the first half of the Trouw data) to population (the complete Trouw data). E\[V(N,f)\] and V(N,f): expected and observed frequency spectrum; E\[V(N)\] and V(N): expected and observed numbers of types; Mp(f): population probability mass of the types with frequency f in the sample; MGT(f): Good-Turing estimate of Mp(f); Ms(f): unadjusted sample estimate of Mp(f).</Paragraph> <Paragraph position="8"> expectations Es\[V(N,f)\]. 7 The upper right panel shows that interpolation on the basis of Sichel's model (dashed line) is virtually indistinguishable from interpolation using (1) (dotted line). The observed vocabulary sizes are represented by large dots. As expected, both (1) and the parametric smoother reveal the characteristic overestimation pattern.</Paragraph> <Paragraph position="9"> The center panels of Figure 9 show that the overestimation characteristic for interpolation is reversed when extrapolating to larger samples. For extrapolation, underestimation is typical. The dotted line in the left-hand panel represents the observed vocabulary size of the complete Trouw text, the solid line shows the result from interpolation and extrapolation from N = 132,680. The right-hand panel highlights the corresponding difference scores. For N = 265,360, the error is large: 5.5% of the actual vocabulary size.</Paragraph> <Paragraph position="10"> Having established that E\[V(N)\] underestimates V(N) when extrapolating, the question is how well the Good-Turing estimates perform. To determine this, I will consider the probability mass of the frequency classes V(M,f) for f = 1... 40. Let = v(M,f) . p; (M) (19) 7 The fit (X2(18) = 9.93, p > .9) was obtained for the parameter values c~ = 0.291, 3' = -0.7, and b = 0.011.</Paragraph> </Section> <Section position="4" start_page="472" end_page="473" type="sub_section"> <SectionTitle> Baayen Lexical Specialization </SectionTitle> <Paragraph position="0"> be the joint Good-Turing probability mass of all types with frequency f in the sample of M = 132,680 tokens, and let Mp(f) be the joint probability mass of exactly the same word types, but now in the population (N = 265,360 tokens):</Paragraph> <Paragraph position="2"> with f(i, X) the frequency of the i-th type in a sample of X tokens. The bottom left panel of Figure 9 shows that for the first frequency ranks f, the Good-Turing estimate MeT (f, M) underestimates the probability mass of the frequency class in the population.</Paragraph> <Paragraph position="3"> For the higher-frequency ranks, the estimates are fairly reliable. The bottom right panel of Figure 9 plots the corresponding errors for the unadjusted sample probability</Paragraph> <Paragraph position="5"> f, which overestimates the population values. Surprisingly, the unadjusted estimates overestimate the population values to roughly the same extent that the adjusted estimates lead to underestimation. A heuristic estimate,</Paragraph> <Paragraph position="7"> the mean of Ms(f, M) and MCT(f, M), appears to approximate the population relative class frequencies Mp(f) reasonably well, as shown in Table 1 for the Trouw data as well as for Alice in Wonderland, Moby Dick, and Max Havelaar. For f > 5, as shown in Figure 10, the heuristic estimate remains a reasonable compromise.</Paragraph> <Paragraph position="8"> We have seen that both E\[V(N)\] and the Good-Turing estimates MCT(f,M) (especially for f < 5) lead to underestimation of population values. Interestingly, ~(M) overestimates the probability mass of unseen types. For the Trouw data, at M = 132,680 we count 11,363 hapax legomena, hence ~'(M) = 0.0856. However, the probability mass of the types that do not appear among the first 132,680 tokens, M(0), is much smaller: 0.0609. Table 1 shows that ~'(M) similarly leads to overestimation for Alice in Wonderland, Moby Dick, and Max Havelaar. To judge from Table 1, the Good-Turing estimate MeT(1,M) is an approximate lower bound and the unadjusted estimate Ms(1,M) a strict upper bound for Mp (0).</Paragraph> <Paragraph position="9"> It is easy to see why 7~(N) is an upper bound for coherent text by focusing on its interpretation. Given the urn model, the probability that the first token sampled represents a type that will not be represented by any other token equals V(N, 1)/N.</Paragraph> <Paragraph position="10"> By symmetry, this probability is identical to the probability that the very last token sampled will represent an unseen type. This probability approximates the probability that, after N tokens have been sampled, the next token sampled will be a new type.</Paragraph> <Paragraph position="11"> However, this interpretation hinges on the random selection of word tokens, and this paper presents ample evidence that once a word has been used it is much more likely to be used again than the urn model predicts. Hence, the probability that after sampling N tokens the next token represents an unseen type is less than V(N, 1)/N.</Paragraph> <Paragraph position="12"> Due to intra-textual and inter-textual cohesion, the V(N) - V(N, 1) types that have already been observed have a slightly higher probability of appearing than expected under chance conditions, and consequently the unseen types have a lower probability.</Paragraph> <Paragraph position="13"> Summing up, the Good-Turing frequency estimates are severely effected by the cohesive use of words in normal text. In the absence of probabilistic models that take cohesive word usage into account, estimates of (relative) frequencies remain heuristic Computational Linguistics Volume 22, Number 4 Table 1 Comparison of probability mass estimates for frequencies f = 1 ..... 5 using the smoother Es\[V(N,f)\] of Sichel (1986). The probability mass of unseen types, Mp(0), is also tabulated. Notation: MCTOC, M): Good-Turing estimate; Ms(f,M): sample estimate; Mh(f,M): heuristic estimate; Mp(f): population mass. For Max Havelaar, a sample comprising the first third of the novel was used, for the other texts, a sample consisting of the first half of the tokens was selected.</Paragraph> <Paragraph position="14"> in nature. For the frequencies of types occurring at least once in the sample, the average of the sample and Good-Turing adjusted frequencies is a useful heuristic. For estimates of the probability of unseen types, the sample and Good-Turing estimates provide approximate upper and lower bounds.</Paragraph> </Section> </Section> class="xml-element"></Paper>