File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1701_intro.xml
Size: 12,868 bytes
Last Modified: 2025-10-06 14:04:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1701"> <Title>Web-based frequency dictionaries for medium density languages</Title> <Section position="2" start_page="0" end_page="2" type="intro"> <SectionTitle> 0 Introduction </SectionTitle> <Paragraph position="0"> In theoretical linguistics introspective grammaticality judgments are often seen as having methodological primacy over conclusions based on what is empirically found in corpora. No doubt the main reason for this is that linguistics often studies phenomena that are not well exemplified in data.</Paragraph> <Paragraph position="1"> For example, in the entire corpus of written English there seems to be only one attested example, not coming from semantics papers, of Bach-Peters sentences, yet the grammaticality (and the preferred reading) of these constructions seems beyond reproach. But from the point of view of the theoretician who claims that quantifier meanings can be computed by repeat substitution, even this one example is one too many, since no such theory can account for the clearly relevant (though barely attested) facts.</Paragraph> <Paragraph position="2"> In this paper we argue that ordinary corpus size has grown to the point that in some areas of theoretical linguistics, in particular for issues of inflectional morphology, the dichotomy between introspective judgments and empirical observations need no longer be maintained: in this area at least, it is now nearly possible to make the leap from zero observed frequency to zero theoretical probability i.e. ungrammaticality.</Paragraph> <Paragraph position="3"> In many other areas, most notably syntax, this is still untrue, and here we argue that facts of derivational morphology are not yet entirely within the reach of empirical methods. Both for inflectional and derivational morphology we base our conclusions on recent work with a gigaword web-based corpus of Hungarian (Hal'acsy et al 2004) which goes some way towards fulfilling the goals of the WaCky project (http://wacky.sslmit.unibo.it, see also L&quot;udeling et al 2005) inasmuch as the infrastructure used in creating it is applicable to other medium-density languages as well. Section 1 describes the creation of the WFDH Web-based Frequency Dictionary of Hungarian from the raw corpus. The critical disambiguation step required for lemmatization is discussed in Section 2, and the theoretical implications are presented in Section 3. The rest of this Introduction is devoted to some terminological clarification and the presentation of the elementary probabilistic model used for psycholinguistic experiment design.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 0.1 The range of data </SectionTitle> <Paragraph position="0"> Here we will distinguish three kinds of corpora: small-, medium-, and large-range, based on the internal coherence of the component parts. A small-range corpus is one that is stylistically homogeneous, generally the work of a single author. The largest corpora that we could consider small-range are thus the oeuvres of the most prolific writers, rarely above 1m, and never above 10m words. A medium-range corpus is one that remains within the confines of a few text types, even if the authorship of individual documents can be discerned e.g. by detailed study of word usage. The LDC gigaword corpora, composed almost entirely of news (journalistic prose), are from this perspec- null tive medium range. Finally, a large-range corpus is one that displays a variety of text types, genres, and styles that approximates that of overall language usage - the Brown corpus at 1m words has considerably larger range than e.g. the Reuters corpus at 100m words.</Paragraph> <Paragraph position="1"> The fact that psycholinguistic experiments need to control for word frequency has been known at least since Thorndike (1941) and frequency effects also play a key role in grammaticization (Bybee, 2003). Since the principal source of variability in word (n-gram) frequencies is the choice of topic, we can subsume overall considerations of genre under the selection of topics, especially as the former typically dictates the latter - for example, we rarely see literary prose or poetry dealing with undersea sedimentation rates. We assume a fixed inventory of topics T1,T2,...,Tk, with k on the order 104, similar in granularity to the Northern Light topic hierarchy (Kornai et al 2003) and reserve T0 to topicless texts or &quot;General Language&quot;. Assuming that these topics appear in the language with frequency q1,q2,...,qk, summing to 1 [?] q0 [?] 1, the &quot;average&quot; topic is expected to have frequency about 1/k (and clearly, q0 is on the same order, as it is very hard to find entirely topicless texts).</Paragraph> <Paragraph position="2"> As is well known, the salience of different nouns and noun phrases appearing in the same structural position is greatly impacted not just by frequency (generally, less frequent words are more memorable) but also by stylistic value. For example, taboo words are more salient than neutral words of the same overall frequency. But style is also closely associated with topic, and if we match frequency profiles across topics we are therefore controlling for genre and style as well. Presenting psycholinguistical experiments is beyond the scope of this paper: here we put the emphasis on creatingthecomputationalresource, thefrequency dictionary, that allows for detail matching of frequency profiles.</Paragraph> <Paragraph position="3"> Defining the range r of a corpus C simply as summationtextj qj where the sum is taken over all topics touched by documents in C, single-author corpora typically have r < 0.1 even for encyclopedic writers, and web corpora have r > 0.9. Note that r just measures the range, it does not measure how representative a corpus is for some language community. Here we discuss results concerning all three ranges. For small range, we use the Hungarian translation of Orwell's 1984 - 98k words including punctuation tokens, (Dimitrova et al., 1998). For mid-range, we consider four topically segregated subcorpora of the Hungarian side of our Hungarian-English parallel corpus - 34m words, (Varga et al., 2005). For large-range we use our webcorpus - 700m words, (Hal'acsy et al., 2004).</Paragraph> <Paragraph position="4"> 1 Collecting and presenting the data Hungarian lags behind &quot;high density&quot; languages like English and German but is hugely ahead of minority languages that have no significant machine readable material. Varga et al (2005) estimated there to be about 500 languages that fit in the same &quot;medium density&quot; category, together accounting for over 55% of the world's speakers.</Paragraph> <Paragraph position="5"> Halacsy et al (2004) described how a set of open source tools can be exploited to rapidly clean the results of web crawls to yield high quality mono-lingual corpora: the main steps are summarized below.</Paragraph> <Paragraph position="6"> Raw data, preprocessing The raw dataset comes from crawling the top-level domain, e.g.</Paragraph> <Paragraph position="7"> .hu, .cz, .hr, .pl etc. Pages that contain no usable text are filtered out, and all text is converted to a uniform character encoding. Identical texts are dropped by checksum comparison of page bodies (a method that can handle near-identical pages, usually automatically generated, which differ only in their headers, datelines, menus, etc.) Stratification A spellchecker is used to stratify pages by recognition error rates. For each page we measure the proportion of unrecognized (either incorrectly spelled or out of the vocabulary of the spellchecker) words. To filter out non-Hungarian (non-Czech, non-Croatian, non-Polish, etc.) documents, the threshold is set at 40%. If we lower the threshold to 8%, we also filter out flat native texts that employ Latin (7-bit) characters to denote their accented (8 bit) variants (these are still quite commonduetotheubiquityofUSkeyboards). Finally, below the 4% threshold, webpages typically contain fewer typos than average printed documents, making the results comparable to older frequency counts based on traditional (printed) materials.</Paragraph> <Paragraph position="8"> Lemmatization To turn a given stratum of the corpus into a frequency dictionary, one needs to collect the wordforms into lemmas based on the same stem: we follow the usual lexicographic practice of treating inflected, but not derived, forms of a stem as belonging to the same lemma.</Paragraph> <Paragraph position="9"> Inflectional stems are computed by a morphological analyzer (MA), the choice between alternative morphological analyses is resolved using the output of a POS tagger (see Section 2 below). When there are several analyses that match the output of the tagger, we choose one with the least number of identified morphemes. For now, words outside the vocabulary of the MA are not lemmatized at all this decision will be revisited once the planned extension of the MA to a morphological guesser is complete.</Paragraph> <Paragraph position="10"> Topic classification Kornai et al (2003) presented a fully automated system for the classification of webpages according to topic. Combining this method with the methods described above enables the automatic creation of topic-specific frequency dictionaries and further, the creation of a per-topic frequency distribution for each lemma.</Paragraph> <Paragraph position="11"> This enables much finer control of word selection in psycholinguistic experiments than was hitherto possible.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 1.1 How to present the data? </SectionTitle> <Paragraph position="0"> For Hungarian, the highest quality (4% threshold) stratum of the corpus contains 1.22m unique pages for a total of 699m tokens, already exceeding the 500m predicted in (Kilgarriff and Grefenstette, 2003). Since the web has grown considerably since the crawl (which took place in 2003), their estimate was clearly on the conservative side.</Paragraph> <Paragraph position="1"> Of the 699m tokens some 4.95m were outside the vocabulary of the MA (7% OOV in this mode, but less than 3% if numerals are excluded and the analysis of compounds is turned on). The remaining 649.7m tokens fall in 195k lemmas with an average 54 form types per lemma. If all stems are considered, the ratio is considerably lower, 33.6, but the average entropy of the inflectional distributions goes down only from 1.70 to 1.58 bits.</Paragraph> <Paragraph position="2"> As far as the summary frequency list (which is less than a megabyte compressed) is concerned, this can be published trivially. Clearly, the availability of large-range gigaword corpora is in the best interest of all workers in language technology, and equally clearly, only open (freely downloadable) materials allow for replicability of experiments. Whileitispossibletoexploitsearchengine queries for various NLP tasks (Lapata and Keller, 2004), for applications which use corpora as unsupervised training material downloadable base data is essential.</Paragraph> <Paragraph position="3"> Therefore, a compiled webcorpus should contain actual texts. We believe all &quot;cover your behind&quot; efforts such as publishing only URLs to be fundamentally misguided. First, URLs age very rapidly: in any given year more than 10% become stale (Cho and Garcia-Molina, 2000), which makes any experiment conducted on such a basis effectively irreproducible. Second, by presenting a quality-filtered and characterset-normalized corpus the collectors actually perform a service to those who are less interested in such mundane issues. If everybody has to start their work from the ground up, many projects will exhaust their funding resources and allotted time before anything interesting could be done with the data. In contrast, the Free and Open Source Software (FOSS) model actively encourages researchers to reuse data.</Paragraph> <Paragraph position="4"> In this regard, it is worth mentioning that during the crawls we always respectedrobots.txt and in the two years since the publication of the gigaword Hungarian web corpus, there has not been a single request by copyright holders to remove material. We do not advocate piracy: to the contrary, it is our intended policy to comply with removal requests from copyright holders, analogous to Google cache removal requests. Finally, even with copyright material, there are easy methods for preserving interesting linguistic data (say unigram and bigram models) without violating the interests of businesses involved in selling the running texts. 1</Paragraph> </Section> </Section> class="xml-element"></Paper>