File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1028_intro.xml
Size: 4,086 bytes
Last Modified: 2025-10-06 14:03:19
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1028"> <Title>A Figure of Merit for the Evaluation of Web-Corpus Randomness</Title> <Section position="3" start_page="0" end_page="217" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The Web is a very rich source of linguistic data, and in the last few years it has been used intensively by linguists and language technologists for many tasks (Kilgarriff and Grefenstette, 2003).</Paragraph> <Paragraph position="1"> Among other uses, the Web allows fast and inexpensive construction of &quot;general purpose&quot; corpora, i.e., corpora that are not meant to represent a specific sub-language, but a language as a whole. There are several recent studies on the extent to which Web-derived corpora are comparable, in terms of variety of topics and styles, to traditional &quot;balanced&quot; corpora (Fletcher, 2004; Sharoff, 2006). Our contribution, in this paper, is to present an automated, quantitative method to evaluate the &quot;variety&quot; or &quot;randomness&quot; (with respect to a number of non-random partitions) of a Web corpus. The more random/less-biased towards specific partitions a corpus is, the more it should be suitable as a general purpose corpus.</Paragraph> <Paragraph position="2"> We are not proposing a method to evaluate whether a sample of Web pages is a random sample of the Web, although this is a related issue (Bharat and Broder, 1998; Henzinger et al., 2000).</Paragraph> <Paragraph position="3"> Instead, we propose a method, based on simple distributional properties, to evaluate if a sample of Web pages in a certain language is reasonably varied in terms of the topics (and, perhaps, textual types) it contains. This is independent from whethertheyareactuallyproportionallyrepresenting what is out there on the Web or not. For example, although computer-related technical language is probably much more common on the Web than, say, the language of literary criticism, one might prefer a biased retrieval method that fetches documents representing these and other sub-languages in comparable amounts, to an unbiased method that leads to a corpus composed mostly of computer jargon. This is a new area of investigation with traditional corpora, one knows a priori their composition. As the Web plays an increasingly central role as data source in NLP, we believe that methods to efficiently characterize the nature of automatically retrieved data are becoming of central importance to the discipline.</Paragraph> <Paragraph position="4"> In the empirical evaluation of the method, we focus on general purpose corpora built issuing automated queries to a search engine and retrieving the corresponding pages, which has been shown to be an easy and effective way to build Web-based corpora (Ghani et al., 2001; Ueyama and Baroni, 2005; Sharoff, 2006). It is natural to ask which kinds of query terms, henceforth seeds, are more appropriate to build a corpus comparable, in terms of variety, to traditional balanced corpora such as the British National Corpus, henceforth BNC (Aston and Burnard, 1998). We test our procedure to assess Web-corpus randomness on corpora built using seeds chosen following different strategies.</Paragraph> <Paragraph position="5"> However, the method per se can also be used to assesstherandomnessofcorporabuiltinotherways; null e.g., by crawling the Web.</Paragraph> <Paragraph position="6"> Our method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions constructed using queries to a search engine for deliberately biased seeds. As such, it is nearly resource-free, as it only requires lists of words belonging to specific domains that can be used as biased seeds. In our experiments we used Google as the search engine of choice, but different search engines could be used as well, or other ways to obtain collections of biased documents, e.g., via a directory of precategorized Web-pages.</Paragraph> </Section> class="xml-element"></Paper>