File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1028_evalu.xml
Size: 4,699 bytes
Last Modified: 2025-10-06 13:59:32
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1028"> <Title>A Figure of Merit for the Evaluation of Web-Corpus Randomness</Title> <Section position="7" start_page="221" end_page="222" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> Table 3 summarizes the results of the experiments with Google. Each column represents one experiment involving a specific - supposedly - unbiased category. The category with the best (lowest) d score is highlighted in bold. The unbiased sample is always ranked higher than all biased samples.</Paragraph> <Paragraph position="1"> The results show that the best results are achieved with Brown corpus seeds. The bootstrapped error estimate shows that the unbiased Brown samples are significantly more random than the biased samples and, orthogonally, of the BNC and 3esl samples. In particular medium frequency terms seem to produce the best results, although the difference among the three Brown categories are not significant. Thus, while more testing is needed, our data provide some support for the choice of medium frequency words as best seeds.</Paragraph> <Paragraph position="2"> Terms extracted from the BNC are less effective than terms from the Brown corpus. One possible explanation is that the Web is likely to contain much larger portions of American than British English, and thus the BNC queries are overall column, always the unbiased category.</Paragraph> <Paragraph position="3"> more biased than the Brown queries. Alternatively, this might be due to the smaller, more controlled nature of the Brown corpus, where even medium- and low-frequency words tend to be relatively common terms. The internal ranking of the BNC categories, although not statistically significant, seems also to suggest that medium frequency words (BNC.mf) are better than low frequency words. In this case, the all/low frequency set (BNC.af) tends to contain very infrequent words; thus, the poor performance is likely due to data sparseness issues, as also indicated by the relatively smaller quantity of data retrieved (Table 2 above). We take the comparatively lower rank of BNC.demog to constitute further support for the validity of our method, given that the corresponding set, being entirely composed of words from spoken English, should be more biased than other unbiased sets. This latter finding is particularly encouraging because the way in which this set is biased, i.e., in terms of mode of communication, is completely different from the topic-based bias of the WordNet sets. Finally, the queries extracted from the 3esl set are the most biased.</Paragraph> <Paragraph position="4"> This unexpected result might relate to the fact that, on a quick inspection, many words in this set, far from being what we would intuitively consider &quot;core&quot; vocabulary, are rather cultivated, often technical terms (aesthetics, octopi, misjudgment, hydroplane), and thus they might show a register-based bias that we do not find in lists extracted from balanced corpora. We randomly selected 100 documents from the corpora constructed with the &quot;best&quot; unbiased set (Brown.mf) and 100 documents from this set, and we classified them in terms of genre, topic and other categories (in random order, so that the source of the rated documents was not known). This preliminary analysis did not highlight dramatic differences between the two corpora, except for the fact that 6 over 100 documents in the 3esl sub-corpus pertained to the rather narrow domain of aviation and space travel, while no comparably narrow topic had such a large share of the distributionintheBrown.mfsub-corpus. Moreresearch is needed into the qualitative differences that correlate with our figure of merit. Finally, although different query sets retrieve different amounts of documents, and lead to the construction of corpora of different lengths, there is no sign that these differences are affecting our figure of merit in a systematic way; e.g., some of the larger collections, in terms of number of documents and token size, are both at the top (most unbiased samples) and at the bottom of the ranks (law, sociology).</Paragraph> <Paragraph position="5"> On Web data we observed the same effect we saw with the BNC data, where we could directly sample from the whole collection and from its biased partitions. This provides support for the hypothesis that our measure can be used to evaluate how unbiased a corpus is, and that issuing unbiased/biased queries to a search engine is a viable, nearly knowledge-free way to create unbiased corpora, and biased corpora to compare them against.</Paragraph> </Section> class="xml-element"></Paper>