File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/w98-1101_abstr.xml
Size: 7,143 bytes
Last Modified: 2025-10-06 13:49:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1101"> <Title>Bayesian Stratified Sampling to Assess Corpus Utility</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper describes a method for asking statistical questions about a large text corpus. We exemplify the method by addressing the question, &quot;What percentage of Federal Register documents are real documents, of possible interest to a text researcher or analyst?&quot; We estimate an answer to this question by evaluating 200 documents selected from a corpus of 45,820 Federal Register documents.</Paragraph> <Paragraph position="1"> Stratified sampling is used to reduce the sampling uncertainty of the estimate from over 3100 documents to fewer than 11300. The stratification is based on observed characteristics of real documents, while the sampling procedure incorporates a Bayesian version of Neyrnan allocation. A possible application of the method is to establish baseline statistics used to estimate recall rates for information retrieval systems.</Paragraph> <Paragraph position="2"> Introduction The traditional task in information retrieval is to find documents from a large corpus that are relevant to a query. In this paper we address a related task: answering statistical questions about a corpus. Instead of finding the documents that match a query, we quantify the percentage of documents that match it.</Paragraph> <Paragraph position="3"> The method is designed to address statistical questions that are: * subjective: that is, informed readers may disagree about which documents match the query, and the same reader may make different judgment at different times. This characteristic describes most queries of real interest to text researchers.</Paragraph> <Paragraph position="4"> * difficult: that is, one cannot define an algorithrn to reliably assess individual documents, and thus the corpus as a whole.</Paragraph> <Paragraph position="5"> This characteristic follows naturally from the first. It may be compounded by an insufficient understanding of a corpus, or a shortcoming in one's tools for analyzing it.</Paragraph> <Paragraph position="6"> Statistical questions asked of small corpora can be answered exhaustively, by reading and scoring every document in the corpus. Such answers will be subjective, since judgments about the individual documents are subjective. For a large corpus, it is not feasible to read every document. Instead, one must sample a subset of documents, then extrapolate the resuits of the sample to the corpus as a whole. The conclusions that one draws from such a sampling will have two components: the estimated answer to the question, and a confidence interval around the estimate.</Paragraph> <Paragraph position="7"> The method described in this paper combines traditional statistical sampling techniques (Cochran (1963), Kalton (1983)) with Bayesian analysis (Bayes (1763), Berger (1980)) to reduce this sampling uncertainty. The method is well-grounded in statistical theory, but its application to textual queries is novel. One begins by stratifying the data using objective tests designed to yield relatively homogeneous strata, within which most documents either match or do not match the query. Then one samples randomly within each stratum, with the number of documents sampled per stratum determined through the analysis of a presampie. A reader scores each selected document, and the results of the different strata are combined. If the strata are well constructed, the resuiting estimate about the corpus will have a much smaller credibility interval (the Bayesian version of a confidence interval) than one based on a sample of the corpus as a whole.</Paragraph> <Paragraph position="8"> The method is well suited for subjective queries because it brings a human reader's subjective judgments to bear on individual documents. The Bayesian approach that we apply to this problem allows a second opportunity for the reader to influence the results of the sampling. The reader can construct a probability density that summarizes his or her prior expectations about each stratum. These prior expectations are combined with presampling results to determine the makeup of the final sample. When the final sample is analyzed, the prior expectations are again factored in, influencing the estimated mean and the size of the credibility interval. Thus different readers' prior expectations, and their judgments of individual documents, can lead to substantiially different results, which is consistent with the subjective probability paradigm.</Paragraph> <Paragraph position="9"> In earlier work we used this method to analyze medical records, asking, &quot;What percentage of the patients are female?&quot; (Thomas et al. (1995)). The lack of a required gender field in the record format made this a subjective question, especially for records that did not specify the patient's gender at all, or gave conflicting clues. We stratified the corpus into probable male and female records based on linguistic tests such as the number of female versus male pronouns in a record, then sampled within each stratum. Stratification reduced the sampling uncertainty for the question from fourteen percentage points (based on an overall sample of 200 records) to five (based on a stratified sample of the same size). In this paper, we update the method and apply it to a new corpus, the Federal Register.</Paragraph> <Paragraph position="10"> The main change from Thomas et al. (1995) is a greater focus on numerical methods as opposed to parametric and forrnulaic calculations. For example, we use a non-parametric prior density instead of a beta density, and combine posterior densities between strata using a Monte Carlo simulation rather than weighted means and variances. Other differences, such as a Bayesian technique for allocating samples between strata, and a new method for determining the size of the credibility interval, are noted in the text.</Paragraph> <Paragraph position="11"> The Federal Register corpus is of general interest because it is part of the TIPSTER collection. The question we addressed is likewise of general interest: what percentage of documents are of possible interest to a researcher, or to an analyst querying the corpus? Anyone who has worked with large text corpora will recognize that not all documents are created equal; identifying and filtering uninteresting documents can be a nuisance. Estimating the percentage of uninteresting documents in a corpus therefore helps determine its utility.</Paragraph> <Paragraph position="12"> The paper begins by describing the Federal Register corpus and the corpus utility query. It then describes two steps in finding a statistical answer to the query: first through an overall sample of 200 documents from the corpus, then through a stratified sample of 200, then 400 documents. The Conclusion takes up the question of possible application domains and implementation issues for the method.</Paragraph> </Section> class="xml-element"></Paper>