File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1101_metho.xml

Size: 19,909 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1101">
  <Title>Bayesian Stratified Sampling to Assess Corpus Utility</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Overall estimation and the
</SectionTitle>
    <Paragraph position="0"> Bayesian approach We will illustrate the Bayesian approach in the context of a straw man effort to estimate the percentage of real documents without stratifying the corpus. From the entire 45,820 document set, we sampled 200 documents at random. Sampling was done with replacement (i.e., we did not remove sampled documents from the population); however, no documents were observed to be selected twice. One of our researchers then reviewed the documents and judged them as real documents versus pseudodocuments. He did this by reading the first fifty lines of each document.</Paragraph>
    <Paragraph position="1"> Of the 200 documents sampled, 187, or 0.935, were judged to be real documents; this served as our initial estimate for the overall percentage of real documents in the corpus.</Paragraph>
    <Paragraph position="2"> We then used Bayesian techniques to modify this estimate based on our prior expectations about the population. This was a three-step process. First, we calculated the binomial likelihood function corresponding to the sampling results. Second, we encoded our prior expectations in a likelihood function. Third, we combined the binomial and prior likelihood functions to create a posterior probability density. This posterior served as the basis for the final estimate and credibility interval.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Binomial likelihood function
</SectionTitle>
      <Paragraph position="0"> The standard binomial likelihood function associated with the sampling result (187 real documents out of 200), 200! x 187 (l-x) 13, (1) f(x)- 187!13! is graphed in Figure 1. It shows, given each possible true percentage of real documents, the likelihood that one would find 187 real documents out of 200 sampled. We evaluated the likelihood function at a high degree of granularity -- at x intervals corresponding to five significant digits -- so that we would later be able to map percentages of documents onto exact numbers of documents.</Paragraph>
      <Paragraph position="2"> 187 real documents out of 200 sampled</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Prior
</SectionTitle>
      <Paragraph position="0"> We chose a prior by inputting a personal likelihood: one researcher's subjective opinion about the population based on a first look at the corpus. The researcher's input consisted of eleven likelihood values, at intervals of 0.1 on the x axis, as shown in Figure 2. These points were then splined to obtain a likelihood function (Fig. 2; see Press et al. (1988)) and normalized to obtain a probability density. The resulting density was discretized at five significant digits to match the granularity of the binomial likelihood function.</Paragraph>
      <Paragraph position="1">  proportion of real documents An alternative to the above procedure is to choose a prior from a parametric family such as beta densities. This approach simplifies.</Paragraph>
      <Paragraph position="2"> later calculations, as shown in Thomas et al.</Paragraph>
      <Paragraph position="3"> (1995). However, the non-parametric prior allows the researcher more freedom to choose a probability density that expresses his or her best understanding of a population.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Posterior
</SectionTitle>
      <Paragraph position="0"> Once the prior density was established, we applied Bayes' theorem to calculate a posterior probability density for the population. We did this by multiplying binomial likelihood function (Fig. 1) by the prior density (Fig. 2), then normalizing. The non-zero portion of the resuiting posterior is graphed in Figure 3.</Paragraph>
      <Paragraph position="1"> Figure 3 contrasts this posterior density with the binomial likelihood function from Figure 1, also normalized. From a Bayesian perspective, the latter density implicitly factors in the standard non-informative prior in which each possible percentage of real documents has an equal probability: The informative prior shifted the density slightly to the left.</Paragraph>
      <Paragraph position="2"> We used the posterior density to revise our estimate of the percentage of real documents in the population, and to quantify the uncertainty of this estimate. The revised estimate was the mean It, of the density, defined as l /,~kf(xk), where l is the number of points evaluated for the function (1,000,001). This evaluated to 0.9257. To quantifythe uncertainty of this estimate, we found the 95% credibility interval surrounding it -- that is, the range on the x axis that contained 95% of the area under the posterior density.</Paragraph>
      <Paragraph position="3">  proportion of real documents The traditional way to find this interval is to assume a normal distribution, compute the variance c 2 of the posterior, defined as l f(xk)(Xk-l.t) 2, and set the credibility interval k=l at .It 5:1.96 a. This yielded a credibility interval between 0.8908 and 0.9606. As an alternative, we calculated the credibility interval exactly, in a numerical fashion that yielded the tightest possible interval and thus somewhat reduced the final uncertainty of the estimate. To do so we moved outward from the peak of the density, summing under the curve until we reached a total probability of 0.95. At each step outwards from the peak we considered probability values to the fight and left and chose the larger of the two. This method also finds a tighter interval than the numerical method used in Thomas et al. (1995), which was based on finding the left and right tails that each contained 0.025 of the density.</Paragraph>
      <Paragraph position="4"> The credibility interval found for the posterior probability density using the exact method is summarized in Table 1, in percentages of real documents and in numbers of real documents. The document range was calculated by multiplying the percentage range by the number of documents in the corpus (45,820). For comparison's sake the table includes the parallel results obtained using a  led to a slightly smaller credibility interval than the informative prior, implying that the latter was poorly chosen. But regardless of the prior used, the size of the credibility interval, expressed in numbers of documents, was over  3100 documents. This was a lot of uncertainty -- enough to taint any decision about the usage of documents in the on-line Federal Register.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Reducing uncertainty
</SectionTitle>
    <Paragraph position="0"> stratified sampling using We performed a stratified sampling to reduce the uncertainty displayed in Table 1. This process involved dividing the data into two relatively homogeneous strata, one containing mostly real documents, the other mostly pseudo-documents, and combining sampling results from the two strata.</Paragraph>
    <Paragraph position="1"> This approach is advantageous because the variance of a binomial density, ~ (where n is the number sampled, and p the percentage of &amp;quot;yes&amp;quot; answers), shrinks dramatically for extreme values of p. Therefore, one can generally reduce sampling uncertainty by combining results from several homogeneous strata, rather than doing an overall sample from a heterogeneous population.</Paragraph>
    <Paragraph position="2"> As with our overall sample, we performed the stratified sampling within the Bayesian framework. The steps described in Section 3 for the overall sample were repeated for each stratum (with an additional step to allocate samples to the strata), and the posteriors from the strata were combined for the final estimate.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Defining strata and allocating the
samples
</SectionTitle>
      <Paragraph position="0"> We divided the documents into two strata: apparent real documents, and apparent pseudodocuments. The basis for the division was the observation that most pseudo-documents were of the following types:  1. Part dividers (title pages for subparts of an issue) 2. Title pages 3. Tables of contents 4. Reader Aids Sections 5. Instructions to insert illustrations not present in the electronic version 6. Null documents (no text material between &lt;TEXT&gt; and &lt;/TEXT&gt; markers) 7. Other defective documents, such as titles of  presidential proclamations that were separated from the proclamation itself. We wrote a short Per1 script that recognized pseudo-document types 1-4 using key phrases (e.g., /&gt;Part \[IVXM\]/ for Part dividers), and types 5-7 by their short length. This test stratified the data into 3444 apparent pseudo-documents and 42,376 apparent real documents. Exploration of the strata showed that this stratification was not perfect -- indeed, if it were, we could no longer call this query difficult! Some real documents were misclassified as pseudo-documents because they accidentally triggered the key phrase detectors. An erratum document correcting the incomprehensible Register-ese error &amp;quot;&lt;ITAG tagnum=68&gt;BILLING</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="6" type="metho">
    <SectionTitle>
CODE 1505-01-D &lt;/ITAG&gt;&amp;quot;
</SectionTitle>
    <Paragraph position="0"> was misclassified as a real document. However, we will see that the stratification sufficed to sharply reduce the credibility interval.</Paragraph>
    <Paragraph position="1"> Before doing the stratified sampling, we had to decide how many documents to sample from each stratum. In a departure from Thomas et al. (1995), we used a Bayesian modification of Neyman allocation to do this.</Paragraph>
    <Paragraph position="2"> Traditional Neyman allocation requires a presampling, of each stratum to determine its heterogeneity; heterogeneous strata are then sampled more intensively. In Newbold's Bayesian modification (1971), prior expectations for each stratum are combined with pre- null sample results to create a posterior densiry for each stratum. These posteriors are then used to determine the allocation.</Paragraph>
    <Paragraph position="3"> This technique therefore required creating posterior densities for each stratum that blended a prior density and a presample. Accordingly, we devised priors for the two strata -- apparent pseudo-documents, and apparent real documents -- based on our exploratory analysis of the strata. As in the overall analysis (Section 3.2), we splined the priors to five significant digits. The original (unsplined) priors are graphed in Figure 4.</Paragraph>
    <Paragraph position="4">  For the presample, we randomly chose ten documents from each stratum (with replacement) and read and scored them. The presample results were perfect -- all apparent pseudo-documents were pseudo-documents, and all apparent real documents were real. We applied Bayes' theorem to calculate the posterior density for each stratum, multiplying the binomial likelihood function associated with the stratum's presample by the relevant prior density, and normalizing.</Paragraph>
    <Paragraph position="5"> With these posteriors in hand, we were ready to determine the optimum allocation among the strata. Newbold (1971) gives the fraction q/allocated to each stratum i by</Paragraph>
    <Paragraph position="7"> where k is the number of strata, Ci is the cost of sampling a stratum (assumed here to be 1), n i  is the number of documents in the presample for the stratum, and Ai is Ai Hi 2 Pi (1-Pi) = (ni+2) (3) where Hi is the fraction of the overall population that comes from the ith stratum, and Pi is the population mean for the posterior density in the ith stratum. The outcome of this procedure was an allocation of 15 apparent pseudo-documents and 185 apparent real documents.</Paragraph>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.2 Posteriors for each stratum
</SectionTitle>
      <Paragraph position="0"> Having already sampled ten documents from each stratum, we now sampled an additional 5 apparent pseudo-documents and 175 apparent real documents to make up the full sample.</Paragraph>
      <Paragraph position="1"> We chose documents randomly with replacement and judged each document subjectively as above. To our surprise (knowing that the stratification was error-prone), this sampling again gave perfect results: all apparent pseudo-documents were pseudo-documents, and all apparent real documents were real.</Paragraph>
      <Paragraph position="2"> We applied Bayes' theorem a final time to derive a new posterior probability density for each stratum based on the results of the full sample. For each stratum, we multiplied the binomial likelihood function corresponding to the full sampling results (0/15 and 185/185) by the prior probability density for each stratum (i.e., the posterior density from the presample), then normalized.</Paragraph>
      <Paragraph position="3"> 4~3 Combining the results: Monte Carlo simulation The final step was to combine the two postedors to obtain an estimate and credibility interval for the population as a whole. The traditional approach would be to find the mean and variance for each stratum's posterior and combine these according to each stratum's weight in the population. Newbold (1971) k 'bi gives the weighted mean as i=~l~i n'-i' where bi is the number of real documents found in stratum i out of ni sampled. As an altemative technique, we used a Monte Carlo simulation (Shreider (1966)) to compute the density of k the fraction of real documents p = ~ Hi Pi. i=1 We then used this density to provide a final estimate and a corresponding credibility interval. The Monte Carlo simulation combined the two posteriors in proportion to the incidence of real and pseudo-documents in the Federal Register corpus. Real documents constituted 0.925 of the corpus, and pseudo-documents the remaining 0.075. To perform the simulation, we randomly sampled both posterior densities a million times. For each pair of points picked, we determined the weighted average of the two points, and incremented the value of the corresponding point on the overall density by 10 -6 , or one millionth. For example, if we picked 0.2 from the posterior for apparent pseudo-documents and 0.9 from the posterior for apparent real documents, then we would increment the value of 0.8475 (0.2*0.075 + 0.9*0.925) in the overall density by 10 -6 . At the end of the simulation, the total area of the density was 1.0.</Paragraph>
      <Paragraph position="4"> The resulting overall density is graphed in Figure 5 along with the posteriors. Since the corpus mostly contained apparent real documents, the combined density was closer to that straatum's density.</Paragraph>
      <Paragraph position="5"> Using the same method as in section 3.3, we then found the exact 95% credibility interval for the combined density. The results, summarized in Table 2, show a better than 3:1 reduction from the overall sample, from 3138 to 919 documents. Table 2 also shows the results obtained using a non-informative prior -that is, based on the sampled results alone, without any specific prior expectations. Here we clearly see the benefit of vigorously applying the Bayesian approach, as the prior knowledge helps reduce the credibility interval by seven-tenths of a percent, or 325 documents.</Paragraph>
      <Paragraph position="6">  By sampling 200 documents, stratified according to observed document characteristics with a Bayesian version of Neyman allocation, we have addressed the question of how many Federal Register documents are useful documents that reflect the activities of the Federal  government. The answer was a credibility interval between 91% and 93%, or between 41,768 and 42,687 documents. This was a substantially tighter estimate than could be obtained using either an overall sample, or a stratified sample without prior expectations.</Paragraph>
      <Paragraph position="7"> This estimate was probably tight enough to be useful in applications such as comparing the utility of different corpora. If higher precision were called for, the simplest way to further narrow the credibility interval would be to increase the sample size. In a follow-on experiment, it took less than a half hour to read an additional 200 documents (this turned up two incorrectly stratified documents, confirming our expectations from exploratory analysis).</Paragraph>
      <Paragraph position="8"> The new data sharpened the posteriors, reducing the combined credibility interval to 624 documents, or 1.3 percentage points. Further reductions could be obtained as desired.</Paragraph>
      <Paragraph position="9"> A final topic to address is When and how our technique may be used. What types of questions are likely to be addressed, and what are the implementation issues involved?  We see two likely types of questions. A question may be asked for its own sake, as in this paper or Thomas et al. (1995). Looking further at the Federal Register corpus, other feasible questions using our method come to mind, such as: * Has the amount of attention paid to the environment by the Federal government increased? * What proportion of Federal affairs involve the state of New Mexico? Users of other corpora could likewise pose questions relevant to their own interests.</Paragraph>
      <Paragraph position="10"> A question could also be asked not for its own sake, but to establish a baseline statistic for information retrieval OR) recall rates. Recall is the percentage of relevant documents fiar a query that an IR system actually finds. To establish recall, one must know how many relevant documents exist. The standard technique for estimating this number is &amp;quot;pooling&amp;quot;: identifying relevant documents from among those returned by all IR systems involved in a comparison. This method is used by the TREC program (Voorhees and Harman (1997)). Our method is a principled alternative to this method that is well-grounded in statistical theory, and, unlike pooling, is independent of any biases present in current IR systems.</Paragraph>
      <Paragraph position="11"> Applying the method to a new question, whether for its own sake or to determine recall, involves developing a stratification test, constructing a prior density for each stratum, performing the presample and full samples, and combining the results. Of these steps, stratification is the most important in reducing the credibility interval. In our work to date we have achieved good results with stratification tests that are conceptually and computationally simple. We suspect that when asking multiple questions of the same corpus, it may even be possible to automate the construction of stratification scripts. Priors are easiest to construct if the strata are clean and well-understood.</Paragraph>
      <Paragraph position="12"> The appropriate amount of time to invest in refining a stratification test and the associated priors depends on the cost of evaluating documents and the importance of a small credibility interval. If documents are easy to evaluate, one might choose to put less time into stratification and priors construction, and reduce the credibility interval by increasing sample size. If one is restricted to a small sample, then accurate stratification and good priors are more important. If one requires an extremely tight confidence interval, then careful stratification and prior construction, and a generous sample, are all recommended.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML