File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0908_metho.xml

Size: 10,637 bytes

Last Modified: 2025-10-06 14:07:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0908">
  <Title>Using the Distribution of Performance for Studying Statistical NLP Systems and Corpora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Bootstrap Method
</SectionTitle>
    <Paragraph position="0"> The bootstrap is a re-sampling technique designed for obtaining empirical distributions of estimators. It can be thought of as a smoothed version of k-fold cross-validation (CV). The method has been applied to decision tree and bayesian classi ers by Kohavi (1995) and to neural networks by, e.g., LeBaron and Weigend (1998).</Paragraph>
    <Paragraph position="1"> In this paper, we use the bootstrap method to obtain the distribution of performance of a system which learns to identify non-recursive noun-phrases (base-NPs). While there are a few re nements of the method, the intention of this paper is to present the bene ts of obtaining distributions, rather than optimising bias or variance. We do not aim to study the properties of bootstrap estimation.</Paragraph>
    <Paragraph position="2"> Let a statistic S = S(x1;:::;xn) be a function of the independent observations fxigni=1 of a statistical variable X. The bootstrap method constructs the distribution function of S by successively re-sampling x with replacements. null After B samples, we have a set of bootstrap samples fxb1;:::;xbngBb=1, each of which yields an estimate ^Sb for S. The distribution of ^S is the bootstrap estimate for the distribution of S. That distribution is mostly used for estimating the standard deviation, bias, or con dence interval of S.</Paragraph>
    <Paragraph position="3"> In the present work, xi are the base-NP instances in a given corpus, and the statistic S is the recall on a test set.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> The aim of our experiments is to test whether the recall distribution can be helpful in answering the questions Q1{Q3 mentioned in the introduction of this paper.</Paragraph>
    <Paragraph position="1"> The data and learning algorithms are presented in Sections 4.1 and 4.2. Section 4.3 describes the sampling method in detail. Section 4.4 motivates the use of recall and describes the experiments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Data
</SectionTitle>
      <Paragraph position="0"> We used Penn-Treebank (Marcus et al., 1993) data, presented in Table 1. Wall-Street Journal (WSJ) Sections 15-18 and 20 were used by Ramshaw and Marcus (1995) as training and test data respectively for evaluating their base-NP chunker. These data have since become a standard for evaluating base-NP systems. null The WSJ texts are economic newspaper reports, which often include elaborated sentences containing about six base-NPs on the  average.</Paragraph>
      <Paragraph position="1"> The ATIS data, on the other hand, are a collection of customer requests related to ight schedules. These typically include short sentences which contain only three base-NPs on the average. For example: I have a friend living in Denver that would like to visit me here in Washington DC .</Paragraph>
      <Paragraph position="2"> The structure of sentences in the ATIS data di ers signi cantly from that in the WSJ data. We expect this di erence to be re ected in the recall of systems tested on both data sets.</Paragraph>
      <Paragraph position="3"> The small size of the ATIS data can in uence the results as well. To distinguish the size e ect from the structural di erences, we drew two equally small samples from WSJ Section 20. These samples, WSJ20a and WSJ20b, consist of the rst 100 and the following 93 sentences respectively. There is a slight di erence in size because sentences were kept complete, as explained Section 4.3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Learning Algorithms
</SectionTitle>
      <Paragraph position="0"> We evaluated base-NP learning systems based on two algorithms: MBSL (Argamon et al., 1999) and SNoW (Mu~noz et al., 1999).</Paragraph>
      <Paragraph position="1"> MBSL is a memory-based system which records, for each POS sequence containing a border (left, right, or both) of a base-NP, the number of times it appears with that border vs. the number of times it appears without it. It is possible to set an upper limit on the length of the POS sequences.</Paragraph>
      <Paragraph position="2"> Given a sentence, represented by a sequence of POS tags, the system examines each sub-sequence for being a base-NP. This is done by attempting to tile it using POS sequences that appeared in the training data with the base-NP borders at the same locations.</Paragraph>
      <Paragraph position="3"> For the purpose of the present work, su ce it to mention that one of the parameters is the context size (c). It denotes the maximal number of words considered before or after a base-NP when recording sub-sequences containing a border.</Paragraph>
      <Paragraph position="4"> SNoW (Roth, 1998, \Sparse Network of Winnow&amp;quot;) is a network architecture of Winnow classi ers (Littlestone, 1988). Winnow is a mistake-driven algorithm for learning a linear separator, in which feature weights are updated by multiplication. The Winnow algorithm is known for being able to learn well even in the presence of many noisy features.</Paragraph>
      <Paragraph position="5"> The features consist of one to four consecutive POSs in a 3-word window around each POS. Each word is classi ed as a beginning of a base-NP, as an end, or neither.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Sampling Method
</SectionTitle>
      <Paragraph position="0"> In generating the training samples we sampled complete sentences. In MBSL, an un-marked boundary may be counted as a negative example for the POS-subsequences which contains it. Therefore, sampling only part of the base-NPs in a sentence will generate negative examples.</Paragraph>
      <Paragraph position="1"> For SNoW, each word is an example, but most of the words are neither a beginning nor an end of a base-NP. Random sampling of words might generate a sample with an improper balance between the three classes.</Paragraph>
      <Paragraph position="2"> To avoid these problems, we sampled full sentences instead of words or instances. Within a good approximation, it can be assumed that base-NP patterns in a sentence do not correlate. The base-NP instances drawn from the sampled sentences can therefore be regarded as independent.</Paragraph>
      <Paragraph position="3"> As described at the end of Sec. 4.1, the WSJ20a and WSJ20b data were created so that they contain 613 instances, like the ATIS data. In practice, the number of instances exceeds 613 slightly due to the full-sentence constraint. For the purpose of this work, it is enough that their size is very close to the size  unique sentences in the training data.</Paragraph>
      <Paragraph position="4"> We used the WSJ15-18 dataset for training. This dataset contains n0 = 54760 base-NP instances. The number of instances in a bootstrap sample depends on the number of instances in the last sampled sentence. As Table 2 shows, it is slightly more than n0.</Paragraph>
      <Paragraph position="5"> For k-CV sampling, the data were divided into k random distinct parts, each containing  k 2 instances. Table 3 shows the number ofrecall samples in each experiment (MBSL and SNoW experiments were carried out seperately). null</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Experiments
</SectionTitle>
      <Paragraph position="0"> We trained SNoW and MBSL; the latter using context sizes of c=1 and c=3. Data sets WSJ20, ATIS, WSJ20a, and WSJ20b were used for testing. MBSL runs with the two c values were conducted on the same training samples, therefore it is possible to compare their results directly.</Paragraph>
      <Paragraph position="1"> Each run yielded recall and precision. Recall may be viewed as the expected 0-1 lossfunction on the given test sample of instances. Precision, on the other hand, may be viewed as the expected 0-1 loss on the sample of instances detected by the learning system. Care should be taken when discussing the distribution of precision values because this sample varies from run to run. We will therefore only analyse the distribution of recall in this work. In the following, r1 and r3 denote recall samples of MBSL with c = 1 and c = 3, with standard deviations 1 and 3. 13 denotes the cross-correlation between r1 and r3. SNoW recall and standard deviation will be denoted by rSN and SN.</Paragraph>
      <Paragraph position="2"> To approach the questions raised in the introduction we made the following measurements: null Q1: System comparison was addressed by comparing r1 and r3 on the same test data.</Paragraph>
      <Paragraph position="3"> With samples at hand, we obtained an estimate of P(r3 &gt;r1).</Paragraph>
      <Paragraph position="4"> Q2: We studied training and test adequacy through the e ect of more speci c features on recall, and on its standard deviation.</Paragraph>
      <Paragraph position="5"> Setting c = 3 takes into account sequences with context of two and three words in addition to those with c = 1. Sequences with larger context are more speci c, and an improvement in recall implies that they are informative in the test data as well.</Paragraph>
      <Paragraph position="6"> For particular choices of parameters and test data, the recall spread yields an estimate of the training sampling noise. On inadequate data, where the statistics di er significantly from those in the training data, even small changes in the model can lead to a noticeable di erence in recall. This is because the model relies on statistics which appear relatively rarely in the test data. Not only do these statistics provide little information about the problem, but even small di erences in weighting them are relatively in uential.</Paragraph>
      <Paragraph position="7"> Therefore, the more training and test data di er from each other, the more spread we can expect in results.</Paragraph>
      <Paragraph position="8"> Q3: For comparing test data sets with a system, we used cross-correlations betweenr1, r3, orrSN samples obtained on these data sets. We know that WSJ data are di erent from ATIS data, and so expect the results on WSJ to correlate with ATIS results less than with other WSJ results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML