File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/w01-0908_concl.xml

Size: 3,779 bytes

Last Modified: 2025-10-06 13:53:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0908">
  <Title>Using the Distribution of Performance for Studying Statistical NLP Systems and Corpora</Title>
  <Section position="7" start_page="0" end_page="0" type="concl">
    <SectionTitle>
6 Summary and Further Research
</SectionTitle>
    <Paragraph position="0"> In this work, we used the distribution of recall to address questions concerning base-NP learning systems and corpora. Two of these questions, of training and test adequacy, and of comparing data sets using NLP systems, were not addressed before.</Paragraph>
    <Paragraph position="1"> The recall distributions were obtained using CV and bootstrap resampling.</Paragraph>
    <Paragraph position="2"> We found di erences between algorithms with similar recall, related to the features they use.</Paragraph>
    <Paragraph position="3"> We demonstrated that using an inadequate test set may lead to noisy performance results. This e ect was observed with two di erent learning algorithms. We also reported a case when changing a parameter of a learning algorithm improved results on one dataset but degraded results on another.</Paragraph>
    <Paragraph position="4"> We used classi ers as \similarity rulers&amp;quot;, for producing a similarity measure between datasets. Classi ers may have various properties as similarity rulers, even when their recalls are similar. Each classi er should be scaled di erently according to its noise level. This demonstrates the way we can use classi ers to study data, as well as use data to study classi ers.</Paragraph>
    <Paragraph position="5">  CV. Correlations of r1 capture dataset similarity in the best way. By using MBSL with di erent context sizes, our results provide insights into the relation between training and test data sets, in terms of general and speci c features. That issue becomes important when one plans to use a system trained on certain data set for analysing an arbitrary text. Another approach to this topic, examining the e ect of using lexical bigram information, which is very corpusspeci c, appears in (Gildea, 2001).</Paragraph>
    <Paragraph position="6"> In our experiments with systems trained on WSJ data, there was a clear di erence between their behaviour on other WSJ data and on the ATIS data set, in which the structure of base-NPs is di erent. That di erence was observed with correlations and standard deviations. This shows that resampling the training data is essential for noticing these structure di erences.</Paragraph>
    <Paragraph position="7"> To control the e ect of small size of the ATIS dataset, we provided two equally-small WSJ data sets. The e ect of di erent genres was stronger than that of the small-size.</Paragraph>
    <Paragraph position="8"> In future study, it would be helpful to study the distribution of recall using training and test data from a few genres, across genres, and on combinations (e.g. \known-similarity corpora&amp;quot; (Kilgarri and Rose, 1998)). This will provide a measure of the transferability of a model.</Paragraph>
    <Paragraph position="9"> We would like to study whether there is a relation between bootstrap and 2 or 3-CV results. The average number of unique base-NPs in a random bootstrap training sample is about 63% of the total training instances (Table 2). That corresponds roughly to the size of a 3-CV training sample. More work is required to see whether this relation between bootstrap and low-fold CV is meaningful.</Paragraph>
    <Paragraph position="10"> We also plan to study the distribution of precision. As mentioned in Sec. 4.4, the precisions of di erent runs are now taken from di erent sample spaces. This makes the bootstrap estimator unsuitable, and more study is required to overcome this problem.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML