File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/w01-0908_evalu.xml
Size: 3,674 bytes
Last Modified: 2025-10-06 13:58:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0908"> <Title>Using the Distribution of Performance for Studying Statistical NLP Systems and Corpora</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results and Discussion </SectionTitle> <Paragraph position="0"> For each of the ve test datasets, Table 4 reports averages and standard deviations of r1, r3, and rSN obtained by 3, 5, 10, and 20-fold cross-validation, and by bootstrap. 13 and P(r3 >r1) are reported as well.</Paragraph> <Paragraph position="1"> We discuss our results by considering to what extent they provide information for answering the three questions:</Paragraph> <Paragraph position="3"> For the WSJ data sets, the di erence between r3 and r1 was well above their standard deviations, and r3 > r1 nearly always. For ATIS, the standard deviation of the di erence ( 2r3 r1 = ( 1)2 + ( 3)2 2 1 3 13) was small due to the high 13, and r1 > r3 nearly always.</Paragraph> <Paragraph position="4"> Q2 { The adequacy of training and test sets: It is clear that adding more speci c features, by increasing the context, improved recall on the WSJ test data and degraded it on the ATIS data. This is likely to be an indication of the di erence in syntactic structure between ATIS and WSJ texts.</Paragraph> <Paragraph position="5"> Another evidence of structural di erence comes from standard deviations. The spread of the ATIS results always exceeded that of the WSJ results, with all three experiments.</Paragraph> <Paragraph position="6"> That di erence cannot be solely attributed to the small size of ATIS, since WSJ20a and WSJ20b results displayed a much smaller spread. Indeed, these results had a wider standard deviation than WSJ20, probably due to the smaller size, but not as wide as ATIS. This indicates that base-NPs in ATIS text have di erent characteristics than those in WSJ texts.</Paragraph> <Paragraph position="7"> Q3 { Comparing datasets by a system: Table 5 reports, for each pair of datasets, the correlation between the 5-fold CV recall samples of each experiment on these datasets. The correlations change with CV fold number, 5-fold results were chosen as they represent intermediary values.</Paragraph> <Paragraph position="8"> Both MBSL experiments yielded negligible correlations of ATIS results with any WSJ data set, whether large or small. These correlations were always weaker than with WSJ20a and WSJ20b, which are about the same size.</Paragraph> <Paragraph position="9"> This is due to ATIS being a di erent kind of text. The correlation between WSJ20a and WSJ20b results was also weak. This may be due to their small sizes; these texts might not share enough features to make a signi cant correlation.</Paragraph> <Paragraph position="10"> SNoW results were highly correlated for all pairs. That behaviour is markedly di erent from the MBSL results, and indicates a high level of noise in the SNoW features. Indeed, Winnow is able to learn well in the presence of noise, but that noise causes the high correlations observed here.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Further Observations </SectionTitle> <Paragraph position="0"> The decrease of 13 with CV fold number is related to stabilization of the system. As the folds become larger, training samples become more similar to each other, and the spread of results decreases. This e ect was not visible in the SNoW data, most likely due to the high level of noise in the features. This noise also contributes to the higher standard deviation of SNoW results.</Paragraph> </Section> </Section> class="xml-element"></Paper>