XML Viewer - w03-0409

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0409_evalu.xml
Size: 13,182 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0409">
  <Title>Exceptionality and Natural Language Learning</Title>
  <Section position="4" start_page="2" end_page="5" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> In 5.1 we reproduce the editing and comparison experiments from the previous study to see if their results generalize to our tasks. In 5.2, we move to our next goal: characterizing learners' performance using exceptionality measures. Both learners were run using default pa-</Paragraph>
    <Paragraph position="2"> Natural language learning and memory-based learning First, we performed the editing experiments from the previous study. The purpose of those experiments was to see the impact of editing exceptional and typical instances on the accuracy of the memory-based learner. Since our datasets were small, unlike the previous study which performed editing only on the first train-test partition of a 10-fold cross validation, we performed the editing experiment on all partitions of a 10-fold cross validation. For every fold, we edited 0, 1, 2, 5, 10, 20, 30, 40 and 50% of the training set based on extreme values of all our exceptionality criteria. Accuracy after editing a given percent was averaged among all folds (there is a significant difference in accuracies among folds but all folds exhibit a similar trend with the average). Figure 2 shows our results for the ISCORR dataset  We performed parameter tuning experiments for both predictors: for every fold of a 10-fold cross validation, part of the training set was used as a validation set (for tuning parameters). Our results indicate that the tuned parameters depend on the fold used and there was no clear gain to accuracy from tuning (in some cases there was even loss in accuracy). Integrating tuned parameters with our leave-one-out experiments presents additional problems.</Paragraph>
    <Paragraph position="3"> using six types of editing (editing based on low and high value for all three criteria). In contrast with the previous study, where for all tasks even the smallest editing led to significant accuracy decreases, for our task there was no clear decrease in performance. Moreover, for some criteria (like low local-typicality) we can even see an initial increase in performance. Only after editing half of the training set is there a clear decrease in performance for all editing criteria on this task.</Paragraph>
    <Paragraph position="4"> Editing experiments for the other dataset-feature set combinations yield similar results.</Paragraph>
    <Paragraph position="5"> Next, we compared the memory-based learner with our abstraction-based learner on all tasks. Since the datasets were relatively small, we performed leave-one-out cross validations. Table 1 summarizes our results. The baseline used is the majority class baseline. First, we run the predictors on all tasks using all features. In contrast with the previous study which favored the memory-based learner for almost all their tasks, our results favor IB1-IG for only two of the four tasks (ISCORR and STATUS). In Section 4, we mentioned that the typicality range for our tasks was very small compared with the previous study. Contrary to what we expected, the tasks where IB1-IG performed better were the ones with smaller typicality range. To investigate the typicality range impact on our predictors, we tried to make our datasets similar to the datasets from the previous study by tackling the feature set. We eliminated all numeric features (since the tasks from the previous study had none) and performed experiments on the tasks that had the less typicality range (again, ISCORR and STATUS). Again, when typicality range was increased, even though there were no numeric features, IB1-IG performed worse than Ripper. IB1-IG error rate increased when using only non-numeric features for both tasks compared with the error rate when using all features. This observation led us to assume that, at least for IB1-IG, some of the relevant features for classification were numeric and they were not present in our feature set. Thus, we selected two sets of features (First9 and First15) based on the features' relevance and performed the experiments again on the ISCORR dataset. We can  rate on some of our dataset-feature set combinations observe that as the number of relevant features is increased, the error rate for both predictors and the typicality range are decreasing and IB1-IG takes the lead when the First15 feature set is used. Our results indicate that the predictor that performs better depends on the task, the number of features and the type of features we use.</Paragraph>
    <Paragraph position="6"> To explore why the previous study's results do not generalize in our case, we are planning to replicate these experiments on the dialog-act tagging task on the Switchboard corpus (a task more similar in size and feature types with the previous study than our tasks but still in the area of spoken dialog systems - see Shriberg et al. (1998)).</Paragraph>
    <Section position="1" start_page="2" end_page="5" type="sub_section">
      <SectionTitle>
5.2 Characterizing learners' performance
</SectionTitle>
      <Paragraph position="0"> using exceptionality measures The next goal of our study was to see if we can characterize the performance of our predictors on various classes of instances defined by our exceptionality criteria. In other words, we wanted to try to answer questions like: is IB1-IG better at predicting exceptional instances than Ripper? How about typical instances? Can we combine the two learners and select between them based on the instance exceptionality? To answer these questions, we performed the leave-one-out experiments described above and recorded for every instance whether our predictors predicted it correctly or incorrectly. Next, we computed the exceptionality of every instance using all three measures. Figure 3 shows the exceptionality distribution using the typicality measure for the ISCORR dataset with all features  ISCORR dataset, of instances correctly predicted by IB1-IG, and of instances correctly predicted by Ripper are plotted in the figure. The graph shows that for this dataset there are a lot of boundary instances, very few exceptional instances and few typical instances. The typicality range for all our datasets (usually between 0.85 and 1.15) is far less than the one from the previous study (0.43 up to 10 or even 3500). According to Zhang (1992) hard concepts are often characterized by small typicality spread. Moreover, small typicality spread is associated with low accuracy in predicting.</Paragraph>
      <Paragraph position="1">  of the instances with typicality between a given interval that have been correctly classified by one of the predictors. We can observe that accuracy of both predictors increases with typicality. That is, the more typical the  per based on instance typicality (ISCORR dataset with all features) instance, the more reliable the prediction; the more exceptional the instance, the more unreliable the prediction. This observation holds for all our dataset-feature set combinations. It is not clear for the ISCORR dataset whether one predictor is better than the other based on the typicality. But for datasets CABIN and WERBIN where, overall, IB1-IG did worse than Ripper, the same graph (see Figure 5) shows that IB1-IG's accuracy is worse than Ripper's accuracy when predicting low typicality instances  . Given the problems with typicality if the concepts we want to learn are clustered, we decided  It was not our point to investigate statistical significance of this trend. As we will see later, this trend is powerful enough to yield interesting results when combining the predictors based on exceptionality measures.</Paragraph>
      <Paragraph position="2"> to investigate if this observation holds for other exceptionality measures.</Paragraph>
      <Paragraph position="3"> We continued the experiments on the other exceptionality measures hoping to get more insight into the trend observed for typicality. Indeed, Figure 6 (same as Figure 4 but using the CPS instead of typicality) shows the same trend: IB1-IG is worse than Ripper when predicting exceptional instances and it is better when predicting typical instances. The accuracy curves of the two predictors seem to cross at a CPS value of 0.5, which corresponds to boundary instances. Undefined CPS values (0/0) are assigned a value above 1 (the rightmost point on the graph). Ripper was the one that offered higher accuracy in predicting instances with undefined CPS value for almost all datasets (although not in Figure 6). The result holds for all our dataset-feature set combinations.</Paragraph>
      <Paragraph position="4">  The experiments with local typicality yield the same results: Ripper constantly outperforms IB1-IG for exceptional instances and they switch places for typical instances (see Figure 7). Again, the accuracy curves cross at boundary instances (local typicality value of 1) and the same observation holds for all dataset-feature set combinations.</Paragraph>
      <Paragraph position="5">  Ripper based on instance local typicality (ISCORR dataset with all features)  Abrupt movements in curves are caused by small number of instances in that class. We expect that a larger dataset will smooth our graphs.</Paragraph>
      <Paragraph position="6"> We computed what could be the reduction in error rate if we were to employ both predictors and decide between them based on the instance exceptionality measure. In other words, Ripper prediction was used for exceptional instances and for the left-hand side boundary instances (CPS less than 0.5; typicality less than 1; local typicality less than 1); otherwise IB1-IG prediction was used. The lower bound of this reduction is when we perfectly know which of the predictors offer the correct prediction (in other words the error rate is the number of times both learners furnished wrong predictions). Figure 8 plots the reduction in error rate achieved when deciding between predictors based on typicality, CPS, local typicality and perfect discrimination. The reduction is relative to the best performer on that task. While discriminating based on typicality offered no improvement relative to the best performer, CPS was able to constantly achieve improvement and local typicality improved in six out of eight cases. CPS improved the error rate of the best performer by decreasing it by 1.33% to 3.18% (absolute percentage). In contrast with CPS, local typicality offered, for the cases when it improved the accuracy, more improvement decreasing the error rate by up to 4.94% (absolute percentage). A possible explanation of this difference can be the fact that local typicality captures much more information than CPS (vicinity-level information compared with information very close to the instance).</Paragraph>
      <Paragraph position="7">  In summary, all our exceptionality measures show the same trend in predicting ability: Ripper performs better than IB1-IG on exceptional instances while IB1-IG performs better than Ripper on typical instances. While the fact that IB1-IG does better on typical instances may be linked to its ability to handle subregularities, we have no interpretation for the fact that Ripper does better on exceptional instances. We plan to address this by future work that will look at the distance between exceptional instances and the instances that generated the rule that made the correct prediction for those exceptional instances.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
5.3 Current directions
</SectionTitle>
      <Paragraph position="0"> The previous section showed that we can improve the overall accuracy on our datasets if we combine the prediction generated by our learners based on the exceptionality measure of the new instance. Unfortunately, all our exceptionality measures require the class of the instance. Moreover, for binary classification tasks, since all exceptionality criteria are a ratio, changing the instance class will turn an exceptional instance into a typical instance.</Paragraph>
      <Paragraph position="1"> To move our results from offline to online, we considered interpolating the exceptionality value for an instance based on its neighbors' exceptionality values (the neighbors from the training set). We performed a very simple interpolation by using the exceptionality value of the closest neighbor (relative to equation (1)).</Paragraph>
      <Paragraph position="2"> While previous observations are not obvious anymore in online graphs (there is no clear crossing at boundary instances), there is a small improvement over the best predictor. Figure 9 shows that even for this simple interpolation there is a small reduction in almost all cases in error rate relative to the best performer when using  We are currently investigating more complicated interpolation strategies like learning of a model from the training set that will predict the exceptionality value of an instance based on its closest neighbors.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML