File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0409_metho.xml

Size: 5,515 bytes

Last Modified: 2025-10-06 14:08:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0409">
  <Title>Exceptionality and Natural Language Learning</Title>
  <Section position="3" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Language learning tasks
</SectionTitle>
    <Paragraph position="0"> The tasks we will be using in our study come from the area of spoken dialog systems (SDS). They were all designed as methods for potentially improving the dialog manager of a SDS system called TOOT (Litman and Pan, 2002). This system provides access to train information from the web via telephone and it was developed for the purpose of comparing differences in dialog strategy. null Our tasks are: (1) Identifying user corrections (ISCORR), (2) Identifying correction-aware sites (STATUS), (3) Identifying concept-level speech recognition errors (CABIN) and (4) Identifying word-level speech recognition errors (WERBIN). The first task is a binary classification task that labels each user turn as to whether or not it is an attempt from the user to correct a prior system recognition failure. The second task is a 4way classification task that extends the previous one with whether or not the user is aware the system made a recognition error. The four classes are: normal user turn, user only tries to correct the system, user is only aware of a system recognition error, and user is both aware of and tries to correct the system error. The third and the fourth tasks are binary classification tasks that try to predict the system speech recognition accuracy when recognizing a user turn. CABIN measures a binary version of the Concept Accuracy (percent of semantic concepts recognized correctly) while WERBIN measures a binary version of the Word Error Rate (percent of words recognized incorrectly).</Paragraph>
    <Paragraph position="1"> Data for our tasks was gathered from a corpus of 2,328 user turns from 152 dialogues between human subjects and TOOT. The features used to represent each user turn include prosodic information, information from the automatic speech recognizer, system conditions and dialog history. Then, each user turn was labeled with respect to every classification task. Even though our classification tasks share the same data, there are clear differences between them. ISCORR and STATUS both deal with user corrections which is quite different from predicting speech recognition errors (handled in WERBIN and CABIN). Moreover, one will expect very little noise or no noise at all when manually annotating WERBIN and CABIN. For more information on our tasks and features, see (Litman et al., 2000; Hirschberg et al., 2001; Litman et al., 2001).</Paragraph>
    <Paragraph position="2"> There are a number of dimensions where our tasks differ from the tasks from the previous study. First of all our datasets are smaller (2,328 instances compared with at least 23,898). Second, the number of features used is much bigger than the previous study (141 compared with 4-11). Moreover, many features from our datasets are numeric while the previous study had none. These differences will also reflect on our exceptionality measures values. For example, the smallest range for typicality in the previous study was between 0.43 and 10.57 while for our tasks it is between 0.9 and 1.1. To explore these differences we varied the feature set used. Instead of using all the available features (this feature set is called All), we restricted the feature set by using only non-numeric features (Nonnum - 22 features). The typicality range increased when using this feature set (0.771.45), but the number of features used was still larger than the previous study. For this reason, we next devised two set of features with only 9 (First9) and 15 features (First15). The features were selected based on their information gain (see section 2.1).</Paragraph>
    <Paragraph position="3"> Before proceeding with our results, there is one more thing we want to mention. At least half of our instances have one or more missing values and while the Ripper implementation offered a way to handle them, there was no default handling of missing values in the IB1-IG implementation. Thus, we decided to replace missing values ourselves before presenting the datasets to our learners. In particular there are two types of missing values: genuine missing values (no value was provided; we will refer to them as missing values) and undefined values. Undefined values come from features that are not defined in that user turn (for example, in the first user turn, most of the dialog history features were undefined because there was no previous user turn).</Paragraph>
    <Paragraph position="4"> For symbolic features, we replaced missing and undefined values with a given string for missing values and another one for undefined values. For numeric features, the problem was more complicated since the distance metric uses the difference between two numeric values and thus, the values used to fix the problem can influence the distance between instances. We experimented with different replacement values: to the left and right of the interval boundaries for that features, both replacement values on one side of the interval or very far from the interval boundaries. All experiments with the values provided comparable results. For our experiments, missing values were replaced with a value to the right of the interval for that feature and undefined values were replaced with a value to the left of that interval. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML