File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3028_metho.xml

Size: 8,962 bytes

Last Modified: 2025-10-06 14:09:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3028">
  <Title>Co-training for Predicting Emotions with Spoken Dialogue Data</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Machine Learning Techniques
</SectionTitle>
    <Paragraph position="0"> In this section, we will briefly describe the machine learning techniques used by our system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Co-training
</SectionTitle>
      <Paragraph position="0"> To address the challenge of training classifiers when only a small set of labeled examples is available, Blum and Mitchell (1998) proposed Co-training as a way to bootstrap classifiers from a large set of unlabeled data. Under this framework, two (or more) learners are trained iteratively in tandem. In each iteration, the learners classify more unlabeled data to increase the training data for each other. In theory, the learners must have distinct views of the data (i.e., their features are conditionally independent given the label example), but some studies suggest that Co-training can still be helpful even when the independence assumption does not hold (Goldman, 2000).</Paragraph>
      <Paragraph position="1"> To apply Co-training to our task, we develop two high-precision learners: Emotional and Non-Emotional. The learners use different features because each is maximizing the precision of its label (possibly with low recall). While we have not proved these two learners are conditionally independent, this division of expertise ensures that the learners are different. The algorithm for our Co-training system is shown in Figure 1. Each learner selects the examples whose predicted labeled corresponds to its expertise class with the highest confidence. The maximum number of iterations and the number of examples added per iteration are parameters of the system.</Paragraph>
      <Paragraph position="2"> While iteration &lt; MAXITERATION</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Wrapper Approach with Forward
Selection
</SectionTitle>
      <Paragraph position="0"> As described in Section 2, 449 features have been currently extracted from each utterance of the ITSPOKE corpus (where an utterance is a student's turn in a dialogue). Unfortunately, high dimensionality, i.e. large amount of input features, may lead to a large variance of estimates, noise, overfitting, and in general, higher complexity and inefficiencies in the learners. Different approaches have been proposed to address this problem. In this work, we have used the Wrapper Approach with Forward Selection.</Paragraph>
      <Paragraph position="1"> The Wrapper Approach, introduced by John et al. (1994) and refined later by Kohavi and John (1997), is a method that searches for a good subset of relevant features using an induction algorithm as part of the evaluation function. We can apply different search algorithms to find this set of features.</Paragraph>
      <Paragraph position="2"> Forward Selection is a greedy search algorithm that begins with an empty set of features, and greedily adds features to the set. Figure 2 shows our algorithm implemented for the forward wrapper approach.</Paragraph>
      <Paragraph position="3">  the ones whose parameters we have changed in order to test and improve the performance.</Paragraph>
      <Paragraph position="4"> We can use different criteria to select the feature to add, depending on the object of optimization. Earlier, we have explained the basis of the Co-training system. When developing an expert learner in one class, we want it to be correct most of the time when it guesses that class. That is, we want the classifier to have high precision (possibly at the cost of lower overall accuracy). Therefore, we are interested in finding the best set of features for precision in each class. In this case, we are focusing on Emotional and Non-Emotional classifiers.</Paragraph>
      <Paragraph position="5"> Figure 3 shows the formulas used for the optimization criterion on each class. For the Emotional Class, our optimization criterion was to maximize the PPV (Positive Predictive Value), and for the Non-Emotional Class our optimization criterion was to maximize the NPV (Negative</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> For the following experiments, we fixed the size of our training set to 175 examples (50%), and the size of our test set to 140 examples (40%). The remaining 10% has been saved for later experiments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Selecting the features
</SectionTitle>
      <Paragraph position="0"> The first task was to reduce the dimensionality and find the best set of features for maximizing the PPV for Emotional class and NPV for Non-Emotional class. We applied the Wrapper Approach with Forward Selection as described in section 3.2, using Naive Bayes to evaluate each subset of features.</Paragraph>
      <Paragraph position="1"> We have used 175 examples for the training set (used to select the best features) and 140 for the test set (used to measure the performance). The training set is randomly divided into two sets in each iteration of the algorithm: One for training and the other for development (65% and 35% respectively). We train the learners with the training set and we evaluate the performance to pick the best feature with the development set.</Paragraph>
      <Paragraph position="2">  and 3 best features for PPV using Naive Bayes (used for Feature Selection) and AdaBoost-j48 Decision Trees (used for Co-training) The selected features that gave the best PPV for Emotional Class are 2 lexical features and one acoustic-prosodic feature. By using them we increased the precision of Naive Bayes from 74.5% (using all 449 features) to 92.9%, and of  features) to 100% just by using one lexical feature, and the NPV of AdaBoost-j48 Decision Trees from 90.7% to 100%. This precision remained the same with the set of 3 best features, one lexical and two non-acoustic prosodic features (see Table 2).</Paragraph>
      <Paragraph position="3"> These two set of features for each learner are disjoint.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Co-training experiments
</SectionTitle>
      <Paragraph position="0"> The two learners are initialized with only 6 labeled examples in the training set. The Co-training system added examples from the 140 &amp;quot;pseudo-labeled&amp;quot; examples1 in the Prediction Set. The size of the training set increased in each iteration by adding the 2 best examples (those with the highest confidence scores) labeled by the two learners. The Emotional learner and the Non-Emotional learner were set to work with the set of features selected by the wrapper approach to optimize the precision (PPV and NPV) as described in section 4.1.</Paragraph>
      <Paragraph position="1"> We have applied Weka's (Witten and Frank, 2000) AdaBoost's version of j48 decision trees (as used in Forbes-Riley and Litman, 2004) to the 140 unseen examples of the test set for generating the learning curve shown in figure 4.</Paragraph>
      <Paragraph position="2"> Figure 4 illustrates the learning curve of the accuracy on the test set, taking the union of the set of features selected to label the examples. We used the 3 best features for PPV for the Emotional Learner and the best feature for NPV for the Non-Emotional Learner (see Section 4.1). The x-axis shows the number of training examples added; the y-axis shows the accuracy of the classifier on test instances. We compare the learning curve from Co-training with a baseline of majority class and an upper-bound, in which the classifiers are trained on human-annotated data. Post-hoc analyses reveal that four incorrectly labeled examples were added to the training set: example numbers 21, 22, 45, and 51 (see the x-axis). Shortly after the inclusion of example 21, the Co-training learning curve diverges from the upper-bound. All of them correspond to Non-Emotional examples that were labeled as Emotional by the Emotional learner with the highest confidence.</Paragraph>
      <Paragraph position="3"> The Co-training system stopped after adding 58 examples to the initial 6 in the training set because the remaining data cannot be labeled by the learners with high precision. However, as we can see, the training set generated by the Co-training technique can perform almost as well as the upperbound, even if incorrectly labeled examples are included in the training set.</Paragraph>
      <Paragraph position="4"> 1 This means that although the example has been labeled, the label remains unseen to the learners.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML