File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2014_metho.xml

Size: 9,470 bytes

Last Modified: 2025-10-06 14:10:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2014">
  <Title>Agreement/Disagreement Classi cation: Exploiting Unlabeled Data using Contrast Classi ers</Title>
  <Section position="3" start_page="53" end_page="54" type="metho">
    <SectionTitle>
2 Contrast Classi er
</SectionTitle>
    <Paragraph position="0"> The contrast classi er approach was developed by Peng et al and successfully applied to the problem of identifying protein disorder in a protein structure database (outlier detection) and to nding articles about them (single-class detection) (Peng et al., 2003). A contrast classi er discriminates between the labeled and unlabeled data, and can be used to approximate the posterior class probability of a given data instance as follows. Taking a Bayesian approach, a contrast classi er for the j-th class is de ned as:</Paragraph>
    <Paragraph position="2"> where hj(x) is the likelihood of x generated by class j in the labeled data, g(x) is the distribution of unlabeled data, and rj is the relative proportion of unlabeled data compared to the labeled data for class j. This discriminates the class j in the labeled data from the unlabeled data. Here, we constrain rj = 0.5 for all j, using resampling to address class distribution skew, as described below. Rewriting equation 1, hj(x) can be expressed in terms of</Paragraph>
    <Paragraph position="4"> Then, the posterior probability of an input x for class j, p(j|x), can be approximated as:</Paragraph>
    <Paragraph position="6"> where qj is the prior class probability which can be approximated by the fraction of instances in the class j among the labeled data. By substituting eq. 2 into eq. 3, we obtain:</Paragraph>
    <Paragraph position="8"> Notice that we do not have to explicitly estimate g(x). Eq. 4 can be used to construct the MAP classi er:</Paragraph>
    <Paragraph position="10"> To approximate the class-speci c contrast classi er, ccj(x), we can choose any classi er that outputs a probability, such as a neural net, logistic regression, or an SVM with outputs calibrated to produce a reasonable probability.</Paragraph>
    <Paragraph position="11"> Typically a lot more unlabeled data are available than labeled data, which causes class imbalance when training a contrast classi er. In a supervised setting, a resampling technique is often used to reduce the effect of imbalanced data. Here, we use a committee of classi ers, each of which is trained on a balanced training set sampled from each class. To compute the nal output of the classi er, we implemented four different strategies.</Paragraph>
    <Paragraph position="12"> * For each class, average the outputs of the contrast classi ers in the committee, and use the average as ccj(x) in eq. 5.</Paragraph>
    <Paragraph position="13"> * Average only the outputs of contrast classi ers smaller than their corresponding threshold, and the fraction of the included classi ers is used as the strength of the probability output for the class.</Paragraph>
    <Paragraph position="14"> * Use a meta classi er whose inputs are the outputs of the contrast classi ers in the committee for a class, and whose output is modeled by training it from a separate, randomly sampled data set. The output of the meta classi er is used as ccj(x).</Paragraph>
    <Paragraph position="15"> * Classify an input as the majority class only when the outputs of the meta classi ers for the other classes are all larger than their corresponding thresholds.</Paragraph>
    <Paragraph position="16"> Another bene t of the contrast classi er approach is that it is less affected by imbalanced data. When training the contrast classi er for each class, it uses the instances in only one class in the labeled data, and implicitly models the data distribution within that class independently of other classes. That is, given a data instance, the distribution within a class, hj(x), determines the output of the contrast classier for the class (eq. 1), which in turn determines the posterior probability (eq. 4). Thus it will not be as highly biased toward the majority class as a classi er trained with a collection of data from imbalanced classes. Our experimental results presented in the next section con rm this bene t.</Paragraph>
  </Section>
  <Section position="4" start_page="54" end_page="55" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We conducted experiments to answer the following questions. First, is the contrast classi er approach applicable to language processing problems, which often involve large amounts of unlabeled data? Second, does it outperform other semi-supervised learning methods on a skewed data set?</Paragraph>
    <Section position="1" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
3.1 Features and data sets
</SectionTitle>
      <Paragraph position="0"> The data set used consists of seven transcripts out of 75 meeting transcripts included in the ICSI meeting corpus (Janin et al., 2003). For the study, 7 meetings were segmented into spurts, de ned as a chunk of speech of a speaker containing no longer than 0.5 second pause. The rst 450 spurts in each of four meetings were hand-labeled as either positive (agreement, 9%), negative (disagreement, 6%), backchannel (23%) or other (62%).</Paragraph>
      <Paragraph position="1"> To approximate ccj(x) we use a Support Vector Machine (SVM) that outputs the probability of the positive class given an instance (Lin et al., 2003).</Paragraph>
      <Paragraph position="2"> We use only word-based features similar to those used in (Hillard et al., 2003), which include the number of words in a spurt, the number of keywords associated with the positive and negative classes, and classi cation based on keywords. We also obtain word and class-based bigram language models for each class from the training data, and compute such language model features as the perplexity of a spurt, probability of the spurt, and the probability of the rst two words in a spurt, using each language model. We also include the most likely class by the language models as features.</Paragraph>
    </Section>
    <Section position="2" start_page="54" end_page="55" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> First, we performed the same experiment as in (Hillard et al., 2003) and (Galley et al., 2004), using the contrast classi er (CC) method . Among the four meetings, the data from one meeting was set aside for testing. Table 1 compares the 3-class accuracy of the contrast classi er with previous results, merging positive and backchannel class together into one class as in the other work. When only lexical features are used (the rst three entries), the SVM-based contrast classi er using meta-classi ers gives the best performance, outperforming the decision tree in (Hillard et al., 2003) and the maximum en- null tropy model in (Galley et al., 2004). It also outperformed the SVM trained using the labeled data only. The contrast classi er is also competitive with the best case result in (Galley et al., 2004) (last entry), which adds speaker change, segment duration, and adjacency pair sequence dependency features using a dynamic Bayesian network.</Paragraph>
      <Paragraph position="1"> In table 2, we report the performance of the four classi cation strategies described in section 2. For comparison, we include a result from Hillard, obtained by training a decision tree on the labels produced by their unsupervised clustering technique.</Paragraph>
      <Paragraph position="2"> Meta classi ers usually obtained higher accuracy, but averaging often achieved higher recovery of agreement/disagreement (A/D) spurts. The use of thresholds increases A/D recovery, with a decrease in accuracy. We obtained the best accuracy using both meta classi ers and thresholds together here, but we more often obtained higher accuracy using meta classi ers only.</Paragraph>
      <Paragraph position="3"> Next, we performed experiments on the entire ICSI meeting data. Only 1,318 spurts were labeled, and 62,944 spurts were unlabeled. Again, one of the labeled meeting transcripts was set aside as a test set. We compared the SVM trained only on labeled data  SVM with additional data labeled with con dence by the previously trained SVM. For the co-training, each of an SVM and a multilayer backpropagation network was trained on the labeled data and the unlabeled data classi ed with high con dence (99%) by one classi er were used as labeled data for further training the other classi er. We used two different classi ers, instead of two independent view of the input features as in (Goldman and Zhou, 2000).</Paragraph>
      <Paragraph position="4"> Table 3 shows that the SVM obtained high accuracy, but the F measure and the recall of the smallest class, negative, is quite low. The bias toward the majority class propagates through each iteration in selftraining, so that only 5% of the negative tokens were detected after 30 iterations. We observed the same pattern in co-training; its accuracy peaked after two iterations (85.1%) and then performance degraded drastically (68% after ve iterations) due in part to an increase in mislabeled data in the training set (as previously observed in (Pierce and Cardie, 2001)) and in part because the data skew is not controlled for. The contrast classi er performs better than the others in both F measure and negative class recall, retaining reasonably good accuracy.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML