File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0417_metho.xml

Size: 17,700 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0417">
  <Title>Training a Naive Bayes Classifier via the EM Algorithm with a Class Distribution Constraint</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Naive Bayes Classifier
</SectionTitle>
    <Paragraph position="0"> The naive Bayes classifier is a simple but effective classifier which has been used in numerous applications of information processing such as image recognition, natural language processing, information retrieval, etc. (Escudero et al., 2000; Lewis, 1998; Nigam and Ghani, 2000; Pedersen, 2000).</Paragraph>
    <Paragraph position="1"> In this section, we briefly review the naive Bayes classifier and the EM algorithm that is used for making use of unlabeled data.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Naive Bayes Model
</SectionTitle>
      <Paragraph position="0"> Let vectorx be a vector we want to classify, and c k be a possible class. What we want to know is the probability that the vector vectorx belongs to the class c</Paragraph>
      <Paragraph position="2"> ) can be estimated from training data. However, direct estimation of P(c k |vectorx) is impossible in most cases because of the sparseness of training data.</Paragraph>
      <Paragraph position="3"> By assuming the conditional independence of the elements of a vector, P(vectorx|c</Paragraph>
      <Paragraph position="5"> With this equation, we can calculate P(c k |vectorx) and classify vectorx into the class with the highest P(c k |vectorx).</Paragraph>
      <Paragraph position="6"> Note that the naive Bayes classifier assumes the conditional independence of features. This assumption however does not hold in most cases. For example, word occurrence is a commonly used feature for text classification. However, obvious strong dependencies exist among word occurrences. Despite this apparent violation of the assumption, the naive Bayes classifier exhibits good performance for various natural language processing tasks. There are some implementation variants of the naive Bayes classifier depending on their event models (Mc-Callum and Nigam, 1998). In this paper, we adopt the multi-variate Bernoulli event model. Smoothing was done by replacing zero-probability with a very small constant (1.0x10 [?]4 ).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 EM Algorithm
</SectionTitle>
      <Paragraph position="0"> The Expectation Maximization (EM) algorithm (Dempster et al., 1977) is a general framework for estimating the parameters of a probability model when the data has missing values. This algorithm can be applied to minimally supervised learning, in which the missing values correspond to missing labels of the examples.</Paragraph>
      <Paragraph position="1"> The EM algorithm consists of the E-step in which the expected values of the missing sufficient statistics given the observed data and the current parameter estimates are computed, and the M-step in which the expected values of the sufficient statistics computed in the E-step are used to compute complete data maximum likelihood estimates of the parameters (Dempster et al., 1977).</Paragraph>
      <Paragraph position="2"> In our implementation of the EM algorithm with the naive Bayes classifier, the learning process using unla- null beled data proceeds as follows: 1. Train the classifier using only labeled data.</Paragraph>
      <Paragraph position="3"> 2. Classify unlabeled examples, assigning probabilistic labels to them.</Paragraph>
      <Paragraph position="4"> 3. Update the parameters of the model. Each probabilistically labeled example is counted as its probability instead of one.</Paragraph>
      <Paragraph position="5"> 4. Go back to (2) until convergence.</Paragraph>
      <Paragraph position="6"> 3 Class Distribution Constraint</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Motivation
</SectionTitle>
      <Paragraph position="0"> As described in the previous section, the naive Bayes classifier can be easily extended to exploit unlabeled data by using the EM algorithm. However, the use of unlabeled data for actual tasks exhibits mixed results. The performance is improved for some cases, but not in all cases. In our preliminary experiments, using unlabeled data by means of the EM algorithm often caused significant deterioration of classification performance.</Paragraph>
      <Paragraph position="1"> To investigate the cause of this, we observed the change of class distribution of unlabeled data occuring in the process of the EM algorithm. What we found is that sometimes the class distribution of unlabeled data greatly diverges from that of the labeled data. For example, when the proportion of class A examples in labeled data was about 0.9, the EM algorithm would sometimes converge into states where the proportion of class A is about 0.7.</Paragraph>
      <Paragraph position="2"> This divergence of class distribution clearly indicated the EM algorithm converged into an undesirable state.</Paragraph>
      <Paragraph position="3"> One of the possible remedies for this phenomenon is that of forcing class distribution of unlabeled data not to diverge from the class distribution estimated from labeled data. In this work, we introduce a class distribution constraint (CDC) into the training process of the EM algorithm. This constraint keeps the class distribution of unlabeled data consistent with that of labeled data.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Calibrating Probabilistic Labels
</SectionTitle>
      <Paragraph position="0"> We implement class distribution constraints by calibrating probabilistic labels assigned to unlabeled data in the process of the EM algorithm. In this work, we consider only binary classification: classes A and B.</Paragraph>
      <Paragraph position="1"> Let p i be the probabilistic label of the ith example representing the probability that this example belongs to class A.</Paragraph>
      <Paragraph position="2"> Let th be the proportion of class A examples in the labeled data L. If the proportion of the class A examples (the proportion of the examples whose p i is greater than 0.5) in unlabeled data U is different from th, we consider that the values of the probabilistic labels should be calibrated. null The basic idea of the calibration is to shift all the probability values of unlabeled data to the extent that the class distribution of unlabeled data becomes identical to that of labeled data. In order for the shifting of the probability values not to cause the values to go outside of the range from 0 to 1, we transform the probability values by an inverse sigmoid function in advance. After the shifting, the values are returned to probability values by a sigmoid function.</Paragraph>
      <Paragraph position="3"> The whole calibration process is given below:  1. Transform the probabilistic labels p  in descending order. Then, pick up the  value q border that is located at the position of proportion th in these n values.</Paragraph>
      <Paragraph position="4"> 3. Since q border is located at the border between the  examples of label A and those of label B, the value should be close to zero (= probability is 0.5). Thus</Paragraph>
      <Paragraph position="6"> by a sigmoid function back into probability values.</Paragraph>
      <Paragraph position="7"> This calibration process is conducted between the E-step and the M-step in the EM algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="75" type="metho">
    <SectionTitle>
4 Confusion Set Disambiguation
</SectionTitle>
    <Paragraph position="0"> We applied the naive Bayes classifier with the EM algorithm to confusion set disambiguation. Confusion set disambiguation is defined as the problem of choosing the correct word from a set of words that are commonly confused. For example, quite may easily be mistyped as quiet. An automatic proofreading system would need to judge which is the correct use given the context surrounding the target. Example confusion sets include: {principle, principal}, {then, than}, and {weather, whether}.</Paragraph>
    <Paragraph position="1"> Until now, many methods have been proposed for this problem including winnow-based algorithms (Golding and Roth, 1999), differential grammars (Powers, 1998), transformation based learning (Mangu and Brill, 1997), decision lists (Yarowsky, 1994).</Paragraph>
    <Paragraph position="2"> Confusion set disambiguation has very similar characteristics to a word sense disambiguation problem in which the system has to identify the meaning of a polysemous word given the surrounding context. The merit of using confusion set disambiguation as a test-bed for a learning algorithm is that since one does not need to annotate the examples to make labeled data, one can conduct experiments using an arbitrary amount of labeled data.</Paragraph>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Features
</SectionTitle>
      <Paragraph position="0"> As the input of the classifier, the context of the target must be represented in the form of a vector. We use a binary feature vector which contains only the values of 0 or 1 for each element.</Paragraph>
      <Paragraph position="1"> In this work, we use the local context surrounding the target as the feature of an example. The features of a target are the two preceding words and the two following words. For example, if the disambiguation target is quiet and the system is given the following sentence &amp;quot;...between busy and quiet periods and it...&amp;quot; the contexts of this example are represented as follows:  In the input vector, only the elements corresponding to these features are set to 1, while all the other elements are set to 0.</Paragraph>
      <Paragraph position="2">  that is currently one of the largest corpora available. The corpus contains roughly one hundred million words collected from various sources.</Paragraph>
      <Paragraph position="3"> The confusion sets used in our experiments are the same as in Golding's experiment (1999). Since our algorithm requires the classification to be binary, we decomposed three-class confusion sets into pairwise binary classifications. Table 1 shows the resulting confusion sets used in the following experiments. The baseline performances, achieved by simply selecting the majority class, are shown in the second column. The number of unlabeled data are shown in the rightmost column.</Paragraph>
      <Paragraph position="4"> The 1,000 test sets were randomly selected from the corpus for each confusion set. They do not overlap the labeled data or the unlabeled data used in the learning process.</Paragraph>
      <Paragraph position="5">  maybe, may be 91.1 77.6 92.9 passed, past 77.9 70.2 82.0 peace, piece 78.4 81.5 82.1 principal, principle 72.8 88.7 79.4 quiet, quite 85.3 75.9 83.5 raise, rise 83.7 86.1 81.0 sight, site 67.7 68.7 67.9 site, cite 96.2 93.3 92.8 than, then 74.7 84.0 85.3 their, there 88.4 91.4 90.2 there, they're 96.4 96.4 89.1 they're, their 96.9 96.9 96.9 weather, whether 90.6 92.3 93.7 your, you're 87.8 81.8 90.3 AVERAGE 83.8 82.9 85.4 The results are shown in Table 2 through Table 5. These four tables correspond to the cases in which the number of labeled examples is 32, 64, 128 and 256 as indicated by the table captions. The first column shows the confusion sets. The second column shows the classification performance of the naive Bayes classifier with which only labeled data was used for training. The third column shows the performance of the naive Bayes classifier with which unlabeled data was used via the basic EM algorithm. The rightmost column shows the performance of the EM algorithm that was extended with our proposed calibration process.</Paragraph>
      <Paragraph position="6"> Notice that the effect of unlabeled data were very different for each confusion set. As shown in Table 2, the precision was significantly improved for some confusion sets including {I, me}, {accept, except} and {affect, effect} . However, disastrous performance deterioration can be observed, especially that of the basic EM algorithm, in some confusion sets including {among, between}, {country, county}, and {site, cite}.</Paragraph>
      <Paragraph position="7"> On average, precision was degraded by the use of un- null labeled data via the basic EM algorithm (from 83.3% to 82.9%). On the other hand, the EM algorithm with the class distribution constraint improved average classification performance (from 83.3% to 85.4%). This improved precision nearly reached the performance achieved by twice the size of labeled data without unlabeled data (see the average precision of NB in Table 3). This performance gain indicates that the use of unlabeled data effectively doubles the labeled training data.</Paragraph>
      <Paragraph position="8"> In Table 3, the tendency of performance improvement (or degradation) in the use of unlabeled data is almost the same as in Table 2. The basic EM algorithm degraded the performance on average, while our method improved average performance (from 85.7% to 87.7%). This performance gain effectively doubled the size of labeled data. The results with 128 labeled examples are shown in Table 4. Although the use of unlabeled examples by means of our proposed method still improved average performance (from 87.6% to 88.6%), the gain is smaller than that for a smaller amount of labeled data.</Paragraph>
      <Paragraph position="9"> With 256 labeled examples (Table 5), the average per- null accept, except 85.7 90.7 89.4 affect, effect 91.9 93.1 93.3 among, between 80.0 76.3 80.1 amount, number 78.2 68.9 69.3 begin, being 94.4 88.1 95.0 cite, sight 96.9 96.9 98.1 country, county 81.3 75.1 75.7 fewer, less 89.9 74.9 89.4 its, it's 88.6 93.2 95.2 lead, led 80.5 82.5 82.2 maybe, may be 94.5 80.9 94.4 passed, past 81.8 74.1 85.5 peace, piece 84.1 81.3 82.5 principal, principle 79.8 89.8 89.5 quiet, quite 86.5 82.7 90.1 raise, rise 85.2 86.4 87.7 sight, site 75.6 70.3 70.5 site, cite 96.1 95.8 97.0 than, then 81.7 84.2 84.5 their, there 91.8 91.5 91.2 there, they're 95.9 83.4 91.3 they're, their 96.9 96.9 96.7 weather, whether 92.0 92.6 95.1 your, you're 88.9 84.1 94.5 AVERAGE 87.6 85.2 88.6 formance gain was negligible (from 89.2% to 89.3%). Figure 1 summarizes the average precisions for different number of labeled examples. Average peformance was improved by the use of unlabeled data with our proposed method when the amount of labeled data was small (from 32 to 256) as shown in Table 2 through Table 5. However, when the number of labeled examples was large (more than 512), the use of unlabeled data degraded average performance.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="75" type="sub_section">
      <SectionTitle>
5.1 Effect of the amount of unlabeled data
</SectionTitle>
      <Paragraph position="0"> When the use of unlabeled data improves classification performance, the question of how much unlabeled data are needed becomes very important. Although unlabeled data are generally much more obtainable than labeled data, acquiring more than several-thousand unlabeled examples is not always an easy task. As for confusion set disambiguation, Table 1 indicates that it is sometimes impossible to collect tens of thousands examples even in a very large corpus.</Paragraph>
      <Paragraph position="1"> In order to investigate the effect of the amount of un- null accept, except 89.7 90.3 91.2 affect, effect 93.4 93.5 93.9 among, between 79.6 75.1 80.4 amount, number 81.4 68.9 69.2 begin, being 94.6 89.9 96.6 cite, sight 97.6 97.9 98.4 country, county 84.2 76.5 77.5 fewer, less 90.8 83.0 89.2 its, it's 90.2 93.3 94.5 lead, led 82.9 79.8 82.6 maybe, may be 96.0 87.1 94.7 passed, past 83.5 74.6 86.3 peace, piece 84.6 81.4 85.7 principal, principle 83.4 90.5 90.5 quiet, quite 88.6 86.8 91.2 raise, rise 88.0 87.1 88.4 sight, site 79.2 71.7 73.2 site, cite 97.3 97.6 97.4 than, then 82.3 85.5 85.9 their, there 93.6 92.1 92.0 there, they're 96.5 83.0 91.1 they're, their 96.8 90.8 97.3 weather, whether 93.8 91.9 94.7 your, you're 89.7 83.8 94.6 AVERAGE 89.2 85.9 89.3 labeled data, we conducted experiments by varying the amount of unlabeled data for some confusion sets that exhibited significant performance gain by using unlabeled data.</Paragraph>
      <Paragraph position="2"> Figure 2 shows the relationship between the classification performance and the amount of unlabeled data for three confusion sets: {I, me}, {principal, principle}, and {passed, past}. The number of labeled examples in all cases was 64.</Paragraph>
      <Paragraph position="3"> Note that performance continued to improve even when the number of unlabeled data reached more than ten thousands. This suggests that we can further improve the performance for some confusion sets by using a very large corpus containing more than one hundred million words.</Paragraph>
      <Paragraph position="4"> Figure 2 also indicates that the use of unlabeled data was not effective when the amount of unlabeled data was smaller than one thousand. It is often the case with minor words that the number of occurrences does not reach one thousand even in a one-hundred-million word corpus. Thus, constructing a very very large corpus (containing</Paragraph>
    </Section>
    <Section position="3" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
Amount of Unlabeled Data
</SectionTitle>
      <Paragraph position="0"> more than billions of words) appears to be beneficial for infrequent words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML