File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/w03-0417_relat.xml
Size: 3,080 bytes
Last Modified: 2025-10-06 14:15:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0417"> <Title>Training a Naive Bayes Classifier via the EM Algorithm with a Class Distribution Constraint</Title> <Section position="5" start_page="75" end_page="75" type="relat"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Nigam et al.(2000) reported that the accuracy of text classification can be improved by a large pool of unlabeled documents using a naive Bayes classifier and the EM algorithm. They presented two extensions to the basic EM algorithm. One is a weighting factor to modulate the contribution of the unlabeled data. The other is the use of multiple mixture components per class. With these extensions, they reported that the use of unlabeled data reduces classification error by up to 30%.</Paragraph> <Paragraph position="1"> Pedersen et al.(1998) employed the EM algorithm and Gibbs Sampling for word sense disambiguation by using a naive Bayes classifier. Although Gibbs Sampling results in a small improvement over the EM algorithm, the results for verbs and adjectives did not reach baseline performance on average. The amount of unlabeled data used in their experiments was relatively small (from several hundreds to a few thousands).</Paragraph> <Paragraph position="2"> Yarowsky (1995) presented an approach that significantly reduces the amount of labeled data needed for word sense disambiguation. Yarowsky achieved accuracies of more than 90% for two-sense polysemous words.</Paragraph> <Paragraph position="3"> This success was likely due to the use of &quot;one sense per discourse&quot; characteristic of polysemous words.</Paragraph> <Paragraph position="4"> Yarowsky's approach can be viewed in the context of co-training (Blum and Mitchell, 1998) in which the features can be split into two independent sets. For word sense disambiguation, the sets correspond to the local contexts of the target word and the &quot;one sense per discourse&quot; characteristic. Confusion sets however do not have the latter characteristic.</Paragraph> <Paragraph position="5"> The effect of a huge amount of unlabeled data for confusion set disambiguation is discussed in (Banko and Brill, 2001). Bank and Brill conducted experiments of committee-based unsupervised learning for two confusion sets. Their results showed that they gained a slight improvement by using a certain amount of unlabeled data. However, test set accuracy began to decline as additional data were harvested.</Paragraph> <Paragraph position="6"> As for the performance of confusion set disambiguation, Golding (1999) achieved over 96% by a winnow-based approach. Although our results are not directly comparable with their results since the data sets are different, our results does not reach the state-of-the-art performance. Because the performance of a naive Bayes classifier is significantly affected by the smoothing method used for paramter estimation, there is a chance to improve our performance by using a more sophisticated smoothing technique.</Paragraph> </Section> class="xml-element"></Paper>