File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0908_intro.xml
Size: 6,512 bytes
Last Modified: 2025-10-06 14:07:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0908"> <Title>Text Classification by Bootstrapping with Keywords, EM and Shrinkage</Title> <Section position="2" start_page="0" end_page="52" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> When provided with enough labeled training examples, a variety of text classification algorithms can learn reasonably accurate classifiers (Lewis, 1998; Joachims, 1998; Yang, 1999; Cohen and Singer, 1996). However, when applied to complex domains with many classes, these algorithms often require extremely large training sets to provide useful classification accuracy. Creating these sets of labeled data is tedious and expensive, since typically they must be labeled by a person. This leads us to consider learning algorithms that do not require such large amounts of labeled data.</Paragraph> <Paragraph position="1"> While labeled data is difficult to obtain, unlabeled data is readily available and plentiful.</Paragraph> <Paragraph position="2"> Castelli and Cover (1996) show in a theoretical framework that unlabeled data can indeed be used to improve classification, although it is exponentially less valuable than labeled data. Fortunately, unlabeled data can often be obtained by completely automated methods. Consider the problem of classifying news articles: a short Perl script and a night of automated Internet downloads can fill a hard disk with unlabeled examples of news articles. In contrast, it might take several days of human effort and tedium to label even one thousand of these.</Paragraph> <Paragraph position="3"> In previous work (Nigam et al., 1999) it has been shown that with just a small number of labeled documents, text classification error can be reduced by up to 30% when the labeled documents are augmented with a large collection of unlabeled documents.</Paragraph> <Paragraph position="4"> This paper considers the task of learning text classifiers with no labeled documents at all. Knowledge about the classes of interest is provided in the form of a few keywords per class and a class hierarchy.</Paragraph> <Paragraph position="5"> Keywords are typically generated more quickly and easily than even a small number of labeled documents. Many classification problems naturally come with hierarchically-organized classes. Our algorithm proceeds by using the keywords to generate preliminary labels for some documents by term-matching.</Paragraph> <Paragraph position="6"> Then these labels, the hierarchy, and all the unlabeled documents become the input to a bootstrapping algorithm that produces a naive Bayes classifier. null The bootstrapping algorithm used in this paper combines hierarchical shrinkage and Expectation-Maximization (EM) with unlabeled data. EM is an iterative algorithm for maximum likelihood estimation in parametric estimation problems with missing data. In our scenario, the class labels of the documents are treated as missing data. Here, EM works by first training a classifier with only the documents</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> Computer Science </SectionTitle> <Paragraph position="0"> computer, university, science, system, paper Software Programming OS Artificial ... Hardware & HC1 Information Engineering programming distributed Intelligence Architecture computer Retrieval software language system learning circuits system information design logic systems university design multimedia text engineering university network computer computer university documents tools programs time based university paper classification Semantics Garbage Compiler&quot; &quot; NLP Machine Planning Knowledge ... Interface Cooperative Multimedia semantics Collection Design language Learning planning Representation Design collaborative multimedia denotationel garbage compiler natural learning temporal knowledge interface cscw real language collection code processing algorithm reasoning representation design work time construction memory parallel information algorithms plan language user provide data types optimization data text university problems system sketch group media region language networks natural interfaces calculated by naive Bayes and shrinkage with vertical word redistribution (Hofmann and Puzicha, 1998). Words among the initial keywords for that class are indicated in plain font; others are in italics. preliminarily-labeled by the keywords, and then uses the classifier to re-assign probahilistically-weighted class labels to all the documents by calculating the expectation of the missing class labels. It then trains a new classifier using all the documents and iterates.</Paragraph> <Paragraph position="1"> We further improve classification by incorporating shrinkage, a statistical technique for improving parameter estimation in the face of sparse data. When classes are provided in a hierarchical relationship, shrinkage is used to estimate new parameters by using a weighted average of the specific (but unreliable) local class estimates and the more general (but also more reliable) ancestors of the class in the hierarchy. The optimal weights in the average are calculated by an EM process that runs simultaneously with the EM that is re-estimating the class labels.</Paragraph> <Paragraph position="2"> Experimental evaluation of this bootstrapping approach is performed on a data set of thirty-thousand computer science research papers. A 70-leaf hierarchy of computer science and a few keywords for each class are provided as input. Keyword matching alone provides 45% accuracy. Our bootstrapping algorithm uses this as input and outputs a naive Bayes text classifier that achieves 66% accuracy. Interestingly, this accuracy approaches estimated human agreement levels of 72%.</Paragraph> <Paragraph position="3"> The experimental domain in this paper originates as part of the Ra research project, an effort to build domain-specific search engines on the Web with machine learning techniques. Our demonstration system, Cora, is a search engine over computer science research papers (McCallum et al., 1999). The bootstrapping classification algorithm described in this paper is used in Corn to place research papers into a Yahoo-like hierarchy specific to computer science.</Paragraph> <Paragraph position="4"> The-search engine, including this hierarchy, is publicly available at www. cora.justresearch, com.</Paragraph> </Section> </Section> class="xml-element"></Paper>