File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0908_evalu.xml
Size: 5,549 bytes
Last Modified: 2025-10-06 14:00:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0908"> <Title>Text Classification by Bootstrapping with Keywords, EM and Shrinkage</Title> <Section position="6" start_page="55" end_page="56" type="evalu"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> In this section, we provide empirical evidence that bootstrapping a text classifier from unlabeled data can produce a high-accuracy text classifier. As a test domain, we use computer science research papers.</Paragraph> <Paragraph position="1"> We have created a 70-1ear hierarchy of computer science topics, part of which is shown in Figure 1. Creating the hierarchy took about 60 minutes, during which we examined conference proceedings, and explored computer science sites on the Web. Selecting a few keywords associated with each node took about 90 minutes. A test set was created by expert hand-labeling of a random sample of 625 research papers from the 30,682 papers in the Cora archive at the time we began these experiments. Of these, 225 (about one-third) did not fit into any category, and were discarded--resulting in a 400 document test set. Labeling these 400 documents took about six hours. Some of these papers were outside the area of computer science (e.g. astrophysics papers), but most of these were papers that with a more complete hierarchy would be considered computer science papers. The class frequencies of the data are not too skewed; on the test set, the most populous class accounted for only 7% of the documents.</Paragraph> <Paragraph position="2"> Each research paper is represented as the words of the title, author, institution, references, and abstract. A detailed description of how these segments are automatically extracted is provided elsewhere (McCallum et al., 1999; Seymore et al., 1999). keyword matching, human agreement, naive Bayes (NB), and naive Bayes combined with hierarchical shrinkage (S), and EM. The classification accuracy (Acc), and the number of labeled (Lab), keyword-matched preliminarily-labeled (P-Lab), and unlabeled (Unlab) documents used by each method are shown.</Paragraph> <Paragraph position="3"> Words occurring in fewer than five documents and words on a standard stoplist were discarded. No stemming was used. Bootstrapping was performed using the algorithm outlined in Table 1.</Paragraph> <Paragraph position="4"> Table 2 shows classification results with different classification techniques used. The rule-list classifier based on the keywords alone provides 45%. (The 43% of documents in the test set containing no key-words cannot be assigned a class by the rule~list classifter, and are counted as incorrect.) As an interesting time comparison, about 100 documents could have been labeled in the time it took to generate the keyword lists. Naive Bayes accuracy with 100 labeled documents is only 30%. With 399 labeled documents (using our test set in a leave-one-outfashion), naive Bayes reaches 47%. When running the bootstrapping algorithm, 12,657 documents are given preliminary labels by keyword matching. EM and shrinkage incorporate the remaining 18,025 documents, &quot;fix&quot; the preliminary labels and leverage the hierarchy; the resulting accuracy is 66%. As an interesting comparison, agreement on the test set between two human experts was 72%.</Paragraph> <Paragraph position="5"> A few further experiments reveal some of the inner-workings of bootstrapping. If we build a naive Bayes classifier in the standard supervised way from the 12,657 preliminarily labeled documents the classifter gets 47% accuracy. This corresponds to the performance for the first iteration of bootstrapping.</Paragraph> <Paragraph position="6"> Note that this matches the accuracy of traditional naive Bayes with 399 labeled training documents, but that it requires less than a quarter the human labeling effort. If we run bootstrapping without the 18,025 documents left unlabeled by keyword matching, accuracy reaches 63%. This indicates that shrinkage and EM on the preliminarily labeled documents is providing substantially more benefit than the remaining unlabeled documents.</Paragraph> <Paragraph position="7"> One explanation for the small impact of the 18,025 documents left unlabeled by keyword matching is that many of these do not fall naturally into the hierarchy. Remember that about one-third of the 30,000 documents fall outside the hierarchy. Most of these will not be given preliminary labels by key-word matching. The presence of these outlier documents skews EM parameter estimation. A more inclusive computer science hierarchy would allow the unlabeled documents to benefit classification more.</Paragraph> <Paragraph position="8"> However, even without a complete hierarchy, we could use these documents if we could identify these outliers. Some techniques for robust estimation with EM are discussed by McLachlan and Basford (1988).</Paragraph> <Paragraph position="9"> One specific technique for these text hierarchies is to add extra leaf nodes containing uniform word distributions to each interior node of the hierarchy in order to capture documents not belonging in any of the predefined topic leaves. This should allow EM to perform well even when a large percentage of the documents do not fall into the given classification hierarchy. A similar approach is also planned for research in topic detection and tracking (TDT) (Baker et al., 1999). Experimentation with these techniques is an area of ongoing research.</Paragraph> </Section> class="xml-element"></Paper>