File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1033_intro.xml

Size: 5,850 bytes

Last Modified: 2025-10-06 14:02:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1033">
  <Title>Learning with Unlabeled Data for Text Categorization Using Bootstrapping and Feature Projection Techniques</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Text categorization is the task of classifying documents into a certain number of pre-defined categories. Many supervised learning algorithms have been applied to this area. These algorithms today are reasonably successful when provided with enough labeled or annotated training examples. For example, there are Naive Bayes (McCallum and Nigam, 1998), Rocchio (Lewis et al., 1996), Nearest Neighbor (kNN) (Yang et al., 2002), TCFP (Ko and Seo, 2002), and Support Vector Machine (SVM) (Joachims, 1998).</Paragraph>
    <Paragraph position="1"> However, the supervised learning approach has some difficulties. One key difficulty is that it requires a large, often prohibitive, number of labeled training data for accurate learning. Since a labeling task must be done manually, it is a painfully time-consuming process. Furthermore, since the application area of text categorization has diversified from newswire articles and web pages to E-mails and newsgroup postings, it is also a difficult task to create training data for each application area (Nigam et al., 1998). In this light, we consider learning algorithms that do not require such a large amount of labeled data.</Paragraph>
    <Paragraph position="2"> While labeled data are difficult to obtain, unlabeled data are readily available and plentiful.</Paragraph>
    <Paragraph position="3"> Therefore, this paper advocates using a bootstrapping framework and a feature projection technique with just unlabeled data for text categorization. The input to the bootstrapping process is a large amount of unlabeled data and a small amount of seed information to tell the learner about the specific task. In this paper, we consider seed information in the form of title words associated with categories. In general, since unlabeled data are much less expensive and easier to collect than labeled data, our method is useful for text categorization tasks including online data sources such as web pages, E-mails, and newsgroup postings.</Paragraph>
    <Paragraph position="4"> To automatically build up a text classifier with unlabeled data, we must solve two problems; how we can automatically generate labeled training documents (machine-labeled data) from only title words and how we can handle incorrectly labeled documents in the machine-labeled data. This paper provides solutions for these problems. For the first problem, we employ the bootstrapping framework.</Paragraph>
    <Paragraph position="5"> For the second, we use the TCFP classifier with robustness from noisy data (Ko and Seo, 2004).</Paragraph>
    <Paragraph position="6"> How can labeled training data be automatically created from unlabeled data and title words? Maybe unlabeled data don't have any information for building a text classifier because they do not contain the most important information, their category. Thus we must assign the class to each document in order to use supervised learning approaches. Since text categorization is a task based on pre-defined categories, we know the categories for classifying documents. Knowing the categories means that we can choose at least a representative title word of each category. This is the starting point of our proposed method. As we carry out a bootstrapping task from these title words, we can finally get labeled training data.</Paragraph>
    <Paragraph position="7"> Suppose, for example, that we are interested in classifying newsgroup postings about specially 'Autos' category. Above all, we can select 'automobile' as a title word, and automatically extract keywords ('car', 'gear', 'transmission', 'sedan', and so on) using co-occurrence information. In our method, we use context (a sequence of 60 words) as a unit of meaning for bootstrapping from title words; it is generally constructed as a middle size of a sentence and a document. We then extract core contexts that include at least one of the title words and the keywords. We call them centroid-contexts because they are regarded as contexts with the core meaning of each category. From the centroidcontexts, we can gain many words contextually co-occurred with the title words and keywords: 'driver', 'clutch', 'trunk', and so on. They are words in first-order co-occurrence with the title words and the keywords. To gather more vocabulary, we extract contexts that are similar to centroid-contexts by a similarity measure; they contain words in second-order co-occurrence with the title words and the keywords. We finally construct context-cluster of each category as the combination of centroid-contexts and contexts selected by the similarity measure. Using the context-clusters as labeled training data, a Naive Bayes classifier can be built. Since the Naive Bayes classifier can label all unlabeled documents for their category, we can finally obtain labeled training data (machine-labeled data).</Paragraph>
    <Paragraph position="8"> When the machine-labeled data is used to learn a text classifier, there is another difficult in that they have more incorrectly labeled documents than manually labeled data. Thus we develop and employ the TCFP classifiers with robustness from noisy data.</Paragraph>
    <Paragraph position="9"> The rest of this paper is organized as follows.</Paragraph>
    <Paragraph position="10"> Section 2 reviews previous works. In section 3 and 4, we explain the proposed method in detail.</Paragraph>
    <Paragraph position="11"> Section 5 is devoted to the analysis of the empirical results. The final section describes conclusions and future works.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML