File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1033_metho.xml

Size: 14,502 bytes

Last Modified: 2025-10-06 14:09:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1033">
  <Title>Learning with Unlabeled Data for Text Categorization Using Bootstrapping and Feature Projection Techniques</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Related Works
</SectionTitle>
    <Paragraph position="0"> In general, related approaches for using unlabeled data in text categorization have two directions; One builds classifiers from a combination of labeled and unlabeled data (Nigam, 2001; Bennett and Demiriz, 1999), and the other employs clustering algorithms for text categorization (Slonim et al., 2002).</Paragraph>
    <Paragraph position="1"> Nigam studied an Expected Maximization (EM) technique for combining labeled and unlabeled data for text categorization in his dissertation. He showed that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training data with a large pool of unlabeled data.</Paragraph>
    <Paragraph position="2"> Bennet and Demiriz achieved small improvements on some UCI data sets using SVM.</Paragraph>
    <Paragraph position="3"> It seems that SVMs assume that decision boundaries lie between classes in low-density regions of instance space, and the unlabeled examples help find these areas.</Paragraph>
    <Paragraph position="4"> Slonim suggested clustering techniques for unsupervised document classification. Given a collection of unlabeled data, he attempted to find clusters that are highly correlated with the true topics of documents by unsupervised clustering methods. In his paper, Slonim proposed a new clustering method, the sequential Information Bottleneck (sIB) algorithm.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Bootstrapping Algorithm for Creating
Machine-labeled Data
</SectionTitle>
    <Paragraph position="0"> The bootstrapping framework described in this paper consists of the following steps. Each module is described in the following sections in detail.</Paragraph>
    <Paragraph position="1">  1. Preprocessing: Contexts are separated from unlabeled documents and content words are extracted from them.</Paragraph>
    <Paragraph position="2"> 2. Constructing context-clusters for training: - Keywords of each category are created - Centroid-contexts are extracted and verified - Context-clusters are created by a similarity measure 3. Learning Classifier: Naive Bayes classifier are learned by using the context-clusters</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Preprocessing
</SectionTitle>
      <Paragraph position="0"> The preprocessing module has two main roles: extracting content words and reconstructing the collected documents into contexts. We use the Brill POS tagger to extract content words (Brill, 1995).</Paragraph>
      <Paragraph position="1"> Generally, the supervised learning approach with labeled data regards a document as a unit of meaning. But since we can use only the title words and unlabeled data, we define context as a unit of meaning and we employ it as the meaning unit to bootstrap the meaning of each category. In our system, we regard a sequence of 60 content words within a document as a context. To extract contexts from a document, we use sliding window techniques (Maarek et al., 1991). The window is a slide from the first word of the document to the last in the size of the window (60 words) and the interval of each window (30 words). Therefore, the final output of preprocessing is a set of context vectors that are represented as content words of each context.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Constructing Context-Clusters for
Training
</SectionTitle>
      <Paragraph position="0"> At first, we automatically create keywords from a title word for each category using co-occurrence information. Then centroid-contexts are extracted using the title word and keywords. They contain at least one of the title and keywords. Finally, we can gain more information of each category by assigning remaining contexts to each context-cluster using a similarity measure technique; the remaining contexts do not contain any keywords or title words.</Paragraph>
      <Paragraph position="1">  The starting point of our method is that we have title words and collected documents. A title word can present the main meaning of each category but it could be insufficient in representing any category for text categorization. Thus we need to find words that are semantically related to a title word, and we define them as keywords of each category.</Paragraph>
      <Paragraph position="2"> The score of semantic similarity between a title word, T, and a word, W, is calculated by the cosine metric as follows:</Paragraph>
      <Paragraph position="4"> represent the occurrence (binary value: 0 or 1) of words T and W in i-th document respectively, and n is the total number of documents in the collected documents. This method calculates the similarity score between words based on the degree of their co-occurrence in the same document.</Paragraph>
      <Paragraph position="5"> Since the keywords for text categorization must have the power to discriminate categories as well as similarity with the title words, we assign a word to the keyword list of a category with the maximum similarity score and recalculate the score of the word in the category using the following formula:</Paragraph>
      <Paragraph position="7"> is the title word with the maximum similarity score with a word W, c max is the category of the title word T  max , and T secondmax is other title word with the second high similarity score with the word W.</Paragraph>
      <Paragraph position="8"> This formula means that a word with high  ranking in a category has a high similarity score with the title word of the category and a high similarity score difference with other title words. We sort out words assigned to each category according to the calculated score in descending order. We then choose top m words as keywords in the category. Table 1 shows the list of keywords (top 5) for each category in the WebKB data set.  course course assignments, hours, instructor, class, fall faculty professor associate, ph.d, fax, interests, publications project project system, systems, research, software, information student student graduate, computer, science, page, university  We choose contexts with a keyword or a title word of a category as centroid-contexts. Among centroid-contexts, some contexts could not have good features of a category even though they include the keywords of the category. To rank the importance of centroid-contexts, we compute the importance score of each centroid-context. First of all, weights (W</Paragraph>
      <Paragraph position="10"> calculated using Term Frequency (TF) within a category and Inverse Category Frequency (ICF) (Cho and Kim, 1997) as follows:</Paragraph>
      <Paragraph position="12"> and M is the total number of categories.</Paragraph>
      <Paragraph position="13"> Using word weights (W ij ) calculated by formula 3, the score of a centroid-context (S  where N is the number of words in the centroidcontext. null As a result, we obtain a set of words in first-order co-occurrence from centroid-contexts of each category.</Paragraph>
      <Paragraph position="14">  We gather the second-order co-occurrence information by assigning remaining contexts to the context-cluster of each category. For the assigning criterion, we calculate similarity between remaining contexts and centroid-contexts of each category. Thus we employ the similarity measure technique by Karov and Edelman (1998). In our method, a part of this technique is reformed for our purpose and remaining contexts are assigned to each context-cluster by that revised technique. 1) Measurement of word and context similarities As similar words tend to appear in similar contexts, we can compute the similarity by using contextual information. Words and contexts play complementary roles. Contexts are similar to the extent that they contain similar words, and words are similar to the extent that they appear in similar contexts (Karov and Edelman, 1998). This definition is circular. Thus it is applied iteratively using two matrices, WSM and CSM.</Paragraph>
      <Paragraph position="15"> Each category has a word similarity matrix</Paragraph>
      <Paragraph position="17"> and a context similarity matrix CSM n . In each iteration n, we update WSM n , whose rows and columns are labeled by all content words encountered in the centroid-contexts of each category and input remaining contexts. In that matrix, the cell (i,j) holds a value between 0 and 1, indicating the extent to which the i-th word is contextually similar to the j-th word. Also, we keep and update a CSM n , which holds similarities among contexts. The rows of CSM n correspond to the remaining contexts and the columns to the centroid-contexts. In this paper, the number of input contexts of row and column in CSM is limited to 200, considering execution time and memory allocation, and the number of iterations is set as 3.</Paragraph>
      <Paragraph position="18"> To compute the similarities, we initialize WSM n to the identity matrix. The following steps are iterated until the changes in the similarity values are small enough.</Paragraph>
      <Paragraph position="19">  To simplify the symmetric iterative treatment of similarity between words and contexts, we define an auxiliary relation between words and contexts as affinity.</Paragraph>
      <Paragraph position="20"> Affinity formulae are defined as follows (Karov and Edelman, 1998):</Paragraph>
      <Paragraph position="22"> In the above formulae, n denotes the iteration number, and the similarity values are defined by</Paragraph>
      <Paragraph position="24"> . Every word has some affinity to the context, and the context can be represented by a vector indicating the affinity of each word to it.  The weights in formula 7 are computed as reflecting global frequency, log-likelihood factors, and part of speech as used in (Karov and Edelman, 1998). The sum of weights in formula 8, which is a reciprocal number of contexts that contain W  Each remaining context is assigned to a category which has a maximum similarity value. But there may exist noisy remaining contexts which do not belong to any category. To remove these noisy remaining contexts, we set up a dropping threshold using normal distribution of similarity values as follows (Ko and Seo, 2000):</Paragraph>
      <Paragraph position="26"> where i) X is a remaining context, ii) u is an average of similarity values , iii) s is a standard deviation of similarity values, and iv) th is a numerical value corresponding to the threshold (%) in normal distribution table.</Paragraph>
      <Paragraph position="28"> Finally, a remaining context is assigned to the context-cluster of any category when the category has a maximum similarity above the dropping threshold value. In this paper, we empirically use a 15% threshold value from an experiment using a validation set.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Learning the Naive Bayes Classifier Using
Context-Clusters
</SectionTitle>
      <Paragraph position="0"> In above section, we obtained labeled training data: context-clusters. Since training data are labeled as the context unit, we employ a Naive Bayes classifier because it can be built by estimating the word probability in a category, but not in a document. That is, the Naive Bayes classifier does not require labeled data with the unit of documents unlike other classifiers.</Paragraph>
      <Paragraph position="1"> We use the Naive Bayes classifier with minor modifications based on Kullback-Leibler Divergence (Craven et al., 2000). We classify a document d i according to the following formula:</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Handling Noisy Data of Machine-labeled
Data
</SectionTitle>
      <Paragraph position="0"> We finally obtained labeled data of a documents unit, machine-labeled data. Now we can learn text classifiers using them. But since the machine-labeled data are created by our method, they generally include far more incorrectly labeled documents than the human-labeled data. Thus we employ a feature projection technique for our method. By the property of the feature projection technique, a classifier (the TCFP classifier) can have robustness from noisy data (Ko and Seo, 2004). As seen in our experiment results, TCFP showed the highest performance among conventional classifiers in using machine-labeled data.</Paragraph>
      <Paragraph position="1"> The TCFP classifier with robustness from noisy data Here, we simply describe the TCFP classifier using the feature projection technique (Ko and Seo, 2002; 2004). In this approach, the classification knowledge is represented as sets of projections of training data on each feature dimension. The classification of a test document is based on the voting of each feature of that test document. That is, the final prediction score is calculated by accumulating the voting scores of all features.</Paragraph>
      <Paragraph position="2"> First of all, we must calculate the voting ratio of each category for all features. Since elements with a high TF-IDF value in projections of a feature must become more useful classification criteria for the feature, we use only elements with TF-IDF values above the average TF-IDF value for voting. And the selected elements participate in proportional voting with the same importance as the TF-IDF value of each element. The voting ratio of each category c j in a feature t  denotes a set of elements selected for voting and is a function; if the category for an element t is equal to c , the output value is 1. Otherwise, the output value is 0.  Next, since each feature separately votes on feature projections, contextual information is missing. Thus we calculate co-occurrence frequency of features in the training data and modify TF-IDF values of two terms t i and t j in a test document by co-occurrence frequency between them; terms with a high co-occurrence frequency value have higher term weights.</Paragraph>
      <Paragraph position="3"> Finally, the voting score of each category c in the m-th feature t j m of a test document d is calculated by the following formula:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML