File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-1066_abstr.xml
Size: 11,460 bytes
Last Modified: 2025-10-06 13:41:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1066"> <Title>Automatic Text Categorization by Unsupervised Learning</Title> <Section position="1" start_page="0" end_page="455" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The goal of text categorization is to classify documents into a certain number of pre-defined categories. The previous works in this area have used a large number of labeled training doculnents for supervised learning. One problem is that it is difficult to create the labeled training documents. While it is easy to collect the unlabeled documents, it is not so easy to manually categorize them for creating traiuing documents. In this paper, we propose an unsupervised learning method to overcome these difficulties. The proposed lnethod divides the documents into sentences, and categorizes each sentence using keyword lists of each category and sentence simihuity measure. And then, it uses the categorized sentences for refining.</Paragraph> <Paragraph position="1"> The proposed method shows a similar degree of performance, compared with the traditional supervised learning inethods.</Paragraph> <Paragraph position="2"> Therefore, this method can be used in areas where low-cost text categorization is needed.</Paragraph> <Paragraph position="3"> It also can be used for creating training documents.</Paragraph> <Paragraph position="4"> Introduction With the rapid growth of the internet, the availability of on-line text information has been considerably increased. As a result, text categorization has become one of the key techniques fox handling and organizing text data. Automatic text categorization in the previous works is a supervised learning task, defined as assigning category labels (pro-defined) to text documents based on the likelihood suggested by a training set of labeled doculnents. However, the previous learning algorithms have some problems. One of them is that they require a large, often prohibitive, number of labeled training documents for the accurate learning.</Paragraph> <Paragraph position="5"> Since the application area of automatic text categorization has diversified froln newswire articles and web pages to electronic mails and newsgroup postings, it is a difficult task to create training data for each application area (Nigam K. et al., 1998).</Paragraph> <Paragraph position="6"> In this paper, we propose a new automatic text categorization lnethod based on unsupervised learning. Without creating training documents by hand, it automatically creates training sentence sets using keyword lists of each category. And then, it uses them for training and classifies text documents. The proposed method can provide basic data fox&quot; creating training doculnents from collected documents, and can be used in an application area to classify text documents in low cost. We use the 2 / statistic (Yang Y. et al., 1998) as a feature selection method and the naive Bayes classifier (McCailum A. et al., 1998) as a statistical text classifier. The naive Bayes classifier is one of the statistical text classifiers that use word frequencies as features. Other examples include k-nearest-neighbor (Yang Y. et al., 1994), TFIDF/Roccio (Lewis D.D. et al., 1996), support vector machines (Joachilns T. et al., 1998) and decision tree (Lewis D.D. et al., 1994).</Paragraph> <Paragraph position="7"> 1 Proposal: A text categorization scheme The proposed system consists of three modules as shown in Figure 1; a module to preprocess collected documents, a module to create training sentence sets, and a module to extract features and to classify text doculnents.</Paragraph> <Paragraph position="9"> Figurel : Architecture for the proposed system</Paragraph> <Section position="1" start_page="453" end_page="453" type="sub_section"> <SectionTitle> 1.1 Preprocessing </SectionTitle> <Paragraph position="0"> First, the html tags and special characters in the collected documents are removed. And then, the contents of the documents are segmented into sentences. We extract content words for each sentence using only nouns. In Korean, there are active-predicative common nouns which become verbs when they am combined with verbderivational suffixes (e.g., ha-ta 'do', toy-la 'become', etc.). There are also stativepredicative common nouns which become adjectives when they are combined with adjective-derivational suffixes such as ha. These derived verbs and adjectives are productive in Korean, and they are classified as nouns according to the Korean POS tagger. Other verbs and adjectives are not informative in many cases.</Paragraph> </Section> <Section position="2" start_page="453" end_page="455" type="sub_section"> <SectionTitle> 1.2 Creating training sentence sets </SectionTitle> <Paragraph position="0"> Because the proposed system does not have training documents, training sentence sets for each category corresponding to the training documents have to be created. We define keywords for each category by hand, which contain special features of each category sufficiently. To choose these keywords, we first regard category names and their synonyms as keywords. And we include several words that have a definite meaning of each category. The average number of keywords for each category is 3. (Total 141 keywords for 47 categories) Table 1 lists the examples of keywords for each category.</Paragraph> <Paragraph position="2"> Next, the sentences which contain pre-defined keywords of each category in their content words are chosen as the initial representative sentences. The remaining sentences am called unclassified sentences. We scale up the representative sentence sets by assigning the unclassified sentences to their related category. This assignment has been done through measuring similarities of the unclassified sentences to the representative sentences. We will elaborate this process in the next two subsections.</Paragraph> <Paragraph position="3"> sentences We define the representative sentence as what contains pre-defined keywords of the category in its content words. But there exist error sentences in the representative sentences. They do not have special features of a category even though they contain the keywords of the category. To relnove such error sentences, we can rank the representative sentences by computing the weight of each sentence as follows: 1) Word weights are computed using Term Frequency (TF) and Inverse Category Frequency (ICF) (Cho K. et al., 1997).</Paragraph> <Paragraph position="4"> @ The within-category word frequency(TF~j), TFij = the number of times words ti occurs in the j th category (1) (r) In Information Retrival, Inverse Document Frequency (IDF) are used generally. But a sentence is a processing unit in the proposed method. Therefore, the document frequency cannot be counted. Also, since ICF was defined by Cho K. et al. (1997) and its efficiency was verified, we use it in tile proposed method. ICF is computed as follows:</Paragraph> <Paragraph position="6"> where CF is tile number of categories that contain t;, and M is tile total number of categories.</Paragraph> <Paragraph position="7"> Tile Colnbination (TFICF) of the above (9 and (r), i.e., weight w~ i of word t; in ./tit category is computed as follows:</Paragraph> <Paragraph position="9"> 2) Using word weights (%) computed in 1), a sentence weight (We) in jth category are computed as follows: W Ij q- W2j +...-F WNj W!/ = (4) N where N is the total number of words in a sentence.</Paragraph> <Paragraph position="10"> 3) The representative sentences of each category are sorted in the decreasing order of weight, which was computed in 2). And then, the lop 70% of tile representative sentences are selected and used in our experiment. It is decided empirically.</Paragraph> <Paragraph position="11"> To extend lhe representative sentence sets, the unclassified sentences are classified into their related category through measuring similarities of the unclassified sentences to the representative sentences.</Paragraph> <Paragraph position="12"> (l) Measurement of word and sentence similarities As similar words tend to appear in similar contexts, we compute the similarity by using contextual information (Kim H. et al., 1999; Karov Y. et al., 1999). In this paper, words and sentences play COlnplementary roles. That is, a sentence is represented by the set of words it contains, and a word by the set of sentences in which it appears. Sentences are simihu&quot; to the extent that they contain similar words, and words are similar to the extent that they appear in similar sentences. This definition is circular. Titus, it is applied iteratively using two matrices as shown in Figure 2. in this paper, we set the number of iterations as 3, as is recommended by Karov Y. et al. (1999).</Paragraph> <Paragraph position="13"> Sirnilar~~ Word &quot; Sente,,C; : i A i d In Figure 2, each category has a word silnilarity matrix WSM,, and a sentence similarity matrix SSM,,. In each iteration n, we update WSM,, whose rows and columns are labeled by all content words encountered in the rcpresentatwe sentences of each category and input unclassified sentences. In that lnatrix, the cell (i j) hokls a value between 0 and l, indicating the extent to which the ith word is contextually similar to the jth word. Also, we keep and update a SSM,,, which holds similarities among sentences. The rows of SSM,, correspond to the unclassified sentences and the cohmms to the representative sentences. In this paper, the number of input sentences of row and column in SSM is limited to 200, considering execution time and memory allocation.</Paragraph> <Paragraph position="14"> To compute tile similarities, we initialize WSM, to the identity matrix. That is, each word is fully similar (1) to itself and completely dissimilar (0) to other words. The following steps are iterated until the changes in the similarity values are small enough.</Paragraph> <Paragraph position="15"> 1. Update the sentence similarity lnatrix SSM,,, using the word similarity matrix WSM,.</Paragraph> <Paragraph position="16"> 2. Update the word similarity matrix WSM,,, using the sentence similarity matrix SSM,. (2) Affinity formulae qb simplify tile symmetric iterative treatment of similarity between words and sentences, we del'ine an auxiliary relation between words and sentences as affinity. A woM W is assumed to have a certain affinity to every sentence, which is a real number between 0 and 1. It reflects the contextual relationships between W and the words of the sentence. If W belongs to a sentence S, its affinity to S is 1. If W is totally unrelated to S, the affinity is close to 0. If W is contextually similar to the words of S, its affinity to S is between 0 and 1. In a similar manner, a sentence S has some affinity to every word, reflecting the similarity of S to the sentences involving that word.</Paragraph> <Paragraph position="17"> Affinity formulae are defined as follows (Karov Y. et al., 1999). In these formulae, W ~ S means that a word belongs to a sentence:</Paragraph> <Paragraph position="19"> In the above formulae, n denotes the iteration number, and the similarity values are defined by WSM,, and SSM,,. Every word has some affinity to the sentence, and the sentence can be represented by a vector indicating the affinity of each word to it.</Paragraph> </Section> </Section> class="xml-element"></Paper>