File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1087_metho.xml
Size: 16,213 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1087"> <Title>Selforganizing classification on the Reuters news corpus</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Dimensionality Reduction VSM (Vector Space Model) is a basic technique </SectionTitle> <Paragraph position="0"> to transform text documents to numeric vectors.</Paragraph> <Paragraph position="1"> Often neural networks including the SOM model for text classification apply VSM on their pre-processing stage. SOM does not reduce the length of vectors but only presents the high dimensionality of input vectors by prearranged units on a low dimensional space. Dealing with a huge text collections means dealing with huge dimensionality that needs to be reduced for neural approaches such as SOM (Berry et al.</Paragraph> <Paragraph position="2"> 1999).</Paragraph> <Paragraph position="3"> In the field of linear algebra, PCA (Principal Component Analysis), SVD (Singular Value Decomposition) and Random projection are effective for dimensionality reduction but suffer from two main side effects. The first one is that the results are difficult to interpret and the second one is a reduction of the accuracy.</Paragraph> <Paragraph position="4"> Rather than introducing hierarchies from SOM we want to exploit existing semantic knowledge, especially here from WordNet. WordNet (Miller, 1985) is a network of semantic relationships between English words. Semantic relations among words construct a network. The sets of synonyms compose synsets, which are the very basic relations in WordNet. Words in the same synset have the same or similar concept and vice versa. In addition to synonymy, there are several different types of semantic relations such as antonymy, hyponymy, meronymy, troponomy, and entailment in each different syntactic category, i.e. nouns, verbs, adjectives and adverbs. This semantic dictionary is useful in extracting the real concept of a word, a query or a document in the field of text mining (Richardson 1994; Richardson and Smeaton 1995; Voorhees 1993; Voorhees 1998; Scott and Matwin 1998; Gonzalo et al. 1998; Moldovan and Mihalcea 1998; Moldovan and Mihalcea 2000). Using these semantic relations in WordNet, one index word may present its many synonyms, siblings or other relevant words. Therefore, by mapping words to more general concepts, WordNet can be used to reduce the dimensionality.</Paragraph> <Paragraph position="5"> Instead of using these approaches to reduce multi-dimensional vectors, we apply significance vectors to present the importance of words in each semantic category and use pre-assigned topics as axes of multi-dimensional space. Thus a news article can be represented by a n-dimension vector, where n is the number of pre-assigned topics. This method offers a way to divert from the huge dimensionality curse. A more detail description is shown in section 3.2.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Selforganizing classification on the </SectionTitle> <Paragraph position="0"> new Reuters corpus using WordNet</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The New Version of Reuters Corpus </SectionTitle> <Paragraph position="0"> We work with the new version of Reuters corpus (Reuters 2000). This corpus is made up of 984 Mbytes of newspaper articles in compressed format from issues of Reuters between the 20 th Aug., 1996 and 19 th Aug., 1997. The number of total news articles is 806,791, which contain 9,822,391 paragraphs, 11,522,874 sentences and about 2 hundred million word occurrences. Each document is saved in a standard XML format and is pre-classified by 3 different codes of categories, which are industry code, region code and topic code. We are currently interested in the topic code only. 126 topics are defined in this new corpus but 23 of them contain no articles. All articles except 10,186 of them are classified in at least one topic.</Paragraph> <Paragraph position="1"> In our first experiments we concentrate on 8 major topics (Table 1). In order to get a comparison of the performance with and without the use of WordNet and the relation of headlines and full-text news articles, a series of experiments have been performed. First, we use the first 100,000 news headlines for training and another 100,000 news headlines for test. The second experiment is exactly the same as the first one but we use full-text instead of headlines. In the third experiment, we use 100,000 full-text news articles for training and use their headlines for test. The fourth experiment is opposite to the third one. An integration of SOM and WordNet will be presented in last two experiments.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="11" type="metho"> <SectionTitle> 8 GDIP international relations </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="11" type="sub_section"> <SectionTitle> 3.2 Presenting Text Documents by Significance Vectors </SectionTitle> <Paragraph position="0"> We use pre-assigned topics as axes of a multi-dimensional space and apply significance vectors to present the importance of words in each semantic category based on (Wermter 2000).</Paragraph> <Paragraph position="1"> Significance vectors are defined by the frequency of a word in different topics. A significance vector is presented with topic elements (t</Paragraph> <Paragraph position="3"> presents the frequency of a word in j semantic category. Thus a document x is presented with:</Paragraph> <Paragraph position="5"> where n is the number of words and m is the number of topics. This Method1 vector is the summation of significance vectors.</Paragraph> <Paragraph position="7"> Method 1 can be susceptible to the number of news documents observed in each topic. An alternative method 2 of vector presentation can alleviate skewed distributions. Thus a document</Paragraph> <Paragraph position="9"> (2) Because only nouns and verbs have the hypernym relation in WordNet and because nouns and verbs convey enough information of document concepts, we remove all words except nouns and verbs found in WordNet in our experiments. We also benefit by a function of WordNet, morphword, as a simple stemming tool. After above pre-processing, our 100,000 news article training set represents the total number of 8,920,287 (381,871) word occurrences and the total number of 22,848 (10,185) distinct words in full-text and headline experiments respectively. An example of these vector representation methods is shown in (Table 2). Note that the representation of &quot;to&quot; is the 0-vector since is not shown in nouns and verbs collections of WordNet.</Paragraph> <Paragraph position="10"> Table 2. Examples of rounded significance vectors on news headline experiment. Topic codes are presented on number 1 to 8 (Table 1).</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.3 Classification and Presentation using SOM </SectionTitle> <Paragraph position="0"> Our work is based on the SOM algorithm (Vesanto et al. 1999). We give each news article a topic label. This label is determined by the most significant weights of topics in an input vector based on one of the above methods. Then input vectors are normalised. After the training process, a label of a map unit is assigned according to the highest number of assigned labels. For example, if 3 news articles of ECAT and 10 news articles of CCAT are mapped to unit 1, then the label of unit 1 will be associated with CCAT. Therefore, all units present their favourite news article labels. We adopt a semi-supervised SOM concept to add an extra semantic vector, x s , with a small number 0.2 as its highest value to represent the desired class. In our case x s has 8 elements, as has x. That is, the document vector d is represented as d=[x</Paragraph> <Paragraph position="2"> 0.31 0.13]. This approach can make the border of SOM units more prominent and also can be used to verify the performance of text classification. A SOM map with 225 output units is shown in (Fig 1) based on classifying these 16 element document vectors. Other architectures (e.g. 25 x25) have been tested and show similar clear results.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.4 Composing Semantic Clusters from WordNet </SectionTitle> <Paragraph position="0"> WordNet physically builds the database according to syntactic categories and semantic relations among synsets. In our work, we use the hypernym-hyponymy relation. A hypernym of a term is a more general term where hyponymy is more specific. For example, an apple is a kind of edible fruit, so edible fruit is a hypernym of apple and an apple is a hyponymy of edible fruit. We use the hypernym relation because the concept of this relation is similar to the definition of news classification.</Paragraph> <Paragraph position="1"> The concept of a category of news is more general than each distinct news article. News articles with a similar concept will be grouped in a same class, and each group member, i.e. each distinct news article, still has its own specific meaning. We use a 2-level hypernym to replace each word in a news article with its hypernym term in order to get a more general concept of its original word. Only nouns and verbs in WordNet consist of this hypernym relation. Polysemous and synonymous terms can be represented in several synsets and each synset may lie in a different hypernym hierarchy. It is difficult to decide the concept of a document that contains several ambiguous terms. Salton and Lesk give an example that offers a useful approach (Salton and Lesk 1971). The set of nouns base, bat, glove, and hit have each their own different senses, but putting them together means the game of baseball clearly. We use this idea and take advantage of synsets' glosses, which are an explanation of the meaning of each concept.</Paragraph> <Paragraph position="2"> Then the correct concept of a term is decided by comparing the similarity of each gloss with the semantic term-topic database of Reuters. For example, the first news article is pre-assigned to topic ECAT. The first term of the headline of this article is recovery that consists of 3 senses as Noun and 0 senses as Verb. Thus, there are 3 glosses for this word. We count the number of the co-occurrence of terms shown in each gloss and the pre-assigned term-topic database. Then we average the significance of terms by dividing by the total number of terms in each gloss. Thus, the most significance of the gloss means the most possibility of the sense. Finally every term is replaced by its 2-level hypernym. This approach is successful to reduce the total number of distinct words in the training set by 83.15% and 72.84% in full-text and headline experiments respectively (Table 3). Furthermore, this approach can also offer an easy way to extract a reasonable right word sense for an ambiguous word. We will represent our results in the</Paragraph> </Section> <Section position="4" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.5 Evaluation Method </SectionTitle> <Paragraph position="0"> The label shown on a trained SOM is a preference and it is possible that several different labels are assigned to the same SOM unit. We consider that every input vector which is mapped to this unit will be reassigned the unit label to replace its original label. In our above example, those 3 news articles lose their label of ECAT and get the unit label of CCAT. Kohonen et al. (2000) define the classification error as &quot;all documents that represented a minority newsgroup at any grid point were counted as classification errors.&quot; Our classification accuracy is very similar to Kohonen's, but we use the corpus itself to verify the performance. If the replaced input vector label matches ONE of the original labels assigned by Reuters, it is a correct mapping. The accuracy is calculated from the proportion of the number of relevant mappings to the number of input news articles. Some news articles have the label 0 because after pre-processing these articles are zero vectors.</Paragraph> </Section> <Section position="5" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.6 Results of Experiments 3.6.1 Selforganization classification based on News Headline and Full-text </SectionTitle> <Paragraph position="0"> The first 100,000 news articles are used for training and the following 100,000 news articles are used for testing the generality. SOM represents the original distribution of source data so it is important to describe the distribution of data sets (Table 4). Because a news article can be classified in several topics, the distribution over chosen topics is inevitably not even.</Paragraph> <Paragraph position="1"> We have four experiments in this subsection. In the first experiment, the first 100,000 news titles are used for training and 100,000 successive news titles are used for test. The second experiment is same as the first one but full-text news articles are used instead of headlines only. We then try to use the trained SOM based on full-text news to test the coherence of news title sentences. The fourth experiment is inversely to the third one. The results are shown in Table 5-8 respectively. We find that our significance vector representation methods can achieve high accuracy. Second, even though full-text news articles contain more information than headlines there is no big difference in accuracy for a text classification task. Third, a trained SOM based on news headlines or based on full-text news can be highly generalised. However, the former is more general than the latter. Although the new version of Reuters news corpus is used in this work, this result is similar to the conclusion of Rodriguez et al. (1997) who use the old version of Reuters and confirms that the topic headings in Reuters corpus tend to consist of frequent words in the news document itself and this helps the task of news classification.</Paragraph> <Paragraph position="2"> and without the help of WordNet Our results using 2-level hypernym relation are significant for several reasons. First, we successfully reduce the total number of distinct words from 10,185 to 2,766 (22,848 to 3,851) in our training tests based on news headline and full-text news respectively (Table 3). Second, with the use of WordNet, this hybrid neural technique successfully improves the accuracy of news classification without any loss of In the past there had been no consistent conclusions about the value of WordNet for information retrieval tasks (Mihalcea and Moldovan 2000). Experiments performed using different methodologies led to various, sometime contradicting results (Voorhees 1998). This is probably because extracting the concept of a word is seriously dependent on other unambiguous words. Text classification is mapping documents with similar concepts to a cluster with a more general concept.</Paragraph> <Paragraph position="3"> If a vector label matches ONE of the original labels assigned by Reuters, it is considered a correct mapping. Another test could be to consider a multi-topic a NEW topic. This adds many more classes and topics. In this case, we found 54.29% and 80.51% on 100,000 full-text news articles without and with the help of WordNet respectively, demonstrating the merit of using WordNet even more.</Paragraph> <Paragraph position="4"> We have demonstrated that it is suitable to use the hypernym relation from WordNet for text classification. We successfully used this relation and improved the text classification performance substantially. By merging statistical neural methods and semantic symbolic relations, our hybrid neural learning technique is robust to classify real-word text documents and allows us to learn to classify above 98% of 100,000 documents to a correct topic.</Paragraph> </Section> </Section> class="xml-element"></Paper>