File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2409_intro.xml
Size: 3,617 bytes
Last Modified: 2025-10-06 14:02:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2409"> <Title>A Comparison of Manual and Automatic Constructions of Category Hierarchy for Classifying Large Corpora</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text classification has an important role to play, especially with the recent explosion of readily available on-line documents. Much of the previous work on text classification use statistical and machine learning techniques.</Paragraph> <Paragraph position="1"> However, the increasing number of documents and categories often hamper the development of practical classification systems, mainly by statistical, computational, and representational problems(Dietterich, 2000). One strategy for solving these problems is to use category hierarchies. The idea behind this is that when humans organize extensive data sets into fine-grained categories, category hierarchies are often employed to make the large collection of categories more manageable.</Paragraph> <Paragraph position="2"> McCallum et. al. presented a method called 'shrinkage' to improve parameter estimates by taking advantage of the hierarchy(McCallum, 1999). They tested their method using three different real-world datasets: 20,000 articles from the UseNet, 6,440 web pages from the Industry Sector, and 14,831 pages from the Yahoo, and showed improved performance. Dumais et. al. also described a method for hierarchical classification of Web content consisting of 50,078 Web pages for training, and 10,024 for testing, with promising results(Dumais and Chen, 2000). Both of them use hierarchies which are manually constructed. Such hierarchies are costly human intervention, since the number of categories and the size of the target corpora are usually very large. Further, manually constructed hierarchies are very general in order to meet the needs of a large number of forthcoming accessible source of text data, and sometimes constructed by relying on human intuition. Therefore, it is difficult to keep consistency, and thus, problematic for classifying text automatically.</Paragraph> <Paragraph position="3"> In this paper, we address the problem dealing with a large collection of data, and propose a method to generate category hierarchy for text classification. Our method uses two well-known techniques, partitioning clustering method called CZ-means and a D0D3D7D7 CUD9D2CRD8CXD3D2 to create hierarchical structure. CZ-means partitions a set of given categories into CZ clusters, locally minimizing the average squared distance between the data points and the cluster centers. The algorithm involves iterating through the data that the system is permitted to classify during each iteration and constructs category hierarchy. To select the proper number of CZ during each iteration, we use a D0D3D7D7 CUD9D2CRD8CXD3D2 which measures the degree of our disappointment in any differences between the true distribution over inputs and the learner's prediction. Another focus of this paper is whether or not a large collection of data, the 1996 Reuters corpus helps to generate a category hierarchy which is used to classify documents.</Paragraph> <Paragraph position="4"> The rest of the paper is organized as follows. The next section presents a brief review the earlier work. We then explain the basic framework for constructing category hierarchy, and describe hierarchical classification. Finally, we report some experiments using the 1996 Reuters corpus with a discussion of evaluation.</Paragraph> </Section> class="xml-element"></Paper>