File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/w04-2409_relat.xml

Size: 3,424 bytes

Last Modified: 2025-10-06 14:15:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2409">
  <Title>A Comparison of Manual and Automatic Constructions of Category Hierarchy for Classifying Large Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Automatically generating hierarchies is not a new goal for NLP and their application systems, and there have been several attempts to create various types of hierarchies(Koller and Sahami, 1997), (Nevill-Manning et al., 1999), (Sanderson and Croft, 1999). One attempt is Crouch(Crouch, 1988), which automatically generates thesauri. Cutting et al. proposed a method called Scatter/Gather in which clustering is used to create document hierarchies(Cutting et al., 1992). Lawrie et al. proposed a method to create domain specific hierarchies that can be used for browsing a document set and locating relevant documents(Lawrie and Croft, 2000).</Paragraph>
    <Paragraph position="1"> At about the same time, several researchers have investigated the use of automatically generating hierarchies for a particular application, text classification. Iwayama et al. presented a probabilistic clustering algorithm called Hierarchical Bayesian Clustering(HBC) to construct a set of clusters for text classification(Iwayama and Tokunaga, 1995). The searching platform they focused on is the probabilistic model of text categorisation that searches the most likely clusters to which an unseen document is classified. They tested their method using two data sets: Japanese dictionary data called 'Gendai yogo no kisotisiki' which contains 18,476 word entries, and a collection of English news stories from the Wall Street Journal which consists of 12,380 articles. The HBC model showed 2AO3% improvements in breakeven point over the non-hierarchical model.</Paragraph>
    <Paragraph position="2"> Weigend et al. proposed a method to generate hierarchies using a probabilistic approach(Weigend et al., 1999). They used an exploratory cluster analysis to create hierarchies, and this was then verified by human assignments. They used the Reuters-22173 and defined two-level categories: 5 top-level categories (agriculture, energy, foreign exchange, metals and miscellaneous category) called meta-topic, and other category groups assigned to its meta-topic. Their method is based on a probabilistic approach that frames the learning problem as one of function approximation for the posterior probability of the topic vector given the input vector. They used a neural net architecture and explored several input representations. Information from each level of the hierarchy is combined in a multiplicative fashion, so no hard decision have to be made except at the leaf nodes. They found a 5% advantage in average precision for the hierarchical representation when using words.</Paragraph>
    <Paragraph position="3"> All of these mentioned above perform well, while the collection they tested is small compared with many realistic applications. In this paper, we investigate that a large collection of data helps to generate a hierarchy, i.e. it is statistically significant better than the results which utilize hierarchical structure by hand, that has not previously been explored in the context of hierarchical classification except for the improvements of hierarchical model over the flat model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML