File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2409_metho.xml

Size: 9,222 bytes

Last Modified: 2025-10-06 14:09:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2409">
  <Title>A Comparison of Manual and Automatic Constructions of Category Hierarchy for Classifying Large Corpora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Generating Hierarchical Structure
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Document Representation
</SectionTitle>
      <Paragraph position="0"> To generate hierarchies, we need to address the question of how to represent texts(Cutting et al., 1992), (Lawrie and Croft, 2000). The total number of words we focus on is too large and it is computationally very expensive.</Paragraph>
      <Paragraph position="1"> We use two statistical techniques to reduce the number of inputs. The first is to use CRCPD8CTCVD3D6DD DACTCRD8D3D6 instead of CSD3CRD9D1CTD2D8 DACTCRD8D3D6. The number of input vectors is not the number of the training documents but equals to the number of different categories. This allows to make the large collection of data more manageable. The second is a well-known technique, i.e. mutual information measure between a word and a category. We use it as the value in each dimension of the vector(Cover and Thomas, 1991).</Paragraph>
      <Paragraph position="2"> More formally, each category in the training set is represented using a vector of weighted words. We call it CRCPD8CTCVD3D6DD DACTCRD8D3D6. Category vectors are used for representing as points in Euclidean space in CZ-means clustering algorithm. Let CR</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
CY
</SectionTitle>
    <Paragraph position="0"> be one of the categories CR</Paragraph>
    <Paragraph position="2"> and a vector assigned to CR  . We select the 1,000 words with the largest mutual information for each category.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Clustering
</SectionTitle>
      <Paragraph position="0"> Clustering has long been used to group data with many applications(Jain and Dubes, 1988). We use a simple clustering technique, CZ-means to group categories and construct a category hierarchy(Duda and Hart, 1973). CZ-means is based on iterative relocation that partitions a dataset into CZ clusters. The algorithm keeps track of the centroids, i.e. seed points, of the subsets, and proceeds in iterations. In each iteration, the following is performed: (i) for each point DC, find the seed point which is closest to DC. Associate DC with this seed point, (ii) re-estimate each seed point locations by taking the center of mass of points associated with it. Before the first iteration the seed points are initialized to random values. However, a bad choice of initial centers can have a great impact on performance, since CZ-means is fully deterministic, given the starting seed points. We note that by utilizing hierarchical structure, the classification problem can be decomposed into a set of smaller problems corresponding to hierarchical splits in the tree. This indicates that one first learns rough distinctions among classes at the top level, then lower level distinctions are learned only within the appropriate top level of the tree, and lead to more specialized classifiers. We thus selected the top CZ frequent categories as initial seed points. Figure 1 illustrates a sample hierarchy obtained by CZ-means. The input is a set of category vectors. Seed points assigned to each cluster are underlined in Figure 1.</Paragraph>
      <Paragraph position="2"> In general, the number of CZ is not given beforehand.</Paragraph>
      <Paragraph position="3"> We thus use a D0D3D7D7CUD9D2CRD8CXD3D2which is derived from Naive Bayes(NB) classifiers to evaluate the goodness of CZ.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.3 NB
</SectionTitle>
      <Paragraph position="0"> Naive Bayes(NB) probabilistic classifiers are commonly studied in machine learning(Mitchell, 1996). The basic idea in NB approaches is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document. The NB assumption is that all the words in a text are conditionally independent given the value of a classification variable. There are several versions of the NB classifiers. Recent studies on a Naive Bayes classifier which is proposed by McCallum et al. reported high performance over some other commonly used versions of NB on several data collections(McCallum, 1999). We use the model of NB by McCallum et al.</Paragraph>
      <Paragraph position="1"> which is shown in formula (2).</Paragraph>
      <Paragraph position="2">  and D4D6D3D4D3D6D8CXD3D2CPD0 CPD7D7CXCVD2D1CTD2D8 strategies(Lewis, 1992). We use probability threshold(PT) strategy where each document is assigned to the categories above a threshold AI  . The threshold AI can be set to control precision and recall. Increasing AI, results in fewer test items meeting the criterion, and this usually increases precision but decreases recall. Conversely, decreasing AI typically decreases precision but increases recall. In a flat non-hierarchical model, we chose AI for each category, so as to optimize performance on the F measure on a training samples and development test samples. In a manual and automatic construction of hierarchy, we chose AI at each level of a hierarchy using training samples and development test samples.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.4 Estimating Error Reduction
</SectionTitle>
      <Paragraph position="0"> Let C8B4DD CY DCB5 be an unknown conditional distribution over inputs, DC, and output classes, DD BECUDD</Paragraph>
      <Paragraph position="2"> CV, and let C8B4DCB5 be the marginal 'input' distribution. The learner is given a labeled training set BW, and estimates a classification function that, given an input DC, produces an esti-</Paragraph>
      <Paragraph position="4"> where C4 is some loss function that measures the degree of our disappointment in any differences between the true  We tested these three assignment strategies in the experiment, and obtained a better result with probability threshold than with other strategies.</Paragraph>
      <Paragraph position="5">  of the result obtained by CZ-means algorithm, and be a set of seed points(categories) labeled training samples. The learner aims to select the result of BW</Paragraph>
      <Paragraph position="7"> has lower error rate than any other sets.</Paragraph>
      <Paragraph position="8">  . We note that the true output distribution C8B4DD CY DCB5 in formula (6) is unknown for each sample DC.Roy et al.(Roy and McCallum, 2001) proposed a method of CPCRD8CXDACT D0CTCPD6D2CXD2CV that directly optimizes expected future error by log-loss, using the entropy of the posterior class distribution on a sample of the unlabeled examples. We applied their technique to estimate it using the current learner. More precisely, from the development training samples BW, a different training set is created. The learner then creates a new classifier from the set. This procedure is repeated D1 times, and the final class posterior for an instance is taken to be the average of the class posteriori for each of the classifiers.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.5 Generating Category Hierarchy
</SectionTitle>
      <Paragraph position="0"> The algorithm for generating category hierarchy is as follows: null  1. Create category vectors from the given training samples. null 2. Apply CZ-means up to D2-1 times (2 AK CZ AK D2), where D2 is the number of different categories. 3. Apply a loss function (6) to each result. 4. Select the CX-th result using formula (5), i.e. the result such that the learner trained on the CX-th set has lower error rate than any other results.</Paragraph>
      <Paragraph position="1"> 5. Assign every seed point(category) of the clusters in  the CX-th result to each node of the tree. For each cluster of sub-branches, eliminates the seed point, and the procedure 2 AO 5 is repeated, i.e. run a local CZ-means for each cluster of children, until the number of categories in each cluster is less than two.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Hierarchical Classification
</SectionTitle>
    <Paragraph position="0"> Like Dumais's approach(Dumais and Chen, 2000), we classify test data using the hierarchy. We select the 1,000 features with the largest mutual information for each category, and use them for testing. The selected features are used as input to the NB classifiers.</Paragraph>
    <Paragraph position="1"> We employ the hierarchy by learning separate classifiers at each internal node of the tree. Then using these classifiers, we assign categories to each test sample using probability threshold strategy where each sample is assigned to categories above a threshold AI. The process is repeated by greedily selecting sub-branches until it reaches a leaf.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML