File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2013_intro.xml

Size: 8,992 bytes

Last Modified: 2025-10-06 14:02:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2013">
  <Title>WordNet-based Text Document Clustering</Title>
  <Section position="3" start_page="2" end_page="2" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> This work is most closely related to the recently published research of Hotho et al. (2003b), and can be seen as a logical continuation of their experiments. While these authors have analysed the benefits of using WordNet synonyms and up to five levels of hypernyms for document clustering (using the bisecting k-means algorithm), this work describes the impact of tagging the documents with PoS tags and/or adding all hypernyms to the information available for each document.</Paragraph>
    <Paragraph position="1">  Hereweusethevectorspacemodel,asdescribed in the work of Salton et al. (1975), in which a document is represented as a vector or 'bag of words', i.e., by the words it contains and their frequency, regardless of their order.</Paragraph>
    <Paragraph position="2"> A number of fairly standard techniques have been used to preprocess the data. In addition, a combination of standard and custom software tools have been used to add PoS tags and Word-Net categories to the data set. These will be described briefly below to allow for the experimentstoberepeated. null The first preprocessing step is to PoS tag the corpus. The PoS tagger relies on the text structure and morphological differences to determine the appropriate part-of-speech. For this reason, if it is required, PoS tagging is the first step to be carried out. After this, stopword removal is performed, followed by stemming. This order is chosen to reduce the amount of words to be stemmed. The stemmed words are then looked up in WordNet and their corresponding synonyms and hypernyms are added to the bag-of-words. Once the document vectors are completed in this way, the frequency of each word across the corpus can be counted and every word occurring less often than the pre specified threshold is pruned. Finally, after the pruning step, the term weights are converted to tfidf as described below.</Paragraph>
    <Paragraph position="3"> Stemming, stopword removal and pruning all aim to improve clustering quality by removing noise, i.e. meaningless data. They all lead to a reduction in the number of dimensions in the term-space. Weighting is concerned with the estimation of the importance of individual terms.</Paragraph>
    <Paragraph position="4"> All of these have been used extensively and are considered the baseline for comparison in this work. However, the two techniques under investigation both add data to the representation. PoS tagging adds syntactic information and WordNet is used to add synonyms and hypernyms. The rest of this section discusses preprocessing, clustering and evaluation in more detail.</Paragraph>
    <Paragraph position="5"> PoS Tagging PoS tags are assigned to the corpus using Brill's PoS tagger. As PoS tagging requires the words to be in their original order this is done before any other modifications on the corpora.</Paragraph>
    <Paragraph position="6"> Stopword Removal Stopwords, i.e. words thought not to convey any meaning, are removed from the text. The approach taken in this work does not compile a static list of stopwords, as usually done. Instead PoS information is exploited and all tokens that are not nouns, verbs or adjectives are removed.</Paragraph>
    <Paragraph position="7"> Stemming Words with the same meaning appear in various morphological forms. To capture their similarity they are normalised into a common root-form, the stem. The morphology function provided with WordNet is used for stemming, because it only yields stems that are contained in the WordNet dictionary.</Paragraph>
    <Paragraph position="8"> WordNet Categories WordNet, the lexical database developed by Miller et al.,isusedto include background information on each word.</Paragraph>
    <Paragraph position="9"> Depending on the experiment setup, words are replaced with their synset IDs, which constitute their different possible senses, and also different levels of hypernyms, more general terms for the a word, are added.</Paragraph>
    <Paragraph position="10"> Pruning Words appearing with low frequency throughout the corpus are unlikely to appear in more than a handful of documents and would therefore, even if they contributed any discriminating power, be likely to cause too fine grained distinctions for us to be useful, i.e clusters containing only one or two documents.</Paragraph>
    <Paragraph position="11"> Therefore all words (or synset IDs) that appear less often than a pre-specified threshold are pruned.</Paragraph>
    <Paragraph position="12"> Weighting Weights are assigned to give an indication of the importance of a word. The most trivial weight is the word-frequency. However, more sophisticated methods can provide better results. Throughout this work, tfidf (term frequency x inverse document frequency) as described by Salton et al. (1975), is used.</Paragraph>
    <Paragraph position="13"> One problem with term frequency is that the lengths of the documents are not taken into account. The straight forward solution to this problem is to divide the term frequency by the total number of terms in the document, the document length. Effectively, this approach is equivalent to normalising each document vector to length one and is called relative term frequency. null However, for this research a more sophisticated measure is used: the product of term frequency and inverse document frequency tfidf.</Paragraph>
    <Paragraph position="14"> Salton et al. define the inverse document frequency idf as</Paragraph>
    <Paragraph position="16"> is the number of documents in which term t appears and n the total number of documents. Consequently, the tfidf measure is calculated as</Paragraph>
    <Paragraph position="18"> simply the multiplication of tf and idf.This means that larger weights are assigned to terms that appear relatively rarely throughout the corpus, but very frequently in individual documents. Salton et al. (1975) measure a 14% improvement in recall and precision for tfidf in comparison to the standard term frequency tf.</Paragraph>
    <Paragraph position="19"> Clustering is done with the bisecting k-means algorithm as it is described by Steinbach et al. (2000). In their comparison of different algorithms they conclude that bisecting k-means is the current state of the art for document clustering. Bisecting k-means combines the strengths of partitional and hierarchical clustering methods by iteratively splitting the biggest cluster using the basic k-means algorithm. Basic k-means is a partitional clustering algorithm based on the vector space model. At the heart of such algorithms is a similarity measure. We choose the cosine distance, which measures the similarity of two documents by calculating the cosine of the angle between them.</Paragraph>
    <Paragraph position="20"> The cosine distance is defined as follows:</Paragraph>
    <Paragraph position="22"> is the dot-product of the two vectors. When the lengths of the vectors are normalised, the cosine distance is equivalent to the dot-product of the vectors, i.e.</Paragraph>
    <Paragraph position="23">  .</Paragraph>
    <Paragraph position="24"> Evaluation Three different evaluation measures are used in this work, namely purity, entropy and overall similarity. Purity and entropy are both based on precision,</Paragraph>
    <Paragraph position="26"> where each cluster C from a clustering C of the set of documents D is compared with the manually assigned category labels L from the manual categorisation L, which requires a categorylabeled corpus. Precision is the probability of a document in cluster C being labeled L.</Paragraph>
    <Paragraph position="27"> Purity is the percentage of correctly clustered documents and can be calculated as:</Paragraph>
    <Paragraph position="29"> yielding values in the range between 0 and 1.</Paragraph>
    <Paragraph position="30"> The intra-cluster entropy (ice) of a cluster C, as described by Steinbach et al. (2000), considers the dispersion of documents in a cluster, and is defined as:</Paragraph>
    <Paragraph position="32"> Based on the intra-cluster entropy of all clusters, the average, weighted by the cluster size, is calculated. This results in the following formula, which is based on the one used by Steinbach et al. (2000):</Paragraph>
    <Paragraph position="34"> Overall similarity is independent of preannotation. Instead the intra-cluster similarities are calculated, giving an idea of the cohesiveness of a cluster. This is the average similarity between each pair of documents in a cluster, including the similarity of a document with itself. Steinbach et al. (2000) show that this is equivalent to the squared length of the cluster centroid, i.e. |vectorc|  Similarity is expressed as a percentage, therefore the possible values for overall similarity range from 0 to 1.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML