File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1710_evalu.xml

Size: 4,980 bytes

Last Modified: 2025-10-06 13:59:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1710">
  <Title>Web Corpus Mining by instance of Wikipedia</Title>
  <Section position="5" start_page="71" end_page="72" type="evalu">
    <SectionTitle>
3.4 Results and discussion
</SectionTitle>
    <Paragraph position="0"> The tables (3) and (5) show the results for corpus variant I (using all HTML-tags) and variant II (using structure relevant HTML-tags only) (see section 3.2). The general picture is that hierarchical clustering performs significantly better than kmeans. However this is only the case for an unrestricted number of clusters. If we restrict the number of clusters for hierarchical clustering to the number of categories, the differences become much less apparent (see tables 4 and 6). The only exception to this is marked by the tree edit distance: The best F-measure of 0.863 is achieved by using 58 clusters. If we restrict the number of clusters to 4, tree edit still reaches an F-measure of 0.710 which is significantly higher than the best k-means result of 0.599.</Paragraph>
    <Paragraph position="1"> As one would intuitively expect the results achieved by the tree edit distance are much better than the variants of tree linearization. The edit distance operates on trees whereas the other algorithms are bound to less informative sequences.</Paragraph>
    <Paragraph position="2"> Interestingly, the differences become much less apparent for the corpus variant II which consists of the simplified DOM-trees (see section 3.2). We can assume that the advantage of the tree edit distance over the linearization-based approaches diminishes, the smaller the trees to be compared are.</Paragraph>
    <Paragraph position="3"> The performance of the different variants of tree linearization vary only significantly in the case of unrestricted hierarchical clustering (see tables 3 and 5). In the case of k-means as well as in the case of restricting hierarchical clustering to exactly 4 clusters, the performances are about equal.</Paragraph>
    <Paragraph position="4"> In order to provide a baseline for better rating the cluster results, we perform random clustering.</Paragraph>
    <Paragraph position="5"> This leads to an F-measure of 0.311 (averaged over 1,000 runs). Content-based categorization experiments using the bag of words model have reported F-measures of about 0.86 (Yang, 1999).</Paragraph>
    <Paragraph position="6"> The baseline for the different variants of lin- null post-order linearization hierarchical 13 0.775 0.809 0.775 spearman single linkage pre-order linearization hierarchical 19 0.741 0.817 0.706 spearman single linkage tree level linearization hierarchical 36 0.702 0.882 0.603 spearman single linkage bfs linearization hierarchical 13 0.696 0.698 0.786 spearman single linkage tree edit distance k-means 4 0.599 0.618 0.641 - cosine distance pre-order linearization k-means 4 0.595 0.615 0.649 - cosine distance post-order linearization k-means 4 0.593 0.615 0.656 - cosine distance tree level linearization k-means 4 0.593 0.603 0.649 - cosine distance random lin. (medians only) - - 0.591 0.563 0.795 - bfs linearization k-means 4 0.580 0.595 0.656 - cosine distance  bfs linearization hierarchical 4 0.599 0.565 0.851 none weighted linkage tree level linearization hierarchical 4 0.597 0.615 0.676 spearman complete linkage post-order linearization hierarchical 4 0.595 0.615 0.683 spearman average linkage pre-order linearization hierarchical 4 0.578 0.599 0.660 cosine average linkage  earization is given by random linearizations: We perform 16 random linearizations, run the different variants of distance measurement as well as clustering and compute the median of the best F-measure values achieved. These are 0.591 for corpus variant I and 0.581 for the simplified variant II. These results are in fact surprising because they are only little worse than the other linearization techniques. This result may indicate that in the present scenario - the linearization based approaches to tree distance measurement are not suitable because of the loss of structure information. More specifically, this raises the following antithesis: Either, the sequence-oriented models of measuring structural similarities taken into account are insensitive to the structuring of web documents. Or: this structuring only counts what regards the degrees of nodes and their labels irrespective of their order. As tree-oriented methods perform better, we view this to be an argument against linearization oriented methods, at least what regards the present evaluation scenario to which only DOM trees are input but not more general graph structures.</Paragraph>
    <Paragraph position="7"> The experiment has shown that analyzing the document structure provides a remarkable amount of information to categorization. It also shows that the sensitivity of the approaches used in different contexts needs to be further explored which we will address in our future research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML