File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1710_metho.xml

Size: 17,349 bytes

Last Modified: 2025-10-06 14:10:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1710">
  <Title>Web Corpus Mining by instance of Wikipedia</Title>
  <Section position="3" start_page="67" end_page="69" type="metho">
    <SectionTitle>
2 Hypertext Categorization
</SectionTitle>
    <Paragraph position="0"> The basic assumption behind present day approaches to hypertext categorization is as follows: Web units of similar function/content tend to have similar structures. The central problem is that these structures are not directly accessible by segmenting and categorizing single web pages. This is due to polymorphism and its reversal relation of discontinuous manifestation: Generally speaking, polymorphism occurs if the same (hyper)textual unit manifests several categories. This one-to-many relation of expression and content units is accompanied by a reversal relation according to which the same content or function unit is distributed over several expression units. This combines to a many-to-many relation between explicit, manifesting web structure and implicit, manifested functional or content-based structure.</Paragraph>
    <Paragraph position="1"> Our hypothesis is that if polymorphism is a prevalent characteristic of web units, web pages cannot serve as input of categorization since polymorphic pages simultaneously instantiate several categories. Moreover, these multiple categorizations are not simply resolved by segmenting the focal pages, since they possibly manifest categories only discontinuously so that their features  do not provide a sufficient discriminatory power.</Paragraph>
    <Paragraph position="2"> In other words: We expect polymorphism and discontinuous manifestation to be accompanied by many multiple categorizations without being reducible to the problem of disambiguating category assignments. In order to show this, we perform a categorization experiment according to the classical setting of function learning, using a corpus of the genre of conference websites. Since these websites serve recurrent functions (e.g. paper submission, registration etc.) they are expected to be structured homogeneously on the basis of stable, recurrent patterns. Thus, they can be seen as good candidates of categorization.</Paragraph>
    <Paragraph position="3"> The experiment is performed as follows: We apply support vector machine (SVM) classification which proves to be successful in case of sparse, high dimensional and noisy feature vectors (Joachims, 2002). SVM classification is performed with the help of the LibSVM (Hsu et al., 2003). We use a corpus of 1,078 English conference websites and 28,801 web pages. Hyper-text representation is done by means of a bagof-features approach using about 85,000 lexical and 200 HTML features. This representation was done with the help of the HyGraph system which explores websites and maps them onto hypertext graphs (Mehler and Gleim, 2005). Following (Hsu et al., 2003), we use a Radial Basis Function kernel and make optimal parameter selection based on a minimization of a 5-fold cross validation error. Further, we perform a binary categorization for each of the 16 categories based on 16 training sets of pos./neg. examples (see table 1). The size of the training set is 1,858 pages (284 sites); the size of the test set is 200 (82 sites). We perform 3 experiments:  1. Experiment A - one against all: First we ap- null ply a one against all strategy, that is, we use X \ Yi as the set of negative examples for learning category Ci where X is the set of all training examples and Yi is the set of positive examples of Ci. The results are listed in table (1). It shows the expected low level of effectivity: recall and precession perform very low on average. In three cases the classifiers fail completely. This result is confirmed when looking at column A of table (2): It shows the number of pages with up to 7 category assignments. In the majority of cases no category could be applied at all - only one-third Category rec. prec.</Paragraph>
    <Paragraph position="4">  genre applied in the experiment.</Paragraph>
    <Paragraph position="5"> of the pages was categorized.</Paragraph>
    <Paragraph position="6"> 2. Experiment B - lowering the discriminatory  power: In order to augment the number of categorizations, we lowered the categories' selectivity by restricting the number of negative examples per category to the number of the corresponding positive examples by sampling the negative examples according to the sizes of the training sets of the remaining categories. The results are shown in table (2): The number of zero categorizations is dramatically reduced, but at the same time the number of pages mapped onto more than one category increases dramatically. There are even more than 1,000 pages which are mapped onto more than 5 categories.</Paragraph>
    <Paragraph position="7"> 3. Experiment C - segment level categorization: Thirdly, we apply the classifiers trained on the monomorphic training pages on segments derived as follows: Pages are segmented into spans of at least 30 tokens reflecting segment borders according to the third level of the pages' document object model trees. Column C of table (2) shows that this scenario does not solve the problem of multiple categorizations since it falls back to the problem of zero categorizations. Thus, polymorphism is not resolved by simply segmenting pages, as other segmentations along the same line of constraints confirmed.</Paragraph>
    <Paragraph position="8"> There are competing interpretations of these results: The category set may be judged to be wrong. But it reflects the most differentiated set applied so far in this area. Next, the representation model  number of ca- A B C tegorizations page level page level segment level  may be judged to be wrong, but actually it is usually applied in text categorization. Third, the categorization method may be seen to be ineffective, but SVMs are known to be one of the most effective methods in this area. Further, the classifiers may be judged to be wrong - of course the training set could be enlarged, but already includes about 2,000 manually selected monomorphic training units. Finally, the focal units (i.e. web pages) may be judged to be unsystematically polymorph in the sense of manifesting several logical units. It is this interpretation which we believe to be supported by the experiment.</Paragraph>
    <Paragraph position="9"> If this interpretation is true, the structure of web documents comes into focus. This raises the question, what can be gained at all when exploring the visible structuring of documents as found on the web. That is, what is the information gain when categorizing documents solely based on their structures. In order to approach this question we perform an experiment in structure-oriented classification in the next section. As we need to control the negative impact of polymorphism, we concentrate on monomorphic pages which uniquely belong to single categories. This can be guaranteed with the help of Wikipedia articles which, with the exception of special disambiguation pages, only address one topic respectively. null</Paragraph>
  </Section>
  <Section position="4" start_page="69" end_page="71" type="metho">
    <SectionTitle>
3 Structure-Based Categorization
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
3.1 Motivation
</SectionTitle>
      <Paragraph position="0"> In this section we investigate how far a corpus of documents can be categorized by solely considering the explicit document structure without any textual content. It is obvious that we cannot expect the results to reach the performance of content based approaches. But if this approach allows to significantly distinguish between categories in contrast to a reference random decider we can conclude that the involvement of structure information may positively affect categorization performance. A positive evaluation can be seen to motivate an implementation of the Logical Document Structure (LDS) algorithm proposed by Mehler et al. (2005) who include graph similarity measuring as its kernel. We expect the same experiment to perform significantly better on the LDS instead of the explicit structures. However this experiment can only be seen as a first attempt. Further studies with larger corpora are required.</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="70" type="sub_section">
      <SectionTitle>
3.2 Experiment setup
</SectionTitle>
      <Paragraph position="0"> In our experiment, we chose a corpus of articles from the German Wikipedia addressing the fol- null With the exception of the first category most articles, being represented as a HTML web page, share a typical, though not deterministic visible structure. For example a Wikipedia article about a city contains an info box to the upper right which lists some general information like district, population and geographic location. Furthermore an article about a city contains three or more sections which address the history, politics, economics and possibly famous buildings or persons. Likewise there exist certain design guidelines by the Wikipedia project to write articles about countries and universities. However these guidelines are not always followed or they are adapted from one case to another. Therefore, a categorization cannot rely on definite markers in the content. Nevertheless, the expectation is that a human reader, once he has seen a few samples of each category, can with high probability guess the category of an article by simple looking at the layout or visible structure and ignoring the written content. Since the layout (esp.</Paragraph>
      <Paragraph position="1"> the structure) of a web page is encoded in HTML we consider the structure of their DOM1-trees for our categorization experiment. If two articles of the same category share a common visible structure, this should lead to a significant similarity of  the DOM-trees. The articles of category 'American Presidents' form an exception to this principle up to now because they do not have a typical structure. The articles about the first presidents are relatively short whereas the articles about the recent presidents are much more structured and complex.</Paragraph>
      <Paragraph position="2"> We include this category to test how well a structure based categorizer performs on such diverse structurations. We examine two corpus variants: I. All HTML-Tags of a DOM-tree are used for similarity measurement.</Paragraph>
      <Paragraph position="3"> II. Only those HTML-tags of a DOM-tree are used which have an impact on the visible structure (i.e. inline tags like font or i are ignored). null Both cases, I and II, do not include any text nodes. That is, all lexical content is ignored. By distinguishing these two variants we can examine what impact these different degrees of expressiveness have on the categorization performance.</Paragraph>
    </Section>
    <Section position="3" start_page="70" end_page="71" type="sub_section">
      <SectionTitle>
3.3 Distance measurement and clustering
</SectionTitle>
      <Paragraph position="0"> The next step of the experiment is marked by a pairwise similarity measurement of the wikipedia articles which are represented by their DOM-trees according to the two variants described in section 3.2. This allows to create a distance matrix which represents the (symmetric) distances of a given article to any other. In a subsequent and final step the distance matrix will be clustered and the results analyzed.</Paragraph>
      <Paragraph position="1"> How to measure the similarity of two DOMtrees? This raises the question what exactly the subject of the measurement is and how it can be adequately modeled. Since the DOM is a tree and the order of the HTML-tags matters, we choose ordered trees. Furthermore we want to represent what tag a node represents. This leads to ordered labeled trees for representation. Since trees are a common structure in various areas such as image analysis, compiler optimization and bio informatics (i.e. RNA analysis) there is a high interest in methods to measure the similarity between trees (Tai, 1979; Zhang and Shasha, 1989; Klein, 1998; Chen, 2001; H&amp;quot;ochsmann et al., 2003). One of the first approaches with a reasonable computational complexity was introduced by Tai (1979) who extended the problem of sequence edit distance to trees.</Paragraph>
      <Paragraph position="2">  The following description of tree edit distances is due to Bille (2003): The principle to compute the edit distance between two trees T1, T2 is to successively perform elementary edit operations on the former tree to turn it into the formation of the latter. The edit operations on a given tree T are as follows: Relabel changes the label of a node v [?] T. Delete deletes a non-root node v [?] T with a parent node w [?] T. Since v is being deleted, its child nodes (if any) are inserted as children of node w. Finally the Insert operation marks the complement of delete. Next, an edit script S is a list of consecutive edit operations which turn T1 into T2. Given a cost function for each edit operation the cost of S is the sum of its elementary operation costs. The optimal edit script (there is possibly more than one) between T1 and T2 is given by the edit script of minimum cost which equals the tree edit distance.</Paragraph>
      <Paragraph position="3"> There are various algorithms known to compute the edit distance (Tai, 1979; Zhang and Shasha, 1989; Klein, 1998; Chen, 2001). They vary in computational complexity and whether they can be used for general purpose or under special restrictions only (which allows for better optimization). In this experiment we use the general-purpose algorithm of Zhang and Shasha (1989) which shows a complexity of O(|T1||T2|min(L1,D1)min(L2,D2)) where |Ti|, Li, Di denote the number of nodes, the number of leafs and the depth of the trees respectively. The approach of tree edit distance forms a good balance between accurate distance measurement of trees and computational complexity. However, especially for large corpora it might be useful to examine how well other (i.e. faster) methods  perform. We therefore consider another class of algorithms for distance measurement which are based on sequence alignments via dynamic programming. Since this approach is restricted to the comparison of sequences, a suitable linearization of the DOM trees has to be found. For this task we use several strategies of tree node traversal: Pre-Order, Post-Order and Breath-First-Search (BFS) traversal. Figure (1) shows a linearization of two sample trees using Post-Order and how the resulting sequences STi may have been aligned for the best alignment distance. We have enhanced the labels of the linearized nodes by adding the inand out degrees corresponding to the former position of the nodes in the tree. This information can be used during the computation of the alignment cost: An example of this is that the alignment of two nodes with identical HTML-tags but different in/out degrees will result in a higher cost than in cases where these degrees match. Following this strategy, at least part of the structure information is preserved. This approach is followed by Dehmer (2005) who develops a special form of tree linearization which is based on tree levels.</Paragraph>
      <Paragraph position="4"> Obviously, a linearization poses a loss of structure information which has impact on the results of distance measurement. But the computational complexity of sequence alignments (O(n2)) is significantly better than of tree edit distances. This leads to a trade-off between the expressiveness of the DOM-Tree representation (in our case tree vs.</Paragraph>
      <Paragraph position="5"> linearization to a sequence) and the complexity of the algorithms to compute the distance thereon.</Paragraph>
      <Paragraph position="6"> In order to have a baseline for tree linearization techniques we have also tested random linearizations. According to this method, trees are transformed into sequences of nodes in random order.</Paragraph>
      <Paragraph position="7"> For our experiment we have generated 16 random linearizations and computed the median of their categorization performances.</Paragraph>
      <Paragraph position="8"> Next, we perform pairwise distance measurements of the DOM-trees using the set of algorithms described above. We then apply two clustering methods on the resulting distance matrices: hierarchical agglomerative and k-means clustering. Hierarchical agglomerative clustering does not need any information on the expected number of clusters so we examine all possible clusterings and chose the one maximizing the F-measure.</Paragraph>
      <Paragraph position="9"> However we also examine how well hierarchical clustering performs if the number of partitions is restricted to the number of categories. In contrast to the previous approach, k-means needs to be informed about the number of clusters in advance, which in the present experiment equals the number of categories, which in our case is four. Because we know the category of each article we can perform an exhaustive parameter study to maximize the well known efficiency measures purity, inverse purity and the combined F-measure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML