File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1710_intro.xml
Size: 6,733 bytes
Last Modified: 2025-10-06 14:04:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1710"> <Title>Web Corpus Mining by instance of Wikipedia</Title> <Section position="2" start_page="0" end_page="67" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In order to reliably judge the collocative affinity of linguistic items, it has to be considered that judgements of this kind depend on the scope of certain genres or registers. According to Stubbs (2001), words may have different collocates in different text types or genres and therefore may signal one of those genres when being observed. Consequently, corpus analysis requires, amongst others, a comparison of occurrences in a given text with typical occurrences in other texts of the same genre (Stubbs, 2001, p. 120).</Paragraph> <Paragraph position="1"> This raises the question how to judge the membership of texts, in which occurrences of linguistic items are observed, to the genres involved. Evidently, because of the size of the corpora involved, this question is only adequately answered by reference to the area of automatic classification. This holds all the more for web corpus linguistics (Kilgarriff and Grefenstette, 2003; Baroni and Bernardini, 2006) where large corpora of web pages, whose membership in webgenres is presently unknown, have to be analyzed. Consequently, web corpus linguistics faces two related task: 1. Exploration: The task of initially exploring which webgenres actually exist.</Paragraph> <Paragraph position="2"> 2. Categorization: The task of categorizing hy null pertextual units according to their membership in the genres being explored in the latter step.</Paragraph> <Paragraph position="3"> In summary, web corpus linguistics is in need of webgenre-sensitive corpora, that is, of corpora in which for the textual units being incorporated the membership to webgenres is annotated. This in turn presupposes that these webgenres are first of all explored.</Paragraph> <Paragraph position="4"> Currently to major classes of approaches can be distinguished: On the one hand, we find approaches to the categorization of macro structures (Amitay et al., 2003) such as web hierarchies, directories and corporate sites. On the other hand, this concerns the categorization of micro structures as, for example, single web pages (Kleinberg, 1999) or even page segments (Mizuuchi and Tajima, 1999). The basic idea of all these approaches is to perform categorization as a kind of function learning for mapping web units above, on or below the level of single pages onto at most one predefined category (e.g. genre label) per unit (Chakrabarti et al., 1998). Thus, these approaches focus on the categorization task while disregarding the exploration task. More specifically, the majority of these approaches utilizes text categorization methods in conjunction with HTML markup, metatags and link structure beyond bag-of-word representations of the pages' wording as input of feature selection (Yang et al., 2002) - in some cases also of linked pages (F&quot;urnkranz, 1999). What these approaches are missing is a more general account of web document structure as a source of genre-oriented categorization. That is, they solely map web units onto feature vectors by disregarding their structure. This includes linkage beyond pairwise linking as well as document internal structures according to the Document Object Model (DOM). A central pitfall of this approach is that it disregards the impact of genre membership to document structure and, thus, the signalling of the former by the latter (Ventola, 1987). Therefore a structure-sensitive approach is needed in the area of corpus linguistics which allows for automatic webgenre tagging. That is, an approach which takes both levels of structuring of web documents into account: On the level of their hyperlink-based linkage and on the level of their internal structure. In this paper we present an algorithm as a preliminary step for tackling the exploration and categorization task together. More specifically, we present an approach to unsupervised structure learning which uses tree alignment algorithms as similarity kernels and cluster analysis for class detection. The paper includes a comparative study of several approaches to tree alignment as a source of similarity measuring of web documents. Its central topics are: * To what extent is it possible to predict the membership of a web document in a certain genre (or register) solely on grounds of its structure when its lexical content and other content bearing units are completely deleted? In other words, we ask to what extent structure signals membership in genre.</Paragraph> <Paragraph position="5"> * A more methodical question regards the choice of appropriate measures of structural similarity to be included into structure learning. In this context, we comparatively study several variants of measuring similarities of trees, that is, tree edit distance as well as a class of algorithms which are based on tree linearizations as input to sequence alignment.</Paragraph> <Paragraph position="6"> Our overall findings hint at two critical points: First, there is a significant contribution of structure-oriented methods to webgenre categorization which is unexplored in predominant approaches. Second, and most surprisingly, all methods analyzed toughly compete with a method based on random linearization of input documents.</Paragraph> <Paragraph position="7"> Why is this research important for web corpus linguistics? An answer to this question can be outlined as follows: * We explore a further resource of reliably tagging web genres and registers, respectively, in the form of document structure.</Paragraph> <Paragraph position="8"> * We further develop the notion of webgenre and thus help to make document structure accessible to collocation and other corpus linguistic analyses.</Paragraph> <Paragraph position="9"> In order to support this argumentation, we first present a structure insensitive approach to web categorization in section (2). It shows that this insensitivity systematically leads to multiple categorizations which cannot be traced back to ambiguity of category assignment. In order to solve this problem, an alternative approach to structure learning is presented in sections (3.1), (3.2) and (3.3). This approach is evaluated in section (3.4) on grounds of a corpus of Wikipedia articles. The reason for utilizing this test corpus is that the content-based categories which the explored web documents belong to are known so that we can apply the classical apparatus of evaluation of web mining. The final section concludes and prospects future work.</Paragraph> </Section> class="xml-element"></Paper>