File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/j00-4001_abstr.xml
Size: 6,279 bytes
Last Modified: 2025-10-06 13:41:41
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-4001"> <Title>Automatic Text Categorization in Terms of Genre and Author</Title> <Section position="2" start_page="0" end_page="472" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The rapid expansion of the World Wide Web (WWW) in recent years has resulted in the creation of large volumes of text in electronic form. NLP applications such as information retrieval and information extraction have been developed to treat this information automatically. Since the Internet is a very heterogeneous domain, these applications usually involve text categorization tasks with the following desiderata: (c) 2001 Association for Computational Linguistics Computational Linguistics Volume 26, Number 4 The two main factors that characterize a text are its content and its style, both of which can be used for categorization purposes. Nevertheless, the literature on computational stylistics is very limited in comparison to the work dealing with the propositional content of the text. This is due to the lack of a formal definition of style as well as to the inability of current NLP systems to incorporate stylistic theories that require complicated information. In contrast to traditional stylistics based on formal linguistic theories, the use of statistical methods in style processing has proved to be a reliable approach (Biber 1995). According to the stylostatisticians, a given style is defined as a set of measurable patterns, called style markers. We adopt this definition in this study. Typical classificatory tasks in computational stylistics are the following: * Text genre detection concerns the identification of the kind (or functional style) of the text (Karlgren and Cutting 1994; Michos et al. 1996; Kessler, Nunberg, and Schi.itze 1997).</Paragraph> <Paragraph position="1"> Extraction of style markers: A set of quantifiable measures are defined and a text-processing tool is usually developed, to automatically count them.</Paragraph> <Paragraph position="2"> Classification procedure: A disambiguation method (e.g., statistical, connectionist, etc.) is applied to classify the text in question into a predefined category (i.e., a text genre or an author).</Paragraph> <Paragraph position="3"> The most important computational approaches to text genre detection have focused on the use of simple measures that can be easily detected and reliably counted by a computational tool (Kessler, Nunberg, and Sch~itze 1997). To this end, various sets of style markers have been proposed (Karlgren and Cutting 1994), all of which are, in essence, subsets of the set used by Biber (1995), who ranked registers along seven dimensions by applying factor analysis to a set of lexical and syntactic style markers that had been manually counted. In general, the current text genre detection approaches try to avoid using existing text processing tools rather than taking advantage of them. Authorship attribution studies have focused on the establishment of the authorship of anonymous or doubtful literary texts, such as the Federalist Papers, 12 of which are of disputed authorship (Mosteller and Wallace 1984; Holmes and Forsyth 1995). Typical methodologies deal with a limited number of candidate authors using long text samples of several thousand words. Almost all the approaches to this task are based mainly on distributional lexical style markers. In a review paper of authorship attribution studies, Holmes (1994) claims: &quot;yet, to date, no stylometrist has managed to establish a methodology which is better able to capture the style of a text than that based on lexical items&quot; (p. 87).</Paragraph> <Paragraph position="4"> To the best of our knowledge, there is still no computational system that can distinguish the texts of a randomly chosen group of authors without requiring human assistance in the selection of both the most appropriate set of style markers and the most accurate disambiguation procedure.</Paragraph> <Paragraph position="5"> Stamatatos, Fakotakis, and Kokkinakis Text Categorization In this paper we describe an approach to text categorization in terms of genre and author based on a new stylometric method that utilizes already existing NLP tools. In addition to the style markers relevant to the actual output of the NLP tool (i.e., the analyzed text), we introduce analysis-level style markers, which represent the way in which the text has been analyzed by that tool. Such measures contain useful stylistic information and are easily available without additional computational cost.</Paragraph> <Paragraph position="6"> To illustrate, we apply the proposed technique to text categorization tasks for Modern Greek corpora using an already existing sentence and chunk boundaries detector (SCBD) in unrestricted Modern Greek text (Stamatatos, Fakotakis, and Kokkinakis 2000). We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the performance of the proposed method is better in comparison with the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. Our approach is trainable and can be easily adapted to any set of stylistically homogeneous categories.</Paragraph> <Paragraph position="7"> We begin by discussing work relevant to text genre detection and authorship attribution focusing on the various types of style markers employed (Section 2). Next, we describe the proposed solution for extracting style markers using already existing NLP tools (Section 3) and apply our method to Modern Greek (Section 4), briefly describing the SCBD and proposing our set of style markers. The techniques used for automatic categorization of the stylistic vectors are discussed in Section 5. Section 6 deals with the application of our approach to text genre detection, and Section 7, with authorship attribution, for both author identification and author verification. In Sections 8 and 9, we discuss important performance issues of the proposed methodology and the conclusions that can be drawn from this study.</Paragraph> </Section> class="xml-element"></Paper>