File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1068_intro.xml
Size: 3,756 bytes
Last Modified: 2025-10-06 14:03:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1068"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Study on Automatically Extracted Keywords in Text Categorization</Title> <Section position="3" start_page="0" end_page="537" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic text categorization is the task of assigning any of a set of predefined categories to a document. The prevailing approach is that of supervised machine learning, in which an algorithm is trained on documents with known categories. Before any learning can take place, the documents must be represented in a form that is understandable to the learning algorithm. A trained prediction model is subsequently applied to previously unseen documents, to assign the categories. In order to perform a text categorization task, there are two major decisions to make: how to represent the text, and what learning algorithm to use to create the prediction model. The decision about the representation is in turn divided into two subquestions: what features to select as input and which type of value to assign to these features.</Paragraph> <Paragraph position="1"> In most studies, the best performing representation consists of the full length text, keeping the tokens in the document separate, that is as unigrams. In recent years, however, a number of experiments have been performed in which richer representations have been evaluated. For example, Caropreso et al. (2001) compare unigrams and bigrams; Moschitti et al. (2004) add complex nominals to their bag-of-words representation, while Kotcz et al. (2001), and Mihalcea and Hassan (2005) present experiments where automatically extracted sentences constitute the input to the representation. Of these three examples, only the sentence extraction seems to have had any positive impact on the performance of the automatic text categorization.</Paragraph> <Paragraph position="2"> In this paper, we present experiments in which keywords, that have been automatically extracted, are used as input to the learning, both on their own and in combination with a full-text representation.</Paragraph> <Paragraph position="3"> That the keywords are extracted means that the selected terms are present verbatim in the document.</Paragraph> <Paragraph position="4"> A keyword may consist of one or several tokens.</Paragraph> <Paragraph position="5"> In addition, a keyword may well be a whole expression or phrase, such as snakes and ladders.</Paragraph> <Paragraph position="6"> The main goal of the study presented in this paper is to investigate if automatically extracted key-words can improve automatic text categorization.</Paragraph> <Paragraph position="7"> We investigate what impact keywords have on the task by predicting text categories on the basis of keywords only, and by combining full-text representations with automatically extracted keywords.</Paragraph> <Paragraph position="8"> We also experiment with different ways of representing keywords, either as unigrams or intact.</Paragraph> <Paragraph position="9"> In addition, we investigate the effect of using the headlines -- represented as unigrams -- as input, to compare their performance to that of the keywords. null The outline of the paper is as follows: in Section 2, we present the algorithm used to automatically extract the keywords. In Section 3, we present the corpus, the learning algorithm, and the experimental setup for the performed text categorization experiments. In Section 4, the results are described.</Paragraph> <Paragraph position="10"> An overview of related studies is given in Section 5, and Section 6 concludes the paper.</Paragraph> </Section> class="xml-element"></Paper>