File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1068_metho.xml
Size: 11,936 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1068"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Study on Automatically Extracted Keywords in Text Categorization</Title> <Section position="4" start_page="537" end_page="537" type="metho"> <SectionTitle> 2 Selecting the Keywords </SectionTitle> <Paragraph position="0"> This section describes the method that was used to extract the keywords for the text categorization experiments discussed in this paper. One reason why this method, developed by Hulth (2003; 2004), was chosen is because it is tuned for short texts (more specifically for scientific journal abstracts).</Paragraph> <Paragraph position="1"> It was thus suitable for the corpus used in the described text categorization experiments.</Paragraph> <Paragraph position="2"> The approach taken to the automatic keyword extraction is that of supervised machine learning, and the prediction models were trained on manually annotated data. No new training was done on the text categorization documents, but models trained on other data were used. As a first step to extract keywords from a document, candidate terms are selected from the document in three different manners. One term selection approach is statistically oriented. This approach extracts all uni-, bi-, and trigrams from a document. The two other approaches are of a more linguistic character, utilizing the words' parts-of-speech (PoS), that is, the word class assigned to a word. One approach extracts all noun phrase (NP) chunks, and the other all terms matching any of a set of empirically defined PoS patterns (frequently occurring patterns of manual keywords). All candidate terms are stemmed.</Paragraph> <Paragraph position="3"> Four features are calculated for each candidate term: term frequency; inverse document frequency; relative position of the first occurrence; and the PoS tag or tags assigned to the candidate term. To make the final selection of keywords, the three predictions models are combined. Terms that are subsumed by another keyword selected for the document are removed. For each selected stem, the most frequently occurring unstemmed form in the document is presented as a keyword.</Paragraph> <Paragraph position="4"> Each document is assigned at the most twelve keywords, provided that the added regression value Assign. Corr.</Paragraph> <Paragraph position="5"> mean mean P R F words in mean per document; the number of correct (Corr.) keywords in mean per document; precision (P); recall (R); and F-measure (F), when 312 keywords are extracted per document.</Paragraph> <Paragraph position="6"> (given by the prediction models) is higher than an empirically defined threshold value. To avoid that a document gets no keywords, at least three key-words are assigned although the added regression value is below the threshold (provided that there are at least three candidate terms).</Paragraph> <Paragraph position="7"> In Hulth (2004) an evaluation on 500 abstracts in English is presented. For the evaluation, key-words assigned to the test documents by professional indexers are used as a gold standard, that is, the manual keywords are treated as the one and only truth. The evaluation measures are precision (how many of the automatically assigned key-words that are also manually assigned keywords) and recall (how many of the manually assigned keywords that are found by the automatic indexer).</Paragraph> <Paragraph position="8"> The third measure used for the evaluations is the F-measure (the harmonic mean of precision and recall). Table 1 shows the result on that particular test set. This result may be considered to be state-of-the-art.</Paragraph> </Section> <Section position="5" start_page="537" end_page="539" type="metho"> <SectionTitle> 3 Text Categorization Experiments </SectionTitle> <Paragraph position="0"> This section describes in detail the four experimental settings for the text categorization experiments. null</Paragraph> <Section position="1" start_page="537" end_page="538" type="sub_section"> <SectionTitle> 3.1 Corpus </SectionTitle> <Paragraph position="0"> For the text categorization experiments we used the Reuters-21578 corpus, which contains 20 000 newswire articles in English with multiple categories (Lewis, 1997). More specifically, we used the ModApte split, containing 9 603 documents for training and 3 299 documents in the fixed test set, and the 90 categories that are present in both training and test sets.</Paragraph> <Paragraph position="1"> As a first pre-processing step, we extracted the texts contained in the TITLE and BODY tags. The pre-processed documents were then given as input to the keyword extraction algorithm. In Table 2, the number of keywords assigned to the doc- null uments in the training set and the test set are displayed. As can be seen in this table, three is the number of keywords that is most often extracted.</Paragraph> <Paragraph position="2"> In the training data set, 9 549 documents are assigned keywords, while 54 are empty, as they have no text in the TITLE or BODY tags. Of the 3 299 documents in the test set, 3 285 are assigned keywords, and the remaining fourteen are those that are empty. The empty documents are included in the result calculations for the fixed test set, in order to enable comparisons with other experiments.</Paragraph> <Paragraph position="3"> The mean number of keyword extracted per document in the training set is 6.4 and in the test set 6.1 (not counting the empty documents).</Paragraph> <Paragraph position="4"> words per document in training set and test set respectively. null</Paragraph> </Section> <Section position="2" start_page="538" end_page="538" type="sub_section"> <SectionTitle> 3.2 Learning Method </SectionTitle> <Paragraph position="0"> The focus of the experiments described in this paper was the text representation. For this reason, we used only one learning algorithm, namely an implementation of Linear Support Vector Machines (Joachims, 1999). This is the learning method that has obtained the best results in text categorization experiments (Dumais et al., 1998; Yang and Liu, 1999).</Paragraph> </Section> <Section position="3" start_page="538" end_page="539" type="sub_section"> <SectionTitle> 3.3 Representations </SectionTitle> <Paragraph position="0"> This section describes in detail the input representations that we experimented with. An important step for the feature selection is the dimensionality reduction, that is reducing the number of features. This can be done by removing words that are rare (that occur in too few documents or have too low term frequency), or very common (by applying a stop-word list). Also, terms may be stemmed, meaning that they are merged into a common form. In addition, any of a number of feature selection metrics may be applied to further reduce the space, for example chi-square, or information gain (see for example Forman (2003) for a survey).</Paragraph> <Paragraph position="1"> Once that the features have been set, the final decision to make is what feature value to assign.</Paragraph> <Paragraph position="2"> There are to this end three common possibilities: a boolean representation (that is, the term exists in the document or not), term frequency, or tf*idf.</Paragraph> <Paragraph position="3"> Two sets of experiments were run in which the automatically extracted keywords were the only input to the representation. In the first set, key-words that contained several tokens were kept intact. For example a keyword such as paradise fruit was represented as paradise fruit and was -- from the point of view of the classifier -- just as distinct from the single token fruit as from meatpackers. No stemming was performed in this set of experiments.</Paragraph> <Paragraph position="4"> In the second set of keywords-only experiments, the keywords were split up into unigrams, and also stemmed. For this purpose, we used Porter's stemmer (Porter, 1980). Thereafter the experiments were performed identically for the two keyword representations.</Paragraph> <Paragraph position="5"> In a third set of experiments, we extracted only the content in the TITLE tags, that is, the headlines. The tokens in the headlines were stemmed and represented as unigrams. The main motivation for the title experiments was to compare their performance to that of the keywords.</Paragraph> <Paragraph position="6"> For all of these three feature inputs, we first evaluated which one of the three possible feature values to use (boolean, tf, or tf*idf). Thereafter, we reduced the space by varying the minimum number of occurrences in the training data, for a feature to be kept.</Paragraph> <Paragraph position="7"> The starting point for the fourth set of experiments was a full-text representation, where all stemmed unigrams occurring three or more times in the training data were selected, with the feature value tf*idf. Assuming that extracted keywords convey information about a document's gist, the feature values in the full-text representation were given higher weights if the feature was identical to a keyword token. This was achieved by adding the term frequency of a full-text unigram to the term frequency of an identical keyword unigram. Note that this does not mean that the term frequency value was necessarily doubled, as a keyword often contains more than one token, and it was the term frequency of the whole keyword that was added.</Paragraph> </Section> <Section position="4" start_page="539" end_page="539" type="sub_section"> <SectionTitle> 3.4 Training and Validation </SectionTitle> <Paragraph position="0"> This section describes the parameter tuning, for which we used the training data set. This set was divided into five equally sized folds, to decide which setting of the following two parameters that resulted in the best performing classifier: what feature value to use, and the threshold for the minimum number of occurrence in the training data (in this particular order).</Paragraph> <Paragraph position="1"> To obtain a baseline, we made a full-text uni-gram run with boolean as well as with tf*idf feature values, setting the occurrence threshold to three.</Paragraph> <Paragraph position="2"> As stated previously, in this study, we were concerned only with the representation, and more specifically with the feature input. As we did not tune any other parameters than the two mentioned above, the results can be expected to be lower than the state-of-the art, even for the full-text run with unigrams.</Paragraph> <Paragraph position="3"> The number of input features for the full-text unigram representation for the whole training set was 10 676, after stemming and removing all tokens that contained only digits, as well as those tokens that occurred less than three times. The total number of keywords assigned to the 9 603 documents in the training data was 61 034. Of these were 29 393 unique. When splitting up the keywords into unigrams, the number of unique stemmed tokens was 11 273.</Paragraph> </Section> </Section> <Section position="6" start_page="539" end_page="539" type="metho"> <SectionTitle> 3.5 Test </SectionTitle> <Paragraph position="0"> As a last step, we tested the best performing representations in the four different experimental settings on the independent test set.</Paragraph> <Paragraph position="1"> The number of input features for the full-text unigram representation was 10 676. The total number of features for the intact keyword representation was 4 450 with the occurrence threshold set to three, while the number of stemmed keyword unigrams was 6 478, with an occurrence threshold of two. The total number of keywords extracted from the 3 299 documents in the test set was 19 904.</Paragraph> <Paragraph position="2"> Next, we present the results for the validation and test procedures.</Paragraph> </Section> class="xml-element"></Paper>