XML Viewer - p06-1068

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1068_evalu.xml
Size: 6,127 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1068">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Study on Automatically Extracted Keywords in Text Categorization</Title>
  <Section position="7" start_page="539" end_page="541" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> To evaluate the performance, we used precision, recall, and micro-averaged F-measure, and we let the F-measure be decisive. The results for the 5fold cross validation runs are shown in Table 3, where the values given are the average of the five runs made for each experiment. As can be seen in this table, the full-text run with a boolean feature value gave 92.3% precision, 69.4% recall, and 79.2% F-measure. The full-text run with tf*idf gave a better result as it yielded 92.9% precision, 71.3% recall, and 80.7% F-measure. Therefore we defined the latter as baseline.</Paragraph>
    <Paragraph position="1"> In the first type of the experiment where each keyword was treated as a feature independently of the number of tokens contained, the recall rates were considerably lower (between 32.0% and 42.3%) and the precision rates were somewhat lower (between 85.8% and 90.5%) compared to the baseline. The best performance was obtained when using a boolean feature value, and setting the minimum number of occurrence in training data to three (giving an F-measure of 56.9%).</Paragraph>
    <Paragraph position="2"> In the second type of experiments, where the keywords were split up into unigrams and stemmed, recall was higher but still low (between 60.2% and 64.8%) and precision was somewhat lower (88.9-90.2%) when compared to the baseline. The best results were achieved with a boolean representation (similar to the first experiment) and the minimum number of occurrence in the training data set to two (giving an F-measure of 75.0%) In the third type of experiments, where only the text in the TITLE tags was used and was represented as unigrams and stemmed, precision rates increased above the baseline to 93.3-94.5%. Here, the best representation was tf*idf with a token occurring at least four times in the training data (with an F-measure of 79.9%).</Paragraph>
    <Paragraph position="3"> In the fourth and last set of experiments, we gave higher weights to full-text tokens if the same token was present in an automatically extracted keyword. Here we obtained the best results. In these experiments, the term frequency of a key-word unigram was added to the term frequency for the full-text features, whenever the stems were identical. For this representation, we experimented with setting the minimum number of occurrence in training data both before and after that the term frequency for the keyword token was added to the term frequency of the unigram. The  of experiments, with various parameter settings.</Paragraph>
    <Paragraph position="4"> highest recall (72.0%) and F-measure (81.1%) for all validation runs were achieved when the occurrence threshold was set before the addition of the keywords.</Paragraph>
    <Paragraph position="5"> Next, the results on the fixed test data set for the four experimental settings with the best performance on the validation runs are presented.</Paragraph>
    <Paragraph position="6"> Table 4 shows the results obtained on the fixed test data set for the baseline and for those experiments that obtained the highest F-measure for each one of the four experiment types.</Paragraph>
    <Paragraph position="7"> We can see that the baseline -- where the full-text is represented as unigrams with tf*idf as feature value -- yields 93.0% precision, 71.7% recall, and 81.0% F-measure. When the intact key-words are used as feature input with a boolean feature value and at least three occurrences in training data, the performance decreases greatly both considering the correctness of predicted categories and the number of categories that are found.</Paragraph>
    <Paragraph position="8"> When the keywords are represented as unigrams, a better performance is achieved than when they are kept intact. This is in line with the findings on n-grams by Caropreso et al. (2001). However, the results are still not satisfactory since both the precision and recall rates are lower than the baseline.</Paragraph>
    <Paragraph position="9"> Titles, on the other hand, represented as unigrams and stemmed, are shown to be a useful information source when it comes to correctly predicting the text categories. Here, we achieve the highest precision rate of 94.2% although the recall rate and the F-measure are lower than the baseline.</Paragraph>
    <Paragraph position="10"> Full-texts combined with keywords result in the highest recall value, 72.9%, as well as the highest F-measure, 81.7%, both above the baseline.</Paragraph>
    <Paragraph position="11"> Our results clearly show that automatically extracted keywords can be a valuable supplement to full-text representations and that the combination of them yields the best performance, measured as both recall and micro-averaged F-measure. Our experiments also show that it is possible to do a satisfactory categorization having only keywords, given that we treat them as unigrams. Lastly, for higher precision in text classification, we can use the stemmed tokens in the headlines as features  ument varies from zero to twelve. In Figure 1, we have plotted how the precision, the recall, and the F-measure for the test set vary with the number of assigned keywords for the keywords-only unigram representation.</Paragraph>
    <Paragraph position="12">  each number of assigned keywords. The values in brackets denote the number of documents.</Paragraph>
    <Paragraph position="13"> We can see that the F-measure and the recall reach their highest points when three keywords are extracted. The highest precision (100%) is obtained when the classification is performed on a single extracted keyword, but then there are only 36 documents present in this group, and the recall is low. Further experiments are needed in order to establish the optimal number of keywords to extract.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML