File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/p06-1068_relat.xml
Size: 4,222 bytes
Last Modified: 2025-10-06 14:15:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1068"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Study on Automatically Extracted Keywords in Text Categorization</Title> <Section position="8" start_page="541" end_page="542" type="relat"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> For the work presented in this paper, there are two aspects that are of interest in previous work. These are in how the alternative input features (that is, alternative from unigrams) are selected and in how this alternative representation is used in combination with a bag-of-words representation (if it is). An early work on linguistic phrases is done by F&quot;urnkranz et al. (1998), where all noun phrases matching any of a number of syntactic heuristics are used as features. This approach leads to a higher precision at the low recall end, when evaluated on a corpus of Web pages. Aizawa (2001) extracts PoS-tagged compounds, matching pre-defined PoS patterns. The representation contains both the compounds and their constituents, and a small improvement is shown in the results on Reuters-21578. Moschitti and Basili (2004) add complex nominals as input features to their bag-of-words representation. The phrases are extracted by a system for terminology extraction1. The more complex representation leads to a small decrease on the Reuters corpus. In these studies, it is unclear how many phrases that are extracted and added to the representations.</Paragraph> <Paragraph position="1"> Li et al. (2003) map documents (e-mail messages) that are to be classified into a vector space of keywords with associated probabilities. The mapping is based on a training phase requiring both texts and their corresponding summaries.</Paragraph> <Paragraph position="2"> Another approach to combine different representations is taken by Sahlgren and C&quot;oster (2004), where the full-text representation is combined with a concept-based representation by selecting one or the other for each category. They show that concept-based representations can outperform traditional word-based representations, and that a combination of the two different types of representations improves the performance of the classifier over all categories.</Paragraph> <Paragraph position="3"> Keywords assigned to a particular text can be seen as a dense summary of the same. Some reports on how automatic summarization can be used to improve text categorization exist. For ex1In terminology extraction all terms describing a domain are to be extracted. The aim of automatic keyword indexing, on the other hand, is to find a small set of terms that describes a specific document, independently of the domain it belongs to. Thus, the set of terms must be limited to contain only the most salient ones.</Paragraph> <Paragraph position="4"> ample, Ko et al. (2004) use methods from text summarization to find the sentences containing the important words. The words in these sentences are then given a higher weight in the feature vectors, by modifying the term frequency value with the sentence's score. The F-measure increases from 85.8 to 86.3 on the Newsgroups data set using Support vector machines.</Paragraph> <Paragraph position="5"> Mihalcea and Hassan (2004) use an unsupervised method2 to extract summaries, which in turn are used to categorize the documents. In their experiments on a sub-set of Reuters-21578 (among others), Mihalcea and Hassan show that the precision is increased when using the summaries rather than the full length documents. &quot;Ozg&quot;ur et al. (2005) have shown that limiting the representation to 2 000 features leads to a better performance, as evaluated on Reuters-21578. There is thus evidence that using only a sub-set of a document can give a more accurate classification. The question, though, is which sub-set to use.</Paragraph> <Paragraph position="6"> In summary, the work presented in this paper has the most resemblance with the work by Ko et al. (2004), who also use a more dense version of a document to alter the feature values of a bag-of-words representation of a full-length document.</Paragraph> </Section> class="xml-element"></Paper>