File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4005_intro.xml

Size: 5,049 bytes

Last Modified: 2025-10-06 14:02:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4005">
  <Title>Enhancing Linguistically Oriented Automatic Keyword Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> The approach taken to the keyword extraction task is that of supervised machine learning. This means that a set of documents with known keywords is used to train a model, which in turn is applied to select keywords to and from previously unseen documents. The keyword extraction discussed in this paper is based on work presented in Hulth (2003a) and Hulth (2003b).</Paragraph>
    <Paragraph position="1"> In Hulth (2003a) an evaluation of three different methods to extract candidate terms from documents is presented. The methods are: a0 extracting all uni-, bi, and trigrams that do not begin or end with a stopword.</Paragraph>
    <Paragraph position="2"> a0 extracting all noun phrase (NP) chunks as judged by a partial parser.</Paragraph>
    <Paragraph position="3"> a0 extracting all part-of-speech (PoS) tagged words or sequences of words that match any of a set of empirically defined PoS patterns.</Paragraph>
    <Paragraph position="4"> The best performing models use four attributes. These are:</Paragraph>
    <Paragraph position="6"> a0 relative position of the first occurrence a0 the POS tag or tags assigned to the term All terms are stemmed using Porter's stemmer (Porter, 1980), and an automatically selected keyword is considered correct if it is equivalent to a stemmed manually assigned keyword. The performance of the classifiers is evaluated by calculating the F-measure for the selected keywords, with equal weight given to the precision and the recall.</Paragraph>
    <Paragraph position="7"> In Hulth (2003b), experiments on how the performance of the keyword extraction can be improved by combining the judgement of three classifiers are presented. The classifiers differ in how the data are represented, and more specifically in how the candidate terms are selected from the documents. By only assigning keywords that are selected by at least two term selection approaches--that is by taking the majority vote--a better performance is achieved. In addition, by removing the subsumed key-words (keywords that are substrings of other selected keywords) the performance is yet higher.</Paragraph>
    <Paragraph position="8"> The classifiers are constructed by Rule Discovery System (RDS), a system for rule induction1. This means that the models consist of rules. The applied strategy is that of recursive partitioning, where the resulting rules are hierarchically organised (i.e., decision trees).</Paragraph>
    <Paragraph position="9"> The data set on which the models are trained and tested originates from the Inspec database2, and consists of abstracts in English from scientific journal papers. The set of 2 000 documents is divided into three sets: a training set of 1 000 documents (to train the models), a validation set consisting of 500 documents (to select the best performing model, e.g., for setting the threshold value for the regression runs), and the remaining 500 documents are saved for testing (to get unbiased results). Each abstract has two sets of keywords--assigned by a professional indexer--associated to them: a set of controlled terms (keywords restricted to the Inspec thesaurus); and a set of uncontrolled terms that can be any suitable terms.</Paragraph>
    <Paragraph position="10"> Both the controlled terms and the uncontrolled terms may or may not be present in the abstracts. However, the indexers had access to the full-length documents when assigning the keywords, and not only to the abstracts. For the experiments presented in this paper, only the uncontrolled terms are considered, as these to a larger extent are present in the abstracts (76.2% as opposed to 18.1% for the controlled terms). The performance is evaluated using the uncontrolled keywords as the gold standard.</Paragraph>
    <Paragraph position="11"> In the paper, three minor improvements to the keyword extraction algorithm are presented. These concern how one of the term selection approaches extract candidate terms; how the collection frequency is calculated; and how the weights are set to the positive examples. The major focus of the paper is how the learning task is defined. For these experiments, the same machine learning system--RDS--is used as for the experiments presented by Hulth (2003a). Also the same data are used to train the models and to tune the parameters. The results of the experiments are presented in Tables 1-5, which show: the average number of keywords assigned per document (Assign.); the average number of correct keywords per document (Corr.); precision (P); recall (R); and F-measure (F). On average, 7.6 manually assigned keywords are present per document. The total number of manual keywords present in the abstracts in the test data set is 3 816, and is the number on which the recall is calculated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML