File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1042_intro.xml
Size: 11,734 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1042"> <Title>Evaluation of Automatically Identified Index Terms for Browsing Electronic Documents I</Title> <Section position="2" start_page="0" end_page="304" type="intro"> <SectionTitle> 2. Introduction </SectionTitle> <Paragraph position="0"> In this paper, we consider the problem of how to evaluate the automatic identification of index terms that have been derived without recourse to lexicons or to other kinds of domain-specific information. By index terms, we mean natural language expressions that constitute a meaningful representation of a document for humans.</Paragraph> <Paragraph position="1"> The premise of this research is that if significant topics coherently represent information in a document, these topics can be used as index terms that approximate the content of individual documents in large collections of electronic documents.</Paragraph> <Paragraph position="2"> We compare three shallow processing methods for identifying index terms: * Keywords (KW) are terms identified by counting frequency of stemmed words in a document; Technical terms (TT) are noun phrases (NPs) or subparts of NPs repeated more than twice in a document \[Justeson and Katz 1995\]; Head sorted terms (HS) are identified by a method in which simplex noun phrases (as defined below) are sorted by head and then ranked in decreasing order of frequency \[Wacholder 1998\].</Paragraph> <Paragraph position="3"> The three methods that we evaluated are domain-independent in that they use statistical and/or linguistic properties that apply to any natural language document in any field. These methods are also corpus-independent, in that the ranking of terms for an individual document is not dependent on properties of the corpus.</Paragraph> <Section position="1" start_page="0" end_page="304" type="sub_section"> <SectionTitle> 2.1 Overview of methods and results </SectionTitle> <Paragraph position="0"> Subjects were drawn from two groups: professionals and students. Professionals included librarians and publishing professionals familiar with both manual and automatic text indexing. Students included undergraduate and graduate students with a variety of academic interests.</Paragraph> <Paragraph position="1"> To assess terms, we used a standard qualitative ranking technique. We presented subjects with an article and a list of terms identified by one of the three methods. Subjects were asked to answer the following general question: &quot;Would this term be useful in an electronic index for this article?&quot; Terms were rated on a scale of 1 to 5, where 1 indicates a high quality term that should definitely be included in the index and 5 indicates a junk term that definitely should not be included. For ex1 This research was partly funded by NSF IRI 97-12069, &quot;Automatic identification of significant topics in domain independent full text documents&quot; and NSF IRI 97-53054, &quot;Computationally tractable methods for document analysis&quot;. _'tO~ ample, the phrase court-approved affirmative action plans received an average rating of 1 from the professionals, meaning that it was ranked as useful for the article; the KW affirmative received an average rating of 3.75, meaning that it was less useful; and the KW action received an average ranking of 4.5, meaning that it was not useful.</Paragraph> <Paragraph position="2"> The goal of our research is to determine which method, or combination of methods, provides the best results. We measure results in terms of two criteria: quality and coverage.</Paragraph> <Paragraph position="3"> By quality, we mean that evaluators ranked terms high on the 1 to 5 scale from highest to lowest. By coverage, we mean the thoroughness with which the terms cover the significant topics in the document. Our methodology permits us to measure both criteria, as shown in Figure Our results from both the professionals and students show that TTs are superior with respect to quality; however, there are only a small number of TTs per document, so they do not provide adequate coverage in that they are not fully representative of the document as a whole. In contrast, KWs provide good coverage but relatively poor quality in that KWs are vague, and not well filtered. SNPs, which have been sorted using HS and filtered, provide a better balance of quality and coverage.</Paragraph> <Paragraph position="4"> From our study, we draw the following conclusions: * The KW approach identifies some useful index terms, but they are mixed in with a large number of low-ranked terms.</Paragraph> <Paragraph position="5"> * The TT approach identifies high quality terms, but with low coverage, i.e., relatively few indexing terms.</Paragraph> <Paragraph position="6"> * The HS approach achieves a balance between quality and coverage.</Paragraph> <Paragraph position="7"> 3. Domain-independent metrics for identi- null fying significant topics In order to identify significant topics in a document, a significance measure is needed, i.e., a method for determining which concepts in the document are relatively important for a given task. The need to determine the importance of a particular concept within a document is motivated by a range of applications, including information retrieval \[Salton 1989\], automatic determination of authorship \[Mosteller and Wallace 1963\], similarity metrics for cross-document clustering \[Hatzivassiloglou et al. 1999\], automatic indexing \[Hodges et al. 1996\] and input to summarization \[Paice 1990\].</Paragraph> <Paragraph position="8"> For example, one of the earlier applications using frequency for identifying significant topics in a document was proposed by \[Luhn 1958\] for use in creating automatic abstracts. For each document, a list of stoplisted stems was created, and ranked by frequency; the most frequent keywords were used to identify significant sentences in the original document. Luhn's premise was that emphasis, as indicated by repetition of words and collocation is an indicator of significance. Namely, &quot;the more often certain words are found in each other's company within a sentence, the more significance may be attributed to each of these words.&quot; This basic observation, although refined extensively by later summarization techniques (as reviewed in \[Paice 1990\]), relies on the capability of identifying significant concepts. null The standard IR technique known as tf*idf \[Salton 1989\] seeks to identify documents relevant to a particular query by relativizing keyword frequency in a document as compared to frequency in a corpus. This method can be used to locate at least some important concepts in full text. Although it has been effective for information retrieval, for other applications, such as human-oriented indexing, this technique is impractical. Ambiguity of stems (trad might refer to trader or tradition) and of isolated words (state might be a political entity or a mode of being) means that lists of keywords have not usually been used to represent the content of a document to human beings. Furthermore, humans have a difficult time processing stems and parts of words out of phrasal context.</Paragraph> <Paragraph position="9"> The technical term (TT) method, another technique for identification of significant terms in text that can be used as index terms was introduced by \[Justeson and Katz 1995\], who developed an algorithm for identifying repeated multi-word phrases such as central processing unit in the computer domain or word sense in the lexical semantic domain.</Paragraph> <Paragraph position="10"> This algorithm identifies candidate TTs in a corpus by locating NPs consisting of nouns, adjectives, and sometimes prepositional phrases. TTs are defined as those NPs, or their subparts, which occur above some frequency threshold in a corpus. However, as \[Boguraev and Kennedy 1998\] observe, the TT technique may not characterize the full content of documents. Indeed, even in a technical document, TTs do not provide adequate coverage of the NPs in a document that contribute to its content, especially since TTs are by definition multi-word. A truly domain-general method should apply to both technical and non-technical documents. The relevant difference between technical and non-technical documents is that in technical documents, many of the topics which are significant to the document as a whole may be also TTs.</Paragraph> <Paragraph position="11"> \[Wacholder 1998\] proposed the method of Head Sorting for identifying significant topics that can be used to represent a source document. HS also uses a frequency measure to provide an approximation of topic significance. However, instead of counting frequency of stems or repetition of word sequences, this method counts frequency of a relatively easily identified grammatical element, heads of simplex noun phrases (SNPs). For common NPs (NPs whose head is a common noun), an SNP is a maximal NP that includes premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. For example, the well-known book is an SNP but the well-known book on asteroids includes two SNPs, well-known book and asteroids. For proper names, an SNP is a name that refers to a single entity. For example, Museum of the City of New York, the name of an organization, is an SNP even though the organizational name incorporates a city name. Others, such as \[Church 1988\], have discussed a similar concept, sometimes called simple or base NPs.</Paragraph> <Paragraph position="12"> The HS approach is based on the assumption that nominal elements can be used to convey the gist of a document. SNPs, which are semantically and syntactically coherent, appear to be at a good level of detail for content representation of the document. ' SNPs are identified by a system \[Evans 1998; Evans et al. 2000\] which sequentially parses text that has been tagged with part of speech using a finite state machine. Next, the complete list of SNPs identified in a document is sorted by the head of the phrase, which, at least for English-language common SNPs, is almost always the last word. The intuitive justification for sorting SNPs by head is based on the fundamental linguistic distinction between head and modifier: in general, a head makes a greater contribution to the syntax and semantics of a phrase than does a modifier. This linguistic insight can be extended to the document level. If, as a practical matter, it is necessary to rank the contribution to a whole document made by the sequence of words constituting an NP, the head should be ranked more highly than other words in the phrase. This distinction is important in linguistic theory; for example, \[Jackendoff 1977\] discusses the relationship of heads and modifiers in phrase structure. It is also important in NLP, where, for example, \[Strzalkowski 1997\] and \[Evans and Zhai 1996\] have used the distinction between heads and modifiers to add query terms to information retrieval systems.</Paragraph> <Paragraph position="13"> Powerful corpus processing techniques have been developed to measure deviance from an average occurrence or co-occurrence in the corpus. In this paper we chose to evaluate methods that depend only on document-internal data, independent of corpus, domain or genre.</Paragraph> <Paragraph position="14"> We therefore did not use, for example, tf*idf, the purely statistical technique that is the used by most information retrieval systems, or \[Smadja 1993\], a hybrid statistical and symbolic technique for identifying collocations.</Paragraph> </Section> </Section> class="xml-element"></Paper>