File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1407_abstr.xml
Size: 4,404 bytes
Last Modified: 2025-10-06 13:42:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1407"> <Title>A Simple but Powerful Automatic Term Extraction Method</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In this paper, we propose a new idea for the automatic recognition of domain specific terms. Our idea is based on the statistics between a compound noun and its component single-nouns. More precisely, we focus basically on how many nouns adjoin the noun in question to form compound nouns. We propose several scoring methods based on this idea and experimentally evaluate them on the NTCIR1 TMREC test collection. The results are very promising especially in the low recall area.</Paragraph> <Paragraph position="1"> Introduction Automatic term recognition, ATR in short, aims at extracting domain specific terms from a corpus of a certain academic or technical domain. The majority of domain specific terms are compound nouns, in other words, uninterrupted collocations. 85% of domain specific terms are said to be compound nouns. They include single-nouns of the remaining 15% very frequently as their components, where &quot;single-noun&quot; means a noun which could not be further divided into several shorter and more basic nouns. In other words, the majority of compound nouns consist of the much smaller number of the remaining 15% single-noun terms and other single-nouns. In this situation, it is natural to pay attention to the relation among single-nouns and compound nouns, especially how single-noun terms contribute to make up compound noun terms.</Paragraph> <Paragraph position="2"> Another important feature of domain specific terms is termhood proposed in (Kageura & Umino 96) where &quot;termhood&quot; refers to the degree that a linguistic unit is related to a domain-specific concept. Thus, what we really have to pursue is an ATR method which directly uses the notion of termhood.</Paragraph> <Paragraph position="3"> Considering these factors, the way of making up compound nouns must be heavily related to the termhood of the compound nouns. The first reason is that termhood is usually calculated based on term frequency and bias of term frequency like inverse document frequency. Even though these calculations give a good approximation of termhood, still they are not directly related to termhood because these calculations are based on superficial statistics. That means that they are not necessarily meanings in a writer's mind but meanings in actual use. Apparently, termhood is intended to reflect this type of meaning. The second reason is that if a certain single-noun, say N, expresses the key concept of a domain that the document treats, the writer of the document must be using N not only frequently but also in various ways. For instance, he/she composes quite a few compound nouns using N and uses these compound nouns in documents he/she writes. Thus, we focus on the relation among single-nouns and compound nouns in pursuing new ATR methods.</Paragraph> <Paragraph position="4"> The first attempt to make use of this relation has been done by (Nakagawa & Mori 98) through the number of distinct single-nouns that come to the left or right of a single-noun term when used in compound noun terms. Using this type of number associated with a single-noun term, Nakagawa and Mori proposed a scoring function for term candidates.</Paragraph> <Paragraph position="5"> Their term extraction method however is just one example of employing the relation among single-nouns and compound nouns. Note that this relation is essentially based on a noun bigram. In this paper, we expand the relation based on noun bigrams that might be the components of longer compound nouns. Then we experimentally evaluate the power of several variations of scoring functions based on the noun bigram relation using the NTCIR1 TMREC test collection. By this experimental clarification, we could conclude that the single-noun term's power of generating compound noun terms is useful and essential in ATR.</Paragraph> <Paragraph position="6"> In this paper, section 1 gives the background of ATR methods. Section 2 describes the proposed method of the noun bigram based scoring function for term extraction. Section 3 describes the experimental results and discusses them.</Paragraph> </Section> class="xml-element"></Paper>