XML Viewer - w02-1407

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1407_metho.xml
Size: 14,063 bytes
Last Modified: 2025-10-06 14:08:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1407">
  <Title>A Simple but Powerful Automatic Term Extraction Method</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Single-Noun Bigrams as Components of
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Compound Nouns
2.1 Single-Noun Bigrams
</SectionTitle>
      <Paragraph position="0"> The relation between a single-noun and complex nouns that include this single-noun is very important.</Paragraph>
      <Paragraph position="1"> Nevertheless, to our knowledge, this relation has not been paid enough attention so far. Nakagawa and Mori (1998) proposed a term scoring method that utilizes this type of relation. In this paper, we extend our idea comprehensively. Here we focus on compound nouns among the various types of complex terms. In technical documents, the majority of domain-specific terms are noun phrases or compound nouns consisting of a small number of single nouns. Considering this observation, we propose a new scoring method that measures the importance of each single-noun. In a nutshell, this scoring method for a single-noun measures how many distinct compound nouns contain a particular single-noun as their part in a given document or corpus. Here, think about the situtation where single-noun N occurs with other single-nouns which might be a part of many compound nouns shown in Figure 1 where [N M] means bigram of noun N and M.</Paragraph>
      <Paragraph position="3"> In Figure 1, [LNi N] (i=1,..,n) and [N RNj] (j=1,...,m) are single-noun bigrams which constitute (parts of) compound nouns. #Li and #Rj (i=1,..,n and j=1,..,m) mean the frequency of the bigram [LNi N] and [N RNj] respectively. Note that since we depict only bigrams, compound nouns like [LNi N RNj] which contains [LNi N] and/or [N RNj] as their parts might actually occur in a corpus. Again this noun trigram might be a part of longer compound nouns.</Paragraph>
      <Paragraph position="4"> Let us show an example of a noun bigram. Suppose that we extract compound nouns including &amp;quot;trigram&amp;quot; as candidate terms from a corpus shown in the following example.</Paragraph>
      <Paragraph position="5"> Example 1.</Paragraph>
      <Paragraph position="6"> trigram statistics, word trigram, class trigram, word trigram, trigram acquisition, word trigram statistics, character trigram Then, noun bigrams consisting of a single-noun &amp;quot;trigram&amp;quot; are shown in the following where the number bewteen</Paragraph>
      <Paragraph position="8"> We just focus on and utilize single-noun bigrams to define the function on which scoring is based. Note that we are concerned only with single-noun bigrams and not with a single-noun per se. The reason is that we try to sharply focus on the fact that the majority of domain specific terms are compound nouns.</Paragraph>
      <Paragraph position="9"> Compound nouns are well analyzed as noun bigram.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Scoring Function
</SectionTitle>
      <Paragraph position="0"> Since a scoring function based on [LNi N] or [N RNj] could have an infinite number of variations, we here consider the following simple but representative scoring functions.</Paragraph>
      <Paragraph position="1"> #LDN(N) and #RDN(N) : These are the number of distinct single-nouns which directly precede or succeed N. These are exactly &amp;quot;n&amp;quot; and &amp;quot;m&amp;quot; in Figure 1. For instance, in an example shown in Figure 2, #LDN(trigram)=3, #RDN(trigram)=2 LN(N,k) and RN(N,k): The general functions that take into account the number of occurrences of each noun bigram like [LNi N] and [N RNj] are defined as follows.</Paragraph>
      <Paragraph position="2"> For instance, if we use LN(N,1) and RN(N,1) in example 1, GM(trigram,1) = )15()13( +x+ = 4.90. In (3), GM does not depend on the length of a compound noun that is the number of single-nouns within the compound noun. This is because we have not yet had any idea about the relation between the importance of a compound noun and a length of the compound noun. It is fair to treat all compound nouns, including single-nouns, equally no matter how long or short each compound noun is.</Paragraph>
      <Paragraph position="4"> We can find various functions by varying parameter k of (1) and (2). For instance, #LDN(N) and #RDN(N) can be defined as LN(N,0) and RN(N,0). LN(N,1) and RN(N,1) are the frequencies of nouns that directly precede or succeed N. In the example shown in Figure 2, for example, LN(trigram,1)=5, and RN(trigram,1)=3.</Paragraph>
      <Paragraph position="5"> Now we think about the nature of (1) and (2) with various value of the parameter k.</Paragraph>
      <Paragraph position="6"> The larger k is, the more we take into account the frequencies of each noun bigram. One extreme is the case k=0, namely LN(N,0) and RN(N,0), where we do not take into account the frequency of each noun bigram at all. LN(N,0) and RN(N,0) describe how linguistically and domain dependently productive the noun N is in a given corpus. That means that noun N presents a key and/or basic concept of the domain treated by the corpus. Other extreme cases are large k, like k=2 , 4, etc. In these cases, we rather focus on frequency of each noun bigram.</Paragraph>
      <Paragraph position="7"> In other words, statistically biased use of noun N is the main concern. In the example shown in Figure 2, for example, LN(trigram,2)=11, and RN(trigram,2)=5.</Paragraph>
      <Paragraph position="8"> If k&lt;0, we discount the frequency of each noun bigram. However, this case does not show good results of in our ATR experiment.</Paragraph>
      <Paragraph position="9">  Information we did not use in the bigram based methods described in 2.2.1 and 2.2.2 is the frequency of single-nouns and compound-nouns that occur independently, namely left and right adjacent words not being nouns. For instance, &amp;quot;word patterns&amp;quot; occurs independently in &amp;quot;... use the word patterns occurring in ... .&amp;quot; Since the scoring functions proposed in 2.2.1 are noun bigram statistics, the number of this kind of independent occurrences of nouns themselves are not used. If we take this information into account, a new type of information is used and better results are expected.</Paragraph>
      <Paragraph position="10"> In this paper, we employ a very simple method for this. We observe that if a single-noun or a compound noun occurs independently, the score of the noun is multiplied by the number of its independent occurrences. Then GM(CN,k) of the formula (3) is revised.</Paragraph>
      <Paragraph position="11"> We call this new GM FGM(CN,k) and define it as follows.</Paragraph>
      <Paragraph position="12">  The next thing to do is to extend the scoring functions of a single-noun to the scoring functions of a compound noun. We adopt a very simple method, namely a geometric mean. Now think about a compound noun : CN = N1 N2...N L. Then a geometric mean: GM of CN is defined as follows.</Paragraph>
      <Paragraph position="13"> if N occurs independently then f(CN)k)GM(CN,k)FGM(CN, x= where f(CN) means the number of independent occurrences of noun CN</Paragraph>
      <Paragraph position="15"> For instance, in example 1, if we find independent &amp;quot;trigram&amp;quot; three times in the corpus,</Paragraph>
      <Paragraph position="17"> We compare our methods with the C-value based method(Frantzi and Ananiadou 1996) because 1) their method is very powerful to extract and properly score compound nouns., and 2) their method is basically based on unithood. On the contrary, our scoring functions proposed in 2.2.1 try to capture termhood. However the original definition of C-value can not score a single-noun because the important part of the definition C-value is:</Paragraph>
      <Paragraph position="19"> where a is compound noun, length(a) is the number of single-nouns which make up a, n(a) is the total frequency of occurrence of a on the corpus, t(a) is the frequency of occurrence of a in longer candidate terms, and c(a) is the number of those candidate terms.</Paragraph>
      <Paragraph position="20"> As known from (5), all single-noun's C-value come to be 0. The reason why the first term of right hand side is (length(a)-1) is that C-value originally seemed to capture how much computational effort is to be made in order to recognize the important part of the term. Thus, if the length(a) is 1, we do not need any effort to recognize its part because the term a is a single-word and does not have its part. But we intend to capture how important the term is for the writer or reader, namely its termhood. In order to make the C-value capture termhood, we modify (5) as follows.</Paragraph>
      <Paragraph position="22"/>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experiment
</SectionTitle>
      <Paragraph position="0"> In our experiment, we use the NTCIR1 TMREC test collection (Kageura et al 1999). As an activity of TMREC, they have provided us with a Japanese test collection of a term recognition task. The goal of this task is to automatically recognize and extract terms from a text corpus which contains 1,870 abstracts gathered from the computer science and communication engineering domain corpora of the NACSIS Academic Conference Database, and 8,834 manually collected correct terms. The TMREC text corpus is morphologically analyzed and POS tagged by hand. From this POS tagged text, we extract uninterrupted noun sequences as term candidates. Actually 16,708 term candidates are extracted and several scoring methods are applied to them. All the extracted term candidates CNs are ranked according to their GM(CN,k), FGM(CN,k) and MC-value(CN) in descending order. As for parameter k of (1) and (2), we choose k=1 because its performance is the best among various values of k in the range from 0 to 4. Thus, henceforth, we omit k from GM and FGM, like GM(CN) and FGM(CN). We use GM(CN) as the baseline.</Paragraph>
      <Paragraph position="1"> In evaluation, we conduct experiments where we pick up the highest ranked term candidate down to the PNth highest ranked term candidate by these three scoring methods, and evaluate the set of selected terms with the number of correct terms, we call it CT, within it. In the following figures, we only show CT because recall is CT/8834, where 8834 is the number of all correct terms, precision is CT/PN.</Paragraph>
      <Paragraph position="2"> Another measure NTCIR1 provides us with is the terms which include the correct term as its part. We call it &amp;quot;longer term&amp;quot; or LT. They are sometimes valued terms and also indicate in what context the correct terms are used. Then we also use the number of longer terms in our evaluation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> In Figure 3 through 5, PN of X-axis  FGM(CN) , and LT of GM(CN) minus LT of MC-value(CN) for each PN In Figure 3, the Y-axis represents CT in other words the number of correct terms picked up by GM(CN) and the number of longer terms picked up by GM(CN) for each PN. They are our baseline. The Figure 4 shows the difference between CT of FGM(CN) and CT of GM(CN) and the difference between CT of MC-value(CN) and CT of GM(CN) for each PN. Figure 5 shows the difference between LT of GM(CN) and LT of FGM(CN) or LT of MC-value(CN) for each PN. As known from Figure 4, FGM based method outperforms MC-value up to 1,400 highest ranked terms. Since in the domains of TMREC task that are computer science and communication engineering, 1,400 technical terms are important core terms, FGM method we propose is very promising to extract and recognize domain specific terms. We also show CT of each method for larger PN, say, from 3000 up to 15000 in Table 1 and  As seen in these figures and tables, if we want more terms about these domains, MC-value is more powerful, but when PN is larger than 12,000, again FGM outperforms. As for recognizing longer terms, GM(CN), which is the baseline, performs best for every PN. MC-value is the worst. From this observation we come to know that MC-value tends to assign higher score to shorter terms than GM or FGM. We are also interested in what kind of term is favored by each method. For this, we show the average length of the highest PN ranked terms of each method in Figure 6 where length of CN means the number of single-words CN consists of. Clearly, GM prefers longer terms. So does FGM. On the contrary, MC-value prefers shorter terms. However, as shown in Figure 6, the average length of the MC-value is more fluctuating. That means GM and FGM have more consistent tendency in ranking compound nouns. Finally we compare our results with NTCIR1 results (Kageura et al 1999). Unfortunately since (Kageura et al 1999) only provides the number of the all extracted terms and also the number of the all extracted correct terms, we could not directly compare our results with other NTCIR1 participants. Then, what is important is the fact that we extracted 7,082 correct terms from top 15,000 term candidates with the FGM methods. This fact is indicating that our methods show the highest performance among all other participants of NTCIR1 TMREC task because 1) the highest number of terms within the top 16,000 term candidates is 6,536 among all the participants of NTCIR1 TMREC task, and 2) the highest number or terms in all the participants of NTCIR1 TMREC task is 7,944, but they are extracted from top 23,270 term candidates, which means extremely low precision.</Paragraph>
      <Paragraph position="1">  terms by GM(CN), FGM(CN) and MC-value(CN) for each PN</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML