File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1112_metho.xml

Size: 17,336 bytes

Last Modified: 2025-10-06 14:09:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1112">
  <Title>Chinese Term Extraction from Web Pages Based on Compound word Productivity</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Scoring methods with Simple word
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Bigrams
3.1 Simple word Bigrams
</SectionTitle>
      <Paragraph position="0"> The relation between a simple word and complex words that include the simple word is very important in terms of term space structure. Nevertheless, to my knowledge, this relation has not been paid enough attention so far except for the method proposed by Nakagawa and Mori (2003).</Paragraph>
      <Paragraph position="1"> In this paper, taking over their works, we focus on compound words among the various types of complex terms. In technical documents, the majority of domain-specific terms are noun phrases or compound words consisting of small size vocabulary of simple words. This observation leads to a new scoring methods that measures how many distinct compound words contain the simple word in question as their part in a given document or corpus. Here, suppose the situation where simple word: N occurs with other simple words as a part of many compound words shown in Figure 1 where [N M] means bigram of noun N and M.</Paragraph>
      <Paragraph position="3"> In Figure 1, [LNi N] (i=1,..,n) and [N RNj] (j=1,...,m) are simple word bigrams which make (a part of) compound words. #Li and #Rj (i=1,..,n and j=1,..,m) mean the frequency of the bigram [LNi N] and [N RNj] in the corpus respectively. Note that since we depict only bigrams, compound words like [LNi N RNj] which contains [LNi N] and/or [N RNj] as their parts might actually occur in a corpus. Note that this noun trigram might be a part of longer compound words. We show an example of a set of noun bigrams. Suppose that we extract compound words including &amp;quot;trigram&amp;quot; as term candidates from a corpus as shown in the following example.</Paragraph>
      <Paragraph position="4"> Example 1.</Paragraph>
      <Paragraph position="5"> trigram statistics, word trigram, class trigram, word trigram, trigram acquisition, word trigram statistics, character trigram Then, noun bigrams consisting of a simple word &amp;quot;trigram&amp;quot; are shown in Figure 2 where the number between ( and ) shows the frequency in the corpus.</Paragraph>
      <Paragraph position="7"> Now we focus on and utilize simple word bigrams to define the scoring function. Note that we are only concerned with simple word bigrams and not with a simple word per se because, as stated before, we are concerned with the relation between a compound word and its component simple words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Scoring Function
3.2.1 Score of simple word
</SectionTitle>
      <Paragraph position="0"> Since there are infinite number of scoring functions based on [LNi N] or [N RNj], we here consider the following simple but representative scoring functions.</Paragraph>
      <Paragraph position="1"> #LDN(N) and #RDN(N) : These are the number of distinct simple words which directly precede or succeed N. These coincide with &amp;quot;n&amp;quot; and &amp;quot;m&amp;quot; in Figure 1 respectively. For instance, in an example shown in Figure 2, #LDN(trigram)=3, #RDN(trigram)=2.</Paragraph>
      <Paragraph position="2"> Using #LDN and #RDN we define LN(N) and RN(N): These are based on the number of occurrence of each noun bigram, and defined for [LNi N] and [N RNj] as follows respectively.</Paragraph>
      <Paragraph position="4"> that directly precede or succeed N. For instance, in an example shown in Figure 2, LN(trigram)=5, and RN(trigram)=3.</Paragraph>
      <Paragraph position="5"> Let's think about the background of these scoring functions. #LDN(N) and #RDN(N), where we do not take into account the frequency of each noun bigram but take into account the number of distinct nouns that adjoin to N to make compound words. That indicates how linguistically and domain dependently productive the noun:N is in a given corpus. That means that if N presents a key and/or basic concept of the domain treated by the corpus, writers in that domain work out many distinct compound words with N to express more complicated concepts. On the other hand, as for LN(N) and RN(N), we also focus on frequency of each noun bigram as well. In other words, statistic bias in actual use of noun:N is, this time, one of our main concern. For example, in Figure 2, LN(trigram,2)=11, and RN(trigram,2)=5. In conclusion, since LN(N) and RN(N) convey more information than #LDN(N) and #RDN(N), we adopt LN(N) and RN(N) in this research.</Paragraph>
      <Paragraph position="6">  The next thing to do is expanding those scoring functions for simple word to the scoring functions for compound words. We adopt a geometric mean for this purpose. Now think of a compound word :</Paragraph>
      <Paragraph position="8"> word. Then a geometric mean: LR of CN is defined as follows.</Paragraph>
      <Paragraph position="10"> LR does not depend on the length of CN where &amp;quot;length&amp;quot; means the number of simple words that consist of CN. This is because since we have not yet had any idea about the relation between the importance of a compound word and a length of the compound word, it is fair to treat all compound words, including simple words, equally no matter how long or short each compound word is.</Paragraph>
      <Paragraph position="11">  We still have not fully utilized the information about statistics of actual use in a corpus in the bigram based methods described in 3.2.1 and 3.2.2. Among various kinds of information about actual use, the important and basic one is the frequency of single-and compound words that occur independently. The term &amp;quot;independently&amp;quot; means that the left and right adjacent words are not nouns. For instance, &amp;quot;word patterns&amp;quot; occurs independently in &amp;quot;we use the word patterns which occur in this sentence.&amp;quot; Since the scoring functions proposed in 3.2.1 is noun bigram statistics, the number of this kind of independent occurrences of nouns themselves have not been used so far. If we take this information into account, the better results are expected. Thus, if a simple word or a compound word occurs independently, the score of the noun is simply multiplied by the number of its independent occurrences. We call this new scoring function as FLR(CN) which is defined as follows.</Paragraph>
      <Paragraph position="12"> if N occurs independently then f(CN)(CN)LR(CN)LRF x= where f(CN) means the number of independent occurrences of noun CN (4)</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Term Extraction for Chinese based on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Morphological Analysis
</SectionTitle>
      <Paragraph position="0"> If we try to apply the scoring method proposed in section 3 directly to a Chinese text, every word should be POS tagged because we extract multi-word unit of several types of POS tag sequences as candidates of domain specific terms. For this we need a Chinese morphological analyzer because Chinese is an agglutinative language. Actually, we use Chinese morphological analyzer: ICTCLAS(Zhang and Liu 2004). As term candidates, we extract compound word: MWU having the following POS tag sequence expressed in (5). A multi-word-unit: MWU is defined by the following CFG rules where the right hand sides are expressed as a regular expression.</Paragraph>
      <Paragraph position="1"> MWU &lt;--g2[ag a]* [ng n nr ns nt nz nx vn</Paragraph>
      <Paragraph position="3"> where &amp;quot;ag&amp;quot;, &amp;quot;a&amp;quot;, &amp;quot;n&amp;quot;..., &amp;quot;u&amp;quot; are all tags used in ICTCLAS.</Paragraph>
      <Paragraph position="4"> Roughly speaking (5) means an adjective followed by the repetition of [adjective noun particle] followed by a noun. The problem is the ambiguity of POS tagging because the same word is very often used verb as well as noun. In addition, unknown words like newly appeared proper names also impairs the accuracy. Due to this problem caused by morphological analyzer, the accuracy is degraded.</Paragraph>
      <Paragraph position="5"> Once we segment out word sequences conforming the above POS tag sequences, we calculate LN and RN of each component word. In calculation of LN and RN, a word whose POS is c, u or k is omitted. In other words, if a word sequence &amp;quot;w1 w2 w3&amp;quot; where POS of w2 is c u or k, then we calculate RN of w1 and LN of w3 by regarding the word sequence as &amp;quot;w1 w3.&amp;quot; Then we combine LN and RN of each word to calculate FLR by definition of (3) and (4) to sort all extracted candidates in descending order of FLR.</Paragraph>
      <Paragraph position="6"> We apply the proposed methods to 30 Web pages from People's Daily news. The areas are social, international and IT related news. The average length is 592.6 characters. Firstly, we extract relevant terms by hand from each news article and use it as the gold standard. The average number of gold standard terms per one news particle is 15.9 words. Secondly, we extract terms from each news article and sort them in descending order by the proposed method and evaluate them by a precision of top N terms defined as follows. CT(K)= 1 if Kth term is one of the gold standard terms.</Paragraph>
      <Paragraph position="8"> where N is the number of the gold standard terms, and in our experiment, N=20. Precision(K), where K=1,..,20, are shown in Figure 3 as &amp;quot;Strict.&amp;quot; We also use another precision rate precision ' which is not strict and defined as follows.</Paragraph>
      <Paragraph position="9"> CTpart(K)= 1 if one of gold standard terms.</Paragraph>
      <Paragraph position="10"> is a part of Kth term  From Figure 3, we see that If we pick up the ten highest ranked terms, about 75% of them meet the gold standard. The case we loosen the definition of precision shows better than the strict case of (6) but the difference is not so large. That means that the proposed word based ranking method works very well to extract important Chinese terms from news articles.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Character based Term Extraction
</SectionTitle>
    <Paragraph position="0"> There are several reasons why we would like to develop a term extraction system without morphological analyzer.</Paragraph>
    <Paragraph position="1"> The first reason is that the accuracy of morphological analyzer is, in spite of the great advancement of these years, still around 95% (GuoDong and Jian 2003).</Paragraph>
    <Paragraph position="2"> The second reason is that there possibly exist terminologies with unexpected POS sequences. If we deal only with academic papers or technical documents, we expect POS sequences of terminologies with high accuracy. However, if we consider terminology extraction from Web pages, the possibility of unexpected POS sequence may rise.</Paragraph>
    <Paragraph position="3"> The third reason is language independency.</Paragraph>
    <Paragraph position="4"> Currently proposed and/or used morphological analyzers heavily depend either upon the sophisticated linguistic knowledge about the target language or upon a big size corpus of the target language if machine learning is employed. These linguistic resources, however, are not always available.</Paragraph>
    <Paragraph position="5"> Due to these reasons, we also developed term candidate extraction system which does not use a morphological analyzer. Instead of morphological analyzer, we try to employ a stop word list. In Chinese, as stop words, we find many character unigrams and bigrams because one Chinese character conveys larger amount of information than a character of Latin alphabet. They are partly shown in Appendix A.</Paragraph>
    <Paragraph position="6"> As term candidates, we simply extract character strings between two stop words that are nearest each other within a sentence. Obviously, the character strings thus extracted are not necessarily meaningful compound words. Therefore we cannot directly use these strings as words to calculate LN and RN function. Back to the idea that Chinese characters are ideograms, we come up to the idea that we calculate LN and RN of each character appearing within every character strings extracted as candidates. An example is shown in Figure 4.  In calculation of LN and RN, we neglect characters whose POS are c ,u or k as same as we did in morphological analyzer based method.</Paragraph>
    <Paragraph position="7"> Once we calculate LN and RN of each character, FLR of every character string is calculated as defined by (3) and (4) to sort them in descending order of FLR.</Paragraph>
    <Paragraph position="8"> Actually this idea is very similar with left and right entropy used to extract two character Chinese words from a corpus (Luo and Sun. 2003).</Paragraph>
    <Paragraph position="9"> However what we would like to extract is a set of longer compound words or even phrases used in a Web page. Moreover we only use the Web page and do not use any other language resources such as a big corpus at all due to the reason described above in this section.</Paragraph>
    <Paragraph position="10"> We evaluate the proposed character based extraction method against the same Web pages from People's Daily news used in Morphological Analysis based method described in Section 4. We also use the same gold standard terms described in Section 4 for evaluation. The strict and partly precision defined by (6) and (7) are used. The result is shown in Figure 5.</Paragraph>
    <Paragraph position="11">  Comparing Figure 3 with Figure 5, apparently the result of extracted terms of word based method is better than that of character based method. However, it does not necessarily mean that the character based term extraction is inferior.</Paragraph>
    <Paragraph position="12"> If you take a glance at the stop word list of Appendix A, it seems that several of the stop words are selected mainly from words in auxiliary verbs, pronouns, adverbs, particles, prepositions, conjunctions, exclamations, onomatopoeic words and punctuation marks. However, in reality, our selection is based rather on meaning, usage and generally frequency of use than parts of speech. Thus some of them are not function words but content words in order to exclude non-domain-specific words. Actually, the stop words are not only character unigram but character bigram.</Paragraph>
    <Paragraph position="13"> Because Chinese character is ideograph and each character may have plural meanings, it is difficult only to use character unigram as a stop word in Chinese.</Paragraph>
    <Paragraph position="14"> Our method based on these viewpoints resulted in getting an interesting consequence. We show an example of news article and extracted terms from it by this method in Appendix B and Appendix C.</Paragraph>
    <Paragraph position="15"> This news article is entitled &amp;quot;The Culture of Tibetan Web Site is formally created.&amp;quot; Let's take a look at an underlined sentence in this short article and underlined terms extracted from there. This sentence says: According to the introduction, The Culture of Tibetan Web Site is a site of special pure culture for the purpose of &amp;quot;investigating the essence of Tibetan culture, showing the scale of Tibetan culture and raising the spirit of Tibetan culture&amp;quot;. In the case of method based on stop word list, we can extract compound term of &amp;quot;investigating the essence of Tibetan culture &amp;quot;, &amp;quot;showing the scale of Tibetan culture ( g12&amp;quot;g15g3 &amp;quot;raising the spirit of Tibetan culture ( g12&amp;quot;g3 and so ong3 from this sentence. On the contrary, by the term extraction method based on morphological analysis, gerund , for example, &amp;quot;showing( )&amp;quot; and &amp;quot;raising ( ), can not be extracted.</Paragraph>
    <Paragraph position="16"> We said that the majority of domain specific terms are noun phrases or compound words consisting of small size vocabulary of simple words as stated in section 3. So we especially have paid attention to relation among nouns.</Paragraph>
    <Paragraph position="17"> However most of Chinese nouns can also be used as verbs. Moreover inflection of Chinese verbs can hardly be recognized visually. It is difficult to distinguish verb from noun by morphological analysis. Certainly ICTCLAS classifies the character that has meaning of both verb and noun into the category of vn (verb and noun). But gerunds and verbal noun infinitives are not contained in vn. For instance, &amp;quot; &amp;quot; means not only &amp;quot;write a letter&amp;quot; but &amp;quot;writing letter.&amp;quot; Thus we have to pay attention to verbs in Chinese too. Only by tuning up stop word list, we can take gerunds and verbal noun infinitives into account to some extent. Appendix C shows one of the evidence of this observation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML