File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1063_metho.xml

Size: 17,123 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1063">
  <Title>Text Chunking by Combining Hand-Crafted Rules and Memory-Based Learning</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Chunking Korean
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the structure of the chunking model for Korean. The main idea of this model is to apply rules to determine the chunk type of a word w i in a sentence, and then to refer to a memory based classifier in order to check whether it is an exceptional case of the rules. In the training phase, each sentence is analyzed by the rules and the predicted chunk type is compared with the true chunk type. In case of misprediction, the error type is determined according to the true chunk type and the predicted chunk type. The mispredicted chunks are stored in the error case library with their true chunk types. Since the error case library accumulates only the exceptions of the rules, the number of cases in the library is small if the rules are general enough to represent the instance space well.</Paragraph>
    <Paragraph position="1"> The classification phase in Figure 1 is expressed as a procedure in Figure 2. It determines the chunk type of a word w i given with the context C i . First of all, the rules are applied to determine the chunk type. Then, it is checked whether C i is an exceptional case of the rules. If it is, the chunk type determined by the rules is discarded and is determined again by the memory based reasoning. The condition to make a decision of exceptional case is whether the similar-</Paragraph>
    <Paragraph position="3"> , and the threshold t Output : a chunk type c</Paragraph>
    <Paragraph position="5"> memory based learning.</Paragraph>
    <Paragraph position="6"> case library is larger than the threshold t. Since the library contains only the exceptional cases, the more</Paragraph>
    <Paragraph position="8"> to the nearest instance, the more probable is it an exception of the rules.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Chunking by Rules
</SectionTitle>
    <Paragraph position="0"> There are four basic phrases in Korean: noun phrase (NP), verb phrase (VP), adverb phrase (ADVP), and independent phrase (IP). Thus, chunking by rules is divided into largely four components.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Noun Phrase Chunking
</SectionTitle>
      <Paragraph position="0"> When the part-of-speech of w i is one of determiner, noun, and pronoun, there are only seven rules to determine the chunk type of w i due to the well- null developed postpositions of Korean. 1. If POS(w</Paragraph>
      <Paragraph position="2"> does not have a postposition Then y</Paragraph>
      <Paragraph position="4"> does not have a postposition Then y</Paragraph>
      <Paragraph position="6"> has a possessive postposition Then y</Paragraph>
      <Paragraph position="8"> has a relative postfix Then y</Paragraph>
      <Paragraph position="10"> has a relative ending Then y</Paragraph>
      <Paragraph position="12"> B-NP represents the first word of a noun phrase, while I-NP is given to other words in the noun phrase.</Paragraph>
      <Paragraph position="13"> Since determiners, nouns and pronouns play the similar syntactic role in Korean, they form a noun phrase when they appear in succession without post-position (Rule 1-3). The words with postpositions become the end of a noun phrase, but there are only two exceptions. When the type of a postposition is possessive, it is still in the mid of noun phrase (Rule 4). The other exception is a relative postfix 'B2CWA3(jeok)' (Rule 5). Rule 6 states that a simple relative clause with no sub-constituent also constitutes a noun phrase. Since the adjectives of Korean have no definitive usage, this rule corresponds to the definitive usage of the adjectives in English.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Verb Phrase Chunking
</SectionTitle>
      <Paragraph position="0"> The verb phrase chunking has been studied for a long time under the name of compound verb processing in Korean and shows relatively high accuracy. Shin used a finite state automaton for verb phrase chunking (Shin, 1999), while K.-C. Kim used knowledge-based rules (Kim et. al, 1995). For the consistency with noun phrase chunking, we use the rules in this paper. The rules used are the ones proposed by (Kim et. al, 1995) and the further explanation on the rules is skipped. The number of the rules used is 29.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.3 Adverb Phrase Chunking
</SectionTitle>
      <Paragraph position="0"> When the adverbs appear in succession, they have a great tendency to form an adverb phrase. Though an adverb sequence is not always one adverb phrase, it usually forms one phrase. Table 1 shows this empirically. The usage of the successive adverbs is investigated from STEP 2000 dataset  where 270 cases are observed. The 189 cases among them form a phrase whereas the remaining 81 cases form two phrases independently. Thus, it can be said that the possibility that an adverb sequence forms a phrase is far higher than the possibility that it forms two phrases.</Paragraph>
      <Paragraph position="1"> When the part-of-speech of w i is an adjective, its chunk type is determined by the following rule.  1. If POS(w</Paragraph>
      <Paragraph position="3"> This dataset will be explained in Section 5.1.</Paragraph>
      <Paragraph position="4">  forms a chunk.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.4 Independent Phrase Chunking
</SectionTitle>
      <Paragraph position="0"> There is no special rule for independent phrase chunking. It can be done only through knowledge base that stores the cases where independent phrases take place. We designed 12 rules for independent phrases.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Chunking by Memory-Based Learning
</SectionTitle>
    <Paragraph position="0"> Memory-based learning is a direct descent of the k-Nearest Neighbor (k-NN) algorithm (Cover and Hart, 1967). Since many natural language processing (NLP) problems have constraints of a large number of examples and many attributes with different relevance, memory-based learning uses more complex data structure and different speedup optimization from the k-NN.</Paragraph>
    <Paragraph position="1"> It can be viewed with two components: a learning component and a similarity-based performance component. The learning component involves adding training examples to memory, where all examples are assumed to be fixed-length vectors of n attributes. The similarity between an instance x and all examples y in memory is computed using a distance metric, [?](x,y). The chunk type of x is then determined by assigning the most frequent category within the k most similar examples of x.</Paragraph>
    <Paragraph position="2"> The distance from x and y, [?](x,y) is defined to</Paragraph>
    <Paragraph position="4"> is determined by information gain (Quinlan, 1993), the k-NN algorithm with this metric is called IB1-IG (Daelemans et. al, 2001). All the experiments performed by memory-based learning in this paper are done with IB1-IG.</Paragraph>
    <Paragraph position="5"> Table 2 shows the attributes of IB1-IG for chunking Korean. To determine the chunk type of a word</Paragraph>
    <Paragraph position="7"> , the lexicons, POS tags, and chunk types of surrounding words are used. For the surrounding words, three words of left context and three words of right context are used for lexicons and POS tags, while two words of left context are used for chunk types. Since chunking is performed sequentially, the chunk types of the words in right context are not known in determining the chunk type of w</Paragraph>
    <Paragraph position="9"/>
  </Section>
  <Section position="6" start_page="1" end_page="3" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Dataset
</SectionTitle>
      <Paragraph position="0"> For the evaluation of the proposed method, all experiments are performed on STEP 2000 Korean Chunking dataset (STEP 2000 dataset)  . This dataset is derived from the parsed corpus, which is a product of STEP 2000 project supported by Korean government. The corpus consists of 12,092 sentences with 111,658 phrases and 321,328 words, and the vocabulary size is 16,808. Table 3 summarizes the information on the dataset.</Paragraph>
      <Paragraph position="1"> The format of the dataset follows that of CoNLL2000 dataset (CoNLL, 2000). Figure 3 shows an example sentence in the dataset  . Each word in the dataset has two additional tags, which are a part-of-speech tag and a chunk tag. The part-of-speech tags are based on KAIST tagset (Yoon and Choi, 1999). Each phrase can have two kinds of chunk types: B-XP and I-XP. In addition to them, there is O chunk type that is used for words which are not part of any chunk. Since there are four types of phrases and one additional chunk type O, there exist nine chunk types.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.2 Performance of Chunking by Rules
</SectionTitle>
      <Paragraph position="0"> Table 4 shows the chunking performance when only the rules are applied. Using only the rules gives 97.99% of accuracy and 91.87 of F-score. In spite of relatively high accuracy, F-score is somewhat low.</Paragraph>
      <Paragraph position="1"> Because the important unit of the work in the applications of text chunking is a phrase, F-score is far more important than accuracy. Thus, we have much room to improve in F-score.</Paragraph>
      <Paragraph position="3"> labeled chunk type.</Paragraph>
      <Paragraph position="4"> Table 5 shows the error types by the rules and their distribution. For example, the error type 'B-ADVP I-ADVP' contains the errors whose true label is B-ADVP and that are mislabeled by I-ADVP. There are eight error types, but most errors are related with noun phrases. We found two reasons for this: 1. It is difficult to find the beginning of noun phrases. All nouns appearing successively without postpositions are not a single noun phrase. But, they are always predicted to be single noun phrase by the rules, though they can be more than one noun phrase.</Paragraph>
      <Paragraph position="5"> 2. The postposition representing a noun coordination, 'H0BO (wa)' is very ambiguous. When 'H0BO (wa)' is representing the coordination, the chunk types of it and its next word should be &amp;quot;I-NP I-NP&amp;quot;. But, when it is just an adverbial postposition that implies 'with' in English, the chunk types should be &amp;quot;I-NP B-NP&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.3 Performance of Machine Learning
Algorithms
</SectionTitle>
      <Paragraph position="0"> Table 6 gives the 10-fold cross validation result of three machine learning algorithms. In each fold, the corpus is divided into three parts: training (80%), held-out (10%), test (10%). Since held-out set is used only to find the best value for the threshold t in the combined model, it is not used in measuring the performance of machine learning algorithms.</Paragraph>
      <Paragraph position="1"> The machine learning algorithms tested are (i) memory-based learning (MBL), (ii) decision tree, and (iii) support vector machines (SVM). We use C4.5 release 8 (Quinlan, 1993) for decision tree induction and SVM light (Joachims, 1998) for support vector machines, while TiMBL (Daelemans et. al, 2001) is adopted for memory-based learning. Decision trees and SVMs use the same attributes with memory-based learning (see Table 2). Two of the algorithms, memory-based learning and decision tree, show worse performance than the rules. The F-scores of memory-based learning and decision tree are 91.38 and 91.36 respectively, while that of the rules is 91.87 (see Table 4). On the other hand, support vector machines present a slightly better performance than the rules. The F-score of support vector machine is 92.54, so the improvement over the rules is just 0.67.</Paragraph>
      <Paragraph position="2"> Table 7 shows the weight of attributes when only memory-based learning is used. Each value in this table corresponds to a i in calculating [?](x,y). The more important is an attribute, the larger is the weight of it. Thus, the most important attribute among 17 attributes is C i[?]1 , the chunk type of the previous word. On the other hand, the least important attributes are W  combining the rules and the memory-based learning. The average accuracy is 98.21+-0.43.</Paragraph>
      <Paragraph position="3"> is, the order of important lexical attributes is  same phenomenon is found in part-of-speech (POS) and chunk type (C). In comparing the part-of-speech information with the lexical information, we find out that the part-of-speech is more important. One possible explanation for this is that the lexical information is too sparse.</Paragraph>
      <Paragraph position="4"> The best performance on English reported is 94.13 in F-score (Zhang et. al, 2001). The reason why the performance on Korean is lower than that on English is the curse of dimensionality. That is, the wider context is required to compensate for the free order of Korean, but it hurts the performance (Cherkassky and Mulier, 1998).</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.4 Performance of the Hybrid Method
</SectionTitle>
      <Paragraph position="0"> Table 8 shows the final result of the proposed method. The F-score is 94.21 on the average which is improvement of 2.34 over the rules only, 1.67 over support vector machines, and 2.83 over memory-based learning. In addition, this result is as high as the performance on English (Zhang et. al, 2001).</Paragraph>
      <Paragraph position="1">  by combining the rules and MBL.</Paragraph>
      <Paragraph position="2"> The threshold t is set to the value which produces the best performance on the held-out set. The total sum of all weights in Table 7 is 2.48. This implies that when we set t&gt;2.48, only the rules are applied since there is no exception with this threshold. When t =0.00, only the memory-based learning is used. Since the memory-based learning determines the chunk type of w i based on the exceptional cases of the rules in this case. the performance is poor with t =0.00. The best performance is obtained when t is near 1.94.</Paragraph>
      <Paragraph position="3"> Figure 4 shows how much F-score is improved for each kind of phrases. The average F-score of noun phrase is 94.54 which is far improved over that of the rules only. This implies that the exceptional cases of the rules for noun phrase are well handled by the memory-based learning. The performance is much improved for noun phrase and verb phrase, while it remains same for adverb phrases and independent phrases. This result can be attributed to the fact that there are too small number of exceptions for adverb phrases and independent phrases. Because the accuracy of the rules for these phrases is already high enough, most cases are covered by the rules. Memory based learning treats only the exceptions of the rules, so the improvement by the proposed method is low for the phrases.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> In order to make the proposed method practical and applicable to other NLP problems, the following issues are to be discussed: 1. Why are the rules applied before the memory-based learning? When the rules are efficient and accurate enough to begin with, it is reasonable to apply the rules first (Golding and Rosenbloom, 1996). But, if they were deficient in some way, we should have applied the memory-based learning first.</Paragraph>
    <Paragraph position="1"> 2. Why don't we use all data for the machine learning method? In the proposed method, memory-based learning is used not to find a hypothesis for interpreting whole data space but to handle the exceptions of the rules. If we use all data for both the rules and memory-based learning, we have to weight the methods to combine them. But, it is difficult to know the weights of the methods. 3. Why don't we convert the memory-based learning to the rules? Converting between the rules and the cases in the memory-based learning tends to yield inefficient or unreliable representation of rules. The proposed method can be directly applied to the problems other than chunking Korean if the proper rules are prepared. The proposed method will show better performance than the rules or machine learning methods alone.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML