File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1063_intro.xml

Size: 4,243 bytes

Last Modified: 2025-10-06 14:01:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1063">
  <Title>Text Chunking by Combining Hand-Crafted Rules and Memory-Based Learning</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Text chunking has been one of the most interesting problems in natural language learning community since the first work of (Ramshaw and Marcus, 1995) using a machine learning method. The main purpose of the machine learning methods applied to this task is to capture the hypothesis that best determine the chunk type of a word, and such methods have shown relatively high performance in English (Kudo and Matsumoto, 2000; Zhang et. al, 2001).</Paragraph>
    <Paragraph position="1"> In order to do it, various kinds of information, such as lexical information, part-of-speech and grammatical relation, of the neighboring words is used. Since the position of a word plays an important role as a syntactic constraint in English, the methods are successful even with local information.</Paragraph>
    <Paragraph position="2"> However, these methods are not appropriate for chunking Korean and Japanese, because such languages have a characteristic of partially free wordorder. That is, there is a very weak positional constraint in these languages. Instead of positional constraints, they have overt postpositions that restrict the syntactic relation and composition of phrases.</Paragraph>
    <Paragraph position="3"> Thus, unless we concentrate on the postpositions, we must enlarge the neighboring window to get a good hypothesis. However, enlarging the window size will cause the curse of dimensionality (Cherkassky and Mulier, 1998), which results in the deficiency in the generalization performance.</Paragraph>
    <Paragraph position="4"> Especially in Korean, the postpositions and the endings provide important information for noun phrase and verb phrase chunking respectively. With only a few simple rules using such information, the performance of chunking Korean is as good as the rivaling other inference models such as machine learning algorithms and statistics-based methods (Shin, 1999). Though the rules are approximately correct for most cases drawn from the domain on which the rules are based, the knowledge in the rules is not necessarily well-represented for any given set of cases. Since chunking is usually processed in the earlier step of natural language processing, the errors made in this step have a fatal influence on the following steps. Therefore, the exceptions that are ignored by the rules must be com- null pensated for by some special treatments of them for higher performance.</Paragraph>
    <Paragraph position="5"> To solve this problem, we have proposed a combining method of the rules and the k-nearest neighbor (k-NN) algorithm (Park and Zhang, 2001). The problem in this method is that it has redundant k-NNs because it maintains a separate k-NN for each kind of errors made by the rules. In addition, because it applies a k-NN and the rules to each examples, it requires more computations than other inference methods.</Paragraph>
    <Paragraph position="6"> The goal of this paper is to provide a new method for chunking Korean by combining the hand-crafted rules and a machine learning method. The chunk type of a word in question is determined by the rules, and then verified by the machine learning method.</Paragraph>
    <Paragraph position="7"> The role of the machine learning method is to determine whether the current context is an exception of the rules. Therefore, a memory-based learning (MBL) is used as a machine learning method that can handle exceptions efficiently (Daelemans et. al, 1999).</Paragraph>
    <Paragraph position="8"> The rest of the paper is organized as follows. Section 2 explains how the proposed method works.</Paragraph>
    <Paragraph position="9"> Section 3 describes the rule-based method for chunking Korean and Section 4 explains chunking by memory-based learning. Section 5 presents the experimental results. Section 6 introduces the issues for applying the proposed method to other problems.</Paragraph>
    <Paragraph position="10"> Finally, Section 7 draws conclusions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML