File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0112_intro.xml
Size: 5,028 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0112"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Hybrid Approach to Chinese Base Noun Phrase Chunking</Title> <Section position="3" start_page="0" end_page="87" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Chunking means extracting the non-overlapping segments from a stream of data. These segments are called chunks (Dirk and Satoshi, 2003). The definition of base noun phrase (base NP) is simple and non-recursive noun phrase which does not contain other noun phrase descendants. Base NP chunking could be used as a precursor for many elaborate natural language processing tasks, such as information retrieval, name entity extraction and text summarization and so on. Many other problems similar to text processing can also benefit from base NP chunking, for example, finding genes in DNA and phoneme information extraction.</Paragraph> <Paragraph position="1"> The initial work on base NP chunking is focused on the grammar-based method. Ramshaw and Marcus (1995) introduced a transformation-based learning method which considered chunking as a kind of tagging problem. Their work inspired many others to study the applications of learning methods to noun phrase chunking.</Paragraph> <Paragraph position="2"> (Cardie and Pierce, 1998, 1999) applied a scoring method to select new rules and a naive heuristic for matching rules to evaluate the results' accuracy. null CoNLL-2000 proposed a shared task (Tjong and Buchholz, 2000), which aimed at dividing a text in syntactically correlated parts of words.</Paragraph> <Paragraph position="3"> The eleven systems for the CoNLL-2000 shared task used a wide variety of machine learning methods. The best system in this workshop is on the basis of Support Vector Machines used by (Kudo and Matsumoto, 2000).</Paragraph> <Paragraph position="4"> Recently, some new statistical techniques, such as CRF (Lafferty et al. 2001) and structural learning methods (Ando and Zhang, 2005) have been applied on the base NP chunking. (Fei and Fernando, 2003) considered chunking as a sequence labeling task and achieved good performance by an improved training methods of CRF. (Ando and Zhang, 2005) presented a novel semi-supervised learning method on chunking and produced performances higher than the previous best results.</Paragraph> <Paragraph position="5"> The research on Chinese Base NP Chunking is, however, still at its developing stage. Researchers apply similar methods of English Base NP chunking to Chinese. Zhao and Huang (1998) made a strict definition of Chinese base NP and put forward a quasi-dependency model to analysis the structure of Chinese base NPs. There are some other methods to deal with Chinese phrase (no only base NP) chunking, such as HMM (Heng Li et al., 2003), Maximum Entropy (Zhou Yaqian et al., 2003), Memory-Based Learning (Zhang and Zhou, 2002) etc.</Paragraph> <Paragraph position="6"> However, according to our experiments over 30,000 Chinese words, the best results of Chinese base NP chunking are about 5% less than that of English chunking (Although we should admit the chunking outcomes vary among different sizes of corpus and rely on the details of experiments). The differences between Chinese NPs and English NPs are summarized as following points: First, the flexible structure of Chinese noun phrase often results in the ambiguities during the recognition procedure. For example, many English base NPs begin with the determinative, while the margin of Chinese base NPs is more uncertain. Second, the base NPs begins with more than two noun-modifiers, such as &quot;Gao (high)/JJ Xin (new)/JJ Ji Zhu (technology)/NN&quot;, the noun-modifiers &quot;Gao /JJ &quot; can not be completely recognized. Third, the usage of Chinese word is flexible, as a Chinese word may serve with multi POS (Part-of-Speech) tags. For example, a noun is used as a verbal or an adjective component in the sentence. In this way the chunker is puzzled by those multi-used words. Finally, there are no standard datasets and elevation systems for Chinese base NP chunking as the CoNLL-2000 shared task, which makes it difficult to compare and evaluate different Chinese base NP chunking systems.</Paragraph> <Paragraph position="7"> In this paper, we propose a hybrid approach to extract the Chinese base NPs with the help of the conditional probabilities derived from the CRF algorithm and some appropriate grammar rules.</Paragraph> <Paragraph position="8"> According to our preliminary experiments on SVM and CRF, our approach outperforms both of them.</Paragraph> <Paragraph position="9"> The remainder of the paper is organized as follows. Section 2 gives a brief introduction of the data representations and methods. We explain our motivations of the hybrid approach in section 3. The experimental results and conclusions are introduced in section 4 and section 5 respectively.</Paragraph> </Section> class="xml-element"></Paper>