File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-4005_abstr.xml

Size: 8,213 bytes

Last Modified: 2025-10-06 13:44:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4005">
  <Title>Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach</Title>
  <Section position="2" start_page="0" end_page="532" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> This article is intended to address, with a unified and pragmatic approach, two fundamental questions in Chinese natural language processing (NLP): What is a 'word' in Chinese?, and How does a computer identify Chinese words automatically? Our approach is distinguished from most previous approaches by the following three unique [?] Natural Language Computing Group, Microsoft Research Asia, 5F, Sigma Center, No. 49, Zhichun Road, Beijing, 100080, China. E-mail: jfgao@microsoft.com, muli@microsoft.com, cnhuang@msrchina.research. microsoft.com.</Paragraph>
    <Paragraph position="1"> + The work reported in this article was done while the author was at Microsoft Research. His current e-mail address is andi.wu@grapecity.com.</Paragraph>
    <Paragraph position="2"> Submission received: 22 November 2004; revised submission received: 20 April 2005; accepted for publication: 17 June 2005.</Paragraph>
    <Paragraph position="3"> (c) 2006 Association for Computational Linguistics Computational Linguistics Volume 31, Number 4 components that are integrated into a single model: a taxonomy of Chinese words, a unified approach to word breaking and unknown word detection, and a customizable display of word segmentation.</Paragraph>
    <Paragraph position="4">  We will describe each of these in turn.</Paragraph>
    <Paragraph position="5"> Chinese word segmentation is challenging because it is often difficult to define what constitutes a word in Chinese. Theoretical linguists have tried to define Chinese words using various linguistic criteria (e.g., Packard 2000). While each of those criteria provides valuable insights into &amp;quot;word-hood&amp;quot; in Chinese, they do not consistently lead us to the same conclusions. Fortunately, this may not be a serious issue in computational linguistics, where the definition of words can vary and can depend to a large degree upon how one uses and processes these words in computer applications (Sproat and Shih 2002).</Paragraph>
    <Paragraph position="6"> In this article, we define the concept of Chinese words from the viewpoint of computational linguistics. We develop a taxonomy in which Chinese words can be categorized into one of the following five types: lexicon words, morphologically derived words, factoids, named entities, and new words.</Paragraph>
    <Paragraph position="7">  These five types of words have different computational properties and are processed in different ways in our system, as will be described in detail in Section 3. Two of these five types, factoids and named entities, are not important to theoretical linguists but are significant in NLP. Chinese word segmentation involves mainly two research issues: word boundary disambiguation and unknown word identification. In most of the current systems, these are considered to be two separate tasks and are dealt with using different components in a cascaded or consecutive manner.</Paragraph>
    <Paragraph position="8"> However, we believe that these two issues are not separate in nature and are better approached simultaneously. In this article, we present a unified approach to the five fundamental features of word-level Chinese NLP (corresponding to the five types of words described earlier): (1) word breaking, (2) morphological analysis, (3) factoid detection, (4) named entity recognition (NER), and (5) new word identification (NWI). This approach is based on a mathematical framework of linear mixture models in which component models are inspired by the source-channel models of Chinese sentence generation. There are basically two types of component models: a source model and a set of channel models. The source model is used to estimate the generative probability of a word sequence in which each word belongs to one word type. For each of the word types, a channel model is used to estimate the likelihood of a character string, given the word type. We shall show that this framework is flexible enough to incorporate a wide variety of linguistic knowledge and statistical models in a unified way.</Paragraph>
    <Paragraph position="9"> In computer applications, we are more concerned with segmentation units than words. While words are supposed to be unambiguous and static linguistic entities, segmentation units are expected to vary from application to application. In fact, different Chinese NLP-enabled applications may have different requirements that request different granularities of word segmentation. For example, automatic speech recognition (ASR) systems prefer longer &amp;quot;words&amp;quot; to achieve higher accuracy, whereas in1 In this article, we differentiate the terms word breaking and word segmentation. Word breaking refers to the process of segmenting known words that are predefined in a lexicon. Word segmentation refers to the process of both lexicon word segmentation and unknown word detection.</Paragraph>
    <Paragraph position="10"> 2 New words in this article refer to out-of-vocabulary words that are neither recognized as named entities or factoids nor derived by morphological rules. These words are mostly domain-specific and/or time-sensitive (see Section 5.5 for details).</Paragraph>
    <Paragraph position="11">  Gao et al. Chinese Word Segmentation: A Pragmatic Approach formation retrieval (IR) systems prefer shorter &amp;quot;words&amp;quot; to obtain higher recall rates (Wu 2003).</Paragraph>
    <Paragraph position="12"> Therefore, we do not assume that an application-independent universal word segmentation standard exists. We argue instead for the existence of multiple segmentation standards, each for a specific application. It is undesirable to develop a set of application-specific segmenters. A better solution would be to develop a generic segmenter with customizable output that is able to provide alternative segmentation units according to the specification that is either predefined or implied in the application data. To achieve this, we present a transformation-based learning (TBL; Brill 1995) method, to be described in Section 6.</Paragraph>
    <Paragraph position="13"> We implement the pragmatic approach to Chinese word segmentation in an adaptive Chinese word segmenter called MSRSeg. It consists of two components: (1) a generic segmenter that is based on the linear mixture model framework of word breaking and unknown word detection and that can adapt to domain-specific vocabularies, and (2) a set of output adaptors for adapting the output of (1) to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets. It thus demonstrates the possibility of a single adaptive Chinese word segmenter that is capable of supporting multiple applications.</Paragraph>
    <Paragraph position="14"> The remainder of this article is organized as follows. Section 2 presents previous work in this field. Section 3 introduces the taxonomy of Chinese words and describes the corpora we used in our study. Section 4 presents some of the theoretical background on which our unified approach is based. Section 5 outlines the general architecture of the Chinese word segmenter, MSRSeg, and describes each of the components in detail, presenting a separate evaluation of each component where appropriate. Section 6 presents the TBL method of standards adaptation. While in Section 5 we presume the existence of an annotated training corpus, we focus in Section 7 on the methods of creating training data in a (semi-)automatic manner, with minimal or no human annotation. We thus demonstrate the possibilities of unsupervised learning of Chinese words. Section 8 presents several evaluations of the system on the different corpora, each corresponding to a different segmentation standard, in comparison with other state-of-the-art systems. Finally, we conclude the article in Section 9.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML