File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/j05-4005_intro.xml
Size: 22,353 bytes
Last Modified: 2025-10-06 14:03:00
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4005"> <Title>Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach</Title> <Section position="4" start_page="535" end_page="543" type="intro"> <SectionTitle> 3. Chinese Words </SectionTitle> <Paragraph position="0"> This section defines Chinese words at three levels. We begin with a taxonomy by which Chinese words are categorized into five main types according to the way they are processed and used in realistic systems. Second, we develop the MSR standard, which is a set of specific rules to guide human annotators in segmenting Chinese sentences. Finally, we describe the development of a gold test set and how we evaluate Chinese word segmenters. Here, we use the term &quot;gold test set&quot; to refer to the manually annotated corpus, according to the MSR standard, on top of the &quot;test corpus&quot; that is the raw text corpus.</Paragraph> <Section position="1" start_page="535" end_page="538" type="sub_section"> <SectionTitle> 3.1 Taxonomy </SectionTitle> <Paragraph position="0"> The taxonomy of Chinese words is summarized in Table 1, where Chinese words are categorized into the five types: entries in a lexicon (or lexicon words, LWs), morphologically derived words (MDWs), factoids (FTs), named entities (NEs), and new words Gao et al. Chinese Word Segmentation: A Pragmatic Approach (NWs). These five types of words have different functions in Chinese NLP and are processed in different ways in our system. For example, a plausible word segmentation for the sentence in Figure 1(a) is shown in the same figure. Figure 1(b) is the output of our system, where words of different types are processed in different ways: a114 For LWs, word boundaries are detected, such asCHDS'professor'.</Paragraph> <Paragraph position="1"> a114 For MDWs, their morphological patterns and stems are detected, e.g., ABFNGK'friend+s' is derived by affixation of the plural affixGKto the noun (stem)ABFN(MA S indicates a suffixation pattern), andG0G0D8D8'happily' is a reduplication of the stemG0D8'happy' (MR AABB indicates an AABB reduplication pattern).</Paragraph> <Paragraph position="2"> a114 For FTs, their types and normalized forms are detected, e.g., 12:30 is the normalized form of the time expressionBTDWF5A9BTA6(&quot;tim&quot; indicates a time expression).</Paragraph> <Paragraph position="3"> a114 For NEs, their types are detected, e.g.,C6FMAV'Li Junsheng' is a person name.</Paragraph> <Paragraph position="4"> The five types of words cannot be defined by any consistent classification criteria (e.g., the relation between MDWs and LWs depends on the lexicon being used); the taxonomy therefore does not give a clear definition of Chinese words. We do not intend for this article to give a standard definition of Chinese words. Instead, we treat Chinese word segmentation as a preprocessing step where the best segmentation units depend on how they are used in the consuming applications. The five word types represent the five types of Chinese words that appear in most applications. This is one of the reasons that we title this article &quot;a pragmatic approach.&quot; We focus on two tasks in the approach: processing of the five types of Chinese word using a unified framework that can be jointly optimized (Sections 4 and 5), and adapting our system to different applications (Section 6). Now, we describe each of the five word types in Table 1 in detail. LWs (lexicon words): Although some previous research has suggested carrying out Chinese word segmentation without the use of dictionaries (e.g., Sproat and Shih 1990; Sun, Shen and Tsou 1998), we believe that a dictionary is an essential component of many applications. For example, in a machine translation system, it is desirable to segment a sentence into LWs as much as possible so that the candidate translations of these words can be looked up in a bilingual dictionary. Similarly, we would also like to segment a sentence into LWs in a Chinese text-to-speech (TTS) system because the pronunciations stored in the dictionary are usually much more precise than those generated dynamically (for instance, by character-to-sound rules). In our system, we used a lexicon containing 98,668 words, including 22,996 Chi-Figure 1 (a) A Chinese sentence. Slashes indicate word boundaries. (b) An output of our word segmentation system. Square brackets indicate word boundaries. + indicates a morpheme boundary.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 31, Number 4 nese characters stored as single-character words. This lexicon is a combination of several dictionaries authored by Chinese linguists and used in different Microsoft applications. Thus, all LWs in theory are similar to those described in Packard (2000), i.e., linguistic units that are &quot;salient and highly relevant to the operation of the language processor.&quot; MDWs (morphologically derived words): Chinese words of this type have the following two characteristics. First, MDWs can be generated from one or more LWs (called stems) via a productive morphological process. For example, in Figure 2, the MDW G0G0D8D8 'happily' is generated from a stem G0D8 'happy' via an AABB reduplication process. As shown in Table 1, there are five main categories of morphological processes, each of which has several subcategories, as detailed in Figure 2 (see Wu [2003] for a detailed description): a114 Affixation (MP and MS):ABFNGK(friend - plural) 'friends';</Paragraph> <Paragraph position="7"> Splitting (MS) (i.e., a set of expressions that are separate words at the syntactic level but single words at the semantic level):A3DQD1'already ate', where the bi-character wordA3D1'eat' is split by the particleDQ'already'; The second characteristic of MDWs is that they form stable Chinese character sequences in the corpus. That is, the components within the MDWs are strongly correlated (of high co-occurrence frequency), while the components at both ends have low correlations with words outside the sequence. We shall describe in Section 5.2 how the 'stability' of a Chinese sequence is measured qualitatively, and how to construct a morph-lexicon for Chinese morphology analysis.</Paragraph> <Paragraph position="8"> FTs (factoids): There are ten categories of factoid words, such as time and date expressions, as shown in Table 1. All FTs can be represented as regular expressions. Therefore, the detection and normalization of FTs can be achieved by Finite State Machines.</Paragraph> <Paragraph position="9"> NEs (named entities): NEs are frequently used Chinese names, including per-son names, location names, and organization names. One cannot develop a regular grammar that rejects or accepts the constructions of NEs with high accuracy, as we can do with most FTs. In Section 5.3, we shall describe how we use both heuristics and statistical models for NER.</Paragraph> <Paragraph position="10"> NWs (new words): NWs are OOV words that are neither recognized as named entities or factoids nor derived by morphological rules. In particular, we focus on low-frequency new words, including newly coined words, occasional words, and mostly time-sensitive words (Wu and Jiang 2000). Many current segmenters simply ignore NWs, assuming that they are of little significance in most applications. However, we argue that the identification of those words is critical because a single unidentified word can cause segmentation errors in the surrounding words. For NLP applications that require full parsing, it is even more critical because a single error would cause a whole sentence to fail.</Paragraph> <Paragraph position="11"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach Figure 2 Taxonomy of morphologically derived words (MDWs) in MSRSeg.</Paragraph> </Section> <Section position="2" start_page="538" end_page="539" type="sub_section"> <SectionTitle> 3.2 MSR Standard </SectionTitle> <Paragraph position="0"> The taxonomy employed here has been specified in detail in the MSR standard. There are two general guidelines for the development of the standard: 1. The standard should be applicable to a wide variety of NLP tasks, of which some representative examples are Chinese text input, IR, TTS, ASR, and MT.</Paragraph> <Paragraph position="1"> 2. The standard should be compatible with existing standards, of which representative examples are the Chinese NE standards in ET/ER-99, the Mainland standard (GB/T), Taiwan's ROCLING standard (CNS14366; Huang et al. 1997), and the UPenn Chinese Treebank (Xia 1999), as much as possible.</Paragraph> <Paragraph position="2"> We are seeking a standard that is &quot;linguistically felicitous, computationally feasible, and [ensures] data uniformity&quot; (Huang et al. 1997; Sproat and Shih 2002). The MSR standard consists of a set of specific rules that aims at unambiguously determining the word segmentation of a Chinese sentence, given a reference lexicon. The development of the standard is an iterative procedure, interacting with the development of a gold test set (which we will describe in the next section). We begin with an initial set of Computational Linguistics Volume 31, Number 4 segmentation rules, based on which four human annotators label a test corpus. Whenever an interannotator conflict is detected (automatically), we resolve it by revising the standard (e.g., mostly by adding more specific rules). The process is iterated until no conflict is detected. For example, we begin with the rule for detecting MDWs: &quot;if a character sequence can be derived from a LW via a morphological process, then the sequence is treated as an MDW candidate.&quot; We then observe that both A3DQD1 'already ate' and A3DQA0DJD1 'already had a meal' are derived from the LW A3D1 'eat' via the morphological process of splitting. While A3DQD1 is a reasonable MDW, A3DQA0DJD1 is debatable. We then add a rule: &quot;MDW candidates with complex internal structures should be segmented.&quot; We also add a set of specificrulestodefine what a complex internal structure is. An example of those rules is &quot;for MDW candidates of type MS, we only consider sequences that are less than four characters long.&quot; One drawback of the approach is that the standard would become very complicated as we continue to add such specific rules, and people would start making clerical errors. We currently do not have a systematic solution to this. The complexity has to be controlled manually. That is, all new added rules are recompiled by a linguist so that the total number of rules is manageable.</Paragraph> </Section> <Section position="3" start_page="539" end_page="540" type="sub_section"> <SectionTitle> 3.3 MSR Gold Test Set and Training Set </SectionTitle> <Paragraph position="0"> Several questions had to be answered when we were developing the gold test set for evaluation.</Paragraph> <Paragraph position="1"> 1. How should we construct a test corpus for reliable evaluation? 2. Does the segmentation in the gold test set depend on a particular lexicon? 3. Should we assume a single correct segmentation for a sentence? 4. What are the evaluation criteria? 5. How should we perform a fair comparison across different systems using the gold test set? We answer the first three questions in this section and leave the rest for Section 3.5. First, to conduct a reliable evaluation, we select a test corpus that contains approximately half a million Chinese characters that have been proofread and balanced in terms of domains and styles. The distributions are shown in Table 2. The gold test set is developed by annotating the test corpus according to the MSR standard via the iterative procedure described in Section 3.2. The statistics are shown in Table 3. Some fragments of the gold test set are shown in Figure 3, and the notation is presented in Table 1. As discussed in Section 3.1, we believe that the lexicon is a critical component of many applications. The segmentation of the gold test set depends upon a reference lexicon, which is the combination of several lexicons that are used in Microsoft applications, including a Chinese text input system (Gao et al. 2002), ASR (Chang et al. 2001), TTS (Chu et al. 2003), and the MSR-NLP Chinese parser (Wu and Jiang 2000). The lexicon consists of 98,668 entries. We also developed a morph-lexicon, which contains 50,963 high-frequency MDWs. We will describe how the morph-lexicon was constructed in Section 5.2.</Paragraph> <Paragraph position="2"> Regarding the third question, though it is common that there are multiple plausible segmentations for a given Chinese sentence, we keep only a single gold segmentation for each sentence for two reasons. The first is simplicity. The second is due to the fact Gao et al. Chinese Word Segmentation: A Pragmatic Approach that we currently do not know any effective way of using multiple segmentations in the above-mentioned applications. In particular, we segment each sentence as much as possible into words that are stored in the reference lexicon. When there are multiple segmentations for a sentence, we keep the one that contains the fewest number of words.</Paragraph> <Paragraph position="3"> We have also manually developed a training set according to the MSR standard. It contains approximately 40 million Chinese characters from various domains of text such as newspapers, novels, and magazines. In our experiments, 90% of the training set is used for model parameter estimation, and the remaining 10% is a held-out set for tuning.</Paragraph> </Section> <Section position="4" start_page="540" end_page="541" type="sub_section"> <SectionTitle> 3.4 SIGHAN's Bakeoff Standards and Corpora </SectionTitle> <Paragraph position="0"> As mentioned in Section 1, MSRSeg is designed as an adaptive segmenter that consists of two components: (1) a generic segmenter that can adapt to different domain vocabularies, and (2) a set of output adaptors, learned from application data, for adapting to different application-specific standards.</Paragraph> <Paragraph position="1"> 5 Chinese writing is normally divided into the first three: descriptive, expository, and narrative. Practical writing is just an umbrella term for practical writing such as notes, letter, e-mails, and marriage announcements.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 31, Number 4 Figure 3 Fragments of the MSR gold test set.</Paragraph> <Paragraph position="3"> Therefore, we evaluate MSRSeg using five corpora, each corresponding to a different standard, and consistent train-test splits, as shown in Table 4. MSR is described in previous sections, and the other four are standards used in SIGHAN's First International Chinese Word Segmentation Bakeoff (or Bakeoff for brevity) (Sproat and Emerson 2003). In the Bakeoff corpora, OOV is defined as the set of words in the test corpus not occurring in the training corpus.</Paragraph> <Paragraph position="4"> In experiments, we always consider the following adaptation paradigm. Suppose we have a general predefined standard according to which we create a large amount of training data. We then develop a generic word segmenter. Whenever we deploy the segmenter for any application, we customize the output of the segmenter according to an application-specific standard that can be partially acquired from a given small amount of application data (called adaptation data).</Paragraph> <Paragraph position="5"> The MSR standard described in Section 3.2 is used as the general standard in our experiments, on which the generic segmenter has been developed. The four Bakeoff standards are used as specific standards into which we wish to adapt the general standard. We notice in Table 4 that the adaptation data sets (i.e., training corpora for the four Bakeoff standards) are much smaller than the MSR training set. Thus, the experimental setting is a good simulation of the adaptation paradigm described above. In the rest of the article, we shall by default report results on the MSR data set unless otherwise stated.</Paragraph> </Section> <Section position="5" start_page="541" end_page="542" type="sub_section"> <SectionTitle> 3.5 Evaluation Methodology </SectionTitle> <Paragraph position="0"> As described earlier, we argue that Chinese words (or segmentation units) cannot be defined independently of the applications, and hence a more flexible system (i.e., an Gao et al. Chinese Word Segmentation: A Pragmatic Approach adaptive segmenter such as MSRSeg) should be adopted. However, we are faced with the challenge of performing an objective and rigorous evaluation of such a system. In general, the evaluation of NLP systems is concerned with both the criteria and the standard data sets. In this article, we argue that MSRSeg is a better system in two regards. First, the generic segmenter provides not only word segmentation but also word-internal structures (e.g., the tree structures of MDWs, FTs, and NEs, as will be described in Section 6) that cover all possible segmentations. Ideally, such a segmenter provides a superset of segmentation units where each different application can find the subset it needs. Second, the output adaptors of MSRSeg can automatically pick different subsets (i.e., segmentation units) from the superset according to different applications. Therefore, there are two criteria for evaluating an adaptive segmenter: how complete the superset is and how effective the adaptation is. The real evaluation will require some application data sets (i.e., segmented texts used by different applications). However, such application data are not available yet, and no other system has undergone such evaluation, so there is no way to compare our system against others in this fashion. The evaluation methodology we adopted in this article is a simulation. On the one hand, we developed a generic standard and a corresponding gold test set that simulates the generic superset that attempts to cover as many applications as possible.</Paragraph> <Paragraph position="1"> We then evaluate on the data set the completeness of the generic segmenter. On the other hand, we will show that we can effectively adapt the generic segmenter to the four different bakeoff data sets, each of which simulates an application subset.</Paragraph> <Paragraph position="2"> The evaluation measures we use in this study are summarized in Table 5. The performance of MSRSeg is measured through multiple precision-recall (P/R) pairs, and F-measures (defined as 2PR/(P+R)), each for one word type. Riv is the recall of in-vocabulary words. Roov is the recall of OOV words. They are used to measure the segmenter's performance in resolving ambiguities in word segmentation and detecting unknown words, respectively. We also test the statistical significance of results, using the criterion proposed by Sproat and Emerson (2003).</Paragraph> <Paragraph position="3"> In addition to Riv, the number of OAS (overlap ambiguity string) and CAS (combination ambiguity string) errors are used to measure the segmenter's performance of resolving ambiguities in word segmentation in more detail. Liang (1987) defines OAS and CAS as follows.</Paragraph> <Paragraph position="4"> Definition 1 A character string ABC is called an overlap ambiguity string (OAS) if it can be segmented into two words either as AB/C or A/BC (not both), depending on context.</Paragraph> <Paragraph position="5"> Table 5 Evaluation measures for Chinese word segmenter.</Paragraph> </Section> <Section position="6" start_page="542" end_page="543" type="sub_section"> <SectionTitle> Measures Remarks </SectionTitle> <Paragraph position="0"> P/R/F Multiple pairs, each for one type of words (i.e., LW, MDW, FT, NE); P/R/F of NER are used for cross-system comparison Roov Test the performance of detecting unknown words Riv Test the performance of resolving ambiguities in word segmentation # OAS errors Similar to cross-bracketing, used for cross-system comparison # CAS errors Test on a set of 70 high-frequency CASs in our study Significant test See Sproat and Emerson (2003).</Paragraph> <Paragraph position="1"> A character string AB is called a combination ambiguity string (CAS) if A, B, and AB are words.</Paragraph> <Paragraph position="2"> Liang (1987) reports that the relative frequency of OASs is 1.2 per 100 characters in Chinese text, and the relative frequency of CASs is 12 times lower than that of OAS. However, according to the above definition, the relative frequency of CASs can be much higher because most single characters in Chinese can be words by themselves, and as a result, almost all two-character words can be CASs. However, this is not desirable. Consider the word G0EM 'altitude'. Though both G0 'high' and EM 'degree' are words by themselves, the segmentation G0/EM almost never occurs in reality. To remedy this problem, Sun and Tsou (2001) revise the definition as follows: Definition 3 A character string AB is called a combination ambiguity string (CAS) if A, B, and AB are words, and there is at least one context under which the segmentation A/B is plausible both semantically and syntactically.</Paragraph> <Paragraph position="3"> Though the revision clarifies the definition in principle, it requires a judgment of the syntactic and semantic sense of the segmentation--a task where an agreement cannot be reached easily among different human annotators. Therefore, we only use the CAS measure in a pilot study. As will be described in Section 7, the number of CAS errors is estimated by counting the wrong segmentations of the predefined 70 high-frequency CASs.</Paragraph> <Paragraph position="4"> While all the measures in Table 5 can be used in evaluating MSRSeg, most of them cannot be used in cross-system comparisons. For example, since the MSR gold test set is based on a reference lexicon, some of the measures are meaningless when we compare our system to other segmenters that use different lexicons. So in comparing different systems, we consider only the P/R/F of NER and the number of OAS errors (i.e., crossing brackets), because these measures are lexicon-independent and there is always a single unambiguous answer.</Paragraph> </Section> </Section> class="xml-element"></Paper>