File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-4005_metho.xml
Size: 71,468 bytes
Last Modified: 2025-10-06 14:09:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4005"> <Title>Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach</Title> <Section position="5" start_page="543" end_page="549" type="metho"> <SectionTitle> 4. Theoretical Background </SectionTitle> <Paragraph position="0"> This section provides some theoretical background on the basis of the development of MSRSeg. We first present in Section 4.1 a Chinese word segmentation framework that uses source-channel models of Chinese sentence generation. Then, in Section 4.2, we generalize source-channel models as linear mixture models in which a wide variety of linguistic knowledge and statistical models can be incorporated in a unified way. These models are constructed via two basic modeling tools: (1) n-gram language models (LMs; Chen and Goodman 1999), and (2) finite state automata (FSA; Roche and Schabes 1997). More specifically, the LMs we used are bigram and trigram backoff models, where the parameters are estimated using maximum-likelihood estimation (MLE) with a particular smoothing method, called modified absolute discounting, described in Gao, Goodman, and Miao (2001). LMs are used to capture statistical information such as the likelihood of word or character sequence. FSAs are used to represent (1) the lexicon, (2) the rules for detecting FTs, and (3) the rules for generating NE candidates.</Paragraph> <Paragraph position="1"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach</Paragraph> <Section position="1" start_page="544" end_page="545" type="sub_section"> <SectionTitle> 4.1 Source-Channel Models </SectionTitle> <Paragraph position="0"> The task of MSRSeg is to detect not only word boundaries but also word types so that words of different types can be processed as shown in Figure 1. Therefore, following the Chinese word taxonomy in Table 1, we define a Chinese word class as a group of words that are supposed to be generated according to the same distribution (or processed in the same manner) as follows: 1. Each LW is defined as a class; 2. Each MDW is defined as a class; 3. Each type of FT (e.g., time expressions) is defined as a class; 4. Each type of NE (e.g., person names) is defined as a class; and 5. All NWs belong to a class.</Paragraph> <Paragraph position="1"> Notice that both the LW and MDW classes are open sets and that we need to assign a floor-value to those words that are not stored in the lexicons. In particular, we define six unknown word classes as follows. One class is used to represent all unknown LWs and all unknown MDWs whose type cannot be detected. The other five classes are used to represent unknown MDWs, one for each of the five types listed in Table 1, i.e., MP/MS, MR, MS, MM, and MHP. The probabilities of these unknown word classes are estimated using the Good-Turing method.</Paragraph> <Paragraph position="3"> where GEN(s) denotes the candidate set given s.</Paragraph> <Paragraph position="4"> Equation (1) is the basic form of source-channel models for Chinese word segmentation. The models assume that a Chinese sentence s is generated as follows: First, a person chooses a sequence of concepts (i.e., word classes w) to be output, according to the probability distribution P(w); then the person attempts to express each concept by choosing a sequence of characters, according to the probability distribution P(s|w).</Paragraph> <Paragraph position="5"> The source-channel models also can be interpreted in another way: P(w)isa stochastic model estimating the probability of word class sequence. It indicates, given a context, how likely a word class occurs. For example, person names are more likely to occur before a title such as CHDS 'professor'. Consequently, we also refer to P(w)asacontext model. P(s|w) is a generative model estimating how likely a character string is generated given a word class. For example, the character string C6FMAV is more likely to be a person name than FOFMAV 'Li Junsheng' because C6 is a common family name in China while FO is not. So P(s|w) is also referred to as class model. In our system, we use only one context model (i.e., a trigram language model) and a set of class models of different types, each of which is for one class of words, as shown in Table 6.</Paragraph> <Paragraph position="6"> is a substrings that are NE of a particular type substring of s, and forms an NE. in w.</Paragraph> <Paragraph position="7"> FT FSA, each for one type Number of FT (of a particular type) in w.</Paragraph> <Paragraph position="8"> NW SVM classifier Score of SVM classifier.</Paragraph> <Paragraph position="9"> It should be noted that different class models are constructed in different ways (e.g., NE models are n-gram models trained on corpora, whereas FT models use derivation rules and have binary values). The dynamic value ranges of different class model probabilities can be so different (some are not probabilities but scores)thatitis inappropriate to combine all models through simple multiplication as in Equation (1). One way to balance these score quantities is to introduce for each class model (i.e., channel model) a model weight l to adjust the class model score P(s|w)to</Paragraph> <Paragraph position="11"> . In our experiments, these weights are optimized so as to minimize the number of word segmentation errors on training data under the framework of linear models, as described in Section 4.2. It is worth noticing that the source-channel models are the rationale behind our system, e.g., the decoding process described in Section 5.6 follows the framework. Linear models are just another representation based on the optimization algorithm of class model weights.</Paragraph> </Section> <Section position="2" start_page="545" end_page="548" type="sub_section"> <SectionTitle> 4.2 Linear Models </SectionTitle> <Paragraph position="0"> The framework of linear models is derived from linear discriminant functions widely used for pattern classification (Duda, Hart, and Stork 2001) and has been recently introduced into NLP tasks by Collins and Duffy (2001). It is also related to (log-)linear models described in Berger, Della Pietra, and Della Pietra (1996), Xue (2003); Och (2003), and Peng, Feng, and McCallum (2004).</Paragraph> <Paragraph position="1"> We use the following notation in the rest of the article.</Paragraph> <Paragraph position="2"> a114 Training data are a set of example input/output pairs. In Chinese word segmentation, we have training samples {s</Paragraph> <Paragraph position="4"> We assume a set of D + 1 features f d (s, w), for d = 0...D. The features are arbitrary functions that map (s, w) to real values. Using vector notation, we have f(s, w) [?]Rfractur Gao et al. Chinese Word Segmentation: A Pragmatic Approach they are basically derived from class models). Their values are either the summation of the logarithm of the probabilities of the corresponding probabilistic models, or assigned heuristically. Here, for those features that are only defined on w, we will omit s and denote them as f (w). a114 Finally, the parameters of the model are a vector of D + 1 parameters, each for one feature function, l =(l</Paragraph> <Paragraph position="6"> are in fact class model weights, described in Section 4.1. The likelihood score of a word class sequence w can be written as</Paragraph> <Paragraph position="8"> We see that Equation (2) is yet another representation of the source-channel models described in Section 4.1 by introducing class weights (i.e., adjust P(s|w)toP(s|w) l )and taking the logarithm of all probabilities. The decision rule of Equation (1) can then be rewritten as</Paragraph> <Paragraph position="10"> In what follows, we will describe the way of estimating l under the framework of gradient descent: an iterative procedure of adjusting the parameters of l in the direction that minimizes the segmentation errors with respect to a loss function. We will present in turn the loss function and the optimization algorithm.</Paragraph> <Paragraph position="11"> w by comparing it with a reference segmentation w</Paragraph> <Paragraph position="13"> (i.e., editing distance, in our case). The training criterion that directly minimizes the segmentation errors over the training data is</Paragraph> <Paragraph position="15"> as w [?] . Equation (4) is referred to as the minimum sample risk (MSR; Gao et al. 2005) criterion hereafter. Notice that without knowing the &quot;true&quot; distribution of the data, the best l can be chosen approximately based on training samples. This is known as the principle of empirical risk minimization (ERM; Vapnik 1998): If the segmenter were trained using exactly the MSR criterion, it would converge to a Bayes risk performance (minimal error rate) as the training size goes to infinity.</Paragraph> <Paragraph position="16"> However, Er(.) is a piecewise constant function of the model parameter l, and thus a poor candidate for optimization by any simple gradient-type numerical search. For example, the gradient cannot be computed explicitly because Er(.)isnotdifferentiable with respect to l, and there are many local minima on the error surface. Therefore, we use an alternative loss function, minimum squared error (MSE) in equation (5), where Score(.)isdefined in Equation (2), where s has been suppressed Computational Linguistics Volume 31, Number 4 for simplicity. MSELoss(l) is simply the squared difference between the score of the correct segmentation and the score of the incorrect one, summing over all training samples.</Paragraph> <Paragraph position="18"> It is useful to note that the MSE solution, under certain conditions, leads to approximations to maximum likelihood solution. The quality of the approximation depends upon the form of the linear discriminant functions (e.g., Equation (2)). Due to its appealing theoretical properties, the MSE criterion has received considerable attention in the literature, and there are many solution procedures available (Duda, Hart, and Stork 2001).</Paragraph> <Paragraph position="19"> for an unthresholded perceptron, following the description in Mitchell (1997). The delta rule in its component form is</Paragraph> <Paragraph position="21"> where e is the step size, and G is the gradient of MSELoss.</Paragraph> <Paragraph position="23"> However, the objective function of Equation (5) in the context of our task (i.e., Chinese word segmentation) has many local minima, and thus gradient descent cannot guarantee finding the global minimum. We therefore use a stochastic approximation to gradient descent. Whereas the gradient descent computes parameter updates after summing over all training samples as shown in Equation (7), the stochastic approximation method updates parameters incrementally, following the calculation of the error for each individual training sample, as shown in Equation (8).</Paragraph> <Paragraph position="25"> in the beginning and a small variance in final iterations. We have tested this approach in our study, and it resulted in very limited improvement.</Paragraph> <Paragraph position="26"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach The optimization algorithm we used in our experiments is shown in Figure 4. It takes T passes over the training set. All parameters are initially set to be 1. The context model parameter l does change during training. Class model parameters are updated in a simple additive fashion: Parameters are altered according to the gradient with respect to MSELoss</Paragraph> <Paragraph position="28"> as the w in GEN(s) with the fewest errors, so Score(l,s, w</Paragraph> <Paragraph position="30"> That is, the model parameters are updated when the sentence is wrongly segmented.</Paragraph> <Paragraph position="31"> The update rule increases the parameter values for word classes whose models were &quot;underestimated&quot; (i.e., expected feature value f (s, w) is less than observed feature value f (s, w R )), and decreases the parameter values whose models were &quot;overestimated&quot; (i.e., f (s, w) is larger than f (s, w R )). Empirically, the sequence of these updates, when iterated over all training samples, provides a reasonable approximation to descending the gradient with respect to the original loss function of Equation (5). Although this method cannot guarantee a globally optimal solution, it is chosen for our modeling because of its efficiency and because it achieved the best results in our experiments.</Paragraph> <Paragraph position="32"> The algorithm is similar to the perceptron algorithm described in Collins (2002). The key difference is that, instead of using the delta rule of Equation (8) (as shown in line 5 of Figure 4), Collins (2002) updates parameters using the rule: l</Paragraph> <Paragraph position="34"> ). Our pilot study shows that our algorithm achieves slightly better results.</Paragraph> </Section> <Section position="3" start_page="548" end_page="549" type="sub_section"> <SectionTitle> 4.3 Discussions on Robustness </SectionTitle> <Paragraph position="0"> The training methods described in Section 4.2 aim at minimizing errors in a training set. But test sets can be different. The robustness issue concerns how well the minimal error rate in the training set preserves in the test set. According to Dudewicz and Mishra (1988), the MSE function in general is not very robust because it is not bounded and can be contaminated from those training samples far away from the decision boundary. One of many possible solutions for improving the robustness is to introduce a margin in the training procedure of Figure 4. The basic idea is to enlarge the score difference (or score margin) between a correct segmentation (i.e., w</Paragraph> <Paragraph position="2"> its competing incorrect segmentations (i.e., {w; w [?] GEN(s), w negationslash= w R }). According to Equation (8), the perceptron training algorithm of Figure 4 does not adjust parameters if the sentence is segmented correctly. The robustness could be improved if we continued to enlarge the score margin between the correct segmentation and the top competing candidate even if the input sentence had been correctly segmented, until the Figure 4 The perceptron training algorithm for Chinese word segmentation. Computational Linguistics Volume 31, Number 4 margin has exceeded a preset threshold. More specifically, we can modify Equation (8) as follows</Paragraph> <Paragraph position="4"> where d is the desired margin that can either be an absolute value or a quantity proportional to the score of the correct segmentation (Su and Lee 1994). The modified training algorithm is similar to the perceptron algorithm with margins proposed by Krauth and M`ezard (1987). We leave the evaluation of the algorithm to future work.</Paragraph> <Paragraph position="5"> Readers can also refer to Duda, Hart, and Stork (2001) for a detailed description of margin-based learning algorithms.</Paragraph> </Section> </Section> <Section position="6" start_page="549" end_page="558" type="metho"> <SectionTitle> 5. System Description </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="549" end_page="550" type="sub_section"> <SectionTitle> 5.1 Architecture Overview </SectionTitle> <Paragraph position="0"> MSRSeg consists of two components: a generic segmenter and a set of output adaptors.</Paragraph> <Paragraph position="1"> We describe the first component in this section and leave the second to Section 6.</Paragraph> <Paragraph position="2"> The generic segmenter has been developed on the basis of the mathematical models described in Section 4. It consists of several components, as shown in Figure 5.</Paragraph> <Paragraph position="3"> 1. Sentence Segmenter: To detect sentence boundaries using punctuation clues such as . , ? and !.</Paragraph> <Paragraph position="4"> 2. Word Candidate Generator: Given an input string s, to generate all word candidates and store them in a word lattice. Each candidate is assigned to its word class and the class model score, e.g., log P(s prime |w), where s' is any substring of s.</Paragraph> <Paragraph position="5"> Figure 5 Overall architecture of MSRSeg.</Paragraph> <Paragraph position="6"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach 3. Decoder: To select the best (or the N best) word segmentation (i.e., word class sequence w*) from the lattice according to Equations (4) and (5), using the Viterbi (or A*) search algorithm.</Paragraph> <Paragraph position="7"> 4. Wrapper: To output segmentation results using some predefined canonical forms, e.g., MDW and FT are of their normalized form, as described in Section 3.1.</Paragraph> <Paragraph position="8"> Here, we will describe candidate generators for different word classes. They are (1) the lexicon (and morph-lexicon) TRIEs to generate LW (or MDW) candidates, (2) the NE class models to generate NE candidates, (3) the finite-state automaton (FSA) to generate FT candidates, and (4) the classifier to generate NW candidates.</Paragraph> </Section> <Section position="2" start_page="550" end_page="551" type="sub_section"> <SectionTitle> 5.2 Lexicon Representation and Morphological Analysis </SectionTitle> <Paragraph position="0"> Lexicon words are represented as a set of TRIEs (Frakes and Baeza-Yates 1992), which is a particular implementation of the FSA described in Section 4. Given a character string, all prefix strings that form lexical words can be retrieved by browsing the TRIE whose root represents its first character.</Paragraph> <Paragraph position="1"> Though there are well-known techniques for English morphological analysis (e.g., finite-state morphology), they are difficult to extend to Chinese for two reasons. First, Chinese morphological rules are not as general as their English counterparts. For example, in most cases English plural nouns can be generated using the rule &quot;noun + s - plural noun&quot;. But only a small subset of Chinese nouns can be pluralized (e.g., ABFNGK 'friends') using its Chinese counterpart &quot;noun + GK - plural noun&quot; whereas others (e.g., CFG4 'pumpkins') cannot.</Paragraph> <Paragraph position="2"> Secondly, the operations required by Chinese morphological analysis, such as copying in reduplication, merging, and splitting, cannot be implemented using current finite-state networks.</Paragraph> <Paragraph position="3"> Our solution is extended lexicalization. We simply collect all MDWs of the five types described in Section 3.1 and incorporate them into the TRIE lexicon, called morph-lexicon. The TRIEs are essentially the same as those used for lexical words, except that not only the MDW's identity but also its morphological pattern and stem(s) are stored. The procedure of lexicalization involves three steps: (1) Candidate generation. Candidate generation is done by applying a set of morphological rules to both the word lexicon and a large corpus. For example, the rule 'noun + GK - plural noun' would generate candidates like ABFNGK. (2) Statistical filtering. For each candidate, we obtain a set of statistical features such as frequency, mutual information, and left/right context dependency from a large corpus. We then use an information-gain-like metric described in Chien (1997) and Gao et al. (2002) to estimate how likely a candidate is to form a morphologically derived word and remove the &quot;bad&quot; candidates. The basic idea behind the metric is that a Chinese word should appear as a stable sequence in the corpus. That is, the components within the word are strongly correlated, while the components at both ends should have low correlations with words outside the sequence. (3) Linguistic selection. Finally, we manually check the remaining candidates 7 We do not consider those special cases, such as in a children's fairy story, where the magic pumpkins can talk.</Paragraph> <Paragraph position="4"> 8 There are some types of copying operations that can be implemented by FSMs (e.g., Sproat 1992), but implementation is not trivial. Because the number of MDWs is manageable, storing them as a list is not much more expensive than storing them as a finite-state network, in terms of space and access speed. Our implementation can be viewed as a pragmatic solution, easy to implement and maintain. Computational Linguistics Volume 31, Number 4 and construct the morph-lexicon, where each entry is tagged with its morphological pattern and stem(s). The resulting morph-lexicon contains 50,963 MDWs.</Paragraph> </Section> <Section position="3" start_page="551" end_page="553" type="sub_section"> <SectionTitle> 5.3 Named Entities </SectionTitle> <Paragraph position="0"> We consider four types of named entities: person names (PNs), location names (LNs), organization names (ONs), and transliterations of foreign names (FNs). Because any character string can in principle be a named entity of one or more types, in order to limit the number of candidates for a more effective search, we generate named entity candidates given an input string in two steps: First, for each type, we use a set of constraints (which are compiled by linguists and are represented as FSAs) to generate only those &quot;most likely&quot; candidates. Second, each of the generated candidates is assigned a class model probability. Class models are defined as generative models that are estimated on their corresponding named entity lists using MLE, together with a backoff smoothing schema, as described in Section 4.1.1. We will describe briefly the constraints and the class models here. The Chinese person-name model is a modified version of that described in Sproat et al. (1996). Other NE models are novel, though they share some similarities with the Chinese person-name model.</Paragraph> <Paragraph position="1"> assume that a Chinese PN consists of a family name F and a given name G,andisof the pattern F+G.BothF and G are one or two characters long. (2) Family name list: We only consider PN candidates that begin with an F stored in the family name list (which contains 373 entries in our system).</Paragraph> <Paragraph position="2"> Given a PN candidate, which is a character string s, the class model probability P(s|PN) is computed by a character bigram model as follows: (1) generate the , G2). For example, the generative probability of the string C6FMAV given that it is a PN would be estimated as P(C6FMAV|PN) = P(C6|F)P(FM|G1)P(AV|FM,G2).</Paragraph> <Paragraph position="3"> LN candidate is generated given s (which is less than 10 characters long), if one of the following conditions is satisfied: (1) s is an entry in the LN list (which contains 30,000 LNs); (2) s ends in a keyword in a 120-entry LN keyword list (e.g.,A2'city'). The probability P(s|LN) is computed by a character bigram model. Consider a stringAFD7AKEZ'Shamir river'. It is an LN candidate because it ends in an LN keyword EZ 'river'. The generative probability of the string, given that it is an LN, would be estimated as P(AFD7AKEZ|LN) = P(AF|<LN>) P(D7|AF) P(AK|D7) P(EZ|AK) P(</LN>|EZ), where <LN> and </LN> are symbols denoting the beginning and the end of an LN, respectively.</Paragraph> <Paragraph position="4"> 5.3.3 Organization Names. ONs are more difficult to identify than PNs and LNs because ONs are usually nested named entities. For example, the ON B9H1H1BXB6DED0GW 'Air China Corporation' contains an LNB9H1'China'.</Paragraph> <Paragraph position="5"> 9 For clarity, the constraint is a simplified version of that used in MSRSeg. Gao et al. Chinese Word Segmentation: A Pragmatic Approach Like the identification of LNs, an ON candidate is only generated given a character string s (less than 15 characters long), if it ends in a keyword in a 1,355-entry ON keyword list (e.g., D0GW 'corporation'). To estimate the generative probability of a nested ON, we introduce word class segmentations of s, w, as hidden variables. In principle, the ON class model recovers P(s|ON) over all pos-</Paragraph> <Paragraph position="7"> P(w|ON)P(s|w). We then assume that the sum is approximated by a single pair of terms P(w is the most probable word class segmentation discovered by Equation (5). That is, we also use our system to find w*, but the source-channel models are estimated on the ON list.</Paragraph> <Paragraph position="8"> Consider the earlier example. Assuming that w*=LN/H1BX/B6DE/D0GW, where B9H1 is tagged as an LN, the probability P(s|ON) would be estimated using a word class bigram model as: P(B9H1H1BXB6DED0GW|ON) [?] P(LN/H1BX/B6DE/D0GW|ON)</Paragraph> <Paragraph position="10"> P(B9H1|LN), where P(B9H1|LN) is the class model probability of B9H1 given that it is an LN, and <ON> and </ON> are symbols denoting the beginning and the end of an ON, respectively.</Paragraph> <Paragraph position="11"> 5.3.4 Transliterations of Foreign Names. As described in Sproat et al. (1996), FNs are usually transliterated using Chinese character strings whose sequential pronunciation mimics the source language pronunciation of the name. Since FNs can be of any length and their original pronunciation is effectively unlimited, the recognition of such names can be tricky. Fortunately, there are only a few hundred Chinese characters that are particularly common in transliterations.</Paragraph> <Paragraph position="12"> Therefore, an FN candidate would be generated given s if it contains only characters stored in a transliterated name character list (which contains 618 Chinese characters). The probability P(s|FN) is estimated using a character bigram model.</Paragraph> <Paragraph position="13"> Notice that in our system an FN can be a PN, an LN, or an ON, depending on the context. Then, given an FN candidate, three named entity candidates, each for one category, are generated in the lattice, with the class probabilities P(s|PN) =</Paragraph> <Paragraph position="15"> type to the decoding phase where the context model is used.</Paragraph> <Paragraph position="16"> 5.3.5 Abbreviations. For the sake of completeness, we describe below the basic ideas for tackling NE abbreviations within our framework. This is ongoing research, and we do not have conclusive results yet. Readers can refer to Zhu et al. (2003) and Sun, Zhou, and Gao (2003) for more details, where marginal improvements have been reported.</Paragraph> <Paragraph position="17"> For different types of NEs, different strategies can be adopted. We find that most abbreviations of Chinese PNs and LNs are single-character NEs. The PN and LN models described previously cannot handle them very well because (1) single-character NEs are generated in a different way from that of multicharacter NEs, and (2) the context of single-character NEs is different from multicharacter ones. For example, single-character NEs usually appear adjacent to one another, such as B9 and FG in B9FGBKAJ 'China-Russia trade'. But this is not the case for multicharacter NEs.</Paragraph> <Paragraph position="18"> We thus define single-character PNs and LNs as two separate word classes, denoted by SCPN and SCLN, respectively. We assume that a character is a candidate of SCPN (or SCLN) only when it is included in a predefined SCPN (or SCLN) list, which Computational Linguistics Volume 31, Number 4 contains 151 (or 177) characters. The class model probabilities are assigned by unigram models as</Paragraph> <Paragraph position="20"> where C(s) is the count of the SCPN (or SCLN) s in an annotated training set and N is the size of the SCPN (or SCLN) list. In the context model, we also differentiate between PN (or LN) and SCPN (or SCLN). Therefore, SCPN and SCLN should be labeled explicitly in the training data.</Paragraph> <Paragraph position="21"> It is much more difficult to detect abbreviations of ONs (denoted by ONA) because ONAs are usually multiple-character strings and can be generated from their original ON arbitrarily. For example, the abbreviation of A5C6B3CU 'Tsinghua University' is A5C6while the abbreviation ofANESB3CU'Peking University' isANB3. We assume that an ONA candidate is only generated given a character string s (fewer than 6 characters long), if both of the following conditions are satisfied: (1) An ON has been detected in the same document, and (2) s can be derived from the ON using a generative pattern defined in Table 7. Since there is no training data for the ONA class model, we construct a score function to assign each ONA candidate a class model score.</Paragraph> <Paragraph position="22"> Consider a string ANB3. The generative probability of the string, given it is an ONA, would be approximated as P(ANB3|ONA) [?] P(ANESB3CU|ON) P(ANB3|ANESB3CU,ON), where P(ANB3|ANESB3CU,ON)isdefined as a constant (0.8 in our experiments) if ANB3 can be derived from ANESB3CU using any generative pattern of Figure 7; 0, otherwise.</Paragraph> <Paragraph position="23"> If more than one ON is detected previously in the same document and can be used to derive the ONA candidate (e.g., ANF5B3CU 'North University'), only the closest ON is taken into account. We also notice that ONAs and ONs occur in similar contexts.</Paragraph> <Paragraph position="24"> So we do not differentiate them in the context model.</Paragraph> </Section> <Section position="4" start_page="553" end_page="554" type="sub_section"> <SectionTitle> 5.4 Factoids </SectionTitle> <Paragraph position="0"> The types of factoids handled in MSRSeg are shown in Table 1 of Section 3.1. For each type of factoid, we generate a grammar of regular expressions. The ten regular expressions are then compiled into a single FSA. Given an input string s, we scan it from left to right, and output a FT candidate when a substring matches one of the ten Gao et al. Chinese Word Segmentation: A Pragmatic Approach regular expressions. We also remove those FT candidates that are substrings of any other candidates of the same type. Consider the example in Figure 1; only the string BTDWF5A9BTA6is accepted as a FT candidate (a time expression), not the substrings (e.g., BTDWF5orBTDWF5A9BT).</Paragraph> <Paragraph position="1"> The use of FSA is motivated by the fact that the detection of most FTs is based exclusively on their internal properties and without relying on context. This can be in principle justified by experiments. As shown in Table 8, the overall performance of FT detection using only FSA is comparable with that of using MSRSeg where the contextual information of the FT is considered (i.e., in MSRSeg, FSA are used as feature functions, and the FT are detected simultaneously with other words). If we read the results carefully, we can see that the use of context information (in MSRSeg) achieves a higher precision but a lower recall--a small but significant difference.</Paragraph> </Section> <Section position="5" start_page="554" end_page="558" type="sub_section"> <SectionTitle> 5.5 New Words </SectionTitle> <Paragraph position="0"> New words in this section refer to OOV words that are neither recognized as named entities or factoids nor derived by morphological rules. These words are mostly domain-specific and/or time-sensitive, such as A9AQ 'Three Links', CMDC 'SARS'. The identification of such new words has not been studied extensively before. It is an important issue that has substantial impact on the performance of word segmentation. For example, approximately 30% of OOV words in the SIGHAN's PK corpus in Table 4 are new words of this type. There has been previous work on detecting Chinese new words from a large corpus in an off-line manner and updating the dictionary before word segmentation. However, our approach is able to detect new words on-line, i.e., to spot new words in a sentence on the fly during the process of word segmentation, where widely used statistical features such as mutual information or term frequency are not available.</Paragraph> <Paragraph position="1"> For brevity, we will focus on the identification of two-character new words, denoted as NW 11. Other types of new words such as NW 21 (a two-character word followed with a character) and NW 12 can be detected similarly (e.g., by viewing the two-character word as an inseparable unit, like a character). These three types amount to 85% of all NWs in the PK corpus. Here, we shall describe the class model and context model for NWI, and the creation of training data by sampling.</Paragraph> <Paragraph position="2"> 5.5.1 Class Model. We use a classifier (SVM light [Joachims 2002] in our experiments) to estimate the likelihood of two adjacent characters forming a new word. Of the great number of features with which we experimented, four linguistically motivated features are chosen due to their effectiveness and availability for on-line detection. They are independent word probability (IWP), anti-word pair (AWP), word formation analogy Computational Linguistics Volume 31, Number 4 (WFA), and morphological productivity (MP). We now describe each feature in turn. In Section 5.5.2, we shall describe the way the training data (new word list) for the classifier is created by sampling.</Paragraph> <Paragraph position="3"> IWP (independent word probability) is a real-valued feature. Most Chinese characters can be used either as independent words or component parts of multicharacter words, or both. The IWP of a single character is the likelihood of this character to appear as an independent word in texts (Wu and Jiang 2000):</Paragraph> <Paragraph position="5"> where C(x, W) is the number of occurrences of the character x as an independent word in training data, and C(x) is the total number of occurrences of x in training data. We assume that the IWP of a character string is the product of the IWPs of the component characters. Intuitively, the lower the IWP value, the more likely the character string forms a new word. In our implementation, the training data is word-segmented.</Paragraph> <Paragraph position="6"> AWP (anti-word pair) is a binary feature derived from IWP. For example, the value of AWP of an NW 11 candidate ab is defined as: AWP(ab)=1 if IWP(a)>th or IWP(b) >th; 0, otherwise. th [?] [0, 1] is a preset threshold. Intuitively, if one of the component characters is very likely to be an independent word, it is unlikely to be able to form a word with any other characters. While IWP considers all component characters in a new word candidate, AWP only considers the one with the maximal IWP value.</Paragraph> <Paragraph position="7"> WFA (word formation analogy) is a binary feature. Given a character pair (x, y), a character (or a multicharacter string) z is called the common stem of (x, y)ifatleastone of the following two conditions hold: (1) character strings xz and yz are lexical words (i.e., x and y as prefixes); and (2) character strings zx and zy are lexical words (i.e., x and y as suffixes). We then collect a list of such character pairs, called affix pairs, of which the number of common stems is larger than a preset threshold. The value of WFA for a given NW 11 candidate ab is defined as: WFA(ab) = 1 if there exists an affixpair(a, x) (or (b, x)) and the string xb (or ax) is a lexical word; 0, otherwise. For example, given an NW 11 candidate ABE7 (xia4-gang3, 'be laid off'), we have WFA(ABE7) = 1 because (AA, AB)isanaffix pair (they have 32 common stems such as GZ, BK, GO, CU, CQ, C0, GL) andAAE7(shang4-gang3, 'take over a shift') is a lexical word.</Paragraph> <Paragraph position="8"> MP (morphological productivity) is a real-valued feature. It is a measure of the productivity of a particular construction, as defined here (Baayen 1989):</Paragraph> <Paragraph position="10"> MP is strongly related to the Good-Turing estimate. Here, N is the number of tokens of a particular construction found in a corpus, e.g., the number of tokens of all nouns ending in -GK,andn is the number of types of that construction, e.g., the number of unique nouns ending in -GK. Intuitively, a higher value of MP indicates a higher probability that (one of) the component parts of a multicharacter string appears to be a word. For example, Sproat and Shih (2002) show that the MP values of Chinese noun affix-GK and verb affix-FJ are 0.20 and 0.04, respectively, indicating that -GK is a much more productive affix, while the MP value of single-character nouns which belong to Gao et al. Chinese Word Segmentation: A Pragmatic Approach a closed and nonproductive class is close to 0. These results are in agreement with our intuition. Similarly, we find some very productive characters with high MP values. For example, in our training set, there are 236 words that contain the character BP, among which 13 occur only once.</Paragraph> <Paragraph position="11"> The first is to capture useful contextual information. For example, new words are more likely to be nouns than pronouns, and POS tagging is context-sensitive. The second is more important. As described in Section 4, with a context model, NWI can be performed simultaneously with other word segmentation tasks (e.g., word breaking, NER, and morphological analysis) in a unified manner.</Paragraph> <Paragraph position="12"> However, it is difficult to develop a training corpus where new words are annotated because &quot;we usually do not know what we don'tknow.&quot; Our solution is Monte Carlo simulation. We sample a set of new words from our dictionary according to the distribution--the probability that any lexical word w would be a new word P(NW|w). We then generate a new-word-annotated corpus from a word-segmented text corpus. Now we describe the way P(NW|w) is estimated. It is reasonable to assume that new words are those words whose probability of appearing in a new document is lower than general lexical words. Let P i (k) be the probability of word w i that occurs k times in a document. In our experiments, we assume that P(NW|w i ) can be approximated by the probability of w i occurring less than K times in a new document:</Paragraph> <Paragraph position="14"> where the constant K (7 in our experiments) is dependent on the size of the document: The larger the document, the larger the value. P i (k) can be estimated using several term distribution models (Chapter 15.3 in Manning and Sch &quot;utze [1999]). Following Gao and Lee (2000), we use K-Mixture (Katz 1996) which estimates P</Paragraph> <Paragraph position="16"> observed mean l and the observed inverse document frequency IDF as follows:</Paragraph> <Paragraph position="18"> where cf is the total number of occurrence of word w i in training data, df is the number of documents in training data in which w i occurs, and N is the total number of documents. In our implementation, the training data contain approximately 40,000 documents that have been balanced among domain, style, and time. sifier. This section discusses two factors that we believe have the most impact on the performance of NWI. First, we investigate the relative contribution of the four linguistically motivated features in NWI. Second, we compare methods where we use the NWI component (i.e., an SVM classifier) as a post-processor versus as a feature function in the linear models of Equation (4).</Paragraph> <Paragraph position="19"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach The NWI results on the PK test set are shown in Tables 9 and 10. We turned off the features one at a time and recorded the scores of each ablated NWI component. It turns out that in cases of both NW 11 and NW 12, IWP is obviously the most effective feature.</Paragraph> <Paragraph position="20"> Tables 11 and 12 show results of NWI on four Bakeoff test sets. We can see that unified approaches (i.e., using the NWI component as a feature function) significantly outperform consecutive approaches (i.e., using the NWI component as a post-processor) consistently, in terms of both Roov and P/R/F of the overall word segmentation. This demonstrates empirically the benefits of using the context model for NWI and the unified approach to Chinese word segmentation, as described in 5.5.2.</Paragraph> </Section> <Section position="6" start_page="558" end_page="558" type="sub_section"> <SectionTitle> 5.6 Decoder </SectionTitle> <Paragraph position="0"> The decoding process follows the source-channel framework. It consists of three steps: Step 1: Throughout the process, we maintain an array of word class candidates, called a lattice, which is initialized to be empty.</Paragraph> <Paragraph position="1"> Step 2: Given a Chinese sentence, all possible words of different types are generated simultaneously by the corresponding channel models described in Sections 5.2 to 5.5.</Paragraph> <Paragraph position="2"> For example, as shown in Table 6, the lexicon TRIE generates LW candidates; the SVM classifier generates NW candidates, and so on. All the generated candidates are added to the lattice. Each element in the lattice is a 5-tuple (w, i, l, t, s), where w is the word candidate, i is the starting position of w in the sentence, l is the length of w, t is the word class tag, and s is the class model score of w assigned by its feature function in Table 6.</Paragraph> <Paragraph position="3"> Some examples are shown in Figure 5.</Paragraph> <Paragraph position="4"> Step 3: The Viterbi (or A*) algorithm is used to search for the best word class sequence, among all candidate segmentations in the lattice, according to Equations (2) and (3).</Paragraph> <Paragraph position="5"> For efficiency, we sometimes need to control the search space. Given an input sentence s, all word candidates are ranked by their normalized class model score lf (.).</Paragraph> <Paragraph position="6"> Thus, the number of candidates (i.e., the size of the lattice) is controlled by two parameters: a114 Number threshold: The maximum number of candidates cannot be larger than a given threshold; a114 Score threshold: The difference between the class model score of the top-ranked candidates and the bottom-ranked candidates cannot be larger than a given threshold.</Paragraph> </Section> </Section> <Section position="7" start_page="558" end_page="566" type="metho"> <SectionTitle> 6. Standards Adaptation </SectionTitle> <Paragraph position="0"> This section describes the second component of MSRSeg: a set of adaptors for adjusting the output of the generic segmenter to different application-specific standards.</Paragraph> <Paragraph position="1"> We consider the following standards adaptation paradigm. Suppose we have a general standard predefined by ourselves. We have also created a large amount of training data that are segmented according to this general standard. We then develop a generic word segmenter, i.e., the system described in Section 5. Whenever we Computational Linguistics Volume 31, Number 4 deploy the segmenter for any application, we need to customize the output of the segmenter according to an application-specific standard, which is not always explicitly defined. However, it is often implicitly defined in a given amount of application data (called adaptation data) from which the specific standard can be partially acquired.</Paragraph> <Section position="1" start_page="559" end_page="560" type="sub_section"> <SectionTitle> 6.1 Transformation-Based Learning Approach </SectionTitle> <Paragraph position="0"> In MSRSeg, standards adaptation is conducted by a postprocessor that performs an ordered list of transformations on the output of the generic segmenter--removing extraneous word boundaries and inserting new boundaries--to obtain a word segmentation that meets a different standard.</Paragraph> <Paragraph position="1"> The method we use is Transformation-Based Learning (TBL; Brill [1995]), which requires an initial segmentation, a goal segmentation into which we wish to transform the initial segmentation, and a space of allowable transformations (i.e., transformation templates). Under the above-mentioned adaptation paradigm, the initial segmentation is the output of the generic segmenter. The goal segmentation is adaptation data. The transformation templates can make reference to words (i.e., lexicalized templates) as well as some predefined types (i.e., class-type based templates), as described below.</Paragraph> <Paragraph position="2"> We notice that most variability in word segmentation across different standards comes from those words that are not typically stored in the dictionary. Those words are dynamic in nature and are usually formed through productive morphological processes. In this study, we focus on three categories: MDW, NE, and FT.</Paragraph> <Paragraph position="3"> For each word class that belongs to these categories, we define an internal structure similar to Wu (2003). The structure is a tree with 'word class' as the root and 'component types' as the other nodes. There are 30 component types. As shown in Figure 6, the word class Affixation has three component types: Prefix, Stem, and Suffix. Similarly, PersonName has two component types and Date has nine--3 as non-terminals and 6 as terminals. These internal structures are assigned to words by the generic segmenter at Figure 6 Word internal structure and class-type transformation templates.</Paragraph> <Paragraph position="4"> 1. MSRSeg w/o adaptation .824 .854 .839 .320 .861 .799 .818 .809 .624 .861 2. MSRSeg .952 .959 .955 .781 .972 .895 .914 .904 .746 .950 3. FMM w/ adaptation .913 .946 .929 .524 .977 .805 .874 .838 .521 .952 4. Rank 1 in Bakeoff .956 .963 .959 .799 .975 .907 .916 .912 .766 .949 5. Rank 2 in Bakeoff .943 .963 .953 .743 .980 .891 .911 .901 .736 .949 run time. The transformation templates for words of the above three categories are of the form: Delete: Remove an existing boundary between two component types.</Paragraph> <Paragraph position="5"> Since the application of the transformations derived from the above templates is conditioned on word class and makes reference to component types, we call the templates class-type transformation templates. Some examples are shown in Figure 6. In addition, we also use lexicalized transformation templates as follows: a114 Insert: Place a new boundary between two lemmas.</Paragraph> <Paragraph position="6"> a114 Delete: Remove an existing boundary between two lemmas.</Paragraph> <Paragraph position="7"> Here, lemmas refer to those basic lexical words that cannot be formed by any productive morphological process. They are mostly single characters, two-character words, and 4-character idioms.</Paragraph> </Section> <Section position="2" start_page="560" end_page="561" type="sub_section"> <SectionTitle> 6.2 Evaluation Results </SectionTitle> <Paragraph position="0"> The results of standards adaptation on four Bakeoff open test sets are shown in Tables 13 and 14.</Paragraph> <Paragraph position="1"> A set of transformations for each standard is learned using TBL from the corresponding Bakeoff training set. For each test set, we report results using our system with and without standards adaptation (Rows 1 and 2). It turns out that performance improves dramatically across the board in all four test sets. For comparison, we also include in each table the results of using the FMM (forward maximum matching) greedy segmenter as a generic segmenter (Row 3), and the top 2 scores (sorted by F) that are reported in SIGHAN's First International Chinese We can see that with adaptation, 10 See Section 8.3 for the definitions of open test and close test. 11 The FMM algorithm processes through the sentence from left to right, taking the longest match with the lexicon entry at each point. Similarly, the BMM (backward maximum matching) algorithm processes the sentence from right to left.</Paragraph> <Paragraph position="2"> 1. MSRSeg w/o adaptation .819 .822 .820 .593 .840 .832 .838 .835 .405 .847 2. MSRSeg .948 .960 .954 .746 .977 .955 .961 .958 .584 .969 3. FMM w/ adaptation .818 .823 .821 .591 .841 .930 .947 .939 .160 .964 4. Rank 1 in Bakeoff .954 .958 .956 .788 .971 .894 .915 .904 .426 .926 5. Rank 2 in Bakeoff .863 .909 .886 .579 .935 .853 .892 .872 .236 .906 our generic segmenter can achieve state-of-the-art performance on different standards, showing its superiority over other systems. For example, there is no single segmenter in the Bakeoff that achieved top-2 ranks in all four test sets (Sproat and Emerson 2003). We notice in Tables 13 and 14 that the quality of adaptation seems to depend largely upon the size of adaptation data (indicated by # of Tr. Word in the tables): we outperformed the best Bakeoff systems in the AS set because of the large size of the adaptation data, while we are worse in the CTB set because of the small size of the adaptation data. To verify our hypothesis, we evaluated the adaptation results using subsets of the AS training set of different sizes and observed the same trend, as shown in Table 15. However, even with a much smaller adaptation data set (e.g., 250K words), we still outperform the best Bakeoff results.</Paragraph> <Paragraph position="3"> 7. Training Data Creation This section describes (semi-)automatic methods of creating the training data based on the estimated class model probability P(w) (i.e., trigram probability) in Equation (1). Ideally, given an annotated corpus, where each sentence is segmented into words that are tagged by their classes, the trigram word class probabilities can be calculated using MLE. Unfortunately, building such annotated training corpora is very expensive.</Paragraph> </Section> <Section position="3" start_page="561" end_page="564" type="sub_section"> <SectionTitle> 7.1 Bootstrapping Approach and Beyond </SectionTitle> <Paragraph position="0"> Our basic solution is the bootstrapping approach described in Gao et al. (2002). It consists of three steps: (1) Initially, we use a greedy word segmenter to annotate the corpus and obtain an initial context model based on the initial annotated corpus; (2) we reannotate the corpus using the obtained models; and (3) we retrain the context model using the reannotated corpus.</Paragraph> <Paragraph position="1"> Steps 2 and 3 are iterated until the performance of the system converges. This approach is also named Viterbi iterative training, an approximation of EM training.</Paragraph> <Paragraph position="2"> In the above approach, the quality of the context model depends to a large degree upon the quality of the initial annotated corpus, which is, however, not satisfied due to two problems. First, the greedy segmenter cannot deal with the segmentation ambiguities, and even after many iterations, these ambiguities can only be partially resolved. Second, many factoids and named entities cannot be identified using the greedy word segmenter, which is based on the dictionary.</Paragraph> <Paragraph position="3"> To solve the first problem, we use two methods to resolve segmentation ambiguities in the initially segmented training data. We classify word segmentation ambiguities into two classes: overlap ambiguity (OA) and combination ambiguity (CA), corresponding, respectively, to OAS and CAS, defined in Section 3.5.</Paragraph> <Paragraph position="4"> To resolve OA, we identify all OASs in the training data and replace them with a single token <OAS>. An example is shown in Figure 7. By doing so, we remove the portion of training data that are likely to contain OA errors. We thus train a context model using the reduced training set that does not contain any OASs. Intuitively, the resulting context model would resolve the ambiguities. The method has been tested on the MSR test set. Our main results are shown in Table 16. We can see that the FMM (or backward maximum matching--BMM) method can only resolve 73.1% (or 71.1%) of OAs, while using our method, the resulting context model can resolve 94.3% of the OAs. Our method is different from previous ones that use lexicalized rules to resolve OAS. For example, Sun and Zuo (1998) report that over 90% of OAs can be disambiguated Results of 70 high-frequency two-character CASs. 'Voting' indicates the accuracy of the baseline method that always chooses the more frequent case of a given CAS. 'ME' indicates the accuracy of the maximum-entropy classifier. 'VSM' indicates the accuracy of the method of using VSM for disambiguation.</Paragraph> <Paragraph position="5"> simply by rules. We reimplemented their method in our experiments and found that 90.7% (or 91.3%) of the OAs in the MSR test set can be resolved. The result is similar to Sun and Zou's but still not as good as ours. Therefore, we conclude that our method significantly outperforms the rule-based approaches. Another advantage of our method is that it is an unsupervised approach that requires no human annotation. Readers can refer to Li et al. (2003) for more details.</Paragraph> <Paragraph position="6"> To resolve CA, we select 70 high-frequency two-character CASs (e.g., C5H1 'talent' and C5/H1 'just able'), as shown in Figure 8. For each CAS, we train a binary classifier using sentences that contain the CAS and that have been segmented using the greedy segmenter. Then, for each occurrence of a CAS in the initial segmented training data, the corresponding classifier is used to determine whether the CAS should be segmented. Our experiments show that 95.7% of the CAs can be resolved. Detailed results are shown in Figure 8, where 'Voting' indicates the accuracy of the baseline method that always chooses the more frequent case of a given CAS, and 'VSM' indicates the accuracy of the VSM-inspired (vector space model) binary classifier, which will be described here.</Paragraph> <Paragraph position="7"> Suppose we have a CAS, s, whose position in a sentence is i. We use its six surrounding words w in positions i[?]3, i[?]2, i[?]1, i+1, i+2, and i+3 as features. We then define a set of feature functions to simulate the TF-IDF scores. Each feature function is a mapping f (s, w) [?]Rfractur. In particular, let TF1(s, w)(or TF2(s, w)) be the term frequency of w in the case that s is a 1-word (or 2-word) string. Similarly, let IDF1 (s, w) (or IDF2 (s, w)) be 1 if w only occurs in the case that s is a 1-word (or 2-word) string. If w occurs in both cases, let IDF1 (s, w) = IDF2 (s, w)= 0.25. We also assign weight l for each position empirically (i.e., in our experiments, we =7). Then we can calculate the score of s to be 1-word or 2-word by Equations (18) and (19), respectively. The CAS is a single word if Score1(s) > Score2(s), and two words otherwise.</Paragraph> <Paragraph position="8"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach Our experiments show that the VSM-inspired classifier outperforms other well-known classifiers for this particular task. For example, a maximum entropy classifier using the same features achieved an overall accuracy of 94.1%.</Paragraph> <Paragraph position="9"> For the second problem of NE and FT detection, though, we can simply use the FSA-based approach, as described in Section 5.4, to detect FTs in the initial segmented corpus; our method of NER in the initial step (i.e., step 1) is a little more sophisticated. First, we manually annotate named entities on a small subset (called the seed set) of the training data. Then we obtain a context model on the seed set (called the seed model). We thus improve the context model that is trained on the initial annotated training corpus by interpolating it with the seed model. Finally, we use the improved context model in steps 2 and 3 of the bootstrapping. We shall show in the next subsection that a relatively small seed set (e.g., 150K words) is enough to get a reasonably good context model for initialization.</Paragraph> </Section> <Section position="4" start_page="564" end_page="566" type="sub_section"> <SectionTitle> 7.2 Evaluation Results </SectionTitle> <Paragraph position="0"> To justify the methods just described, we built a large number of context models using different initial corpora. For each of the initial corpora, a context model is trained using the Viterbi iterative procedure until convergence, i.e., the improvement of the word segmentation performance of the resulting system, is less than a preset threshold. The results are shown in Table 17, where Row 1 (FMM) presents the segmentation results of using the initial corpus segmented by a greedy word segmenter-the basic solution described earlier; in Row 2, we resolve segmentation (overlap) ambiguities on top of the corpus in Row 1; we then tag FTs in Rows 3 and 4. From Rows 5 to 8, several NE annotated seed sets of different sizes are used, showing the trade-off between performance and human cost. In Rows 1 to 8, we use the raw training set containing approximately 50 million characters. For comparison, we also include in Row 9 the results of MSRSeg, whose context model has been trained on a 20-million-word manually annotated corpus. The experimental results reveal several facts.</Paragraph> <Paragraph position="1"> Table 17 Comparison of performance of MSRSeg: The versions that are trained using (semi-)supervised iterative training with different initial training sets (Rows 1 to 8) versus the version that is trained on annotated corpus of 20 million words (Row 9).</Paragraph> <Paragraph position="2"> Although the greedy segmenter (FMM) can resolve around 90% of ambiguities in word segmentation, as shown in Table 16, the resulting segmenter is still much worse than MSRSeg because a large number of unknown words cannot be detected correctly even after Viterbi iterative learning.</Paragraph> <Paragraph position="3"> a114 The method of resolving OA brings marginal improvements. Since the method does not require any human annotation, Row 2 shows the best results we achieved in our experiments using unsupervised learning approaches.</Paragraph> <Paragraph position="4"> a114 Factoid rules, although simple, bring substantial improvements.</Paragraph> <Paragraph position="5"> a114 The Viterbi iterative training method does not turn out to be an effective way of resolving ambiguities in word segmentation or of detecting new words. In Rows 1 to 4, the word segmentation performance always saturates after 2 or 3 iterations, with little improvement. For example, FMM (Row 1) achieves an initial segmentation F-measure of 0.8771, and after two iterations, it saturates and ends up with 0.8773.</Paragraph> <Paragraph position="6"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach a114 The Viterbi iterative training is effective in boosting the precision of NER without great sacrifices for recall (e.g., the recall remains almost the same when using the seed set of Row 5 in Table 17, or becomes a little worse when using the seed sets of Rows 6 to 8). As shown in Tables 18-20, we start with a series of seed sets of different sizes and achieve a reasonable accuracy of NER, which is comparable with that of MSRSeg, after two iterations.</Paragraph> <Paragraph position="7"> a114 The use of a small NE annotated seed set (e.g., in Row 5) would achieve the best trade-off between performance and human effort, because after two iterations, the accuracy of NER is very close to that of using larger seed sets, while the human effort of creating the seed set is much less.</Paragraph> </Section> </Section> <Section position="8" start_page="566" end_page="567" type="metho"> <SectionTitle> 8. System Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="566" end_page="567" type="sub_section"> <SectionTitle> 8.1 System Results </SectionTitle> <Paragraph position="0"> Our system is designed so that components such as the FT detector and NE recognizer can be &quot;switched on or off&quot; so that we can investigate the relative contribution of each component to the overall word segmentation performance. To date, we have not done a separate evaluation of MDW recognition. We leave that to future work.</Paragraph> <Paragraph position="1"> The main results are shown in Table 21. For comparison, we also include in the table (Row 1) the results of using the greedy segmenter (FMM) described in Section 7.</Paragraph> <Paragraph position="2"> Row 2 shows the baseline results of our system, where only the lexicon is used. It is interesting to find, in Rows 1 and 2, that the dictionary-based methods already achieve quite good recall, but the precision is not very good because those methods cannot correctly identify unknown words that are not in the lexicon, such as factoids and named entities. We also find that even using the same lexicon, our approach based on the linear mixture models outperforms the greedy approach (with a slight but statistically significant difference) because the use of context model resolves more ambiguities in segmentation. The most promising property of our approach is that the linear mixture models provide a flexible framework where a wide variety of linguistic knowledge and statistical models can be combined in a unified way. As shown in Rows 3 to 6, when components are switched on in turn by activating corresponding class models, the overall word segmentation performance increases consistently.</Paragraph> <Paragraph position="3"> Computational Linguistics Volume 31, Number 4 We also conduct an error analysis, which shows that 85% of errors come from NER and factoid detection, especially the NE abbreviations, although the tokens of these word types amount to only 8.3% in the MSR test set. The remaining 15% of errors are mainly due to new words.</Paragraph> </Section> <Section position="2" start_page="567" end_page="567" type="sub_section"> <SectionTitle> 8.2 Comparison with Other Systems using the MSR Test Set </SectionTitle> <Paragraph position="0"> We compare our system with three other Chinese word segmenters on the MSR test set:</Paragraph> </Section> </Section> <Section position="9" start_page="567" end_page="567" type="metho"> <SectionTitle> 1. The MSWS system is one of the best available products. It is released by </SectionTitle> <Paragraph position="0"> Microsoft (as a set of Windows APIs). MSWS first conducts word breaking using MM (augmented by heuristic rules for disambiguation), and then conducts factoid detection and NER using rules.</Paragraph> </Section> <Section position="10" start_page="567" end_page="567" type="metho"> <SectionTitle> 2. The LCWS system is one of the best research systems in mainland China. </SectionTitle> <Paragraph position="0"> It is released by Beijing Language University. The system works similarly to MSWS, but has a larger dictionary containing more PNs and LNs.</Paragraph> </Section> <Section position="11" start_page="567" end_page="567" type="metho"> <SectionTitle> 3. The PBWS system is a rule-based Chinese parser (Wu and Jiang 2000) that </SectionTitle> <Paragraph position="0"> can also output word segmentation results. It explores high-level linguistic knowledge such as syntactic structure for Chinese word segmentation and NER.</Paragraph> <Paragraph position="1"> As mentioned earlier, to achieve a fair comparison, we compare the previously mentioned four systems only in terms of NER precision and recall and the number of OAS errors. However, we find that due to the different annotation specifications used by these systems, it is still very difficult to compare their results automatically. For example, ANESA2BREC 'Beijing city government' has been segmented inconsistently as ANESA2/BREC 'Beijing city' + 'government' or ANES/A2BREC 'Beijing' + 'city government' even in the same system. Worse still, some LNs tagged in one system are tagged as ONs in another system. Therefore, we have to manually check the results. We picked 933 sentences at random containing 22,833 words (including 329 PNs, 617 LNs, and 435 ONs) for testing. We also did not differentiate LNs and ONs in evaluation. That is, we only checked the word boundaries of LNs and ONs and treated both tags as interchangeable. The results are shown in Table 22. We can see that in this small test set, MSRSeg achieves the best overall performance of NER and the best performance of resolving OASs.</Paragraph> <Section position="1" start_page="567" end_page="567" type="sub_section"> <SectionTitle> 8.3 Evaluations on Bakeoff Test Sets </SectionTitle> <Paragraph position="0"> Table 23 presents the comparison results of MSRSeg on four Bakeoff test sets with others reported previously. The layout of the table follows (Peng, Feng, and McCallum 2004). SXX indicates participating sites in the 1st SIGHAN International Chinese Word Segmentation Bakeoff (Sproat and Emerson 2003). CRFs indicates the word segmenter reported in Peng, Feng, and McCallum (2004), which uses models of linear-chain con-</Paragraph> </Section> </Section> <Section position="12" start_page="567" end_page="570" type="metho"> <SectionTitle> SIGHAN International Chinese Word Segmentation Bakeoff, and CRFs indicates the word </SectionTitle> <Paragraph position="0"> segmenter reported in (Peng et al. 2004). In Columns 2 to 5, entries contain the F-measure of each segmenter on different open runs, with the best performance in bold. Column Site-Avg is the average F-measure over the data sets on which a segmenter reported results of open runs, where a bolded entry indicates the segmenter outperforms MSRSeg. Column Our-Avg is the average F-measure of MSRSeg over the same data sets, where a bolded entry indicates that MSRSeg outperforms the other segmenter.</Paragraph> <Paragraph position="1"> ASo ASc CTBo CTBc HKo HKc PKo PKc Site-Avg Our-Avg ditional random fields (CRFs). Entries contain the F-measure of each segmenter on different open runs, indicated by XXo, with the best performance in bold. Column Site-Avg is the average F-measure over the data sets on which a segmenter reported results of open runs, where a bolded entry indicates the segmenter outperforms MSRSeg.</Paragraph> <Paragraph position="2"> Column Our-Avg is the average F-measure of MSRSeg over the same data sets, where a bolded entry indicates that MSRSeg outperforms the other segmenter. For completeness, we also include in Table 23 the results of closed runs, indicated by XXc. In a closed test, one can only use training material from the training data for the particular corpus being tested on. No other material was allowed (Sproat and Emerson 2003). Since MSRSeg uses the MSR corpus for training, our results are of open tests.</Paragraph> <Paragraph position="3"> 15 It would be more informative if the comparison could be conducted in closed tests, which implies that dictionaries and models of MSRSeg should be generated solely on the given training data. We leave this for future work.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 31, Number 4 Several conclusions can be drawn from Table 23. First, for the same system, open tests generally achieve better results than closed tests due to the use of additional training material.</Paragraph> <Paragraph position="5"> Second, there is no single segmenter that performs best in all four data sets. Third, MSRSeg achieves consistently high performance across all four data sets. For example, MSRSeg achieves better average performance than the other three segmenters that report results on all four data sets (i.e., S03, S11, CRF). In particular, MSRSeg outperforms them on every data set. There are two segmenters that achieve better average F-measure than ours. One is S02, which reported results on CTB only. The other is S10, which reported results on CTB and PK. From these results, we conclude that MSRSeg is an adaptive word segmenter that achieves state-of-the-art performance on different data sets, corresponding to different domains and standards.</Paragraph> <Paragraph position="6"> As described in Section 2.1, most segmenters, including the ones in Table 23, can be roughly grouped into two categories: ones that use a rule-based approach and ones that use a statistical approach. MSRSeg is a hybrid system that takes advantage of both approaches. Though rule-based systems (e.g., S08, S10, and S11 in Table 23) can achieve reasonably good results, they cannot effectively make use of increasingly large training data and are weak in unknown word detection and adaptation. Some statistical segmenters (e.g., S01 and S07 in Table 23) use generative models such as HMM for Chinese word segmentation. However, it is very difficult to incorporate linguistic knowledge into the (approximated or assumed) generation process of Chinese sentences, underneath which the models are developed. Discriminative models (e.g., the linear models in MSRSeg, where though all components models are derived from generative models, they are combined using discriminatively trained weights) are free from this issue and provide a flexible mathematical framework to incorporate arbitrary linguistic knowledge. They do not assume any underlying generation process. Instead, they assume that the training and test sets are generated from the same distribution, but the form of the distribution (i.e., generative process) is unknown. If we view Chinese word segmentation as a classification problem, i.e., to discriminate between &quot;good&quot; segmentations and &quot;bad&quot; ones, we may prefer discriminative models to generative models. Intuitively, it is sufficient to find directly the desired features that can differentiate good segmentations from bad ones (as in discriminative models). It is, however, not necessary to estimate the distributions based upon which Chinese sentences are generated (or segmentations) first, and then use the estimated distributions to construct the desired features (as in generative models). As pointed out by Vapnik (1998): &quot;When solving a given problem, solve it directly and try to avoid solving a more general problem as an intermediate step.&quot; Our models are similar to the maximum entropy models in Xue (2003) and CRFs in Peng, Feng, and McCallum (2004) in that all these models give the flexibility to incorporate arbitrary features and can be discriminatively trained. Our models are novel in that many feature functions are derived from probabilistic or heuristic models inspired by source-channel models of Chinese sentence generation, as described in Section 4.3. Therefore, these feature functions are not only potentially more reasonable but also much more informative than, for instance, the binary features used in standard maximum entropy models in NLP.</Paragraph> <Paragraph position="7"> 16 It is interesting to see that, in AS, close tests achieve better performance than open tests, although among different systems. This is because the training data of AS is much larger than the other three corpora, and those segmenters that apply statistical approaches, such as S09 (Xue 2003) and CRFs, have been well trained.</Paragraph> <Paragraph position="8"> Gao et al. Chinese Word Segmentation: A Pragmatic Approach We also notice that many segmenters (e.g., S03 and S04 in Table 23) separate unknown word detection from word segmentation. Though this would make the development of the segmenter easier, it seems to be a flawed solution in reality, as we discussed earlier. The benefits of integrating both tasks has also been shown empirically in</Paragraph> </Section> class="xml-element"></Paper>