File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0134_metho.xml
Size: 12,980 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0134"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Pragmatic Chinese Word Segmentation System</Title> <Section position="4" start_page="0" end_page="189" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> All the words in our system are categorized into five types: Lexicon words (LW), Factoid words (FT), Morphologically derived words (MDW), Named entities (NE), and New words (NW).</Paragraph> <Paragraph position="1"> Figure 1 demonstrates our system structure.</Paragraph> <Paragraph position="2"> The input character sequence is converted into one or several sentences, which is the basic dealing unit. The &quot;Basic Segmentation&quot; is used to identify the LW, FT, MDW words, and &quot;Named Entity Recognition&quot; is used to detect NW words.</Paragraph> <Paragraph position="3"> We don't adopt the New Word detection algorithm in our system in this bakeoff. The &quot;Disambiguation&quot; module performs to classify complicated ambiguous words, and all the above results are connected into the final result, which is denoted by XML format.</Paragraph> <Section position="1" start_page="0" end_page="189" type="sub_section"> <SectionTitle> 2.1 Trigram and Smoothing Algorithm </SectionTitle> <Paragraph position="0"> We apply the trigram model to the word segmentation task (Jiang 2005A), and make use of Absolute Smoothing algorithm to overcome the sparse data problem.</Paragraph> <Paragraph position="1"> Trigram model is used to convert the sentence into a word sequence. Let w = w word sequence, then the most likely word sequence w* in trigram is:</Paragraph> <Paragraph position="3"> represents LW or a type of FT or MDW. In order to search the best segmentation way, all the word candidates are filled in the word lattice (Zhao 2005). And the Viterbi algorithm is used to search the best word segmentation path.</Paragraph> <Paragraph position="4"> FT and MDW need to be detected when constructing word lattice (detailed in section 2.2). The data structure of lexicon can affect the efficiency of word segmentation, so we represent lexicon words as a set of TRIEs, which is a tree-like structure. Words starting with the same character are represented as a TRIE, where the root represents the first Chinese character, and the children of the root represent the second characters, and so on (Gao 2004).</Paragraph> <Paragraph position="5"> When searching a word lattice, there is the zero-probability phenomenon, due to the sparse data problem. For instance, if there is no cooccurence pair &quot;Wo Men /Chi /Xiang Jiao &quot;(we eat bananas) in the training corpus, then P(Xiang Jiao |Wo Men ,Chi ) = 0. According to formula (1), the probability of the whole candidate path, which includes &quot;Wo Men /Chi / Xiang Jiao &quot; is zero, as a result of the local zero probability. In order to overcome the sparse data problem, our system has applied Absolute Dis-</Paragraph> <Paragraph position="7"> is meant to evoke the number of words that have one or more counts, and the * is meant to evoke a free variable that is summed over. The function ()c represents the count of one word or the cooccurence count of multiwords. In this case, the smoothing probability</Paragraph> <Paragraph position="9"/> <Paragraph position="11"> Because we use trigram model, so the maximum n may be 3. A fixed discount D (0[?]D[?]1) can be set through the deleted estimation on the training data. They arrive at the estimate are the total number of n-grams with exactly one and two counts, respectively. After the basic segmentation, some complicated ambiguous segmentation can be further disambiguated. In trigram model, only the previous two words are considered as context features, while in disambiguation processing, we can use the Maximum Entropy model fused more features (Jiang 2005B) or rule based method.</Paragraph> </Section> <Section position="2" start_page="189" end_page="189" type="sub_section"> <SectionTitle> 2.2 Factoid and Morphological words </SectionTitle> <Paragraph position="0"> All the Factoid words can be represented as regular expressions. So the detection of factoid words can be achieved by Finite State Automaton(FSA).</Paragraph> <Paragraph position="1"> In our system, the following categories of factoid words can be detected, as shown in table 1.</Paragraph> <Paragraph position="2"> a unique &quot;next state&quot; is determined, when given an input symbol and the current state. While it is common for a linguist to write rule, which can be represented directly as a non-deterministic FSA (NFA), i.e. which allows several &quot;next states&quot; to follow a given input and state.</Paragraph> <Paragraph position="3"> Since every NFA has an equivalent DFA, we build a FT rule compiler to convert all the FT generative rules into a DFA. e.g.</Paragraph> <Paragraph position="4"> null &quot;< digit > -> [0..9]; null < year > ::= < digit >{< digit >+}Nian &quot;; null < integer > ::= {< digit >+}; where &quot;->&quot; is a temporary generative rule, and &quot;::=&quot; is a real generative rule.</Paragraph> <Paragraph position="5"> As for the morphological words, we erase the dealing module, because the word segmentation definition of our system adopts the PKU standard.</Paragraph> </Section> </Section> <Section position="5" start_page="189" end_page="191" type="metho"> <SectionTitle> 3 Named Entity Recognition </SectionTitle> <Paragraph position="0"> We adopt Maximum Entropy model to perform the Named Entity Recognition. The extensive evaluation on NER systems in recent years (such as CoNLL-2002 and CoNLL-2003) indicates the best statistical systems are typically achieved by using a linear (or log-linear) classification algorithm, such as Maximum Entropy model, together with a vast amount of carefully designed linguistic features. And this seems still true at present in terms of statistics based methods.</Paragraph> <Paragraph position="1"> Maximum Entropy model (ME) is defined over H x T in segmentation disambiguation, where H is the set of possible contexts around target word that will be tagged, and T is the set of allowable tags, such as B-PER, I-PER, B-LOC, I-LOC etc. in our NER task. Then the model's conditional probability is defined as</Paragraph> <Paragraph position="3"> where h is the current context and t is one of the possible tags.</Paragraph> <Paragraph position="4"> The several typical kinds of features can be used in the NER system. They usually include the context feature, the entity feature, and the total resource or some additional resources.</Paragraph> <Paragraph position="5"> Table 2 shows the context feature templates.</Paragraph> <Paragraph position="6"> While, we only point out the local feature template, some other feature templates, such as long distance dependency templates, are also helpful to NER performance. These trigger features can be collected by Average Mutual Information or Information Gain algorithm etc. Besides context features, entity features is another important factor, such as the suffix of Location or Organization. The following 8 kinds of dictionaries are usually useful (Zhao 2006): Rare character Bi , Dong , Hao In addition, some external resources may improve the NER performance too, e.g. we collect a lot of entities for Chinese Daily Newspaper in 2000, and total some entity features.</Paragraph> <Paragraph position="7"> However, our system is based on Peking University (PKU) word segmentation definition and PKU NER definition, so we only used the basic features in table 2 in this bakeoff. Another effect is the corpus: our system is training by the Chinese Peoples' Daily Newspaper corpora in 1998, which conforms to PKU NER definition. In the section 4, we will give our system performance with the basic features in Chinese Peoples' Daily Newspaper corpora.</Paragraph> <Section position="1" start_page="190" end_page="191" type="sub_section"> <SectionTitle> 4.1 The Evaluation in Word Segmentation </SectionTitle> <Paragraph position="0"> The performance of our system in the third bakeoff is presented in table 4 in terms of recall(R), precision(P) and F score in percentages. The score software is standard and open by SIGHAN.</Paragraph> <Paragraph position="1"> measure. The R iv measure in close test and in open test are 99.1% and 98.9% respectively. This good performance owes to class-based trigram with the absolute smoothing and word disambiguation algorithm.</Paragraph> <Paragraph position="2"> In our system, it is the following reasons that the open test has a better performance than the close test: (1) Named Entity Recognition module is added into the open test system. And Named Entities, including PER, LOC, ORG, occupy the most of the out-of-vocabulary words. (2) The system of close test can only use the dictionary that is collected from the given training corpus, while the system of open test can use a better dictionary, which includes the words that exist in MSRA training corpus in SIGHAN2005. And we know, the dictionary is the one of important factors that affects the performance, because the LW candidates in the word lattice are generated from the dictionary.</Paragraph> <Paragraph position="3"> As for the dictionary, we compare the two collections in SIGHAN2005 and SIGHAN2006, and evaluating in SIGHAN2005 MSRA close test.</Paragraph> <Paragraph position="4"> There are less training sentence in SIGHAN2006, as a result, there is at least 1.2% performance decrease. So this result indicates that the dictionary can bring an important impact in our system. Table 5 gives our system performance in the second bakeoff. We'll make brief comparison. Table 5 MSRA test in SIGHAN 2005 (%) Comparing table 4 with table 5, we find that the OOV is 3.4 in third bakeoff, which is higher than the value in the last bakeoff. Obviously, it is one of reasons that affect our performance. In addition, based on pragmatic consideration, our system has been made some simplifier, for instance, we erase the new word detection algorithm and the is no morphological word detection.</Paragraph> </Section> <Section position="2" start_page="191" end_page="191" type="sub_section"> <SectionTitle> 4.2 Named Entity Recognition </SectionTitle> <Paragraph position="0"> In MSRA NER open test, our NER system is training in prior six-month corpora of Chinese Peoples' Daily Newspaper in 1998, which were annotated by Peking University. Table 6 shows the NER performance in the MSRA open test.</Paragraph> <Paragraph position="1"> As a result of insufficiency in preparing bakeoff, our system is only trained in Chinese Peoples' Daily Newspaper, in which the NER is defined according to PKU standard. However, the NER definition of MSRA is different from that of PKU, e.g, &quot;Zhong Hua /LOC Min Zu &quot;, &quot;Ma /PER Lie /PER Zhu Yi &quot; in MSRA, are not entities in PKU. So the training corpus becomes a main handicap to decrease the performance of our system, and it also explains that there is much difference between the recall rate and the precision in table 6. Table 7 gives the evaluation of our NER system in Chinese Peoples' Daily Newspaper, training in prior five-month corpora and testing in the sixth month corpus. We also use the feature templates in table 2, in order to make comparison with table 6.</Paragraph> <Paragraph position="2"> This experiment indicates that our system can have a good performance, if the test corpus and the training corpora conform to the condition of independent identically distributed attribution.</Paragraph> </Section> <Section position="3" start_page="191" end_page="191" type="sub_section"> <SectionTitle> 4.3 Analysis and Discussion </SectionTitle> <Paragraph position="0"> Some points need to be further considered: (1) The dictionary can bring a big impact to the performance, as the LW candidates come from the dictionary. However a big dictionary can be easily acquired in the real application.</Paragraph> <Paragraph position="1"> (2) Due to our technical and insufficiently preparing problem, we use the PKU NER definition, however they seem not unified with the MSRA definition.</Paragraph> <Paragraph position="2"> (3) Our NER system is a word-based model, and we have find out that the word segmentation with two different dictionaries can bring a big impact to the NER performance.</Paragraph> <Paragraph position="3"> (4) We erase the new word recognition algorithm in our system. While, we should explore the real annotated corpora, and add new word detection algorithm, if it has positive effect. e.g. &quot;He Hua Jiang &quot;(lotus prize) can be recognized as one word by the conditional random fields model.</Paragraph> </Section> </Section> class="xml-element"></Paper>