File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2209_metho.xml
Size: 18,562 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2209"> <Title>BLENDING SEGMENTATION WITH TAGGING IN CHINESE LANGUAGE CORPUS PROCESSING ~</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> BLENDING SEGMENTATION WITH TAGGING IN CHINESE LANGUAGE CORPUS PROCESSING ~ </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> this paper proposes a new method for Chinese language corpus processing. Unlike the past researches, our approach has following charactericstics : it blends segmentation with tagging and integrates nile-based approach with statistics-bascd one in grammatical disambiguation. The principal ideas presented in the paper are incorporated in the development of a Chinese corpus processing system. Expcrimcntal results prove that the overall accuracy for segmentation is 97.68% and that for tagging is 94.55% in about 400,000 Chinese characters.</Paragraph> </Section> <Section position="3" start_page="0" end_page="7276" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Processing a Mandarin Chinese corpus needs to go through several stages. From initial text corpus, through word segmentating, grammatic category tagging, syntactic analysis (bracketing), semantic and pragmatic analysis, one can get coq3ora with different tags, such as segment-ational tags, word categories, phrase categories and so on. In current paper, we will fbcus on the first two stages, i.e. word segmentation and category (i.e. part of speech) tagging.</Paragraph> <Paragraph position="1"> Word segmentation is essential in Chinese ilfformation processing because there are no obvious delimiting markers betwecn Chinese words except for some punctuation marks. Matching input characters against the lexical entries in a large dictionary is helpful in identifying the embedded words. However some ambiguous segmentation strings(ASSs) and unregistered words (i.e. the word that is not registered in the dictionary) in the text will lower the segmentation accuracy. To resolve these problems, various knowledge sources might have to be consulted.</Paragraph> <Paragraph position="2"> In the past decade, two different mcthodologics were used for word segmentation: some approaches are rulebased(ll--5\]), while others are statistics-based(16--8\]). Many automatic word segmentation systems adopting the above models have been developed and significant results have been achieved. But these systems were developed only on word level. They did not take large-scale corpus category tagging into account and were short of a objective cvaluaton for segmentation accuracy from category level. So the development of these automatic segmentation systems is restricted.</Paragraph> <Paragraph position="3"> Grammatical catcgory tagging for Chinese language is very difficult, because Chinese words are frequently ambiguous. One ChinEse word can represent lexical items of different categories. Apart from this, unlike English and other Indo-European languages, Chinese has no inflexions and therefore there arc not obvious morphological variations in Chinese text which are helpful to distinguish one grammatic category from others.</Paragraph> <Paragraph position="4"> In some Chinese category tagging systems, statistics-based algorithms were used(\[10--12\]). The basic processing procedure of these systems is: First, a tagged corpus was made through editing. Then, a dictionary containing category tagging entries and a matrix of category collocational probabilities were derived from the tagged corpus. Using these arguments, a probability model was built and category tagging was completed automatically. Up to now, there are not any reports about rule-based approach to Chinese language category tagging.</Paragraph> <Paragraph position="5"> Comparing with the above researches on segmentation and tagging, our method has the following new characteristics: First, it blends segmentation with tagging. We use a segmentation dictionary, in which every word is marked with its word category, to complete segmentation and initial tagging simultaneously. The category becomes a bridge linking segmentation and tagging.</Paragraph> <Paragraph position="6"> Second, it integrates nde-based approach with statistics-based approach in category tagging. Therefore it inherits the advantages of the two approaches and overcome their respective disadvantages.</Paragraph> <Paragraph position="7"> The following sections will 5iscuss this method in detail.</Paragraph> <Paragraph position="8"> i The project is support by National Science Fundation of China 2. Corpus processing blending segmentation with tagging In practice of segmenting many Chinese sentences, we ffi}d that it is helpfid to make use of word category in automatic segmentation processing. In general, there are three advantages: 1). Using category collocational relation of differcnt words in ASSs and the contextnal word categories, one can resolve most segmentation ambiguities.</Paragraph> <Paragraph position="9"> As we know, there are two types of ASS : intersecting ASS (IASS) and combining ASS(CASS).</Paragraph> <Paragraph position="10"> An lASS S=ABC has two possible segment-ation : AB+C and A+BC. Thus it results in two category combinations : CaB + Cc and CA + CBC. But the probility for them to appear in a given context is not the same. Depending on their context and the difference between two category collocational probabilities (P(CM~\]Cc) and P(CA\]Cnc) ), we can select a correct segmcntatim,.</Paragraph> <Paragraph position="11"> Sometimes a CASS S=AB can be segmented into two words: A+B, but occasionally it is only one word S. Since the CASS itself can not provide the special information R~r correct segmentation, it is necessmy to lake the relation between it and its fonvard word or its backward word into consideration. In this sense, the categories of the words in the CASS and those one beside tim CASS play a very important role.</Paragraph> <Paragraph position="12"> 2). llelp to compound new words by using Clfincsc word-lbrmation theory In Chinese, a word is composed of morplmmes. The combination of morphemes has its special rt, les. These rifles tell us which and what kind of morphcmes can be combined into a word. Using these roles, we can find out some tmrcgistercd words and segment them correctly fi'om a sentence. For example, typical word-compovnding cases of nouns are : A). mono-syllablie noun + mono-syllablic noun</Paragraph> <Paragraph position="14"> From such word-compounding cases, we can sum up many nsefifl word lbrmation rules that are based on categoly combination. Therefore, we will achieve a better segmentation effect in spite of using a smaller segmcntation dictionary.</Paragraph> <Paragraph position="15"> 3). Be helpful to discover some segmentation crrors In Chinese sentence, the frequency of some categmy collocations is very low, such as d+n+$, v+u+d+$ and so on, where d is advclb, n is noun, v is verb, u is auxiliary, $ is the ending mark of a sentence. Therefore, if there is such a category combination in the segmented sentence, we will ahnost be certain that this segmentation may be wrong, in the following examples, there are such Errors : i). mailv le/u yitou/d niu/n ./w (btty -ed head cow correct result : bought a cow ) it). ta/r qiu/n da/v de/u zuihao/d ./w (he ball play Prt hadbetter correct result : lie plays basketball best.) liore, we can see that the categmy information provides a t)owerful means to check seg,ncntation errors atttomatically.</Paragraph> <Paragraph position="16"> Based on all the above understandings, we proposed a method combining segmentation with tagging and used it in the practice of segmentation and category tagging on a large-scale Chinese language corpus. The basic processing procedures are : First, complete automatic segmentation by using a segmentation dictionary with word categories. On the meantime, assign an initial tag(all possible categories for a word) to every segmentation unit.</Paragraph> <Paragraph position="17"> Second, cart3' out seine basic word-compounding words, such as combining stems with affixes , combining overlapping morphemes, integrating Clfinesc numberal words and so on.</Paragraph> <Paragraph position="18"> Third, implement automatic category tagging through grammatic catcgmy disambiguation and assign a single category tag to every word.</Paragraph> <Paragraph position="19"> Fourth, find and coinbinc unregistered words which accord with Chinese word formation rules and assign a suitable categmy to t11e combined new words.</Paragraph> <Paragraph position="20"> Fiflh, check the catcgmy combination in segmented sentences, find some possible errors and then go back to the segmentation process.</Paragraph> <Paragraph position="21"> 3. The designing strategy of categm T tagging Comparing with many past automatic catcgmy tagging systems(t10--12\]), our current processing has some new properties. The basic idea can be briefly summarized as following: l). 13e based on a dictionary with word categories In current process, the initial category tagging was made by looking up the segmentation dictionary with word categories during segmentation. The category is derivcd from the &quot;Grammar Knowledge Base for Chinese Words&quot; (GKBCW), which has been developed by the Institute of Computational Linguistics of Peking University in the past five years\[13\]. Since the information in the dictionary was provided by linguists who refer themselves to the standard of classification based on the distribution of grammatical functions\[14\], it is of high accuracy. Therefore, applying this information to initial category tagging, the coherence and reliability of the tagging results can be guaranteed. This has laid good foundation for the following disambiguation processing.</Paragraph> <Paragraph position="22"> 2). Use a small tag set In our current system, category tagging is restricted to the basic categm-y descriptions, i.e. 26 categories. Meanwhile, in order to keep the new information that was found during manually proofreading, such as proper names, proper addresses, and so on, we add up several subcategories: ng(proper noun), ngp(,proper noun for a person), and Ng(noun morpheme), Ag (adjective morpheme) and Vg(verb morpheme). All these categories and subcategories form a tag set of 31 tags.</Paragraph> <Paragraph position="23"> A small tag set can help us concentrate on the ambiguous words that appear the most frequently in a sentence. Therefore, the processing complex can be reduced and tagging accuracy will be improved.</Paragraph> <Paragraph position="24"> 3). Form a stereo knowledeg base by combining tagged rcsults with the information in the dictionary Although our tag set is small, we can easily expand the tag set for the different application by linking with the GKBCW. Because in our GKBCW, each category has many features, which were proposed by liguists. These fcaturos help to describe the grammatic functions and distributions of every category completely. For example, verb category has about fm-ty features, and noun category has twenty-five features(\[13\]). In general, these grammatic features are also one kind of information for classification.</Paragraph> <Paragraph position="25"> If we use the word and its basic category in tagged corpus as a keDvord to look up GKBCW, we can get the detailed grammatic features of each word. Therefore, taking all tagged words as a plane, and the grammatic features of every word as a depth, we will give a stereo knowledge base. According to different needs, we can tag different grammatic categories or subcategorics to the words in corpus by using the grammatic features in knowledge base. In addition, using the stereo knowledge base, we can also analyse the phrase structure of sentence in corpus.</Paragraph> <Paragraph position="26"> 4). Integrate rule-based approach with statistics-based approach in disambiguation Because rule-based approach and statistics-based approach have their respective advantages, we tried to integrate them in our category tagging system. Our method is: First, through statistical analysis (manually or automatically) in a large-scale corpus, find the tile most frequent ambiguous phenomena, study their context, and extract some contextual frame rules to eliminate those most frequently appearing and comparatively simpler ambiguities. Then, using the arguments trained by correctly tagged corpus, make a probability model to disambiguate some ambiguous category combination of lower frequence and deduce the category of the unregistered words.</Paragraph> <Paragraph position="27"> But during actually processing, we lay different particular emphasis on these two approaches at different stages. At first, because there was not a large-scale corpus tagged with correct category, a small-scale corpus had to be tagged using rule-bascd approach and its remaining ambiguities and some tagging errors were corrected manually. After statistic analysis on the correctly tagged corpus, the rule base was adjusted and some trained arguments were given. Then some new sentences were added to the old corpus to form a new middle-scale corpus, Using the new adjusted rules and trained arguments, the new corpus was tagged through both rule-based approach and statistics-based approach. In this way, the scale of the corpus was increased gradually like a snowball. Due to the increase in corpus scale, the descriptions of rule became more and more accurate and the statistic information became more and more comprehensive. Therefore thc manual proofreading work will decrease drastically. As a result, a best integration of these two approaches was achieved.</Paragraph> <Paragraph position="28"> frequently, especially the mono-syllablic words, such as, &quot;yi&quot;, &quot;zhe&quot;, &quot;le&quot;, &quot;guo&quot;, &quot;ba&quot;, &quot;lai&quot;, &quot;hao&quot;, &quot;jiu&quot;, and so on. For these words, we set some special disambiguation rules, which describe the different context for these words with different category. Therefore, t/to category of words in one sentence can be determinated easily. This is a word-oriented disambiguation.</Paragraph> <Paragraph position="29"> 2). disambiguation against special multi-tag According to statistic analysis, some nmlti-tag combinations, such as v-q, p-v,v-n,q-n,v-d,a-v and so on, appear li'cquently in corpus, lit order to construct the disambiguation rides for these multi-tag combinations, the probability that one special tag is selected from a multi-tag set in the difI~rent context is counted. At the same time, the grammatic function featnres of category, especially the distribution inforamtion which distinguishes one category from the oflmrs are snnlmed till and extracted. Then the ambiguities can be eliminated by these rifles. This is a multitag-oriented disambiguation.</Paragraph> <Paragraph position="30"> 3). disambiguate by context constraint The approach applies a set of context flame rules. Each rtfle, when its context is satisfied, has the effect of deleting one or more candidates from the list of possible lags for one word. if the nmnber of the candidalcs is reduced to one, disambiguation is considered successful. This is a fiame-oriented disambiguation.</Paragraph> <Paragraph position="31"> 4.2. statistics-based alJproach Formally, the statistic schcme can be described as following: Let W=Wb..Wn be a span of ambiguous words in scntence and Wl,W n are unanlbiguous, C--Ct...Cn be a possible tag sequence for the span, where Ci is a category of Wi. P(CIW) is conditional probabilily fiom W to C.</Paragraph> <Paragraph position="32"> Therefore, the goal of disambiguation is equivalent to lind a list of category sequence C' with the largest score P(C'lW), i.e.</Paragraph> <Paragraph position="34"> Computing tile above fornmla with bi-g,'am model, we gct:</Paragraph> <Paragraph position="36"> of two probablities can be calculated from the trained argmnents.</Paragraph> <Paragraph position="37"> During actual process, the category of the unregistered word is deduced firstly. Let Cu is a possible tag set for unregistered word, CI is the tag of its left word and the Cr is the tag of its right word. q' is the set of total tags in corpus. Therefore, Cu={CI, C2}, where :</Paragraph> <Paragraph position="39"> So the unregistered word phenomenon is changed into categorically ambiguous problem.</Paragraph> <Paragraph position="40"> For a span of ambiguous words (bounded by tmambiguous words), if we arrange the diffcrent tags of cvcry word vertically and the different words horizontally, we will form a direct chart whose nodes are tagged with P(Wi ICi ) and whose arcs are tagged with P(Ci ICj ). Using VOLSUNGA algorithm (\[9\]) to get the best path in direct chart, we will complete the automatic category disambiguation.</Paragraph> <Paragraph position="41"> 5. Experimental results and future work A segmentation and tagging system was built based on the above mentioned. The programs of the system are written by C language. Using this system, a verb usage corpus with abont 400,000 Chinese characters or 300,000 Chinese words was segmented and tagged. The test results are: segmentation accuracy --- 97.68%, lagging accuracy --- 94.55%.</Paragraph> <Paragraph position="42"> Some better processing results of previous segmentation systems and tagging systcrms are : about 99'% segmentation acctu'acy on 150,000 Chinese characters (\[5\[) and 94.82% tagging accuracy by close test on 150,000 words of taggcd corpus (\[121). Compared with these systems, the result of our system is promising.</Paragraph> <Paragraph position="43"> I11 our future research, we try to make filrther improvement on our method and add some now fimtions to our seglnentation and tagging system, such as, unregistered word deduce during segmentation, identity management in knowledl,e base, analysis belief degree on tagging results. Then we will extend our corpus' scale to about five milliou words.</Paragraph> <Paragraph position="44"> In addition, we will pcrst,e research on Chinese phrase structure analysis and try to tag phrase category in corpus. We hope the work will be l~elpful for the study on Mandarin Chinese grammar.</Paragraph> </Section> class="xml-element"></Paper>