File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1018_metho.xml
Size: 14,297 bytes
Last Modified: 2025-10-06 14:14:32
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1018"> <Title>Common words Candidates for Chinese Place Names Candidates for Chinese Personal Names</Title> <Section position="2" start_page="119" end_page="501" type="metho"> <SectionTitle> 2. The Complexity of the Task </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="119" end_page="121" type="sub_section"> <SectionTitle> Combinatorial Explosion 1: Word Segmentation Candidate Space </SectionTitle> <Paragraph position="0"> The number of possible segmentations for some sentences may be rather large. Observe: (4) ,~ 1~ r~ I~l ~ I~ ~4~4 ...</Paragraph> <Paragraph position="1"> totally 76 possible segmentations will be found if we simply match the sentence with a dictionary:</Paragraph> <Paragraph position="3"> Fig.l shows the word segmentation candidate space for the sentence (4).</Paragraph> <Paragraph position="4"> The situation will be even complicated as unknown words is under consideration(Fig.2). Generally, segmentation ambiguities can be classified into three categories: (a) ambiguities among common words(refer to all arcs in Fig. 1) (b) ambiguities among unknown words(see arcs of representing candidates for Chinese place name and for Chinese personal names in Fig.2) (c) ambiguities among common words and unknown words(see arcs across Chinese personal name candidates &quot;~Y_:~&quot;, &quot;Jt~k~I:~&quot;, &quot;:E~$~I~&quot; and the arc across common word &quot;:~&quot;(Iove, like) in Fig.2) In our experience, ambiguities of type (a) will cause about 3% loss on the precision rate of segmentation in the condition of making use of maximal matching strategy, one of the most popular methods employed in word segmentation systems, and type (b) and (c) about 10.0% loss if the processing of unknown words is ignored (unfortunately, type (b) and (c) have received less attention than type (a) in the literature).</Paragraph> <Paragraph position="6"> we will get 1296 possible tag sequences solely for seg(76) in the sentence 4 (Fig.3).</Paragraph> <Paragraph position="7"> Combinatorial Explosion 1 x Combinatorial We find out through experiments that the word segmentation and POS tagging are mutually interacted, the performance of the both will increase if they are integrated together\[18\]. Scholars ever tried to do so. The method reported in \[1 I\] is: (a) finding out the N-best segmentation candidates explicitly in terms of word frequency and length; (b) POS tagging each of the N-best segmentation candidates, resulting in the N-best tag sequences accordingly; and (c) using a score with weighted contributions from (a) and (b) to select the best solution. Note that the model used in (a) is just word unigram, and (a) and (b) are being done successively (denoted as &quot;(a)+(b)&quot;). It is a kind of pseudo-integration. More truly one, in our point of view, should be: (a) taking all segmentation possibilities into account; (b) expanding every segmentation candidate of the input sentence into a number of tag sequences one by one, deriving a considerable huge segmentation and tagging candidate space; and (c) seeking the optimal path over such space with a bigram model, obtaining then both word segmentation and POS tagging result from the path found. In the case, (a) and (b) are being done simultaneously (denoted by &quot;(a)ll(b)&quot;). We regard this as a basic strategy and testbed for conducting our system. Obviously, a much more serious combinatorial problem is encountered here.</Paragraph> <Paragraph position="8"> 3. CSeg&Tagl.0: System Architecture and</Paragraph> </Section> <Section position="2" start_page="121" end_page="121" type="sub_section"> <SectionTitle> Algorithm Design </SectionTitle> <Paragraph position="0"> Although great efforts have been paid to the related researches by Chinese information processing community in the last decade, we still have not a practical word segmenter and POS tagger at hand yet.</Paragraph> <Paragraph position="1"> What is the problem? The crucial reason, we believe, lies in the &quot;knowledge&quot;. As indicated in section 2, we meet a very serious difficulty, without relevant knowledge, even humanbeings will definitely fail to solve it. The focus of the research should be no longer solely on the 'pure' or 'new' formal algorithms -- no matter what it will be, instead, what is urgently required is on two issues, i.e., (1) what sorts of and how many knowledges are needed; and (2) how these various konwledges can be represented, extracted, and cooperatively mastered, in a system.</Paragraph> <Paragraph position="2"> This is also the philosophy in designing Cseg&Tagl.0, an integrated system for Chinese word segmentation and POS tagging, which is being developed at the National Key Lab. of Intelligent Technology and Systems, Tsinghua University. The aim of CSeg&Tag is to be able to process unrestricted running texts. Fig.4 gives its architecture.</Paragraph> <Paragraph position="3"> Roughly speaking, Cseg&Tagl.0 can be viewed as a three-level multi-agent(the concept of &quot;agent&quot; means an entity that can make decision independently and communicate with others) system plus some other necessary mechanisms. They are: (1) agents at the low level for treating unknown words; (2) a competition agent at the intermediate level for resolving conflicts among low level agents; (3) a bigram-based agent at the high level for coping with all the remaining ambiguities; (4) mechanisms employing the so-called &quot;global statistics&quot; and &quot;local statistics&quot; (cache); and (5) a rule base. We will introduce them briefly in turn(the detailed discussion of each part is beyond the scope of this paper).</Paragraph> <Paragraph position="4"> 3.1. Agents at the Low Level for Treating</Paragraph> </Section> <Section position="3" start_page="121" end_page="501" type="sub_section"> <SectionTitle> Unknown Words </SectionTitle> <Paragraph position="0"> The types of unknown words CSeg&Tagl.0 currently concerns include Chinese personal names( CN), transliterated foreign personal names( TFN) and Chinese place names(CPN). They can not be enumerated in any dictionary even with numerous size. The difficulty of identifying unknown words in Chinese arises from characteristics of them: (a) no any explicit hint such as capitalization in English exists to signal the presence of unknown words, and the character sets used for unknown words are strict subsets of Chinese characters(the size of the complete Chinese character set is 6763), with some degree of CPN 2595 (b) the length of unknown words may vary arbitrarily; (c) some characters used in unknown words may also be used as mono-syllabic common words in texts; (d) the mono-syllabic words identified above fall into the syntactic categories not only notional words but also function words; (e) the character sets are mutually intersected to some extent; (f) some multi-syllabic words may occur in unknown words.</Paragraph> <Paragraph position="1"> In our system, three agents, CNAgent, TFNAgent and CPNAgent are set up to be responsible for finding candidates in input texts accordingly. A candidate can be regarded as a &quot;guess&quot; with a value of belief. Three steps are involved in all the three agents in general: a pre-processing, then finding candidates over tile resulting fragments of characters There are two strategies for seeking candidates in the input sentence. One is simply viewing it as character string, finding candidates over whole of it in terms of the relevant character set: Input Text ,- :-9 SentToBeSeg C~__. _~ Q~- ~, ......... j\ - ,MainDic Doma inDici Agents at Low Level \ .._f--~/ _____. L -~ U'~</Paragraph> <Paragraph position="3"> Many noises will be unnecessarily introduced, as CN2 and CN3 in (5a). Another way is viewing input as word string, applying MM segmentation as a pre-processing first, then trying to find candidates only over the fragments composed of successive single characters:</Paragraph> <Paragraph position="5"> obviously, ~.(platinum) should be drawn back and added into the TFN candidate* Such multi-syllabic words can be collected from the banks.</Paragraph> <Paragraph position="6"> Step 3: Further determining boundaries of tile candidates All of the useful information, usually language-specific and unknown-word-type-specific, are activated to perform this work.</Paragraph> <Paragraph position="7"> internal information (i) statistical information Each candidate will be assigned a belief according to the statistics derived from the banks. (ii) structural information # nature of characters absolute closure characters for CNs They will definitely belong to a Chinese surname once falling into the control domain of it: * relative closure characters for CNs In certain conditions, they function as absolute closure characters:</Paragraph> <Paragraph position="9"> For this sort of characters, possibilities of being included in a name and excluded out of the name must be reserved: The candidates given independently by three agents may contradict each other on some occasions (see Fig.2). We observe from 497 randomly selected sentences that low level agents generate multiple(>=2) unknown word candidates in 17.7% of them(Fig.5), and, the probability of conflicting is about 88% if candidate number is 2 and 100% if it is greater than 2(Fig.6). A competition agent is established to deal with such conflicts. The evaluation is based on all information from various resources, that is: The output of it, including correct candidates and some unsolved conflicts, are then sent to a high level agent for further processing.</Paragraph> <Paragraph position="10"> 3.3. The Bigram-based Agent at the High Level for Coping with all the Remaining Ambiguities The conventional POS bigram model and a dynamic programming algorithm are used in this high level agent. The searching space of the algorithm is the complete combination of all possible word and tag sequences, and the complexity of it can be theoretically and experimentally proved still polynomial.</Paragraph> </Section> <Section position="4" start_page="501" end_page="501" type="sub_section"> <SectionTitle> 3.4. Global Statistics & Local Statistics </SectionTitle> <Paragraph position="0"> Global statistics are referred to statistical data derived from very large corpora, as mutual information and t-test in Cseg&Tagl.0, whereas local statistics to those derived from the article in which the input sentence stands -- like a chche. Both of them take characters as basic unit of computation, because any Chinese word is exactly a combination of characters in one way or another. Experiments by us reveal that they(especially the latter) are quite important in the resolution of ambiguities and unknown words. Refer back to &quot;~,&quot; and &quot;~&quot; in (Sa) and (8b) as an example. The both CN candidates are reasonable given the isolated sentence only, but by cache, it is in fact a collection of ambiguous entities unsolved so far in the current input article, the algorithm will have more evidence to make decision. We will discuss this in depth in another paper.</Paragraph> <Paragraph position="1"> 3.5. Rule Base It contains knowledge in rule form, including almost all word formation rules in Chinese, a number of simple but very reliable syntactic rules, and some heuristic rules.</Paragraph> </Section> </Section> <Section position="3" start_page="501" end_page="501" type="metho"> <SectionTitle> 4. Experimental Results </SectionTitle> <Paragraph position="0"> Cseg&Tagl.0 is implemented in Windows environment with Visual C++I.0 programming language. The dictionary supporting it contains 60,133 word entries along with word frequencies, parts of speech, and various types of information necessary for the purpose of segmentation and tagging. The size of manually tagged corpus for training the bigram model is about 0.4 M words, and that of the raw corpus for achieving global statistics is 20M characters.</Paragraph> <Paragraph position="1"> We define: The preliminary open tests show that for CSeg&Tagl.0, the word segmentation precision is ranging from 98.0% to 99.3%, POS tagging precision from 91.0 to 97.1%, and the recall and precision for unknown words are from 95.0% to 99.0% and from 87.6% to 95.3% respectively. The speed is about 100 characters per second on Pentium 133. A running sample of Cseg&Tagl.0 is demonstrated as follows(tokens underlined in the output are unknown words successfully identified while those in bold are words wrongly tagged): ~ngd ~'-p~\td I-_ZiZ\td ;~E\vgm i~J}\j ~\[,'~ngd ~\vgd ~e\td ,--~,~i~kngd , Lxp i~'~\vgd ~-I-~nx ~lz~q~ngd I~\ed ~fJ~\vgd . %xs ~Z~:~np , %xp ~J.___~p , ~xp ~-_~p , Lxp ~l~\np , Lxp ~}~-~~np , Lxp ~J~J~\np , Lxp :F~np , Lxp m~.fPS_~_Xnp, Xxp ~Xnp , ~xp T~p * Xxp ~=~ \np --~\egm ~ngd C/~,,-,-~.~gd , ~xp ...</Paragraph> <Paragraph position="2"> It should be pointed out that Cseg&Tagl.0 is just the result of the first round of our investigation. To get our goal, i.e., developing a system with approximately 99% segmentation precision and 95% tagging precision for any running Chinese texts in any cases, quite a lot of work is still waiting there to be done. What we can say now is that we believe it is possible to reach this destination in a not very far future, and we know more than before about how to approach it. The second round work is ongoing currently, with emphasis on two aspects: (1) to promote the algorithm, particularly those associated with agents and cache, carefully; (2) to improve the quality of knowledge base by both enlarging the size of the relevant resources(textual corpora, unknown word banks, etc.) and refining the lexicon, tagged corpus and the rule base.</Paragraph> </Section> <Section position="4" start_page="501" end_page="501" type="metho"> <SectionTitle> Acknowledgment </SectionTitle> <Paragraph position="0"> This research is supported by the National Natural Science Foundation of China and by the Youth Science Foundation of Tsinghua University, Beijing, P.R.China.</Paragraph> </Section> class="xml-element"></Paper>