WORD IDENTIFICATION FOR MANDARIN CIlINESE SENTENCES 
Abstract 
Keh-Jiann Chen Shing-lluan Liu 
Institute of lnfl~rmation Science 
Academia Sinica 
Chinese sentences are composed with string of 
characters without blanks to mark words. However the 
basic unit for sentence parsing and understanding is 
word. Therefore the first step of processing Chinese 
sentences is to identify the words. The difficulties of 
identifying words include (l) the identification of com- 
plex words, such as Determinative-Measure, redupli- 
cations, derived words etc., (2) the identification of 
proper names,(3) resolving the ambiguous segmenta- 
tions. In this paper, we propose the possible solutions 
for the above difficulties. We adopt a matching algo- 
rithm with 6 different heuristic rules to resolve the am- 
biguities and achieve an 99.77% of the success rate. 
The statistical data supports that the maximal match- 
ing algorithm is the most effective heuristics. 
1. Introduction 
Chinese sentences arc cx)mposed with string of 
characters without blanks to mark words. However the 
basic unit for sentence parsing and understanding is 
word. Therefore the first step of processing Chinese 
sentences is to identify the words( i.e. segment the 
character strings of the sentences into word strings). 
Most of the current Chinese natural language 
processing systems include a processor for word iden- 
tification. Also there are many word segmentation 
techniques been developed. Usually they use a lexicon 
with a large set of entries to match input sentences 
\[2,10,12,13,14,21\]. It is very often that there are many 
l~)ssible different successful matchings. Therefore the 
major focus for word identification were on thc resolu- 
tion of ambiguities. However many other important as- 
pects, such as what should be done, in what depth and 
what are considered to be the correct identifications 
were totally ignored. High identification rates are 
claimed to be achieved, but none of them were mea- 
sured under equal bases. There is no agreement in 
what extend words are considered to be correctly iden- 
tified. For instance, compounds occur very often in Chi- 
nese text, but none of the existing systems except ours 
pay much attention to identify them. Proper name is 
another type of words which cannot be listed exhaus- 
tively in the lexicon. Therefore simple matching algo- 
rithms can not successfully identify either compounds 
or proper names. In this paper, we like to raise the 
ptx~blems and the difficulties in identifying words and 
suggest the possible solutions. 
2. Difficulties in the Identification of Words 
As we mentioned in the prevkms chapter, theba- 
sic technique to identify the words is by matching algo- 
rithms. It requires a well prepared lexicon with suffi- 
cient amount of lexical entries which covers all of the 
Chinese words. Iqowcver such a large lexicon is never 
existing nor will be composed, since the set of words is 
open ended. Not only because the new words will be 
generated but because there are unlimited number of 
compounds. Most of the word identification systems 
deliberately ignore the problem of compounds and 
leave the problem unsolved until the stage of parsing. 
We don't agree their view points and believe that dif- 
ferent type of Compounds should be handled by the 
different meth~Ms at different stages. Some types of 
the compounds had better to be handled before pars- 
ing, for they require different grammatical representa- 
tions and idcntificalion strategies compared with the 
parsing of phrase structures. On the camtrary, if the 
lnorphological rules \[or compounds have the ~me re- 
presentation as the phrase structure rules, it is better to 
be identified at parsing stage. We will di~uss this issue 
in more details in the later sections. 
The other problem is that ambiguous segmenta- 
tions frequently tv.:cur during thc processing of word 
matching. It is because that very often a multisyllabic 
word contains monosyllabic words as its components. 
We have to try the different strategies to re~flve such 
ambiguities. 
Many problems need to be ~lved, but first of all a 
lexicon shotdd be composed for the matching algo- 
rithm. 
2.1 What are the Set of Words 
According to Liang's \[14,15\] definition, word is a 
smallest,meaningful, and freely used unit. It is the basic 
processing unit fur Chinese natural language process- 
ing. Since there is no morphological features as word 
segmentation marks, we have to adopt such a wtgue 
definition of the Chinese word. Liang \[151 also pro- 
pose a word segmentation standard. However some of 
his view points are debatable and self contradictory. In 
fact it is almost impossible to define a standard for COr- 
rect identification. Thcrefl)re instead of proposing a 
AcrEs DE COLING-92. NAm'as, 23-28 Aot)r 1992 1 0 1 PRoe. ov COL1NG-92, NANTES, AUG. 23-28, 1992 
standard, we propose a criterion which should be fol- 
lowed by a good word segmentation algorithm. It is 
that a good segmentation algorithm should be able to 
produce a result which is suitable and sufficient for 
the propose of later processing, such as parsing and un- 
derstanding. 
The set of words is open ended. ~lherefore the 
existing lexicons contain lexical entries which vary from 
40 to 120 thousands. A large lexicon usually includes 
many compounds as well as many proper names, for it 
is hard to distinguish a word and a compound. Fbr 
systems with a small lexicou, they might not perform 
worse than systems with a large lexicon, if they incor- 
porate algorithms to identify compounds and proper 
names. However on the other hand there is no harm to 
have a large lexicon, once there is a way of handling the 
ambiguities, since statistically a large lexicon has the 
better chance to match words as well as producing 
ambiguous segmentations. Therefore we have the 
follow principle to collect the word set for the purpose 
of word segmentation. 
Principle for composing a lexicon: 
qqm lexicon should contam as many as possible 
words. If there is a doubt whether a string of charac- 
ters is a word or a compound, you can just collect it as 
an entry. 
Currently we have a lexicon of around 90 thou- 
sands entries, and keep updating for new words. A lexi- 
con with such a s~e of course would still leave out many 
compounds and proper names. We use this lexicon to 
match the Chinese text, the result of the algorithm is a 
sequence of words defined in the lexicon. (1) is an in- 
stance of result. 
(1) 
a. jieshuoyuan Jau Shian-Ting yindau tamen 
interpreter Jau Shian-Ting guide them 
'q'he interpreter Shian-Ting Jau guided 
them." 
b. jieshuo-yuan-Jau-Shian-Ting-yindau-ta- 
men 
However we can see that ~me of the compounds 
and proper names are not identified as shown in (1lb. 
They were segmented into words or characters. There- 
fore at later stage those pieces of segments should be 
regrouped into compounds and proper names. We will 
discuss tile issue at next two sections. 
2.2 Compmmds 
1here are many different type of compounds in 
Chinese and should be handled differently \[3, 6, 7, 11, 
17, 191. 
a. determinative-measure compounds (DM) 
A determinative-measure compound is composed 
of one or more determinatives, together with an op- 
tional mcasm'e. 
(2) je san ben 
this three CL 
"these three" 
It is used to determine the reference or the quan- 
tity of the noun phrase that co-~w.curs with it. De~ite 
the fact that I~)th categories of determinatives and 
measures are closed, the comhinations of them are not. 
However the set of DMs is a regular language which 
can be expressed by regular expressions and reCOg- 
nized by finite automata \[19\]. Mo \[191 also point out 
that the structure of I)Ms are exocentric. They are 
hardly similar to other phrase structures which are en- 
docentric and context-free and can bc analyzed by head 
driven parsing strategies. Therefore we suggest that 
tile identification of DMs should be done in parallel 
with the identification of common words. There are 76 
rules for DMs which covers ahnost all of the DMs \[19\]. 
Dor word identification, those rules function as a sup- 
plement for the lexicon, which works as ff the lexicon 
contains all of the DMs. We will show the test result in 
the section 3.3. 
b. Reduplications 
In Chinese many verhs can be reduplicated to 
denote all additional meaning of flying the actions 
gently and relaxedly(3)\[7\]. 
(3) tiau tiau wu 
jump jump dance 
"dance a little" 
This kind of molphological construction will m)t 
change the argument structure of the verbs, but do 
change their s3mtactic behavior. For instance, the re- 
duplications can not cooccur with the adjuncts of post- 
verbal location, aspect marker, duration, and quantifier 
\[171. In \[17\], they derived 12 different reduplication 
rules which cover the reduplication construction of 
verbs. Ill addition, there are 3 rules for the reduplica- 
tion of DMs and 5 rules for A-not-A questions forma- 
tion. 
The identification of the reduplication construc- 
tion should be done after the words have been identi- 
fied, since it is better to sec the words and then check 
whether part of the words has been reduplicated. It is a 
kind of context dependent process, so a separated pro- 
cess other than the process for DMs or Phrase struc- 
tures, should be incorporated. 
A(ZlT.S I)E COLING-92, NArCrl~s, 23-28 Ao(rr 1992 1 0 2 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
c. A-not-A construction 
A-not-A constructions are commonly used in 
Chinese to form a questiou \[13,71. Aswe mentioned be- 
fore, for A--not~A construction, there are 5 different 
rules for reduplicating part of tile verbs and coverbs 
\[17\]. 
(4) fang-bu- fangshin 
put not stop-worry 
"stop worrying or not" 
A-nof.A constructien is a kind of reduplication 
constrution. Therefore the technique for the idemifi- 
cation of reduplication is aim applicable for the identi- 
fication of the A-not--A construction. 
d. Derived Words 
A derived word is at compound which (x)mposed 
with a word of stem and a prefix or a suffix \[ 11 \]. I)eriva 
tive affixes are very productive in Mandarin Chinese. 
(5) difang-shing 
place quality 
"localityr" 
Those affixes usually are bound morphemes. In 
\[11\], we collect a sct of most frcqucntly occurred af- 
fixes and study their morphological behaviors. We 
found that there are syntactic and semantic restrictions 
bctween modifiers and heads. Such a nrorphok)gical 
patterns can bereprcsentedintermsoflnfonnation-. 
based Case Grammar\[5\], which is also the grammatical 
formalism adopted for representing Chinese phrase 
structures in our parsing system\[5,ll\]. Following is an 
example of representation. 
(6) shing 
Semantic:meaning: equivalent of "-NESS", 
"-lq~(", for cxprcssing abstract notation 
Syntactic: category: Nad 
feature: bound; I + N,--V\] 
constraints: MR: {Vhl, V\[ + transitivc \[, N} < < * 
Since the grammatical representation of derived 
words is the same as the representation of phrase struc- 
tures, wc suggest that the identification of the dcrivcd 
words are better to be done at tile parsing stage. Fur- 
thermore, the Ixoundaries of the derived words ave syJt- 
tactically ambigu(ms. They can not he identified with- 
out checking the contextual information. 
2.3 Proper Names 
Proper names (w.cur very frequently in all kinds of 
articles. ~\[lte identification of proper nanlcs become 
one of the most difficult problems in Chinese natural 
language processing, since we can not exhaustly list all 
t,f the ptToper name in the lcxiceu. Also there is no mor- 
phologieal nor punctuation makers to denote a proper 
name. Besides that a proper name may contain a suh- 
string of common words. It makes the identification of 
the proper names even harder. The only clue might be 
usetul in identifying proper names is the occurrences 
nf Ix)and nu)ll)heules. Usually each Chinese character 
is a meaningful unit. Some of them can I;e used freely 
as a word. Some are not; they have to be combined with 
other characters to form a word. Such characters can 
not be used freely as words, are named bound nmr-. 
phemes. If bound moqflmmes occur after word match- 
ing process, it means that there are derived words or 
proper names occurred iu the text and have not been 
identified. The semantic classification of morphemes 
can be utilized to identify the different type of proper 
names. For instance,in \[11, the setofsurnames were 
used its a clue to identify people's names and titles. 
There is no general solutions so far to handle the dif- 
ferent types of proper names. The only suggestion is 
that mark the proper names before identification pro- 
cuss or treat the unknown strings as proper names. 
2.4 Ambiguities 
For Chiuese character strings,they might have 
many different well fl~rmed segmentations, but ust,- 
ally there is tndy one grammatically aud semantically 
sound segmemation fur ~lch sentence (7). 
(7) yijing jeuglichu jiegno 
aheady arrangc~out result 
'q'he result has come out." 
yijiug-\[jengli-ehu\]-jiegno 
yijing-ljengqichu \] -jiegun 
Therefore many algorithms were proposed and 
heuristic or statistical pretcrence rules were adopted 
for remlving ambiguities. However none of those rules 
has been thoruughly tested and provided their success 
rates, ht the next section, we will state our algorithm as 
well as tile heuristic rulesand alsoprovides the experio 
merit results in section 3.2 to show the success rate of 
each individual tall:. 
3.Wnvd Identification Algnrithm 
According to the discussion of the chapter 2, the 
picture of the word identificatinn algoxithm should be 
clearly its Icit\[ows. (u) 
In fact trot all of the above processes were thor- 
oughly studies, but more ov less some of them were 
studied and have successful results \[2, 8, 12, 13, 14, 16, 
19, 20,211. Our word identification system adopt tl,e 
alnlve sequence of algorithms, lint we defer the second 
last prlycess of finding derived words until parsing stage 
and tile last In'ocess of finding proper names is tempo+ 
Acrt!s DE COLING-92, NANIES, 23-28 Aot'n' 1992 1 0 3 PROC. OV COLING 92, NANIES, A~,IG. 23.-28, 1992 
Words and DMs ql"~ | Lexicon and | 
matching 
I Find compounds of ~ 1. redupli~tion 
2, A-not-A 
rary ignored for not having a feasible identification al- 
gorithm. 
3.1 Matching algorithm and disambiguation 
rules 
The first two steps of word identification algo- 
rithm are the word matching and disambiguation. 
These two processes were performed in parallel. Once 
an ambiguous match occurs, the disambiguation pro- 
cess is invoked immediately. The algorithm reads the 
input sentences from left to right. Then match the in- 
put character string with lexemes as well as DMs rules, 
If an ambiguous segmentation do occur, then the 
matching algorithm looks ahead two more words, then 
apply the disambiguation rules for those three word 
chunks. For instance,ha (9), the first matched word 
could be 'wan' or 'wancheng'. Then the algorithm will 
look ahead to take all of the possible combinations of 
three word chunks, as shown in (10), into consideration. 
(9) wanchengjianding haugau 
complete authenticate report 
"complete the report about authenticating" 
(10) wan~:heng-jianding 
wancheng-jianding-bau 
wancheng-jianding-baugau 
The disambiguation algorithm will select the first 
word of the most plausible chunk as the solution. In 
this case, it is the word 'wancheng'. The algorithm then 
proceeds to process the next word until all the input 
text been processed. 
'/'he most powerful and commonly used disambi- 
guation rule is the heuristic of maximal matching \[12, 
13, 14, 21\]. There are a few variations of the sense of 
maximal matching, but after we have done the experi- 
ments with each of different variations, we adopt the 
following maximal matching rules. 
Heuristic rule 1: 
The most plausible segmentation is the three 
word sequence with the maximal length. 
This heuristic rules achieves as high as 99.69% 
accuracy and 93.21% of the ambiguities were resolved 
by this rule. We will see the detail statistics in the next 
section. However there are still about 6.79% of ambi- 
guities still can not be re~lved by the maximal match- 
ing rule. Therefore we adopt the next heuristic rule. 
Heuristic rule 2: 
Pick the three word chunk which has the smallest 
standard deviation in the word length. This is equiva- 
lent to find the chunk with the minimal value on ( 
L(W1)-Mean) **2 + (L(W2)-Mean)**2 + (14'W3)- 
Mean)**2 ,where Wl,W2,and W3 are three words in a 
chunk; Mean is the average length of Wl,W2.,and W3; 
L(W) denotes the length of the word W. 
Heuristic rule 2 simply says that the word length 
are usually evenly distributed. For instance in (11), the 
segmentation of (lla) has the value 0, but (1 lb) has val- 
ue 2. Therefore according to the heuristic rule number 
2, the (lla) will be the selected solution and it is the 
correct segmentation. 
(11) yianjiou shengminchiyuan 
research life origin 
"to investigate the origin of life" 
a. \[yianjiou-shengminl-chiyuan 
b. \[yianjiousheng-min\]-chiyuan 
However it may happen that there are more than 
two chunks with the same length and variance, we 
need a further resolution. 
Heuristic rule 3: 
Pick the chunk with fewer bound morphemes. 
Heuristic rule 4: 
Pick the chunk with fewer characters in DMs. 
That is to say the normal words get higher prior- 
ity than the bound morphemes and DMs. For instances 
examples (12,13) were resolved by the rule 3 and 4 re- 
spectively. (12a) and (13a) are right choices. 
(12) shietiau shang shoushiu jiau mafan 
negotiate up procedure more trouble- 
some 
AcrEs DE COLING-92, NANTES. 23-28 hOt'q" 1992 1 0 4 PROC. OF COLING-92, Nhr¢rEs, AUG. 23-28, 1992 
"In negotiation, the process is more compli- 
cated." 
a. shietiau-\[shang-shoushiu\]-jiau-mafan 
b. shietiau-\[shangshou-shiu\]-iiau-mafan 
(13) ta benren 
he serf 
"he himself" 
a. ta-benren 
b. taben-ren 
The heuristic rules 2,3,and 4 only resolve 1.71% 
of the ambiguities as shown in the table 2 of the next 
section. After we observe the remaining ambiguities we 
found that many ambiguities were occurred due to the 
occurrences of monosyllabic words. For instances, the 
character string in (14a) can be segmented as (14b) or 
(14c), but none of the above resolution rules Can re- 
solve this case. 
(14) ganran de chiuanshrrenshu shieshialai 
infect DE real number write 
down 
"write down the precise number of the in- 
fected" 
ganran-\[de-chiuanshr\]-renshu-shieshialai 
ganran-\[dechiuan-shr\]-renshu-shieshialai 
If we compare the correct segmentations with 
the incorrect segmentations, we find out that almost 
all of the monosyllabic word in the correct answer 
are function words, such as prepositions,con\]unctions, 
as well as a few high frequent adverbs. And that the 
monosyllabic words in the incorrect segmentations are 
lower frequency words. The set of such frequently oc- 
curred monosyllabic words are shown in appendix 1. 
We then have the following heuristic rule. 
Heuristic rule 5: 
Pickthechunk with the high frequentlyoccurred 
monosyllabic words. 
This rule contributes 3.46% of the success of the 
ambiguity resolution. The remaining unsolved ambi- 
guities are about 1.62% of the total input words. They 
usually should be resolved by applying real wurld 
knowledge or by checking grammatical validity. How- 
ever it is almost impossible to apply real world knowl- 
edge nor to check the grammatical validity at this stage, 
so applying Markov m(v..lel is a possible solution\[21\]. 
The other solution is much simpler ,i.e. to pick the 
chunk with the highest accumulated frequency of 
words\[221. It requires the frequency counts for each 
words only instead of word bigram or trigram which re- 
quired by the Markov model. 
Heuristic rule 6: 
Pick the chunk with the highest probability value. 
q~e prohability value of the sequence of words' W 1 W2 
W3' can be estimated by either 
a) Markov model with the bigram approximation 
P~P(W01Wl ) * P(WI\[W2) * P(W21W3) * 
P(W3) ; or 
b) Word probability accumulation 
P = PfWl) + t,(w2) + P(W3) 
Heuristic rule 6a might not be feasible, since it re- 
quires word bigram a matrix of size in the order of 
10"'10. But heuristic 6b) might not produce a satis- 
factory resolution. According to our experiment the 
success rate for 6b) is less than 70%. qlaerefore the 
other solution is to retain the ambiguities and resolve 
at the parsing stage, 
3.2 Experiment Results 
We designed a word identification system to test 
the matchingalgorithm and the above mentioned heu- 
ristic rules. The lexicon for our system has about 90 
thousands lexical entries plus mflimited amount of the 
DMs generated from 76 regular expressions. The 90 
thousands lexemes form a word tree data structure in 
order to speed up the word matching \[4,10\].For the 
same reasons, DM rules are compiled first to produce a 
Chomsky Normal Form like parsing table. The parsing 
table will then he interpreted during the word matching 
stage\[19\]. 'l~vo sets of test data are randomly selected 
from a Chinese corpus of 20 million characters. We 
summarize the testing result in "lhhle 1. qhble 2 shows 
the success rates and applied rates for each heuristic 
rule. 
qhe recall rate and recognition rate in the above 
table are defined as follows. Let 
NI = the number of words m the input text, 
N2 = the number of words segmented by the system 
for the input text. 
N3 = the number of words were correctly identified. 
Then the recall rate ss defined to bc N3/NI and the 
precision rate ks N3/N2. "lhc dclmlta)n of the other sta- 
tistical result are t~vlousJ) I'ollnwed the conven- 
tion.The above testing algorithm do not include the 
process of handling derived ~4 wds. "lhcrefore the above 
statistics do not ta)unt the mt~lakcs occurred due to the 
existence of derived word,,, or proper names. 
We can sec that the maximal matching algorithm 
is the most effective hcunst~t~, lbcrc are 10311 num- 
ber of ambiguities out of 17404 occurrences of the seg- 
ACRES DECOL\]NG-92, NANTES. 23-28 AO(;r 1992 l 0 5 PROC. OF COLING-92, NANTES, Aua. 23-28. 1992 
mentations. It counts 58.94% of the total segmenta- 
tions and 93.21% of ambiguities were resolved by this 
heuristics. 
4.Discussions and Concluding Remarks 
From the statistical results shown in table 2, it is 
clear that the ma.,dmal matching algorithm is the most 
useful heuristic method. Most of the mistakes caused 
by this heuristic are due to the occurrences of the words 
which are composed by two subwords. Those words are 
needed to have further investigations. If we want to 
further improve our system's performance, it seems 
that employing lexically dependent rules is unavoid- 
able. 
The errors caused by the heuristic rule 2are due to 
t he cases of a three character word followed by a mono- 
syllabic word and which can he divided into two hisyl- 
labic words, for instance (15). 
(15) tzai shia san jou 
at down three week 
"in the followlug three week" 
\[tzai-shiasanjou\] 
\[tzaishia-sanjou\] 
Such mistakes can be avoided by giving the sec- 
ond bisyllabic words a lexically dependent marker 
which denotes that a low priority is given to this word 
when the heuristic rule 2 is applied. 
Table 1. Testm results 
Sample 1 Sample 2 Total 
# of sentences 833 1968 2801 
# of characters 8455 20879 29334 
# of v~ords 5085 12409 17494 
+ , 
# of words identified 5076 12399 17475 by the system 
# of correct 5O64 12370 17434 identifications 
. , 
recall rate 99.58% 99.69% 99.66% 
precision rate 99.76% 99.77% 99.77% 
guLe2 
~e4 
H~mt~ 
Hcalrllt~ 
Rule6 
Table 2. The success rates of the heuristic rules 
Sample 1 S+mple 2 Total 
# of laentdicattom # of error~ succe~ts rate $ of mentilioaiom i of erron ~ucoms rate i r of klentificltlor # of errors suCCX~s rate 
2875 13 99.58% 6938 17 99,75% 9813 30 99.09% 
36 4 8~.89% 74 3 95.98% 110 7 93.64% 
0 0 100% 5 0 100% 8 0 100% 
18 O 100% 32 1 96.86% 50 1 98.00% 
104 2 98,08% .238 12 94.96% 342 14 95 90% 
48 15 6878% 109 36 66,97% 157 51 67.52% 
ACrF~ DE COL1NG-92, NANTES, 23-28 nOt3T 1992 1 0 6 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
'FILe heuristic rules #3 and #4 are the most 
reliable disambiguation rules. However they only con- 
tribute 0.53% of the disambiguation processes. 
The heuristic rule #5 is ugeful, but the priority val- 
ues for each high frequent monosyllabic word has to 
be carefully rearranged in order to reduce possible mis- 
takes \[181. 
The heuristic title #6 needs to be further studied. 
It will be much more easier to use the bigram or trigram 
based on gt-ammatical categories instead of the word 
bigram or the simple accumulation of the word fre- 
quencies. It will be the future study. 
About the identification of the proper names, it 
requires a further investigation on the results of the 
proper names 'after segmentation algorithm is applied. 
Append~ 
~~~~~~T~b~~ 
AcEs DECOLING-92, NAbn'Es, 23-28 Ao~rr 1992 1 0 7 PP.oc. OF COLING-92, NAI~rES, Auo. 23-28, 1992 

References 

\[1\] J. S. Chang, "A Multiple-Corpus Approach to 
Identi cation nf Chinese Surname-Names." Proc. 
of Natural Language Processing Pacific Rim Sym- 
posium, Singapore, 1991 

\[2\] 3". S. Chang, J. I. Chang and S. D. Chert, "A Meth- 
¢yd of Constraint Satisfaction and Statistical Opti- 
mization for Chinese Word Segmentation," Proc. 
of the 1~1 R. O. C Computational Linguistics 
Conference, Taiwan, 1991 

\[3\] Y.R. Chad, A Grammar of Spoken Chinese, Uni~ 
versity of California Press, California, 1968 

\[4\] K.J. Chen, C. J. Chert and L. 3". Lee, "Analysis and 
Research in Chinese Sentences -- Segmentation 
and Construction," Technical Report, TR-864X)4, 
Nankang, Academia Sinica, 1986 

\[51 K. 3. Chen and C. R. Huang, "lutonnation-based 
Case Grammar," COLING - 90, Vol 2, p.54 - p.59 

\[61 K.J. Chen et al, "Compounds and I~arsing in Mau~ 
darin Chinese," Plot. of National Computer Sym- 
posium, 1987 

\[71 G.Y. Chen, " A-not-A Questions in Chinese," 
manuscript, CKIP group, Academia Sinica, 'Pulp- 
el, 1991 

\[8\] C.K. Fan and W. H. q.t;ai," Automatic Word Iden- 
tification in Chinese Sentences by the Relaxation 
"/~chniqne," Computer Processing of Chinese and 
Oriental Languages, Vol.4, No.l, November 1988 

\[9\] R. Garside, G. Leech and (;. Sampson, " The 
Computational Analysis of English -- a Corpus- 
based Approach," lamgman Group UK Limited, 
1987 

\[10\] W. H. Ho, " Automatic Recognition of Chinese 
Words," Master Thesis, National Taiwan Institute 
of qi:chnology, "Paipei, "lhiwan, 1983 

\[I1\] W. M. Hong, C. R. Huaug, T. Z. qhng and K. J. 
Chen," The Morphological Rules of Chinese De- 
rivative Words," rib be presented at the 1991 Inter- 
national Conference on "lEaching Chinese as a 
Second Language, December, 1991, 'Ihipei 

\[12\] C. Y. Jie, Y. Lin and N. Y. Liang, "On Methods of 
Chinese Automatic Segmentation," Journal of 
Chinese hfformation Processing, VoL3, No.l, 
1989 

\[13\] B. I. Li, S. Lien, C, F. Sun and M. S. Sun, "A Maxb 
real Matching Automatic Chinese Word Segmen- 
tation Algorithm Using Corpus "lhgging for Ambb 
guity Resolution," Proc. of the 1991 R. O. C Com- 
putational Linguistics Conference, Taiwan, 1991 

\[14\] N. Y. Liang, "Automatic Chinese Text Word Seg- 
mentation System -- CI)WS", Journal of Chinese 
Information Processing, VoL1, No.2, 1987 

\[151 N. Y. Liang, "Contemporary Chinese Language 
Word Segmentation Standard Used for Informa- 
tion Processing," 1989, a draft proposal 

\[16\] N. Y. Liang, "Fhe Knowledge of Chinese Words 
Segmentation," Journal of Chinese Information 
Processing, Vol.4, No.2., 1990 

\[ 17\] M. L. Lin, "The Grammatical and Semantic Prop- 
erties of Reduplications," manuscript, CKIP 
group, Academia Sinica, 1991 

\[ 18\] I. M. Liu, C. Z. Chang and S. C. Wang, "Frequency 
Count of Frequently Used Chinese Words," 'Pulp- 
el, qMwan, Lucky Bcmk Co., 1975 

\[19\] R. E Mo, Y. J. Yang, K. J. Chen and C. R. Huang, 
"Determinative-Measure Compounds in Manda- 
rin Chinese : Their Formation Rules and Parser 
Implementation," Proc. of the 1991 R.O.C Com- 
putational Linguistics Conference, qhiwan, 199l 

\[20\] R. Sproat and C. Shih, "A Statistical Method for 
Finding Word Boundaries in Chinese 'lext," Com- 
puter Processing of Chinese and Oriental Lan- 
guages, Vol.4, No.4, March 1990 

\[21\] C. L. Yeh and H. J. Lee, "Rule-based Word Iden- 
tification fi3r Mandarin Chinese Sentences -- A 
Unification Approach," Computer Processing of 
Chinese and Oriental Languages, Vol.5, No.2, 
March 1991 
