CHINESE SEGMENTATION DISAMBIGUATION 
Wawing Jin 
Computing Research Laboratory 
New Mexico State University 
wanying((~crl.nlnsu.edu 
Abstract 
A technique of reasoning under uncertainty is 
studied in all attempt to solve disaml)igua,- 
tion probh;nls of Cilinesc segnlcnliation. A 
knowlcdge-.I)a,sed inexact reasoning thcory in- 
corpora,ting knowledge in morp\]lology, syn- 
tax, seniantics and F, ra.gmati(:s is :,rcsent('d. 
1 Introduction 
Processing (7hincsc texts is spccifi('ally dif- 
ficult in its computation because liol'mally 
sentc.nces in Chinese texts arc rcp:rcscnt(;d as 
strings of Chiucse characters without spacc's 
to indica.t(: wor(| boundaries. This (;auscs a 
problem for Chinese machine translation> sl, a- 
tistical analysis of (Jhincse corpora, (lhincse 
informal;ion rctrieva,l, ct(:.; a.s usually these 
projects axe I)e~scd on the a.ssurilt)tion t,tmt rill 
lexicon (lisl, iIictions have \[)(;Cll i'ccoglliZ(',d iU 
,~dva, il ('(:. 
Several a.pproachcs a.iYrled t;o lir a.tl s \['(:r a, @ h i-- 
nese ciia.ra,(;ter stri:ng into a. word sl, ring ha.ve 
I)ecn studied in recent decade's. Two coin- 
peting approaches cominonly used for Chi- 
nese l;cxl, scanlent&lion are the st~l;isl;ical a f)- 
proach (Cilang, (;t a.l, 1!)91; Sproat and Shih, 
1991; Chiang, et al, 1992) and the heuristic 
N)proach (Chcn and l,iu, 1992; lie, ct al, 
1991; ,\]in arid Nie, 1993; diil, ;1992; I,iang 
and Zhcn, 1991; Wang, ctal, 1991). AI 
thougi~ ~t high degree of precision }las l)ecn 
reporl;cd for both :inel;hods, c~t(;h has its linl.- 
iliatiions particularly ill identifying ill/known 
words and disamMgu~ting mulLiplo .ql.~l~IilCil- 
rations, l/,ccently, a hybrid N)l)roach incof 
pora,ting heuristics with statistics h~s l:)een 
studied in an at;lieinpl, l;o solve ltllkllOWll word 
17ccognil, ion prol)lems (Chen an(1 lAu, \]992; 
Nic and Jin, 199/1). l{owevcr, ambiguous scg-- 
menial;ion is still a difIicult problem. 
In t, his paper a Iriel, hod of r(;&SOlling illlder 
un(;o.rl;a,inty iul,ondirlg l;o disalnl)iguate (Jill- 
nose scgmcul, aliion is prcscnl;ed. A model ot! 
cvid(mtia,i sl, rengl;h in inex~mt rea.soning has 
been studied by (lhl('han;m and Short liffc, 
I {)8/I). hi the process of (\]hiricsc segmentation 
know\]trig(' ill tnot'phology., syl:ll, a x, Sel~nant;it:s 
gild pra,gma|,ics is used as evidcnco, to support 
t hc (lisalnl)igual, ion hypotheses. '\]'lle silnilm'- 
ity of uut;('.rl;a.irl kuowh:dg(; and iucxacl; rca 
soning l)cl;wccn medical dbtgnosis and natu-- 
raJ \]migti;~ge intic'rpl'el;al, ion lnakcs it, po~siblc 
t,o apply MY(71N l;echnique to Chinese t;cxl, 
scgmcnl;at;ion. 
2 Difficulties in Chinese 
segmentation 
As (:lainm(t in (lSu, t987),the main (;a.us(;s 
Of 8C~,lllCllta.tioll a, mbiguity al;(; vag~tlClieSs ill 
word dt;finition a, Nd l,hc phenomenon of word 
(:imins. Tlic V&gllCllCSS ()f the wor(I (lc\[initioris 
(;a.tlsos s(;g\]l/l(~rita, Liori alnbigilitics, as in t,h(; 
string ll~/~fl~iEU. It (;&it siiands either for tN4EtI~ 
-J:j: (modcr., factory) or for ~4~ #U:~ (rnod- 
ern chc'mical fa, ctory). A woM cli~in is a, se- 
(lU(mcc of Chinese characters fi'om which sev- 
oral words can /)c \[)rodu(;ed with or withouL 
overlap. Two types of word chains have I)cou 
recognized in (Jhinese litera.turc, i.e. mull, f- 
S(~llS( ~, combinations and interse(;1;ion coral)i- 
nactions (\]hlallg a,lid Liu, 1988). The sl, ring 
1245 
;~N is an example of multi-sense combination; 
(ice), ~I(box) and ~N(refrigerator) are all 
words. The character string ~flN is an exam- 
ple of intersection combination; Ntfl(paddle) 
is a word and ~fl~(sell.-at-sate-price) is also a 
word, whereas tfl is the intersection charac- 
ter. The example of the string ~fl~ f 
illustrates the typical segmentation ambigu- 
ity caused by word chains. The segmentation 
of this string can be either 
(fl'hc ping-pong-balLs were soht outat sale 
price.) or 
('13e paddles for gable tennis were sold out.) 
Some ambiguities can be solved by word 
structure knowledge. Others can be disam- 
biguated by syntactic and/or semantic knowl- 
edge. The most difficult disambiguation is 
that requiring contextual or pragmatic knowl-- 
edge to arrive at an appropriate interpreta- 
tion a,s in the string ~~t which can be 
segmented into: 
(students will write a paper.) or 
(student-association writes a paper'.) 
Both are syntactically and semantically cor- 
rect. in this case, contextual information 
would allow the reader to trace the informa- 
tion claimed in the previous statements to 
solve ambiguity problems. 
3 Reasoning theory for 
Chinese segmentation 
disambiguation 
A model of evidential strength in inexact 
reasoning studied by (Buchanan and Short- 
liffe, 1984) has been successfully implemented 
in the MYCIN system. Tihe theory is that, 
if a hypothesis can be derived from various 
types of mutually exclusive evidence, then the 
strength of truth of the hypothesis can be in- 
creased to reach a plausible conclusion. 
Two concepts MB\[h,e\] and Ml)\[h,e\] have 
been introduced as the measures of belief and 
disbelief. MB\[h,e\] means the measure of in- 
creased belief in the hypothesis h, based on 
the evidence e. M l)\[h,e\] means the measure o\[ 
increased disbelief in the hypothesis h, based 
on the evidence c. To facilitate comparison 
of the evidential strength of competing hy- 
potheses, certainty factor CF is introduced to 
combine degrees of belief and disbelief as fop 
\]OWS: 
csqh, ~1 = M l~\[t~, e\] - MY\[h, c\] 
in the case that a hypothesis is derived froIn 
a number of mutually exclusive observations, 
the combining functions are defined as: 
if MD\[h, el&,e2\] = 1 
then MB\[h, el&,e2\] = 0 
otherwise 
M l:~\[h, el&,c2\] 
= MB\[h, e~\] + M,\[h, e~\] • (:l - MY\[h, e,\]) 
if M13\[h, el&e2\] = 1 
then MD\[h, cl&c~2\] = 0 
otherwise 
M D\[h, cl &c2\] 
= MD\[h,e~,\] + MI)\[h, e2\], (l - mD\[h, ej\]) 
In the case that two hypotheses are estab- 
lished with positive evidence from syntactic 
and semantic knowledge with the same de- 
gree, no discrimination of the strength of 
truth hypotheses can be drawn. If world 
knowledge provides positive evidence for the 
first hypothesis and negative evidence to the 
second; then the strength of the first hypothe- 
sis is stronger than thai; of the second. There- 
fore, the first hypothesis would be the most 
likely correct segmentation. 
A weighted certainty factor is proposed 
he, re to represent the importance of various 
linguistic aspects. The, weight is a vector of 
four elements representing the importance of 
morphology, syntax, semantics and pragmat- 
its, respectively, which total 1, i.e. 
Cl,;\[h,, e\] - w~ , CF\[h, ~\] 
where Wi is the weight of the certainty fac-- 
tor CFi in hypothesis h supported by the ev- 
idence e with respect to one of the linguistic 
1246 
a,specl;s. Suppose, the weight; vecl;or (O.l, 0.2, 
0.3, 0A:) is a,ssigncd (or morphology, synU~x, 
scma,ni;i(:s a, nd pr~gtnal;i(;s, r(,speci;ivcly, Lh(;n 
I;hc following exa.tnple iJlusLra,i, es Lhe t:uncl, iou 
or Ge wcighLcd (:erLa,inl;y \[a,(;l;o," (\]/'i\[/G c,\]. 
(lihe Lhird \]e+der in our (:olnp+ny does (tel; 
ha,re much power) l;he word ¢t\]~iil +~1 ~ pro- 
(hl(:es l, wo segmenLa£ions: 
(t;hc l;hird leaxler it+ ()tit: (:olrit)a,tty (toes HOt, 
have tnueh power) or: 
(l,llc Lhicd piece-el ha,ud hi ()ill' COtlll)a, lty (foes 
UOL ha, re much power) 
To esLima, l,e Lhe sLrengt, h o\[' l, rul, h o1: (,he ficsL 
hypoLhesis, sttppos(': 
• Lhe word sLt'u(:Lltre rule gives Lhe evi(letl- 
l, ia, l st, rengl,h (0.5) ror l,h(, hypot;hesis be- 
e+us(, Lh(, word (:h+d. :le+ (:+m be ('ii, h(;r 
+t~ ~- (pi~c,,-or h,,,(l)(,,, f~-~ (k,,,der). 
T lwrefore, 
6+r;\[t~, ,;,\] = 14:,, c i \[/,,, q} :-: 0.0r, ++,,,~ 
c ~ \[\]+, +,,\] : M ~;\[\],,, <~,,\] - Mn\[\]+, +,.,\] 
:- 0.05 
(,he s.ynl, a,ctJ(: rule gives Lh(', evi(\]eul,hd 
sITeugLh (I) l)e(:~uise iL defitfilx'Jy is a. 
gt'amt:na, t;ic~d senl;en(:e. T\]wr('l'or(', 
c/,~\[/,, ,:4-- ~ * (, I \[/,., ~\] ::: 0.~ +,,,d 
cr'\[A, m <~<;~\] 
:~ ~ BIt,,, q~<;,\] - ms/)\[t,., <~.,<t+<~\] 
=: O,2d 
• l, he sere;mr;it rule gives i;he evidentia, l 
st;,'eugiJ, 1) since +t~T.(i;he Io~utcr) (',a,n 
hame power. 'l~lieref'ore, 
or':+(~,., ,;:,\]-- wi, , (: r'\[l+., ~\] = o.3 ~l.,i,t 
C If\[D,, c ,&<;~&,<;:~\] 
:: m nit,., .., a+<~.~<~,<,~:,\] - M :)\[1~ <~.~....~,~.,,.,\] 
: 0.4(J 
• the world kuowledge rllle gives 1,he evi- 
dentia,l st;rcngl;h (0.8) I)e(:a~use it; is (lUit;( , 
Lrue l;}la.i, Lhe lea,der ha.s less t)ower Lha, n 
Lha, L of t, he \[it'sL or second \[caxter. There,- 
for(;> 
(, 14\[I+, q\] 
:-- W4 * U F\[D,, (~4\] :: 0.32 +u,l 
" L;h c.i&.c.~&,c:~&,e.+\] 
--M I)\[D,, c i &.r.~,~c.:u~q\] 
-: 0.63 
The cert,a, iut, y l:a, ct;or CI" of l;}le hyl)ot;hesis -f~: 
f:l ~,~,:1 ¢'J ~_~! +I,IT- ~Yf ~A: ~); is 0.63. The,'o- 
\[ore, (;his segHietit;a,t;iorl iS likely 1,o hc a, <:oher- 
enl, sLriug. 
To esLiina.Le Lhe evidengi~d sLrengt.h of Lhe se(: 
oud hypol;h('sis, suppose: 
• l, he word sgrucLm:e rule gives l;he evi- 
dent, ial st;rengLh (0.5) for Lhis hyp,:~t.he 
sis since, #\[~T" ca, u be eil;her :IEI \]~(piece+ol' 
ha.u(l) or :I1~ 1:" (le+~der). Therefore, 
c z", \[z,.., \] 
:-14:1 * C//"\[D,, q\] ::: (}.05 ~u.l 
C If\[D., eli 
M.\[/,., ,.,\] M nit,, <,,\] :: o.o5 
• Llle syui;a,cl;ic rule gives Lhe evide, uLia,I 
sl;reuglJI (\]) beta.use it; is a. gramma.t.ic~d 
S(HI \[;(;11 C(',. T hcrel'ot:(;~ 
C' I'~\[D,, c,2\] 
:= W~ * C'/,'\[A, c~\] = 0.2 a, nd 
C l"\[h,, ~:l&c'~\] 
-- M u\[A, <:,,E~\] -- M/)\[t,,, <:, ~<~\] = 0.:~..I 
• t;tle se, m~ull;ic rule giw;s l;he uega, l, ive evi 
dcutM sl,reugl;h (-1) t)e('~ulse t;he t)hra, se 
ID.c h,a, nd o./'~t co.m, pa, ny vJola, Les Lhe se 
n,aui,ic coust, raiid,. 'l;herel'ore, 
C l":/\[A, ~'.:~\] 
- l'l/i~ * Ct,'\[D,, e,:+\] = - 0.3 a, nd 
C i,'\[h, c l&+'.~&c:~\] 
:_: M nil,,, <;,~t+,,~,t;+::+\] - Ms)(/,., ,:,,t.:~,t+,:4 
-: -0.06 
• l, he world knowledge rule gives a, Hega,l, iw'. 
evidcmi~d stxeugllh (I) boca,use a, <'ore 
t)a,ny does uot; ha,w' a, }la, Nd a,s ()lie el! it;s 
COt\]l( pOIICIII, S. 
(71'~\[h, c.4\] -: -0.4 amt 
C l;'\[h, cl&:.'2&e.:.~x:.l\] 
.... 0.34 
The ceH,aiut;y I:a.cLor (~1" of Lhe ll.yl~ol, hcsis #.~ 
• If\] (,,~i.J f¢,j ~2:£ lt~ 1: '~#/ ~).: }~)s is - 0.34. 
1247 
Therefore, this segmentation is unlikely to be 
a coherent; string. 
4 Discussion 
q_'he assignment for the weight vector is 
empirical. It is based on the following analy- 
sis in which ~l's reresent the truth of each evi- 
dence/hypothesis and ~O's represent the false. 
Since the segmentation algorithm always pro- 
duces a segmented string, it is assumed that 
the evidence from morphology is true in vary- 
ing degrees depending on the complexity of 
the word chain. The justification of a hy- 
pothesis is based on the evidence presented 
by the pragmatic, semantic and syntactic as- 
pects shown in the following table. 
~-~ J pragmte I semte I s-sTfitC- 
(1) 0 0 0 
(2) 0 0 
(3) o o 
(4) 0 1 1 
(5) i 0 0 
(6) o 1 
(7) 1 1 0 
(8) 1 1 
hypths 
0 
0 
0 
0 
1 
1 
1 
1 
• Case(l) indicates that if no evidence can 
prove the truth of the hypothesis, then 
the hypothesis is false. 
• Case(2) indicates that if the evi- 
dence supports an incoherent grarumat- 
ical sentence inconsistent with the con- 
text/circumstance, then the hypothesis 
is false as in the case of ~,g~-~(a ba- 
nana ate a monkey). 
• Case(3) indicates that if the evidence 
supports a meaningful but ungrammat- 
ical string inconsistent with the con- 
text/circumstance, then the hypothesis 
is false, i.e. ~g~ (he wretch) against 
the real fact that he is a nice guy. 
• Case(4) indicates that even if tile evi- 
dence supports a grammatical meaning- 
ful sentence but is inconsistent with the 
context/circumstance, then tile hypoth- 
esis is false, i.e., ,~,(~ 7vN ~ ~ N (the 
president's forced resignation makes peo- 
ple angry) violates the circumstance that 
people hate the president. 
• Case(5) indicates the case of an idiomatic 
expression where the string is literally 
ungrammatical and incoherent, but as a 
whole it can be interpreted figuratively 
to make perfect sense. Therefore, we as- 
sutTrle that the hypothesis is true as in 
tile case of :~z~I:~J£, literally means "car- 
water-horse--dragon", but figuratively, it 
nleans "very crowded". 
® Case(6) indicates the case of a metaphor 
or metonymy which superficially it is 
an incoherent grammatical string, but 
by reasoning with the support of world 
knowledge it can be interpreted as a 
lneaninghd string. Then, it is assumed 
that the hypothesis is true, i.e., ~NN~g 
~t (1 drink North-West wind) means "i 
have nothing to eat". 
• Case(7) indicates that the evidence sup- 
ports a meaningful but ungrammat- 
ical string consistent with the con- 
text/circumstance, then the hypothesis 
is true as in Nla;lti (he wretch) is consis- 
tent with the real fact that he is a bad 
guy. 
• Case(8) indicates that if all evidence 
gives positive support to the hypothesis, 
then tile hypothesis is true. 
1)Yore the analysis, it seems to be that 
pragmatic knowledge provides the strongest 
evidence for the hypothesis. Therefore, 
the highest weight is assigned to the prag- 
matic aspect of the certainty factor, in 
the absence of pragmatic inforrnation a de- 
fault assumption, that semantic evidence is 
more important than syntactic evidence, is 
made. This can he observed in daily life 
people communicate through many ungram- 
matical expressions without having a prob- 
lem of transferring the message such as a 
brief email message: ~ DRAFT-cornmerzts 
hard copy best-asap to yw pls. \[t means "To 
1A brief e_mail message from Dr. Yorick Wilks to 
the researchers in Computing |{esearch I,aboratory at 
New Mexico State University. 
"/248 
write the, comment for the Ill{AFT on the 
ha.rd COl)y would be the best. Please return it 
to Yorick Wilks ~s soon as possible." 
The certainty factor Cl;' ix used under the 
premise tha,t a,ll of I;he evide, nce is rendered by 
mutua, lly exclusive observations. Sitice lem- 
guage is a,n expression integr~ting synl;actic, 
semantic and pr~Lgmatic information, is the 
syntat:ti(:, sema,nti(: a,n(\[ I)r~gmatic evid(mce 
mutually exclusive? This is not so (:lca,r. All 
knowledge is cultur~dly (tel)e~l.d(mt , i.e. one 
paN;ieular instance m~y be ~meepta, b\]e in one 
culture but not in a,nothe, r. In this research a 
defmflt assumption is made that the obserw> 
tions from various language ast)ects are inde- 
pendent. The questioa is left ope, for further 
discussiou. 
5 References 
|~u(:h~mml, 13. and E. Shortliffe. (1{)84:). Ua- 
(;erta,inty and F, vident, i~\[ ~qupport. iu 
B. C. Ihwha.na, mid F,. II. Short- 
lille Ed., ll, ulc-Bascd IJrpcrl S'ystcrns: 
The M YCIN I¢:rperimc'nts of th, c Sta,,,- 
ford lleuristic l~rogramming \['reject, 
Addlson-Wesley l)ltblishing Compa,ly., 
1)P. 209-232. 
Cha.ug, J. S., et el. (1991). Chinese 
word segmettl,~t;iotl tJn'ottgh (;onsl;r~dnt 
s~tisfa.t:tion a.nd st~tistical optimiza.tion, 
Pro< of the 4th ILO. C. (/ompulalional 
Linguistics Conference, pp. 147-165. 
Chen, K. J. ~Ltl(:l S. H. /Au. (1992). Word 
l (lent ill cat ion for M~m (latin Chi nese Sen- 
tenet:s. I'r'oc, of the 5th Intc'rnatio,ml 
Conference on (/omputational Linguis- 
tics, Vol. l, pp. 101-107. 
Chiang, T. I\[., et al. (:1!)92). Statis- 
tiea.l models for se, gmcnt~tion a.nd u lv 
known word resolut;ion. I)roc. of th, c 5th 
1tO.(7. Computational Linguistics Con- 
J'crence, I)P. 123-\] 46. 
lie, K. K,,ct el. (11991). The Design l>riu - 
ciple for a, Written Chinese Automatic 
Segment~tion Expert Syst;em. ,Journal 
of Chinese In, formation l'roccssing, re/.5, 
No. 2, pp. 1-14. 
l|ua, ng, X. X..~md 1). Y., l,iu. (1988). The 
Phenomenon of Word Chitin ~nd the Au- 
tomatic Segmentation in Written Chi- 
nese. Journal of the Development of 
I(nowlcdgc I'kzginecring~ pp. 287 291. 
,lin, W. anti ,/. Y, Nie. (1993). Segmenta,- 
1;ion du Chi~lois-- role El,ape Cruciale vet's 
la Tra.duction Automa.tique du Chino is. 
In e.llouillon an(l A. Clas Ed., La 7}'a- 
ductiquc, l,es presses de l'Universite de 
Montrea.I, pp. 349-363. 
,)in, W. (1992). A Ca.so Study: Chi- 
/lese Segment~l, ion a.tl(l its lJisaml)igua- 
tioi~. M(7(Z5'-.92-227, Computing I{,(> 
search I,aboratory, New Mexi(:o State 
(i uiversity. 
1Anug, N. Y. and Y. It, Zhen. (\]991). A 
Chinese Word Segmentation Model and a 
Chinese Word Scgmt;nl;a,tiot~ System I)C - 
CWSS. lh'oc, of COLlt%', gel. l, No. l, 
I)l).51-,55. 
IAu, Y. Q. (1!)87). I)itIiculties in Chi- 
nese l~mguage Processing and Method 
to their Sohfl;ion. l)roc, of 1987 bzte'rna- 
tional (7onference on Chinese Informa- 
tion Processing, Vol. 2, pp. 7125-12(5. 
Nit;, J. Y. mM W. Jin. (1!)94). A Hybrid 
Approach ~o Unknown Word l)etection 
and Segmentation of Chinese, Apl)e~r 
in Prec. of I'nternational Oonfcrcnce on 
(/hincse Computing'.04 (ICC(704). 
Sl)r,,)a.t, 1{. a,t-l(l (~., Shill. (1991). A staA;isLi- 
(:el reel;hot\] R)r finding word boundm'ics 
in Chim;se text,(fomputer l)rocessin.q of 
(kincse and Oriental Languages, gel 4, 
No. 4, PP. 336-351. 
~vVmkg , l,. ,J., el; al. (1991). A Parsitlg 
Metho(l for \[dentifying Words in M~m- 
(tarin Chinese Sentences. l)Tvc, of the 
12lh lnternaiional Joint Co~@rencc on 
Artificial Intelligence , Vol. 2, pp. 1018- 
1023. 
1249 
