Chinese Word Segmentation 
based on Maximum Matching and Word Binding Force 
Pak-kwong Wong and Chorkin Chan 
DeI)artment of Computer Scien(;(~ 
The Univ(;rsil;y of Itong Kong 
l)okfulam ih)a,d 
thmg Kong 
pkwong((~cs.hku.hk and (:chan¢~cs.hku.hk 
Abstract 
A Chinese word Seglnentation algorithm 
based on forward icnaxinnlln matching 
and word binding force is t)roposed in 
this pai)er. This algorithm Iilays a key 
role in post-processing the outtmt of a 
character or st/eech recognizer in deter- 
mining the proper word sequence c(/rre- 
st)onding to an input line of cha.raeter 
images or a speech wav(~,fol'tn. ~FO sup- 
port this algorithm, a text; (:orims of over 
63 millions characters is employed to en- 
rich an 80,O00-words lexi(:on in terlns of 
its word entries and word binding forces. 
As it stands now, given an input line of 
text, the word segmentor can proce, ss on 
the average 210,000 characters per se(:- 
ond when running on an IBM RISC Sys- 
tem/6000 3BT workstation with a col 
rect word identitication rate of 99.74%. 
1 Introduction 
A language model as a t)ost-processor is esse, ntial 
to a recognizer of speech or characters in order 
to determine the approi)riate word se, que, n(:e and 
henc.e the semantics of an inI)ut line of text or ut- 
terance. It is well known that an N-gram statistics 
language model is just as effective as, t)ut nmch 
more eificient than, a syntactk:/semantic analyser 
in determining the correct word sequence. A nec- 
essary condition to successflfl collection of N-gram 
statistics is the existence of a coInprehensive le, x- 
icon and a large text corpus. The latter must 
tie lexically analysed in order to identify all the 
words, from which, N-gram statistics can be de- 
rived. 
About 5,000 characters are being used in mod- 
ern Chinese and they are the building blocks of all 
wor(ls. Ahnost every character is a word and inost 
words are of one or two characters long but there 
are also abundant wor(ls longer than two charac- 
ters. Before it; is seginented into words, a line of 
text is just a sequence of characters and there are 
numerous word segmentation alternatives. Usu- 
ally, all but one of these alternatives arc syntac- 
tically and/or semantically incorrect. This is l;he 
case because unlike texts in English, Chinese texl;s 
have no word nlarkers. A tirst step towmds buiht- 
ing a language model based on N-gram statistics is 
to de, vek)p an etIMent lexical analyser to id(!ntify 
all the words in the, corpus. 
Word segmentation algorithlns behmg to one of 
two types ill general, viz., the structural (Wang et 
al., 1991) and the statistical type (Lua, 1990)(Lua 
and Gan, 1994)(Sproat and Shih, 1990) rt;spec- 
tively. A structural algorithm resolves segmenta- 
tion mnbiguities by examining the structural rcla- 
tionships between words, while a statistical algo-- 
rithm compares the usage flequencies of the words 
and their ordered combinations inste, ad. Both ap- 
proaches ln~ve serious liinitat;ions. 
2 Maximum Matching Method for 
Segmentation 
Maximum matching (l,iu et al., /994) is one of the 
most I)opular structural segmentation algorithms 
for Chinese texts. This method favours long words 
an(1 is a gree(ty algorithm by (lesign, hen(:e, sub- 
optimal. Segmenl;ation may start from either end 
of the line without any difference in segmenta- 
tion results. In this paper, the forward direction 
is adopted. The major advantage of inaximum 
matching is its etHciency while its segmentation 
accuracy can be expected to lie around 95%. 
3 Word Frequency Method for 
Segmentation 
In this statistical approach in terms of word fre- 
quencies, a lexicon needs not only a rich repertoire 
of word entries, lint also the usage frequency of 
e, ach word. To segment a line of text, each possi- 
ble segmentation alternative is ewduated accord- 
ing to the product of the word fi'equencies of the 
words Seglnented. The word sequence, with the 
highest fi'equency product is accepted a.s correct. 
This method is simple but its a(:curacy (h,,lmnds 
heavily on the accuracy of the usage fi'equencies. 
The usage frequency of a word differs greatly from 
200 
one tytm (t\[ do(:umcnts to a.noth(n', say, a l)assag(: 
of world news as a,ga,hlsl; a, t(',(:hnical r(~,port. Sil~c('~ 
(,here, a.r('. I;(uls o\[ l,h(/llsa.Ii(ls o\[ words a,cl;ively us('.(l, 
Oil(', nc(;ds a giganti(: (:oll(~(:ti(ll: ()f texts to mak(~ 
;m a,(',(:urat,(~ estimal;(~, lint t)y t;h(~.u, the (~stimat;(~ 
is jusl, an averag(~ a,n(l it; ma.y not; t)(,, suital)le for 
any tyt)(! (/\[ (h)(:mn(mt at all. /n oth(n' words, 1,}m 
variml(:(~ of ml(:h &ll (~st;illl~tl;(~ is to() great making 
I;h(; (~stiirlat(! listless. 
4 The Lexicon 
Most Chines(; linguists ac(',(;1)t the (h',:linition of a 
wor(1 as thc minimum unit tha,t is scmanticMly 
(',omt/h',t(~ and (',all lie, I)Ut; tog('%her as t/uihting 
t)lo('ks to form a, sent(ulc(u llow(:vex, in Chines(:, 
wor(ls can t)(~ unit(:d t(/ fi)rm (:Oml)()mM words, 
a.n(l they in turn, (',;/.): (:oral/in(: furth(',r 1:() t'()rm 3,('.1, 
higher (>r(lcr('d (:omt)(/und words. As ;1 ma.tt(~r of 
t~lC\[;~ COlllI)o1111(l WOl'ds ,~LI'(; (~,xtr(un(~ly C(/IlllllOIl ;/11(~ 
they exist in large numbers. R is imI)ossit)h', t;(/ 
in(:lud(', all (:Olil\[)()lllld words into the, h!xi(:(m t)ut 
just to kcc t) (,host: which are \['re(tu(;nt;ly Uso(l a,n(i 
ha.v(', the word (:omtl(mcnts unit('d clos(',ly. A lex- 
icon was at:quirt;(1 from th(; Inst;il;ulx'~ o\[ \[nf()rm;l- 
t;i()ii S(;i(~,ll(:(~,, Acad(',nlia, Sini(:a. in Taiwan. Thcr('~ 
are 78410 word (mtri(:s in l:his h~xi(:(tn, (~n(:h associ- 
ated with a usage frextu(;n(:y. A (:O:'lnls (/t: over 63 
million (:hara('.t(:rs o\[ news lines was acquired \[rom 
China. l)u(~ t(/ (:ultural difl'(:r(m(:(:s of tim two st)- 
(:i('t;ios, there arc many words en(:(nmt(~r(:(1 in th(: 
(:()rpllS t)llt II()t in t:h(~ lexi(:on, rl'h(! lal;t(!r must 
t, heretbre lie em'ichcd 1)efor(~ it can 1)e a t)pli(:d 1:(/ 
t)(wt'orln the lexical a.nalysis. The tits( st, el/ t()- 
wa,r(ls this end is to merge a h~xi(:(m l/ut)lishcd in 
China into this one, in(:r(',asing the numt)(u' of word 
ent;ries to 85,855. 
5 The Proposed Word 
Segmentation Algorithm 
Tllc t)rot/os(:d algorithm of this t)al>(:r makes use (t\[' 
a f(/rward ma.ximmn matching st, ra.t(;gy to i(hultify 
w()r(\[s, In this r(:sl)(~(:l; ~ this algorithm is a struc- 
tural atll)roa(:h. (hMer this sl;ratcgy, errors are, 
usually a.ssot;iated with singh',-(:haract(~r words, ill 
th('~ first (:hm'a,(:ter (if a litm is i(hmtili(~d ns a single- 
(:haract(~r word, what it nlcans is that; ther(~ is 
no multi-character word entry in the l(~xi(:on th;d; 
starts with such a chara(:tcr. In that case, there is 
not much on(, can do about it,. On the other hand, 
when a character is khmtifie, d as a single-cha.ra(:tcr 
word fl following another word (t in th(: line, one 
(:annot he, ltl wondca.'ing whether tim sole chm'acter 
(:omt)osing/~ shouhl not 1)(', combined with th(' suf- 
fix of (t to form another word instea.d, even il that 
metals ('hanging (~ int() a shorter w(tr(\[. In that 
case, every t)ossil/h~ w(/rd sO,(lll(~n(;(? alternative (:or- 
responding to the Sllt)-s(:qilo, iicc of (;hari-l(:t(ws fr()ill 
c~ and /3 together will 1)e evaluated according to 
the produ(:t o\[ its constituent word binding for(',es. 
Ttle binding force of a. wor(l is a. rues.sure of how 
strongly the charact(',rs conll)osing th(,, word are 
bound t()g(~ther as a single unit;. This for(x: is oL 
ten equated to tim usage fr(~qu(mcy of the word. 
In this l'(!Sl)(:(;l; , the pr()l>()s(;(1 algoritlun is a, sta.tis- 
ti(:al apl)roach. It is as (,,tti(:i(:nt as tim maximum 
lna.tching moth(/(1 I)(~(;aus(! wor(l binding f()r(:(!s ;u'(~ 
utilized only in (,x(:(~pti(mal cases, th)w(wer, much 
of the word amt/iguities are climilmt(~d, h~a(ling 
to a vc'ry high word identification accuracy. S('g- 
m(:ntation errors ass(/ciat('d with multi-cha.ract(,,r 
words can 11(: r(~(h:(:cd 1)y adding or (leh~ting woMs 
to or from the h',xi(:on as well as adjusting word 
t)in(ling forces. 
6 Structure of the Lexicon 
Words in the h:xi(-(in are divided into 5 groups 
a, ccording to woM h;ngths. They corr(:spond to 
words ()t' l, 2, 3, 4, and more than 4 cha,ra(> 
ters with group sizes equal t() 7025, 53532, 12939, 
11269, and 1090 rt;stmctively. Since iilOSt of tlw, 
l;iule spent ill mm,lyzing a line, o\[ text is ill linding 
a match among the h;xicon (',ntries, a chwcr or- 
ganization o\[ the lexicon Slmcds up the s(',mching 
1)rot'(~ss trclnc, lMously. Most Chin(,a*, words are o\[ 
(mr: or two cha.racJx;rs ()lily. Searching for l(mg(!r 
WOl:dS I)(~\['Ol(: sholt(w OliOS ~/s ln'at:tised iu ma.xi- 
mum nu:tching recalls Sl)en(ling a great ileal of 
time s(,,arching for ram-existent; targets. To over- 
come this problem, I;11(', following measur(',s arc, 
takc, n to organize tim h:xicon for fast s(:m'(:h: 
• All sinp;h; (:ha.la.c.t,(:r w()t'(t,q a,l.O, sI;or(;d ill ;-/ I;a- 
ble of 32768 bins. Since tilt; itll;Cl'llld cod(: Of 
a cha.rat'tcr takes 2 bytes, bits l-15 m'e used 
as th(! bin address for the, wor:l. 
• All 2-charat't(',r words are stored ill a se, parat(; 
tabh: of 655"{6 bins. 'I'll(', two low order bytes 
of the two (:hara(:ttn's arc used as a short iw 
t:(:g(',l" for bin address. Should t\]mrt~ be other 
words (:ont(',sting for the, same biu, they a, re 
kept in a linked list. 
• Any 3-ttha.ra, cl;t;r word is split into a 2- 
(',ha.ra,('.t(;r pre\[ix and a, i\[-chara(:ter sutlix. The 
prt!lix will tm si,ored in the bin tabh: for 2- 
('\]lar~l(:t(',r words with (:lear indi(:ation of its 
l)rcfix st&(liB. Thc Sill\[ix will bc stored in the 
bin table for l-(:harac, t(:r words, again, wiLh 
clear indication of its suffix status. All (tut/li- 
(;ate entries are coral)trier1, i.e., if (~ is a word 
as well as a suflix, tilt; two entries arc com- 
bined into one with a,n indication that it; can 
serve as a word as well as a suffix. 
• Any d-t:haract;ex word is divided up into a 2- 
chara(',ttu" prefix and a 2-(:haract(n' suffix, 1)oth 
stored in tile bin table :\[or 2-character words, 
with ch:ar indications of tll(;ir r('~spc(:tivc sta- 
tus. Each prefix points to a link(;d list of as- 
sociated suffixes. 
201 
• Any word longer than 4 characters will be di- 
vided into a 2-character prefix, a 2-character 
infix and a suffix. The prefix and tile infix are 
stored in the bin table for 2-character words, 
with clear indications of their status. Each 
prefix points to a linked list of associated in- 
fixes and each infix in turn, points to a linked 
list of associated suffixes. 
Maximum matching segmentation of a sequence 
of characters "...abcdefghij.. 2' at the character "a" 
starts with matching "ab" against the 2-character 
words table. If no match is found, then, "a" is as- 
sumed a 1-character word and maximum match- 
ing moves on to "b". If a match is found, then, 
"ab" is investigated to see if it can be a prefix. 
If it cannot, then "ab" is a 2-character word and 
maximum matching moves on to "c". If it can, 
then one examines if it can be associated with an 
infix. If it can, then one examines if "cd" can be 
an infix associated with "ab". If the answer is 
negative, then the possibility of "abed" being a 
word is considered. If that fails again, then "c" 
in the table of 1-character words is examined to 
see if it can be a suffix. If it; can, then "abe" will 
be examined to see if can be a word by search- 
ing the 1-chara(q;er suffix linked list pointed at by 
"ab". Otherwise, one has to accept that "ab" is a 
2-character word and moves on to start Inatching 
at "c". If "cd" can be an infix preceded by "ab", 
the linked list pointed at; by "cd" as an infix will 
be searched for the longest possible sutfix to com- 
bine with "abed" as its prefix. If no match can be 
found, then one has to give up "cd" as an infix to 
"ab':. 
7 Training of the System 
Despite the fact thai; the lexicon acquired from 
Taiwan has been augmented with words fl'om an- 
other lexicon developed in China, when it is ap- 
plied to segment 1.2 million chm'acter news pas- 
sages in blocks of 10,000 characters each randomly 
selected over the text corpus, an average word seg- 
inentation error rate (IZ) of 2.51% was found with 
a standard deviation (c,) of 0.57%, mostly caused 
by uncommon words not included in the enriched 
lexicon. Then it is decided that the lexicon should 
be fllrther enriched with new words and adjusted 
word binding forces over a number of generations. 
In generation i, n new blocks of text are picked 
randomly from the corpus and words segmented 
using the lexicon enriched in the previous genera- 
tion. This process will stop when I* levels off over 
several generations. The 100(1 - a)% confidence 
interval of t* in generation i is :tzto.a~,~,__l~r/v~ 
where a is the standard deviation of error rates 
in generation i- 1, and n is the number of blocks 
to be segmented in generation i. to.5~,n-1 is the 
density function of (0.5a, n - 1) degrees of free- 
dom(Devore, 1991). Throughout the experiments 
below, n is always chosen to be 20 so that the 90% 
confidence interval (i.e., (t = 0.1) of t z is about 
:k0.23%. 
8 Experimental Results 
The lexicon has been updated over six generations 
after being applied to word segment 1.2 million 
characters. Tile vocabulary increases from 85855 
words to 87326 words. The segmentation error 
rates over seven generations of the training process 
are shown in the table below: 
Lexicon 
Generation 
Number 
Error Rate 1 ~, over a text 
of 200 000 Characters 
Max. Mat;. 
5.71% 
5.20% 
4.66% 
- 4U~Sg-0 -- 
2.60% 
2.4:7% 
Max. Mat. ~ 
Word Bind. t~brce 
2.32%-- 
2.16070------ 
1.88% 
1.69,%--- 
0.43% 
0.30% 
6 2.44% 0.26% 
Most of these errors occur in proper nouns not 
included in the lexicon. They are hard to avoid 
unless they become l)opular enough to be added 
to the lexicon. The CPU time used for segmenting 
a text; of 1,200,000 characters is 5.7 seconds on an 
IBM I{ISC System/6000 3BT computer. 
9 Conclusion 
Lexical analysis is a basic process of analyzing and 
understanding a language. The proposed algo- 
rithm provides a highly accurate and highly effi- 
cient way for word segmentation of Chinese texts. 
Due to cultural differences, tile same language 
used in different geographical regions and difl'crent 
applications can be quite diffferent causing prob- 
lems in lexical analysis. However, by introducing 
new words into and adjusting word binding threes 
in the lexicon, such difficulties can be greatly mit- 
igated. 
This word segmentor will be applied to word 
segment the entire corpus of 63 million characters 
before N-gram statistics will be collected for post- 
processing recognizer outputs. 
References 
Jay L. Devore. 1991. Probability and Statistics 
for Engineering and Sciences. Du:rbury Press, 
pages 272 276. 
Ynan Liu, Qiang Tan, and Kun Xu Shen. 1994. 
The Word Segmentation Rules and Automatic 
Word Segmentation Methods for Chinese Infor- 
mation Processing (in Chinese). Qing Hua Uni- 
versity Press and Guang Xi Science and Tee\]t- 
nology Press, page 36. 
202 
Kim-Teng Lua mid Kok-Wcc Gan. 1994. An 
Applicat;ion of \]nibrmal;ion Theory in (\]hincso. 
Word Segmental:ion. Comp'ttter l'~'oce,,ssi'lzg of 
Ch, incsc and Or'ienlal Languages, Vol. 8, No. 1, 
pages 115 123, 2unc'. 
K.T. lma. 1990. From Chm'aclx',r l;o Word An 
At)plication of hfformai;ion Theory. Computer 
Proccssin 9 of Chinese and Oriental Languages, 
Vol. 4, No. 4, pages 304 313, March. 
Limtg-Jyh Wm~g, Tzusheng l'ei, Wci-(\]huan IA, 
and Lih-Ching 11,. Ilmmg. 1991. A Parsing 
Method for hh',ntit~ying Words in Mandarin Chi- 
nes(,, S('m~(,am(,~s. In l'roccssiugs of 121,h lnt(:> 
'national ,loin/, Conference on Artificial httelli 
gcncc, pages 1018 1.023~ l)~rrling IIarl)our, Syd- 
ney, Austr~dia, 24-30 August. 
l{ictmrd Sproat ~md Chilin Shih. 1990. A St, ads- 
deal Method for Finding Word Boundaries in 
Chinese Text. Computer l'rocessin.q of Uhinese 
and Oriental Lo, n.q,tta.qes, Vol. 4, No. 4, pages 
336 349, Mm'ch. 
203 
