N-th Order Ergodie Multigram HMM for Modeling of 
Languages without Marked Word Boundaries 
Hubert Hin-Cheung LAW 
Dept. of Computer Science 
The University of IIong Kong 
hhcJ_aw©c s. hku. hk 
Chorkin CHAN 
Dept. of Computer Science 
The University of Itong Kong 
cchanOcs, hku. hk 
Abstract 
I,;rgodie IIMMs have been successfully 
used for modeling sentence production. 
llowever for some oriental languages 
such as Chinese, a word can consist of 
multiple characters without word bound- 
ary markers between adjacent words 
in a sentence. This makes word- 
segmentation on the training and testing 
data necessary before ergodic ItMM can 
be applied as the langnage model. This 
paper introduces the N-th order Ergodic 
Mnltigram HMM for language modeling 
of such languages. Each state of the 
IIMM can generate a variable number 
of characters corresponding to one word. 
The model can be trained without word- 
segmented and tagged corpus, and both 
segmentation and tagging are trained in 
one single model. Results on its applicw 
Lion on a Chinese corpus are reported. 
1 Motivation 
Statistical language modeling offers advantages 
including minimal domain specific knowledge and 
hand-written rules, trainability and scalability 
given a language corpus. Language models, such 
as N-gram class models (Brown et al., 1992) and 
Ergodic Hidden Markov Models (Kuhn el, al., 
1994) were proposed and used in applications such 
as syntactic class (POS) tagging for English (Cut- 
ting et al., 1992), clustering and scoring of recog- 
nizer sentence hypotheses. 
IIowever, in Chinese and many other oriental 
languages, there are no boundary markers, such 
as space, between words. Therefore preprocessors 
have to be used to perform word segmentation in 
order to identify individual words before applying 
these word-based language models. As a result 
current approaches to modeling these languages 
are separated into two seperated processes. 
Word segmentation is by no means a trivial pro- 
cess, since ambiguity often exists. Pot proper seg- 
mentation of a sentence, some linguistic informa- 
tion of the sentence should be used. iIowever, 
commonly used heuristics or statistical based ap- 
proaches, such as maximal matching, fl'equency 
counts or mutual information statistics, have to 
perform the segmentation without knowledge such 
as the resulting word categories. 
To reduce the impact of erroneous segmenta- 
tion on the subsequent language model, (Chang 
and Chan, 1993) used an N-best segmentation in- 
terface between them. llowever, since this is still 
a two stage model, the parameters of the whole 
model cannot be optimized together, and an N- 
best interface is inadequate for processing outputs 
from recognizers which can be highly ambiguous. 
A better approach :is to keep all possible seg- 
mentations in a lattice form, score the lattice 
with a language model, and finally retrieve the 
best candidate by dynamic programming or some 
searching algorithms. N-gram models arc usu- 
ally used for scoring (Gu et al., 1991) (Nagata, 
1994), but their training requires the sentences of 
the corpus to be manuMly segmented, and even 
class-tagged if class-based N-gram is used, as in 
(Nagata, 1994). 
A language model which considers segmenta- 
tion ambiguities and integrates this with a N- 
gram model, and able to be trained and tested 
on a raw, unsegmented and untagged corpus, is 
highly desirable for processing languages without 
marked word boundaries. 
2 The Ergodie Multigram HMM 
Model 
2.1 Overview 
Based on the Hidden Markov Model, the Er- 
godic Multigram llidden Markov Model (l,aw and 
Chan, 1996), when applied as a language model, 
can process directly on unsegmented input corpus 
204 
as it allows a variable mmfl)er of characters in each 
word class. Other than that its prol)erties are sin> 
liar to l';rgodic tlidden Markov Models (Kuhn ct 
al., 1994), that both training and scoring can be 
done directly on a raw, unCagged corpus, given a 
lexicon with word classes. 
Specifically, the N-Oh order F, rgodic Multigram 
It M M, as in conventional class-based (N+I)-gram 
model, assumes a (loubly stochastic process in sen- 
tence production. The word-class sequence in a 
scalene(: follows Che N-Oh order Markov assulnl> 
tion, i.e. tile identity of a (:lass in the s('.lite\[Ic(~ 
delmn(Is only on tim previous N classes, and the 
word observed depelads only on the class it l)e- 
longs to. The difference is thai, this is a multi- 
gram model (Doligne and Bimbot, 1995) in the 
sense Chat each state (i.e. node in the IIMM) (:a,t 
genera.re a wu-iable number of ot)served character 
sequences. Sentence boundaries are inodelcd as a 
sl)ecial class. 
This model can be apl/lied to a.ll input sent(race 
or a characCer lattice as a language model. 'Fhe 
maxinnun likelihood scat(: sequence through l,he 
model, obtaine(t using the ViCerl)i or Stack I)(> 
coding AlgoriChln, ret)resenCs the 1)est particular 
segmentation and class-tagging for the input sen- 
tence or lattice, since transition of states denotes 
a wor(t boundary and state identity denotes tile 
ClU'rent word class. 
2.2 Le.xi('on 
A lexicon (CK\] P, 1993) of 78,322 words, each con~ 
tainiug up to 10 characters, is awdlabh~ for use ill 
this work. l'ractically all characters have an cnCl:y 
ill the lexicon, so Chat out-of-vocalmlary words are 
modeled as indivi(hlal eharacters. There is a total 
of 192 syntactic classes, arranged in a hierarchical 
way. For example, the month names arc deuoted 
by the class Ndabc, where lg denotes Nouu, Nd de- 
notes 'lbmporal Nouns, Igda \['or 'l'im(~ lmmes and 
Ndab for reusabh' tilne names. '\['here~ is a total of 
8 major categories. 
Each word ill the dictionary is aullol,al.cd with 
one or nlore syntactic tags, tel)resenting dilferent 
syntactic classes Che word cnn possibly belong to. 
Also, a frequ(mcy count tbr each word, base(l on a 
certain corpus, is given, bill without inforniation 
on its distribution over different syntactic classes. 
2.3 T(:rminoh)gy 
I,el, )42 be the set of all Chinese words in l, hc lex- 
icon. A word "wk C W is made up of one or more 
characters, l,et ,s~ r = (.~;I, .';'21....";T) denote, a sen- 
tence as a T-character sequence. A funcCion (5~,, 
is defined such Chat (Sw (~Vk, sit +r- I ) is \] if w,. is a 
r-character word st ... st+,,-1, and 0 otherwise. 1 
1,et /2 be the Ul)per bound of r, i.e. t,11o maxinntm 
uumber of characters ill a word (10 ill this paper). 
I,et (2/ = {cl...cL} be the set, of syntactic 
classes, where L is the nmnber of syntactic (:lasses 
in the lexicon (192 in our case). Lot t? C W × (/ 
denote Che relaCion for all syntactic classitications 
of the. lexicon, such ChaC ('tot:, el) @ C ill' cl is one of 
the syntactic classes tbr 'wk. Each word wk llltlSt 
belong to one or more of the classes. 
A path Chrough the model represents a partic- 
ular segnmnCation and (:lass Lagging for the Sell-- 
I,(~IIC('.. I,et £7 = ('wt, (:It ; • . . ; "Wig, Cl K ) t)e a particu- 
lar segmentation and (;lass tagging for the sentence 
s~', where Wk is the kth word and elk dCllOtCS tllc 
(;lass assigned to w,:, as illustrated below. 
11) I ~cl 1 ltJk ,elk ~I)K ~Cll,( 
(.Sl • • .Stl-I . • .,Stk_ 1 . . .Stk-1 • • .8IK_ l • . .S'I') 
l"(,r C Co be proper, I1' 2, .,~_, ) 1 aml 
(wk,cl~) C l' must be saCistied, where t0 = 1, tic = 
7'+ 1 and tk-j < l,, for 1 < k < K. 
2.4 ItMM S|:a|;es for l;.he N-th order 
model 
In Che tirst order IIMM (class 1)it(am) lnodel, each 
I1MM state corresl)onds directly to the word-class 
of a word. lhlt in general, for an N-Oh order IIMM 
model, siuce each class depends on N previous 
classes, each state has to rel)lJesellt C\]I(t COlil\])illa- 
t, ion of the classes of the most recelfl; N words, 
iuctlading the current, word. 
I,et Qi represent a stal,(~ of the N-th order Er- 
go(lit Multigraul IIMM. Thus Qi = ((%...ci~_,) 
where tie iS the current word (:lass, ci, is the previ- 
()us word class, etc. '\['here is a CeCal of L N states, 
which may nleall too many l)aranl('ters (l/v+l pos- 
sible state transitions, each state can transit to L 
other states) for the model if N is anything greater 
th an ont. 
'1'o solve this l)rol)lem, a reasonal)le aSSlllllli- 
lion can })c luade that the d('taih'xl (;lass idea 
titles of a mor(~ (listanl, word have, in general, 
less influence than the closer ones Co the current 
word class. Thus instead of using C as tim clas- 
sitication relation for all l)revious words, a set of 
I~I'he ;algorithm to bc described ;tSSUlnCs tlt~Lt, th(,. 
(:ha.r;tctcr identities arc known for the S(!lltCttC(~ 8; ?, })It(, 
it can *also be al)plicd when ca.ch charttctcr position 
sL becomes a. set of possible (:h~u'a(:ter (:~Lndida.t, es by 
simply letting &,,(wk,sl +''-I) -- i for all words wk 
which can be constructed from the c\]mr~t(:ter positions 
st...st+, 1 of the input c\]mractcr lattice. This en- 
al)les the mo(M to 1)e used as the languzLgc model 
component for r(!(:ognizcrs and for decoding phoncti(: 
input. 
205 
classification relations {C(°), C(1),...C (N-l) } can 
be used, where C(°) = C represents the origi- 
nal, most detailed classification relation for the 
current word, and C (n) is the less detailed clas- 
sification scheme for the nth previous word at 
each state. Thus the number of states reduces 
to LQ ---- L(°)L (1) ...L (N-l) in which L('0 _ < L. 
Each state is represented as Qi = (c~°o)...elN-_~ O) 
where C (n) = {cln)}, 1 < I < L (n) is the class tag 
set for the nth previous word. 
However, if no constraints are imposed on the 
series of classification relations C Oo , the number 
of possible transitions may increase despite a de- 
crease in the number of states, since state tran- 
sitions may become possible between every two 
state, resulting in a total of L(°)2L (02 ... L (N- 1)2 
possible transitions. 
A constraint is imposed that, given that a word 
belongs to the class cl n) in the classification C (n), 
we can determine the corresponding word class 
c}, ~+0 the given word will belong to in C(~+1), 
and for every word there is no extra classifica- 
tions in C (n+l) not corresponding to one in C (n). 
Formally, there exist mapping functions 5 c('0 : 
COO ~ C("+0, 0 _< n _< N-2, such that if 
C(n) ~(n+l)\] ~ .~'(n) then ((wk, cl n)) 6 C (n)) => I ' '~1 ~ ) 
, (n+l), C(n+l)) (wk,c v ) 6 for all wk 6 W, and that 
y(n) is surjective. In particular, to model sentence 
boundaries, we allow $ to be a valid class tag for 
all C(n), and define 5e('~)($) = 2. 
The above constraint ensures that given a state 
Q, : ,(c!°),o . cl :, 1)) 
it can only transit to 
Qi = (c5~),br(°)(c~))''' J-(N-2)(c~N--~u))) 
where c~° ) is any state in C (°). Thus reducing to 
the maximum number of possible transitions to 
L(°)2L0) ... L(N- 1). 
This constraint is easily satisfied by using a hi- 
erarchical word-class scheme, such as the one in 
the CKIP lexicon or one generated by hierarchi- 
cal word-clustering, so that the classification for 
more distant words (higher n in C (n)) uses a higher 
level, less detail tag set in the scheme. 
2.5 Sentence Likelihood Formulation 
Let {£} be the set of all possible segmentations 
and class taggings of a sentence. Under the N- 
th order model (.)N, the likelihood of each valid 
segmentation and tagging 12 of the sentence s T, 
/~(8T, ~\[oN), can be derived as follows. 
P(w,, c** ; w=, c~= ;... ; Wg, e~,,. IO N) 
= P(W 1 \]Cll )P(cl 1 I$N)P($MK... el.~_,,,+, ) × 
K (\[Ik:= P(W~\]Clk )P(clk IC~*-1 " " " elk_N)) 
= P(w~lc,,)P(O,,lSN)p($lO,K) × 
K (\[Ik=u P(w~lclk)P(Ql~ IQ,k-~)) 
using Nth order Markov assumption and repre- 
senting the class history as HMM states. $ de- 
notes the sentence boundary, elk is $ for k _< 0, and 
Q~k re(°) c! N-l) \] Note that Qlk can be de- I lk * " ' ~k--N+l "" 
termined from clk and Qlk-~ due to the constraint 
on the classification, and thus P(Qzk\]Qlk_~) = 
P(ct~ IQl~-~). 
The likelihood of the sentence s T under the 
model is given by the sum of the likelihoods of 
its possible segmentations. 
v(s lo ) = v(sLno 
3 The Algorithms 
3.1 The Parameters 
As in conventional HMM, the Ergodic Multigram 
HMM consists of parameters E) N ~-- {A, B}, in 
which A = {aij\], 0 < i,j <_ LQ (Total num- 
ber of states), denotes the set of state transition 
probabilities from Qi to Qi, i.e. P(Q31Qi). In 
particular, a0i = P(Qi\[$ N) and ai0 = P($\]Qi) 
denote the probabilities that the state Qi is the 
initial and final state in traversing the HMM, re- 
spectively, a00 is left undefined. H = {bj(w~)\], 
where 1 < j < L Q, denotes the set of word ob- 
servation probabilities of wk at the state Qj, i.e. 
P(wk\]Qj). 
The B matrix, as shown above, models the 
probabilities that wk is observed given N most 
recent classes, and contains LQ\[W\] parameters 
(recall that LQ = L(°)L(1)... L(N-1)). Our ~as- 
sumption that wk only depends on the current 
class reduces the number of parameters to L(°)\]W\[ 
for the /3 matrix. Thus in the model, bj(wk) 
representing P(Wk\[Qj) are tied together for all 
states Qj with the same current word-class, i.e. 
P(wklOj) = P(welc,) if 03 = (c,...). Also, aij is 
0 if Qi cannot transit to Qj. As a resul~ the num- 
ber of parameters in the A matrix is only L(°)LQ. 
Given the segmentation and class sequence £ 
of a sentence, the state sequence (Qz~ ... QI~) can 
be derived from the class sequence (eh...ci~.). 
Thus the observation probability of the sentence 
• P~d' £/ON), can s~ ~ given £ and the model O N , ~ 1 , 
be reformulated as 
b ll (wl)ao l I( 
206 
Given this tbrmulation the training procedure is 
mostly similar to that of the first order Ergodic 
Mnltigram HMM. 
3.2 Forward and Backward Procedure 
The forward variable is defined as 
O't(i) = P(S1.-. St, QI(t)- " Qi\[ ~)N) 
where Q~(t) is the state of the \[IMM when the word 
containing the character st as the last character is 
produced. 
The recursive equations for c~t(i) are 
~t(j) = 
~t(j) = 
0{brt< 1 
w~ 
.1~ LQ 
~ \[~c~t-,'(i)aljbj(w~)l 
~w (Wk, stt-r+l ) 
\['or l <t <7' 
Similarly, the backward variable is defin('d as 
lit(i) 7- \['(St-b1...st Iq+(,) = Qi, O N) 
'l'he recursive equations for fit(i) are 
fit(i) -- 
9 (i) = 
fit(i) 
0 for t > T 
aio 
It LQ 
r=l wkEla2 j==l 
~~o (wk, t+,.~ St+l) 
for I <t <T--1 
As A, H arrays and the 5~, fimction are mostly 0s, 
considerable simplification can be done in irnph'.- 
mentation. 
The likelihood of the sentence given the model 
can be evaluated as 
LQ 
P(s'('lO N) = ~f_~.,r(i)aio 
i=1 
The Viterbi algo,'ithm \[br this model can be ob 
tained by replacing the summations of the forward 
algorithm with maximizations. 
3.3 Re-estimation Algorithm 
&(i, j) is detined as the probability that given a 
sentence .s~' and the model (_)N, a word ends at 
the character st in the state Qi an(l tile next word 
starts at the character st+l in the state Qj. Thus 
~t(i, j) can be expressed as 
R 
s,+, (j) 
r=l wkCW 
P(sY'leN) 
\['or l < t < fl'-- I 1 < i,j < LQ. turthermore 
dellne %(/) to be the probahility that, given Sl r 
and O N , a word ends at the character st in the 
state Qi. Thus 
ctt(i)/3,(i) for 1 <t <7',1 <i< LQ. 7,(i)- p(sy.l®N) 
Sulnlnation of (t (i, j) ()vet" t gives tile expected 
number of times that state Qi transits to slate 
Qj in the sentence, aml stunmation of 7t(i) over 
t gives the expected number of state Qi occurring 
in it. Thtts the quotient of their summation over 
t gives aij, the new estimation for aij. 
"1'- 1 (l' 
aij -- ~_\[, ~'t(,,Y)/~_~ 7,(i) for 1 _< i,j .::( LQ 
t=l tin1 
The initial and fi,,a\[ class probability estimates, 
a0j and ai0 can be re-estimated as follows. 
It 
r=l wkE'VV 
= t (si"leN) 
P 
aio -- c~.r(i)aio /~'Tt(i) 
To derive bj (w~:), first define ctt ~ (i) as the prob- 
ability of the sentence prefix (sl • • . st) with 'wa, in 
state Qi as the last coml)lete word. Thus 
It 1,~ 
r=l i=l 
( (): t-- ; ( i)aij bj ( w k )~w ('u)k , S tt--r + l )) 
This represents the contribution of wk, occurring 
as the last word in sl, to ,~,(j). Also define 7't °~ (j) 
to be the I)robability that, given the sente.nce ,s'~" 
and the model, we is observed to end at character 
st in the state Qj. 
(,~\[~(j)fJt(j) 
7~"~(J) - p(8~'lO N) 
Let Qj o Qj, denot(;s the relation that both Qj 
and Qj, represent the s~me current word class. 
Thus summation of 71~k(j) ow:r t gives the e.x- 
petted munber of times that wk is observed in 
207 
state Qj, and summation of 7t(J) over t gives 
the total expected number of occurrence of state 
Qj. Since states with the same current word class 
are tied together by our assumption, the required 
value of bj(wk) is given by 
E J' E~I ,./~ok (j,) 
-Dj (Wk ) = Q.ioQj, 
E ; ET1 7t(J') QjoQj, 
4 Experimental Results 
4.1 Setup 
A corpus of daily newspaper articles is divided 
into training and testing sets for the experiments, 
which is 21M and 4M in size respectively. Th(' first 
order (N=I) algorithms are applied to the train- 
ing sets, and parameters obtained after different 
iterations are used for testing. 
The initial parameters of the HMM are set 
based on the frequency counts from the lexicon. 
The class-transition probability aij is initialized 
as the a priori probability of the state P(Qj), es- 
timated fl'om the relative frequency counts of the 
lexicon, bj(wk) is initialized as the relative count 
of the word wk within the class corresponding to 
the current word class in Qj. Words belonging 
to multiple classes have their counts distributed 
equally among them. Smoothing is then applied 
by adding each word count by 0.5 and normaliz- 
ing. 
After training, the Viterbi algorithm is used to 
retrieve the best segmentation and tagging £* of 
each sentence of the test corpus, by tracing the 
best state sequence traversed. 
4.2 Perplexity 
The test-set perplexity, calculated as 
m'= exp(- M \]-- log(J'(Z', 
i 
where the summation is taken over all sentences 
s~ '~ in the testing corpus, and M represents the 
number of characters in it, is used to measure the 
performance of the model. 
The results for models trained on training cor- 
pus subsets of various sizes, and after various it- 
erations are shown (Table 1). It is obvious that 
with small training corpus, over-training occurs 
with more iterations. With more training data, 
the performance improves and over-training is not 
evident. 
4.3 Phonetic Input Decoding 
A further experiment is performed to use the mod- 
els to decode phonetic inputs (Gu et el., 1991). 
'Daining Size 2 d 6 8 
98K 194.009 214.096 246.613 286.721 
1.3M 126.084 122.304 121.606 121.776 
6.3M 118.531 113.600 111.745 110.783 
21M 116.376 11.1.275 109.282 108.1/12 
Table 1: Test Set Perplexities of testing set after 
different iterations on subsets of training set 
This is not trivial since each Chinese syllable 
can correspond to up to 80 different characters. 
Sentences from the testing corpus are first ex- 
panded into a lattice, formed by generating all 
the common homophones of each Chinese charac- 
ter. Tested on 360K characters, a character recog- 
nition rate of 91.24:% is obtained for the model 
trained after 8 iterations with 21M of training 
text. The results are satisfactory given that the 
test corpus contains many personal names and ()tit 
of vocabulary words, and the highly ambiguous 
nature of (;he problem. 
5 Discussion and Conclusion 
In this paper the N-th order Ergodic Multigram 
IIMM is introduced, whose application enables in- 
tegrated, iterative language model training on nn- 
tagged and unsegmented corpus in languages such 
as Chinese. 
The pertbrmanee on higher order models are ex- 
pected to be better as the size of training corpus is 
relatively large. Itowever some form of smoothing 
may have to be applied when the training corpus 
size is small. 
With some moditication this algorithm would 
work on phoneme candidate input instead of char- 
acter candidate input. This is useful in decod- 
ing phonetic strings without character boundaries, 
such as in continuous Chinese~Japanese~Korean 
phonetic inpnt, or speech recognizers which out- 
put phonemes. 
This model also makes a wealth of techniqnes 
developed for HMM in the speech recognition 
field available for language modeling in these lan- 
guages. 
References 
Brown, P.F., deSouza, P.V., Mercer, 11..L., Della 
Pietra, V.J., Lai, J.C. 1992. Class-Based n- 
gram Models of Natural Language. In Compu- 
lalional Linguistics, 18:467-479. 
Chang, C.II., Chart, C.1). 1993. A Study on Inte- 
grating Chinese Word Segmentation and l)~rt - 
208 
of-Speech Tagging. In Comm. of COLIP,5', Vol 
3, No. I, pp.69-77. 
Chinese Knowledge lntbrmation Group 1!)!)3. In 
Technical Report No. 93-05. \[nstitul.e of lnt'of 
mation Science, Academia Sinica, 'l'aiwan. 
Cutting, K., Kupic(', J., l)cdcrs(:n, J., Sibun, P. 
1992. A PracticM I)ar|,-of-Sl)Cech Tagger. In 
Proceeding,s of the Third Confercu.cc on Appli(:d 
Natural Language Procc.s,sin9, pp. 133-140. 
I)clignc, S., Bimbot, F. t9i)5, l,~mgu;~g('. Model- 
ing by Vt~ritd)le Length S(;quenccs: Thcor('.tical 
Formul~ttion ~md Evahmtion of Multigrams. In 
1CAb'5'P 95, Pl). 169-172. 
Gu, II.Y., Tscng, C.Y., l,cc, I,.S. 1991. Markov 
Modeling of Mmldarin C'hincsc for decoding the 
phonc~ic sequence into Chinese ch;~r~cl.(ws. In 
Uompuler ,5'pooch and Language, Vol 5, pl).363- 
377. 
Kuhn, T., Nicmann, H., Schukat • TM~tmazz- 
ini, E.G. 1994. Ergodic t/iddcn Markov Mod- 
els trod Polygr~ms for I,anguage Modeling. In 
ICA,gSP 94, pp.357-360. 
L~tw, t\[.II.C., Chan, (3. 1996. Ergodi(" Multi- 
grotto IIMM Integrating Word Segmc'ntal, iou 
;rod Class Tagging for (Jhinesc I,mlguagc Mod- 
e\]ing. 'Fo appear in 1CAHS'I ~ 95. 
Na.gata, M. 1994. A Stochastic ,\]ap~mcs(~ 
Morphok)gical AnMyzcr Using ~ l,'orwa.rd-l)P 
B~L<;kwa.rd-A* N-Best Sear<:h Algorithm. In 
COL1NG 94, I)1).201-207. 
209 
