Decision-Tree based Error Correction for Statistical Phrase Break 
Prediction in Korean * 
Byeongchang Kim and Geunbae Lee 
Del)artment of Computer Science & Engineering 
Pohang University of Science & Technology 
Pohang, 790-784, South Korea 
{bckim, gblee}((~postech.ac.kr 
Abstract 
tn this paper, we present a new 1)hrase break 
prediction architecture that integrates proba- 
bilistic apt)roach with decision-tree based error 
correction. The probabilistic method alone usu- 
ally sufl'crs fronl performance degradation due 
to inherent data sparseness l)rolflems and it only 
covers a limited range of contextual informa- 
tion. Moreover, the module can not utilize the 
selective morpheme tag and relative distance 
to the other phrase breaks. The decision-tree 
based error correction was tightly integrated to 
overt:ohm these limitations. 
The initially phrase break tagged morphcnm se- 
quence is corrected with the error correcting de- 
cision tree which was induced by C4.5 fl'om the 
correctly tagged corpus with the outtmt of the 
15mbabilistic predictor. The decision tree-based 
post error correction l)rovided improved results 
even with the phrase break predictor that has 
l)oor initial performance. Moreover, tim system 
can be flexibly tamed to new corI)uS without 
massive retraining. 
1 Introduction 
During 1;15(; past thw years, there has l)een a 
great deal of interest in high quality text-to- 
speech (TTS) systelns (van Santen et al., 1997). 
One of the essential prolflenlS ill developing high 
quality TTS systems is to predict phrase breaks 
flora texts. Phrase breaks are especially es- 
sential fbr subsequent processing in the TTS 
systems such as grapheme-to-iflloneme conver- 
sion and prosodic feature generation. More- 
over, gral)helnes in the phrase,-break bound- 
aries are not phonologically changed and should 
be i)ronommed as their original corresponding 
p honenles. 
There have been two apln'oaches to predict 
phrase breaks (Taylor and Black, 1998). The 
* This paper was supported by the University Research 
Program of the Ministry of Intbrmation & Communica- 
tion in South Korea through the IITA(1998.7-2000.6). 
first: uses some sort of syntactic information to 
In:edict prosodic boundaries based on the fact 
that syntactic structure and prosodic structure 
are co-related. This method needs a reliable 
parser and syntax-to-prosody 1nodule. These 
modules are usnally implemented in rule-driven 
methods, consequently, they are difficult to 
write, modi(y, maintain and adapt to new do- 
mains and s. Ill addition, a greater use 
of syntactic information will require, more con> 
lmtation for finding n more detailed syntactic 
parse. Considering these shortcomings, the sec- 
ond approach uses some probabilistic methods 
on the crude POS sequence of the text:, and this 
lnethod will be fln:ther developed in this paper. 
However, t:he. probabilistic method alone usu- 
ally sufl'ers front pertbrmance degradation due 
to inherent data sparseness problems. 
So we adopted decision tree-based error COl'- 
re, ction to overconm these training data limi- 
tations. Decision tree induction iv 1;t5(; most 
widely used \]calming reel;hod. Espcci~flly in lla~l;- 
m:al  and speech processing, decision 
tree learning has been apt)lied to many prob- 
h,.nls including stress acquisition fl'om texts, 
gralflmme to phonenm conversion and prosodic 
phrase, modeling (Daelemans et al., 1994) (van 
Santen et al., 1997) (Lee and Oh, 1999). 
In the next section, linguistic fb, atures of Ko- 
rean relevant to phrase break prediction are de- 
scribed. Section 3 presents the probabilistic 
phrase break prediction method and the tree- 
based error correction method. Section 4 shows 
experimental results to demonstrate the t>erfor - 
mam:e of the method and section 5 draws st)me 
conclusions. 
2 Features of Korean 
This section brMly explains the linguistic char- 
acterists of spoken Korean before describing the 
phrase break prediction. 
1) A Korean word consists of more than 
one morpheme with clear-cut morphenm bound- 
aries (Korean is all agglutinative ). 
1051 
2) Korean is a postpositional  with 
many kinds of noun-endings, verb-endings, and 
prefinal verb-endings. These functional mor- 
phemes determine a noun's case roles, a verb's 
tenses, modals, and modification relations be- 
twcen words. 3) Korean is basically an SOV 
 but has relatively free word order 
compared to other rigid word-order s 
such as English ,except tbr the constraints that 
the verb must appear in a sentence-final posi- 
tion. However, in Korean, some word-order con- 
straints actually do exist such that the auxiliary 
verbs representing modalities must follow the 
main verb, and modifiers must be placed betbre 
the word (called head) they modify. 4) Phono- 
logical changes can occur in a morpheme, be- 
tween morphemes in a word, and even between 
words in a phrase, but not between phrases. 
3 Hybrid Phrase Break Detection 
Part-of speech (POS) tagging is a basic step 
to phrase break prediction. POS tagging sys- 
tems have to handle out-of vocabulary (OOV) 
words for an unlimited vocabulary TTS sys- 
tem. Figure 1 shows the architecture of our 
phrase break predictor integrated with the POS 
tagging system. The POS tagging system era- 
ploys generalized OOV word handling mecha- 
nisms in the morphological analysis and cas- 
cades statistical and rule-based approaches in 
the two-phase training architecture tbr POS dis- 
ambiguation. 
Morphological !;i I analyzc r~ 
{M~pl me ..... ~se~i~ ': 
~,: . / Protmbilistlc I rtgram ~z'~*t $ 
I 
Figure 1: Architecture of the hybrid phrase 
break prediction. 
Tire probabilistic phrase break predictor seg- 
ments the POS sequences into several phrases 
according to word trigram probabilities. Tire 
irdtial phrase break tagged morpheme sequence 
is corrected with the error correcting tree 
learned by the C4.5 (Quinlan, 1983). 
Tire next two subsections will give detailed 
descriptions of the probabilistic phrase predic- 
tion and error correcting tree learning. The hy- 
brid POS tagging system will not l)e explained 
in this paper, and the interested readers can see 
(Cha et al., 1998) tbr further reference. 
3.1 Probabilistie Phrase Break 
Detection 
3.1.1 Probabillstie Models 
For phrase break prediction, we develop tire 
word POS tag trigrmn model. Some experi- 
ments are performed on all the possible trigram 
sequences and 'word-tag word-tag break word- 
tag' sequence turns out to be the most fl'uitful 
of any others, which are the same results as the 
previous studies in English (Sanders, 1995). 
The probability of a phrase break bi appear- 
ing after the second word POS tag is given by 
P(bilt¢,2ta) = C(tlt2bit3) Ej=o,~,2 
C(ht2bjt3) ' 
where C is a frequency count flmction and b0, 
bl and b2 mean no break, minor break and ma- 
jor break, respectively. Even with a large num- 
ber of training patterns it is very clear that 
there will be a number of word POS tag se- 
quences that never occur or occur only once in 
the training corpus. One solution to this data 
sparseness problem is to smooth the probabili- 
ties by using the bigram and unigram probabil- 
ities, which adjusts the fl'equency counts of rare 
or non-occurring POS tag sequences. We use 
the smoothed probabilities: 
P(biltlt,2ta) = )q C(trt2bit3) 
~j=o,l,2 C(t~ t2bjt3) 
C(t2bita) 
q- A2 ~j=0,1,2 C(t2bjt3) 
+ C(t2bi) )t3  j=_0,1,2 C( 2bj) ' 
where )~1, A2 and ),a are three nommgative con- 
stants such that h I q- ~2 Jr- ~3 = 1. In some 
experiments, we can get the weights ~1, ~2 and 
A3 as 0.2, 0.7 and 0.1, respectively. 
3.1.2 Adjusting the POS Tag 
Sequences of Words 
Previous researchers of phrase break predic- 
tion used mainly content-flnmtion word rule, 
wherel)y a phrase break is placed before every 
flmction word that follows a content word (Allen 
and Hmmicut, 1987) (Taylor et al., 1991). The 
1052 
researchers used tag set size of only 3, including 
function, content ~md t)lmctuation in the rule. 
However, Korean is a post-positional aggln- 
tinative . If the eontent-t'unction word 
rule is to be adapted in Korean, the rule nmst 
be changed so that a phrase break is placed 
before every content mort/henm that R)llows a 
fimction morl)heme. Unfortunately this rule 
is very inet\[icient in Korean since it tends to 
create too many pauses. In our works, only 
the POS tags of Nnction mort)heroes are used 
be, cause the function morphelnes constrain the 
classes of precedent n:orpheanes and t)b\y impor- 
tant roles in syntactic relation. So, each word 
is represented by the I?OS tag of its fimction 
morpheme. In the case of the word which has 
no function mort)heine , simplified POS tags of 
content mort)henms are used. The nmnber of 
POS tags use, d in this rese, m'ch is a2. 
3.2 Decision-Tree Based Error 
Correction 
The t)robabilistic phrase break prediction only 
covers a limited range of contextual infornm- 
tion, i.e. two preceding words and one. follow- 
ing word. Moreove, r, the module can not utilize 
the morl)heme tag se.lectively and relative dis- 
tance to the other phrase breaks. For this rea- 
son we designed error correcting tree to con> 
pensate for |;tie limitations of the. probal)ilistic 
phrase break prediction. However, designing er- 
ror corre, cting rules with knowledge engineering 
is te, dious and error-prone,, lTnstead, we, adopte, d 
decision tree learning ai)proa(:h to auton~atically 
learn the error correcting rules froln a correctly 
t:,hrase break tagged eorlms. 
Most algorithms th:~t have t)een develope, d 
for lmilding decision trees employ a top-down, 
greedy search through the space of possible deci- 
sion trees (Mitchell, 1997). The 04.5 (Quinlan, 1983) 
is adequate to buiht a decision tree easily 
for successively dividing the regions of feature 
vector to minimize the prediction error. It also 
uses intbrmation gain which lneasures how well 
a given attril)ute separates the training vectors 
according to their target classification in order 
to select the most critical attrilmtes at each step 
while growing |;tit tree (hence the nmne is IG- 
~l'ree). Now, we utilize it for correcting the ini- 
tially phrase break tagged POS tag sequences 
generated by probabilistic predictor. 
However, wc invented novel way of using the 
decision tree as trmlsibrmation-t)ased rule in- 
duction (Brill, 1992). l?igure, 2 shows the tree 
learning architecture tbr phrase break error cor- 
rection. The initial phrase break tagged POS 
tag sequences support the ti;ature vectors tbr 
attributes which are used tbr decision mak- 
ing. Because the ii;atm'e vectors include phrase 
break sequences as well as POS tag sequences, 
a learned decision tree can check |;lie morphenm 
tag selectively and utilize |;lie relative distanee 
to the other phrase breaks. The correctly phrase 
break tagged POS tag sequences support the 
classes into which the feature vectors are classi- 
fied. C4.5 lmilds a decision tree fl'om the t)airs 
which consist of the feature vectors and their 
classes. 
.... I Prol,al,ilislic 11 
l'I,rase l,reak ta~ged 
POS tag seque nee 
..... Correctly plmlse break tagged 
~-~\]:: ........ ...... POS tag sequence 
ErrOr Correcting decision tree 
Figure 2: Architecture of the error correcting 
decision tree learner. 
4 Experimental Results 
4.1 Corpus 
The, experiments are t)ertbrmed on a Korean 
news story database,, called MBCNF, WS\])I~, of 
spoken Korean directly recorded from broad- 
casting news. The size of th(; database is now 
6,111 sentences (75,647 words) and it is eontin- 
nously growing. '12) lm used in the phrase break 
prediction experiments, |;tie database has been 
POS tagged and break-b~beled with major and 
minor phrase breaks. 
4.2 Phrase Break Detection and Error 
Correction 
We, I)eribrmed three experiments to show syner- 
gistic results of probabilistic method and tree- 
based error correction method. First, only prob- 
abilistic method was used to predict phrase 
breaks. %'igrams, bigrams and unigrams for 
phrase break prediction were trained fl:om the 
break-labeled an(1 POS tagged 5,492 sentences 
of the MBCNEWSDB by adjusting the POS 
sequences of words as described in sut)section 
3.1.2. The other 619 sentences are used to 
test the t)ertbrnum(:e of the probabilistic I)hrase 
break predictor. In the second experiment, we 
made a decision tree, which can be used only 
to predict phrase breaks and cannot be used to 
1053 
correct phrase breaks, from the 5,429 sentences. 
Also the 619 sentences were used to test the 
performance of the decision tree-based phrase 
break predictor. The size of feature vector (the 
size of the window) is w~ried fi'om 7 (the POS 
tag of current word, preceding 3 words and fol- 
lowing 3 words) to 15 (the POS tag of current 
word, preceding 7 words and following 7 words). 
The third experiment utilized a decision tree as 
post error corrector as presented in this paper. 
We trained trigrams, bigrams and unigrams us- 
ing 60% of totM sentences, and learned the deci- 
sion tree using 3(1% of total sentences. For the 
other experiment, 50% aim 40% of total sen- 
fences are used tbr probability training and tbr 
decision tree learning, respectively. Tim other 
10% of total sentences were used to test as in 
the prevkms ext)eriments(Figure 3). For the de- 
cision tree in the tlfird experiment, though the 
size of the window is also varied from 7 words 
to 15 words, the size of feature vector is varied 
from 14 to 30 because phrase breaks tagged by 
probabilistic predictor are include in the feature 
vector. 
El I:oI prababililies trai,ling D For decision Iree induction \[3 I:(!r lest 
(}9; ..... 
PI ol~:lbilislic Iil~,lhod 
only 
\[ 
\]GITtee mlly Prolmbilisfic ii~01\]lod Pmballilistic iii0thod 
lind post error ~tlltl IX)st ¢IlOl 
cDrl ocliOll(6:3 ) t'olteclJoll(4:5 ) 
Fignre 3: The number of sentences for the prob- 
ability training, the decision tree learning and 
the test in the experiments. 
Tit(; performance is assessed with reference to 
N, the total number of junctures (spaces in text 
including any type of phrase breaks), and B, the 
total number of phrase breaks (only minor(b1) 
and major(b,)) breaks) in the test set. The er- 
rors can be divided into insertions, deletions and 
substitutions. An insertion (I) is a break in- 
serted in the test sentence, where there is not a 
break in the reference sentence. A deletion (D) 
occurs when a break is marked in the rethrence 
sentence but not in the test sentence. A substi- 
tution (S) is an error between major break and 
minor break or vice versa. Since there is no sin- 
gle way to measure the performance of phrase 
break prediction, we use the following peribr- 
mance measures (Taylor and Black, 1998). 
Break_Correct - B-D-S B x 100%, 
N-D-S-I Juncture_Correct = x 100% 
N 
We use another pertbrmance nmasure, cMled 
adjusted score, which refer to the prediction ac- 
curacy in proportion to the total nmnber of 
phrase breaks as following performance measure 
proposed by Sanders (Sanders, 1995). 
Adjusted_Score - ,IC - NB 1-NI3 ' 
where NB 1 means the proportion of no 
breaks to the number of interword spaces and 
,lC means the Juncture_Correct/lO0. 
Table 1. shows the experimental results of our 
phrase break prediction and error con:ection 
method on the 619 open test sentences (10% 
of the total corpus). In the table, W means the 
thature vector size tbr the decision tree, and 6:3 
and 4:5 mean ratio of the number of sentences 
used in the probabilistic train and the decision 
tree induction. 
The performance of probabilistic method is 
better than that of IG-tree method with any 
window size in U'reak_Cor'rect. However, as the 
ti;ature vector size is growing in IO-tree method, 
.lv, nctv, re_Co'rrect and Adj,(sled_Score become 
better than those of the l)robM)ilitic method. 
From the fact that the attribute located in the 
first level of the decision trees is the POS tag of 
preceding word, we can see that the POS tag of 
preceding word iv the most useful attribute for 
predicting phrase breaks. 
The pertbrmance before the error correction 
in hyl)rid experiments iv worse ttlan that of the 
original 1)robabilistic method because the size of 
training corlms for probabilistic method is only 
66.6% and 44.4:% of that of the original one, re- 
spectively. However, the performance sets im- 
proved by the post error correction tree, and be- 
corns finally higher than that of both the prob- 
abilistic nmthod and the IG-tree method. The 
attribute located in the first level of the decision 
tree is the phrase break that was predicted in 
the probatfilistic method phase. Although the 
initial pertbrmmme (beibre error correction) of 
the exImriment using 4:5 corpus ratio is worse 
than that of the experiment using 6:3 corlms 
ratio, the final perfbrmance gets impressively 
improved as the decision tree induction corpus 
1NB _ N-I~ N 
1054 
'li~l)l( ~, 1: 1)hrase break t)\]e{ti(:tion ~md error eorr{',ction results. 
\]3r(;~k_Correet 
1}rol)al)ilis|;ic method only 
IG-Tree only 
Prol)al)ilisl;ie 
mel;ho(t 6:3 
~I~15(I 
l)ost error 
eorr(~,e|;ioll 
4:5 
W=7 
W- 11 
W= 15 
l)efore error eorre{;t;iol~ 
W=7 
W=ll 
W= 15 
1)~;~, ore, error con'e(;tion 
W=7 
W--ll 
W--15 
52.17% 
5O.58% 
51.66% 
51.77% 
52.03% 
57.34{~) 
59.8O% 
60.75% 
51.30% 
59.O4% 
61.83% 
62.7/1% 
J un{:l;{~re_CorreeI; 
81.39% 
81.39% 
81.65% 
81.71.% 
81.29~/~0 
83.67% 
84.69% 
85.O6% 
80.85'~ 
84.42% 
85.16% 
85.57% 
Adj uste(l_-~ 
0.48O 
0.480 
0,487 
0.488 
0.477 
0.543 
0.572 
0.582 
0.465 
0.564: 
0.585 
0.597 
incre,~ses from 30% 1;o 50% of the to|;al (:()rims. 
~l)his result; shows t;h~t |;he prol)osed ~rehilx~e- 
ture c~m 1)rovi(te, improved results evell with the 
phrase 1)re~k \])re(tie|:or |:h~l; h~s I)oor |nit|a,1 per- 
f()z'51 I~L51(;(~,. 
5 Conclusion 
This t)~l)er l)r{;s(:nts ;~ new 1)hr;tse t)rea.k predic- 
|;ion ~rt:hil;eel;ure l;h~l; inl;egr~tes the t}rob~bilis- 
tie ;~t)t)ro~eh wii;h the (le(:ision-tree 1)~s{'(t ~tl)- 
t)ro~teh in ;~ synergistic w~y. Our m~in contl'it)u- 
|,ions include presenl;ing (leeision |;ree-1);~se(1 er- 
ror correction for 1)hr~se t)re~k prediel;ion. Also, 
i)rol)~fl)ilistic t)hrase break prediction w~ts im- 
t)leme, nt;e(l a.s ;I,51 inil:i~tl am~ot~l;or of the (tet:i- 
si()n tree-t)~tse(t e, rror (:orre('tiolL The m:ehite,('- 
ture ('~m t)rovi(te imt)l"ove(t results even with l;he 
1)\]u:;~se t)re;~k l)redi{:tor |;h;t|; ha,s l)o{)r inil;i~d t}er- 
t'orm~mee. Moreover, I;he, syslxun (:~m 1)e ttexit)ly 
t;u, ned t;o new eort)us wil;houl; nmssive rel;r;~ining 
which is necess~ry in the t)rol}~bilisti(: metho(t. 
As shown in the result, t)erli)rmamce, of the hy- 
t)rid t)hr~se 1)re~k t)redietion is dctermilmd t)y 
how well the error eorre('|;or (;tin (;Ollll)ellS;~l;e 
i;he defi(:iencies of the 1)rol)~d)ilistie t)hrase 1)re~k 
predict;ion. 
The next sl;el) will 1)e to :m~lyze the le~rned 
de(:ision trees eareflflly I;() exi;r;~(:l; more desir~t)le 
ti',~tl;ur(', veel;ors. W(', ~re now working 055 in(:or- 
l)or~|;ing this i)hl'~se, break 1)redi('tion 5he|hod 
into the ext)erimenl;~l Korean 'I?'\]?S sysl;em. 

References 

J. Alle, n ~md S. Humli{'ut. 1987. l'"ro~tt !l'ext to 
St)eech: the: MITalk Systcnt. Cmnl)ridge Uni- 
versity Press. 

IB. \]\]rill. 1992. A simple rule-based p~rl;-ot/- 
speech t~gg{'r. In \]}roccediT~,.qs of the co~:fer- 
c'~tcc o~, applied ~,at'u,r'rd la~zg'aage processi~zg. 

.\]eos\]gwon Ch~, Oeunbae \]~(;e, ~md Jong-Hyeok 
Lee. :1998. Ge, nc, r~dized mlknown morphelne 
guessing for hybrid P()S l;~tgging of Kor(';m. 
In P'lw('ecdi'l~,gs of th.(: Sizth, Wo'r~:.sh, op o'l~, l&-"cq 
ha'rqc Co'rpora, t}~g('s 85 93. 

Dvr~llx'r l)a(;lem~ms, St;ev{'n Gills, ~md Gerl; 
\])urieux. 1994. 'l'he ~cquisil;ion of sl;ress: A 
d~l;a-orienl;ed ~l)l)ro~{:h. Co~tI)~tl, al, io'l~,al Liw,- 
g'ltil, ics, 20(3):421 451. 

Sangho Lee ;rod Yung-Itw~m O15. 1999. ~.lS:ee- 
t)~rse(t mo(l(;ling of 1)roso(tie 1)hr,(sing ~1~(t 
segmei~l;~tl (tm:a.t;ion for kore~m |.t.q ,qy,qJ;elll,q. 
,~peech, Co'll~,~n,'w~,icatio~,, 28(4):283 300. 

q'om M. Mil;ehell. 1!)97. Mach, i~,c Lea'~"H,i~,g. 
MeGr~w-Ilill. 

J. l/,. {~uinDm. 1.983. C~.5: l'ro!tr(~ms fo'~" M(z- 
ch, i'~,e Lea'r'~,i'~ 9. Morgan K~mflnlmn. 

\]Brie S~mtlers. 1.9!t5. Using prot}al)ilistic mel:h- 
()<Is |;o t)re(lict phr~se t)oundaries for ~t text- 
to-sl)eeeh system. Master's thesis, University 
of Nijmegen. 

P~ml T~,ylor ;md Alum W. BD~t:k. 1!)98. As- 
signing phrase l)rc,~ks kom 1)ar|;-of-st)e, eeh sc,- 
qllelIces. Cotlt\])'lttcr Speech (t~td La~tg'~t(,,gc, 12(2):9!} 
1 ~7. 

lhml A. 'l?aylor, I. A. N~irn, A. M. fiutherl;md, 
~md M. A..l~ck. 1991. A re~l time speech 
synthesis system. 1151 l~rot:ce, dirt.q.s of th, c \]~'~t- 
rospecch, '9.l. 

,l~m P.II. wm Santen, l/.iehard W. Sl)roat, 
3osel)h P. Olive, ~md .luli~ Hirsehl)erg. 
1997. l}rog'r(;ss in Speech Sy~,th, esis. Springer- 
VerLzg. 
