A Rule Induction Approach to Modeling Regional Pronunciation 
Variation. 
Vdronique Hoste and Steven Gillis and Walter Daelemans* 
CNTS - Language Te(:hnology Group 
University of Antwerp 
Universiteitsl)leinl, 2610 Wilrijk 
hoste(@uia.ua.ac.1)e, gillis((~uia.ua.a(:.l)e, daelem(O)uimua.ac.be 
Abstract 
This 1)~q)er descril)es the use of rule indue- 
tion techniques fi)r the mli;omatic exl;ra(:l;ion of 
l)honemic knowledge mM rules fl'om pairs of 
l:,romm(:intion lexi(:a. This (:xtra(:ted knowl- 
edge allows the ndat)tntion of sl)ee(:h pro(:ess- 
ing systelns tO regional vm'iants of a . 
As a case sl;u(ty, we apply the approach to 
Northern Dutch and Flemish (the wtriant of 
Dutch spoken in Flan(lers, a t)art; of Bel- 
gium), based Oll C(?lex and l'bnilex, promm- 
clarion lexi(:a tbr Norttmrn l)utch mM Fhm,- 
ish, r(~sl)e(:tively. In our study, we (:omt)ar(~ 
l;wo rule ilMu(:tion techniques, franslbrmation- 
B;tsed Error-l)riven Learning ('I'I/E\])I,) (Brill, 
1995) mM C5.0 (Quinl~m, 1993), and (,valu- 
ate the extr~tct(xl knowh;dge quanl:it~l;ively (a(:- 
(:ura.cy) mM qualitatively (linguistic r(;levanc:e 
of the rules). We (:onchMe that. whereas 
classificntion-1)ased rule. induct;ion with C5.0 is 
11101.'0 a(;(:(lr&l;e~ th(? |;rallSt~)rnl;~l;ion l"ules le;~rne(t 
with TBE1)I, can 1)e more easily ini;ert)reted. 
1. Introduction 
A central (:onq)onenl; of speech l)ro(;essing sys- 
tems is a t)rommciation lexicon detining the re- 
lntionshi t) between the sl)elling mM t)rommcin- 
|;ioi1 of words. Regionnl wMants of ~ langut~ge 
may differ considerably in their l)ronunci:ttion. 
Once a spe~ker from a particular region is de- 
tected, speech inlmt and output systems should 
be al)lc to ~Mat)l; their t)rommei;Ltion lexi(:on l;o 
this regionM vm'bml;. Regional l)rommciation 
(litiin'ences are mostly systeln~ti(: mM can t)e 
modeled using rules designed by experts. How- 
ever, in this 1)at)er, we investigate the :mtoma- 
* This resear(:h was l)artially funded 1)y the. F\V() 
1)reject Linguaduct and the i\VT project CGN (Cortms 
Gesprokcn Nedcrhmds). 
tion of this process by using data-driven ted> 
niques, more. specitically, rule induction tech- 
niques. 
l)ata-(lriven reel;hods have proven their ef- 
fi(',;tcy in severM  engineering tasks: 
such as gr~l)hemc-to-tfl~oncmc conversion, tmrt;- 
of-sl)eech tagging, el;(:. Extraction of linguistic 
knowledge, fl'(nn a snmple corlms instead of num- 
uM encoding of linguistic intbrmation proved to 
be ml extremely powcrflfl method tbr overcom- 
ing the, linguistic knowledge acquisition bottle- 
ne(:k. \])itt'erent at)preaches are awfilM)le, such 
as decision-tree le~rrning (l)ietterich, 1997), lleu- 
ral lml;work or (:onne(:tionist al)proaches (Se- 
jnowski ~tnd l/.os(ml)erg, 1987), lnemory-base(1 
lena'ning (Daelemans mM van den Bosch, 1996) 
el;(:, l)at~-driv(m al)i)roaehcs (:~m yield (:Oral);> 
ral)le (;111(t often eVell better) results ttum the 
rule-lmsed at)t)ro;mh, as described in the work of 
l)aelemans nnd wm den \]~os(:h (199(i) in which 
a (:omt)~rison is mnde 1)ctwe(m Morpa-cmn- 
Morphon (Heemskerk and wm He, uv(m, 1993), 
an ex:mlt)le of n linguistic knowledge 1)a.sed at)- 
1)roacll |;o gr~t)heme-to-1)honem(~ (:OllVersion and 
\[G-'.lh'ee, an examph; of n m(mloryd)ased at)- 
1)roach (Daelen~ms et M., 1996). 
Ill this study, we will look tbr the patterns 
and generalizations in the i)honemic ditrer(m(:es 
1)et;ween Dutch and Fhmfish 1)y using two (tat;n- 
driven t(~chniques. It; is our aim to extract the 
regularities that are implicitly contained in the 
data. Two corpora were used tbr this study, 
r(~l)resenting the Norl;hern Dul, eh and Sout;hern 
Dutch w~rbmts. D)r Northenl Dut(:h Celex (re- 
leas(; 2) was used and for Flemish Fonilex (ver- 
sioll 1.01)). The Celex datM)ase contains fie- 
quen(:y infi)rlnation (based on the INL corl)uS of 
the hlsl;itute fi)r 1)ul;(:h Lexieology), and i)hono- 
logi(:al~ morl)hologicM , and synt;a(:tic lexicM in- 
tbrmation tbr more l;tmn 384.000 word forms, 
327 
and uses DISC as encoding tbr word pronuncia- 
tion. The Fonilex database is a list of more than 
200.000 word tbrms together with their Flemish 
pronunciation. For each word tbrm, an abstract 
lexical representation is given, together with 
the concrete pronunciation of that word tbrm 
in three speech styles: highly formal st)eech, 
sloppy speech and "normal" speech (which is 
an intermediate level). A set of phonological 
rewrite rules was used to deduce these concrete 
speech styles ti'om the abstract t)honological 
tbrm. The initial phonological transcription was 
obtained by a grapheme-to-phoneme converter 
and was afterwards corrected by hand. Fonilex 
uses YAPA as encoding scheme. By means of 
their identification number, the Fonilex entries 
also contain a rethrence to the Celex entries, 
since Celex served as basis tbr the list of word 
tbrms in Fonilex. E.g. tbr the word "aaitje" 
(Eng.: "stroke"), the relevant Celex entry is "25/aait.je/5/'aj-tj¢}/" 
and the corresponding 
Fonilex entry looks like "251aaitjel'ajts@l". The 
word tbrms in Celex with a fl'equency of 1 and 
higher (indicated in field 3) ~re included in 
Fonilex and fl:om the list with tiequency 0, only 
the monomorphematic words were selected. 
In the fi)llowing section, a brief explanation 
is given of the method we used to search for 
the overlap and ditfhrences between both re- 
gional w~riants of Dutch. Section 3 provides 
a quantitative analysis of the results. Section 
4 discusses the dittbrences between Celex and 
Fonilex, starting fl'om tile set of transtbrmation 
rules that is learned during Transtbrmation- 
Based Error-Driven Learning (TBEDL). These 
rules are COlnpared to the production rules pro- 
duced by C5.0. In addition, we present an 
overview of the non-systematic diflhrences. In 
a final section, some concluding remarks are 
given. 
"2 Rule Induction 
Our starting i)oild; is the assumption that the 
differences in the phonemic transcriptions be- 
tween Flemish and Dutch are highly systematic, 
and can be represented in a set of rules. Hence, 
these rules provide linguistic insight into the 
overlap and discrepancies between both w~ri- 
ants. Moreover, they can be used to adapt pro- 
mmciation databases tbr Dutch md;omatically 
to Flemish and vice versa. A possil)le w~y to 
find the regul~trities within the diflbrences be- 
tweet, both corpora is to make the rules by 
hand, which is time-consmning and error-prone. 
Another option is to make use of a data-oriented 
learning method in which linguistic knowledge 
is learned automatically. In our experiment we 
have made use of two rule induction techniques, 
viz. ~:a~nsformation-Based Error Driven Learn- 
ing (TBEDL) (Brill, 1995) and C5.0 (Quinlan, 
1993). 
In the process of 'Transfbnnation-Based 
Error-Driven Learning, transtbrmation rules are 
learned by comparing a corpus that is annotated 
by an initial state annotator to a correctly amlo- 
tared corpus, which is called the "truth". l)ur- 
ing that comparison, an ordered list of trans- 
tbrmation rules is learned. This ordering im- 
plies that the application of an earlier rule some- 
times makes it possible tbr a later rule to apply 
(so-called "feeding"). In other cases, as also 
descrit)ed in the work of Roche and Schabes 
(1995), a given structure fiTdls to undergo a rule 
as a consequence of s()me earlier rule ("bleed- 
ing"). These rules are applied to tile output of 
the initial state ammtator ill order to t)ring that 
outt)ut closer to the "truth". A rule consists of 
two parts: a transtbrmation and a "triggering 
environment". For each iteration in the learning 
process, it is investigated tbr each possible rule 
how many mistakes can be corrected through 
al)t)lication of that rule. The rule which causes 
the greatest error reduction is retained. 
Figure 1 shows the TBEDL learning process 
apt)lied to the comparison of the Celex repre- 
sentation and the Fonilex "n()rmal" representa- 
|;ion, which flmctions as "truth". In this case, 
the task is to learn how to transtbrm Celex rel)- 
resentations into Fonilex representations (i.e., 
translate Dutch pronunciation to Flemish pro- 
mmciation). Both corpora serve as input tbr 
the "transtbrmation rule learner" (Brill, 1995). 
This learning process results in an ordered list of 
transformation rules which reflects the system- 
atic differences between both representations. 
A rule is read as: "change x (Celex representa- 
tion) into y (Fonilex representation) in the fol- 
lowing triggering enviromnent". 
E.g. /i:/ /~/ NEXT 1 Oil. 2 ()I/l. 3 PHON/e:/ 
(change a tense/i/to a lax/i/when one of the 
three tollowing Celex phonemes is at tense/e/). 
328 
Cclex graphemic Fonilex graphcmic 1 
and phonemic and phonemic / 
/ representation representalion ) 
¢ 
l?igure 1: Architecture of the learning process 
m~king use, of TBEDL 
C5.0 (Quinlnn, 1993), on the other hand, 
wMch is a commercial version of the C4.5 
t)rogr~mh gener~l;(;s a classifier in the form of 
~r decision tree. This decision tree (:~Lll \])e used 
to cLassi(y ;~ case 1)y starting a.t the root of 
|;11(; I;ree mM then moving througLl the tree 
untiL a le~ff node (associated with ~ class) is 
eneomltered. Since decision t;recs c;m be hard 
to read, the decision tree is (:onver(;ed to a set 
of production rules, which ~re more intcJligibh'. 
to the user. All rules h~we the form "L -> H.", 
in w\]fich the left-hml(1 side is ~ conjmmtion of 
a.l,I;ribute-b~tsc(l tests and the rightqm.ud side 
is a. (:l~ss. Note that in th¢'. imt)hmleul,at;ion 
of C5.0, feeding mM t)leeding effects of rules 
do not occur, due to |.h('. (:onlti(:l. resolution 
sl, r~tegy used, whi(:h ensures that tbr each case 
only one rule (:ml apply (Quinla.n, 199:3). In 
this cxperillR',nt; we ha,ve 111~,(t(; use of a, (:onl;exl; 
of three phonemes preceding (in(Lic:tted by t-1~ 
52, and f3) and fi)llowing (f+l, 1"+2, f+3) th(' 
fi)(:us phoneme, which is in(tic;d;(;d t)y an 'f~. 
The t)r(;(li(:t;e(1 class for this ease is then t;t1(', 
right-hand side of the rule. At the top of the 
rule the nmnber of training cnses covered by the 
rule is given together with the number of cases 
that do not 1)elong to the class L)redicted t)y 
the rule. The "lift;" is l;he (;stim~l;ed ac(:urt~cy 
of the rule divided by the prior probnt)ility of 
l;he t)redicl;c(l class. 
E.g.: (4370/138, lift 82.8) 
f= i: 
f+2 in {e, ~, e:, ~:, y:, J', c:} 
-> class ~ \[0.968\] 
Before presenting the d~ta to TBEDL and 
C5.0, aLignment is required (Daek;mans and 
v}l, ll (tOll Bosch, 1.996) for l;he gr~phenfic 
and phonemic rel)resentations of CeLex and 
FoniLex, since the l)honemic representation mid 
the spelling of a word often differ in length. 
Therefore, the phonemic symbols are aLigned 
with the graphemes of the written word tbrm. 
In case the phonemic transcription is shorter 
than the speLLing, mill phonemes ('-') are used 
to filL the g~L)S. In the exmnpLe '%ahnoezenier" 
(Eng.: "chaL)Lain" ) this results in: 
~ a 1 m o (: z e n i e r - l m u: - z o n i: - r 
A further step in the t)reparation of the (latch 
consists of the use of an extensive set of so-called 
"compound i)honemes". Compound phonem(;s 
~I,I;C llSCd whelleVer gr~l)hemes ilia 4) with more 
than one phoneme, a.s in the word ':taxi", in 
which the <x> is 1)honemically reL)reseni;ed as 
il, pr(,1,Le,11 .',olv,.,l t,y 
defining a new t)honemic sylnbol l;h;tl; (:orre- 
Sl)on(ls to the two l)honemes. 
Our d~t|;ase|; consists of all Fonilex entries 
wil;h omission of the doul)le transcriptions (only 
tl,x; Lirst tr;mscriL)t;ion is taken), as in the word 
"(:~mw;m", which can be i)honemi(:ally rcl)re,- 
s(',nl;(;(t ;ts /k(ir(lv(m/ or ;is /k~;l't:v~:n/. Also 
wor(ls of which the l)hon(;mi( • l;rm~s(:ript;ion is 
longer l;h~m l;he orl;hograL)\]g and for whi('h no 
(:oml)ound phonemes ~tr(; l)rovi(ted: are omitted, 
e.g. "b'tje" (Eng.: "little b")(L)honenfically: 
/',e:O/). The ,)f 20 . 36 
word forms or 1.972.577 phonemes. I)ISC is 
used as phonemi(" encoding st;henle. All DISC 
phonemes are included and new phonem(;s are 
created for the t)honemic symbols which only 
occur in the Fonih'x (lttl;ab;~se. V~k; h~we divided 
the corlms into a training part, consisting of 
90% of the data and ~ 10% test part. 
InitialLy, an overla L) of 59.07% on the word 
level and !)2.77% on 1;he 1)honeme level was 
observed in the 10% test sol; l)etween Dutch 
and Flemish reL)resentations. CollSOll}lJlt;S ~gll(t 
dit)hthollgs are highly overlapping. 
~d \[ Phon. { Cons. I Vowel I~ 
\[59.07 \[ 92.77_1 95.95 I 85.58 I 99.76~ 
~lhble J: initial overLap between Celex en 
Foni\]ex 
329 
3 Quantitative analysis 
We first test whettmr rule induction techniques 
are able to learn to adapt Northern Dutch pro- 
nunciations to Flemish when trained on a nun:- 
ber of examples. With .Transformation-Based 
Error-Driven Learning and C5.0, we looked 
for the systematic differences between Northern 
Dutch and Flenfish. 
In TBEDL, the complete training set of 90% 
was used for learning the transfbrmation rules. 
A threshold of 15 errors was specified, which 
means that learning stops if the error reduc- 
tion lies under that threshold. Due to the large 
amount of training data, this threshold was cho- 
sen to reduce training time. This resulted in 
about 450 rules. In figure 2, the number of 
transformation rules is plotted against the ac- 
curacy of the conversion between both w, riants. 
100 ..... 
95 
9O 
85 
~ 80 
~ 75 
7O 
65 
6O 
55 
0 50 100 150 200 250 300 350 400 450 
number ot' transformation rules 
Figure 2: Descrii)tion of the accuracy of the 
word and phoneme level in relation to the 
nnmber of transtbrmation rules. 
Figure 2 shows that especially tile first 50 
rules lead to a considerable increase of perfor- 
mance fl'om 59.07% to 79A0% on the word level 
and from 92.77% to 96.98% for phonemes, which 
indicates the high applicability of these rules. 
Afterwards, the increase of accuracy is more 
graduah from 79.40% to 88.95% (words) and 
fl'om 96.98% to 98.52% (phonemes). 
For the C5.0 experiment~ 50% (887.647 cases) 
of the original training set served as training set 
(more training data was not feasible). A deci- 
sion tree model and a production rule model 
were lmilt from the training cases. The tree 
gave rise to 745 rules. These production rules 
were applied to the original 10% test set we nsed 
in the TBEDL experiment. In order to make 
the type of task comparable for the transfbr- 
marion based approach used by TBEDL, and 
the classification-based approach used in C5.0, 
the output class to be predicted by C5.0 was 
either ~0' when the Celex and Fonilex phoneme 
are identical (i.e. no change), or the Fonilex 
phoneme when Celex and Fonilex differ. 
Table 2 gives an overview of the overlap 
between Celex and Fonilex after application of 
t)oth rule induction techniques. A comparison 
of these results shows that, when evaluating 
both TBEDL and C5.0 on the test set, the rules 
learned by the Brill-tagger have a higher error 
rate, even when C5.0 is only trained on half the 
data used by TBEDL. On tile word level, the 
initial overlap of 59.07% is raised to 88.95% af 
ter application of the 450 transformation rules, 
and to 90.35% when using the C5.0 rules. On 
the phoneme level, the initial 92.77% overlap 
is increased to 98.52% (TBEDL) and 98.74% 
(C5.0). C5.0 also has a slightly lower error rate 
for the consonants, vowels and diphthongs. 
~_ I Word Phon. Cons. Vowel Diph. \] ~ 
8.95 98.52 99,35 96.88 99.32 
0.35 98.74 99,19 97.70 99.68 
Table 2: Overlap between Celex en Fonilex 
after application of' 450 transformation rules 
and all C5.() production rules. 
When looking at those cases where Celex and 
Fonilex difl'er, we see that it; ix possible to learn 
Brill rules which predict 73% of these differences 
at the word level and 79.5% of the differences 
at the phoneme level. Tile C5.0 rules are more 
or less 3% more accurate: 76.4% (words) and 
82.6% (phonemes). It is indeed possible to reli- 
ably 'translate' Dutch into Flenfish. 
4 Qualitative Analysis 
In this section, we are concerned with the lin- 
guistic quality of tile rules that were extracted 
using TBEDL and C5.0. To gain more insight 
ill the important differences between both pro- 
nunciation variants, a qualitative analysis of tile 
rules was performed. Therefore, the conver- 
sion rules were listed and compared. The fol- 
lowing list presents some examples for conso- 
nants, vowels and diphthongs. Starting point 
330 
is the first 10 rules that were learned during 
TBEDL, which are compared with the 10 C5.0 
rules, which most reduce the error rate. In the, 
transtbrmation rules 1)resented below, the rela- 
tionship between Dutch art(1 Flemish, especially 
the most important differences, are extracted 
fronl the eorl)ora and tbrmulated in a set of eas- 
ily understmldal)le rules. The C5.0 produ(:tion 
rules, Oll the other hand also descrit/e the over- 
lapl)ing phonelnes between Celex and Dmilex, 
which makes it hard to have at clear overview of 
the regularities in the dilt'erences 1)etween both 
variants of Dutch. The fact that the category 
'0' was used to describe the overlap between 
the databases (no chauge) does not; really hell). 
Even if C5.0 discovers that no change is the 
default rifle, additional specific rules (lescrit)ing 
the, default condition are neverthel(~ss ne('c, ssary 
to l)revent the other rules fl'om tiring incorrectly. 
4.1 Consonants 
Nearly 60% of the differences on the conso- 
nant level concerns l;he alternation 1)etween 
voiced and unvoiced consonants. In the word 
"gelijkaardig" (Eng.: "equal"), for example, 
we lind a /xolcika:rdox/ with a voi('eless velm: 
fri('ative in Dutch and /golcika:rdax/ with a 
voiced velar fricative in Flenlish. The word 
'hnachiavellisme" (Eug.: "Machiavellism") is 
pronommed as /ln(igi:ja:w:hsm,)/ in Dutch an(t 
as /m(d¢J.j(,wchzmo/ in Flelnish. 
F I" 1'1~7-1 v I '~ T~ Ix~ 
t 14774 127 
d 30 6516 
f 2/138 14 
v 2d 321!) 
s 1(149~ 327 
z 57 1992 
x 2743 1880 
g 92 2373 
Table 3: Conthsion matrix tbr the voiced and 
unvoiced consonants in the test cortms. 
T~tble 3 clearly shows the alternation t)e- 
tween /x/ and /,g/. This alternation also is 
the subject of the first transformation rule, 
namely "/x/ changes into/,g/ it, case of a word 
t)eginning (indicated by "STAAll.T") one or 
two I)ositions t)efbre". When looking at the 
to t) ten of the C5.0 l)roduction rules that most 
reduce error rate, the two most important rules 
also describe this alternation: 
ilnle 682: (7749/29, lift 110.9) 
t-1 in {=, {, c:} 
f in {x, .q, ;, Q}1 
-> class y \[0.996\] 
Rule 683: (7749/29, lift 110.91 
fl in {=, {, ~:} 
f in {x, ,q} 
-> class ,g \[0.996\] 
Another important phenomenon is the use 
of p~tlatalisation in Flemish, as in the word 
"aaitje" (Eng.: "stroke"), where Folfilex uses 
the t)alatalized fortll/aljtJ'o/instead of/a:jtjo/. 
The two sul)sequent transtbrmation rules 3 and 
d make this change possible. In the top 10 of 
C5.0 rules, only the tirst i)arl; of this change 
is descril)ed. Transtbrmation rule 8 (les('rit)es 
the omission of the i)honeme /t/ in ea,se, of the 
gral)hemic combination <ti>, as in "t)olitie '~ 
(Eng.: "police"). 
NIL 
1. 
3. 
4. 
8. 
Tal)le 
(tuent 
4.2 
96% 
C. F. ~I¥iggering environnm\]fl; 
x ,g PREV 1 ()l{ 2 PHON STAART 
j tJ' SURROUND PHON to 
t - NEXT PHON tJ" 
ts s RBIGRAM t i 
4: %'anstbrmation rules for the most fre- 
ditl'erences at the (;otlSOll,'l, tlt level. 
Vowels 
of the difl'erences at; the vowel level 
1)etwe('al l)ut;(:h mM Ph;mish concerns t;he use 
of a lax vowel instead of a tellse vowel fi)r the 
/i:/, le:l, la:l, I(,:/.'.. I.:1. This aH, ernatio. 
is illustrated t)y the following confllsion matrix, 
wlfich clearly shows that tense Celex-vowels 
not only (:orre, sl)ond with tense, but also with 
lax vowels in Fonilex. Other less frequent dif 
t'erences are glide insertion, e.g. in "geshakct" 
and the use of schwa instead of another vowel, 
as in %eleprocessing" in Flelnish. 
I I i: by:l'": I ~: I": I I I" I'- I" I:' I 
i: 23O') 2~a'; 
y: 38~ 51{ 
e: 438~1 ,(19:~ 
~'u 3507 1797 
o: 254( 160( 
Tal)le 5: Confusion matrix showing the use of 
Flemish btx and tense vowels given the Dutch 
tense vowels. 
'The,/;/and/Q/m'e compound phonemes wc intro- 
duced. They do not have an IPA equivalent. 
331 
In transformation rules 2, 5, 6, 7, 9, there is a 
transition from a tense vowel into a lax vowel in 
a certain triggering environment. An example 
is the word "multipliceer" (Eng.: "multiply") 
which is transcribed as/multi:pli:se:r/in Celex 
and as/multlphse:r/in Fonilex. 
Nr. 
2. i: 
5. i: 
6. i:j 
7. o: 
9. a: 
Table 6: 
for the 
Flemish 
C. F. Triggering environment 
I 
I 
Ct 
NEXT 1 OR 2 OR 3 PHON e: 
NEXT 1 OR 2 GRAPH c 
CUR GRAPH i 
NEXT 1 OR 2 OR 3 PHON e: 
NEXT 2 GRAPH a 
Most important transtbrmation rules 
differences between the Dutdl and 
vowels. 
A closer look at the ten most important C5.0 
production rules shows that seven out of ten 
rules descrit)e this transition from a Celex tense 
vowel to a Fonilex lax vowel. E.g. 
Rule 322: (4370/138, lift 82.8) 
f= i: 
f+2 in {e, ~, e:, a:, y:, J', e:} 
-> , \[0.968\] 
4.3 Diphthongs 
For tile dit)hthongs, few transformation rules 
are learned during training, since Celex and 
Fonilex are highly overlapping (see table 1). 
The rnles concern the phonemes that follow 
the diphthongs: /.j/ after /ei/ and /u/ afl;er 
/ou/. E.g. '%lanw" "blue"), the/l,/ 
is omitted in Flemish: /bkm/. In the top ten 
of C5.0 rules, no rules are given describing this 
phenomenon. 
Nr. C. F. %'iggering envirolmmnt 
10. u - PREV PHON (m I 
Table 7: %'ansfonnation rule concerning the 
lack or presence of a/u/ tbllowing an/au/. 
These rules, describing the differences be- 
tween Northern Dutch and Flemish consonants, 
vowels and diphthongs also make linguistic 
sense. Linguistic literature, such as the work 
of Booij (1995) and De Schutter (1978) in- 
dicates tendencies such as voicing and devoic- 
ing on the consonant level and the confllsion of 
tense and lax vowels as important differences 
between Northern Dutch and Flemish. The 
same discrepancies are f(mnd in the transcrip- 
tions made by both Flemish and Dutch sub- 
jects in the Dutch transcription experiments de- 
scribed in Gillis (19!)9). 
5 Error Analysis 
Besides tile systematic phonenlic differences be- 
tween Flemish and Dutch, there are a num- 
ber of mmystematic differences between both 
databases. After application of 450 transfor- 
mation rules, 88.95% of the words makes a cor- 
rect; transition from the Celex-transcription to 
the Fonilex-transcriptiolL The 7d5 C5.0 rules 
lead to a 90.35%. Using the Brill-tagger, it also 
has to be taken into account that rules can be 
undone by a later rule (see also Roche and Sch- 
abes (1995)), as in tile word "feuilleteer" (Bug.: 
"leaf througlf'). Celex provides the transcrit)- 
tion/t'(x'.y.iol;e:r/, while Fonilex transcribes it as 
/f~:jate:r/. During learning, the transtbrmation 
rule "change /my/ into /~:/ if the preceding 
graphenm is an <e>" is learned. This results 
in the correct Fonilex-/ff~:jote:r/. This trans- 
formation, however, is canceled by a later rule, 
which changes /0:/ back into /oey/ if tile *bl- 
lowing grapheme is an <i>. This leads again 
to the original Celex-transcription. C5.0, which 
does not suflhr from sinfilar consequences of rule 
ordering, will correctly classify "feuilleteer". 
hi this section, we are concerned with the 
relnaining errors after application of all rules. 
Making use of a rule induction technique to ex- 
tract the sub-regularities in the differences be- 
tween the corpora can lead to some rules, which, 
however, may be based on noise or errors in 
the databases. Theretbre, a manual analysis 
was done, which showed that the explanation 
of these remaining errors is twotbhl. 
A first reason is that no rule is awfilable tbr 
less frequent cases. TILe rules are induced on 
the basis of a sufficiently big frequency effect. 
This leads to no rule at all tbr less frequent 
1)honemes and phoneme coml)inations mid also 
for phonemes which are not always consistently 
transcribed. Examples are loan words, such as 
"points" and "pantys '~ or the loan sound /-/ 
which only appears in Fonilex. 
Another cause tbr errors is that rules will 
overgeneralise in certain cases. The confusion 
332 
matrix for vowels in table 5 clearly indicates 
the tendency to use more l~x vowels in Flem- 
ish. This leads to a mmfl)er of tr;mstbrInation 
rules ~md C5.0 rules describing this tendency. A 
(:loser investigation of the errors committed t)y 
the, Brill-tagger, however, shows thnt 41..7% of 
the errors concerns the use of a wrong vowel. In 
25% of the errors conmfitted on the t)honeme 
level, there was an incorrect transition fl'om a 
tense to a b~x vowel, as in '%ntagonislne" (Eng.: 
'%ntagonisln") where there was no transition 
from an /o:/ to ;u, /:)/. in 16.8% of the er- 
rors, a tense vowel is errolleOllsly used instead 
of a \]~x vowel, as in "atfi('he" (Eng.: "I)OSt(~r ") 
where an/,/ is used instead of a ((:orr(x:t) /i/. 
1) ifliculties ill the alternation t)etween voiced ;u 
mlvoiced consort;mrs account for 6.3% of the er- 
rors on the phoneme level. E.g. in "a(hninis- 
trestle" the/t/ was not (:onverted into/d/. 
In order to analyse why C5.0 1)ertbrms bet- 
ter on our task them TBEI)\],, a (:loser compari- 
son was made of the errors ex(:lusively made 1)y 
the Brill-tagger ;rod those ex(:lusively re;Me l)y 
C5.(). Ih)wever, no system~ttie dilt'eren(:es in er- 
rors were t'(mnd .which could exl)la.in the higher 
accuracies when using C5.0. 
6 Concluding remarks 
In this 1):~l)er, we hz~ve prol)osed the llSe o\]' 
rule induction |;eclmiques to h'.m'n to adapt i)ro- 
mmciation ret)resel~tations to regional vayiants, 
and to study the linguisti(: ast)e(:ts ()f su('h wu'i- 
ntion. A qmnltitative and qualitative rarely- 
sis was given of the t)honemie ditlbxences dis- 
covered 1)y these teehni(tues when trained on 
the Celex dnt~d)ase (Dutch) and the Fonilex 
(t~tat)ase (Flemish). \]n order to stu(ty the 
relationshi l) between both pronunciation sys- 
t(;lllS~ we \]l&VC lll}L(te llS(? Of tWO rifle in(hu:- 
tion techniques, nnmely 3h:mlsformation-Based 
Error-Driven Learning (Brill, 1995) and C5.0 
(Quinlan, 1993). Studying the accuracy of 1)oth 
systems, we noted that M'ter ~l)plication of the 
transtbrmation rules that were learned t)y the 
TBEDL method, 73% of the differences (m the 
word level and 80% of the ditl'crences on the 
t)honeme level was covered by the rules. The 
C5.0 I)ercentages are some 3('/o higher. This (:()r- 
resl)onds with an overall a(:(:ura(:y in 1)redicting 
the 1)ronun(:iation of n lq('.mish word l)romm- 
cb~tion ti'om the l)utch pronunciation of about 
89% for TBEDL and 90% for C5.0 (about 99% 
at i)honeme level for l)oth). 
A qualitative analysis of the first ten rules 
produced by both methods, suggested that l)oth 
TBEDL and 05.0 extract valuable rules describ- 
ing the most important linguistic differences be- 
tween Dutch mid Flelnish on the consonant and 
the vowel level. The C5.() production rules, 
however, are more numerous and more dit\[icult 
to interpret. The results of the transtbrnlation- 
based le~rnillg approach are clearly more un- 
derstandMfle than those of a classification-based 
lenrning approach for this problem. 

References 

G. \]h)oij. 1995. The, phonok)gy of \])utch. ()xford: 
Clarendon Press. 

E. BrilI. 1995. Transformation-based error-driven 
learning and natural  processing: A case 
study in part of speech tagging. Computational 
Linguistics, 21:543 565. 

W. Daelomans told A. van den Bosch. 1996. 
\]mnguage-indo, t)endent data-oriented grapheme- 
to-t)honeme conversion. In Progress in Speech 
Synthesis, pages 77 90. New York: Springer Ver- 
lag. 

W. Daelemans, A. vml den Bosch, and T. Weijters. 
1996. Igtree: Using trt;es for compression ml(l 
cla.~sitication in lazy learning algorithms. Arti- 
fit:ial intelligence, R, cvicw, special issue on Lazy 
Lcavnin 9. 

C,. \])c S(:hul;ter. 1978. Aspektcn van dc Ncdcvlwn, dsc 
t,:la,'n, kstrukt',,'u,'r, volunw, 1.5. Antwerp Papers In 
Linguistics. 

T.G. l)M;terich. 1997. Machine learning research: 
l))ur era:rent directions. A1 Magazine , 18(4):97 
136. 

S. Gillis. \]999. Phonemic transcriptions: qualitative 
and quantital;ivc aSl)e(:gs. Pat)er t)resent('xl at l;he 
International Workshol) about D('.sign an(1 Almo- 
tation of Speech Corpora, Tilburg. 

J. H('x:mskerk and V.J. wm Heuven. \]993. M()I~- 
PItA, a h',xicon-based MORphological PArser. 
Berlin, Mouton de Oruyter. 

3.I{. Quinlan. 1993. C/t.5: programs for mach.ine 
learning. San Mateo: Morgan kaufmann Publisl> 
ers. 

E. Roche and Y. Schabes. 1995. Deterministic part-of-speech tagging with finite-state transducers. Computational Linguistics, 2112):227--253. 

T.J. Sejnowski mid C.S. Rosenberg. 1987. Paral- 
M networks that learn to pronounce english text. 
Complex Systems, 1:145d68. 
