File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1048_metho.xml

Size: 20,819 bytes

Last Modified: 2025-10-06 14:07:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1048">
  <Title>A Rule Induction Approach to Modeling Regional Pronunciation Variation.</Title>
  <Section position="3" start_page="327" end_page="329" type="metho">
    <SectionTitle>
&amp;quot;2 Rule Induction
</SectionTitle>
    <Paragraph position="0"> Our starting i)oild; is the assumption that the differences in the phonemic transcriptions between Flemish and Dutch are highly systematic, and can be represented in a set of rules. Hence, these rules provide linguistic insight into the overlap and discrepancies between both w~riants. Moreover, they can be used to adapt prommciation databases tbr Dutch md;omatically to Flemish and vice versa. A possil)le w~y to find the regul~trities within the diflbrences betweet, both corpora is to make the rules by hand, which is time-consmning and error-prone.</Paragraph>
    <Paragraph position="1"> Another option is to make use of a data-oriented learning method in which linguistic knowledge is learned automatically. In our experiment we have made use of two rule induction techniques, viz. ~:a~nsformation-Based Error Driven Learning (TBEDL) (Brill, 1995) and C5.0 (Quinlan, 1993).</Paragraph>
    <Paragraph position="2"> In the process of 'Transfbnnation-Based Error-Driven Learning, transtbrmation rules are learned by comparing a corpus that is annotated by an initial state annotator to a correctly amlotared corpus, which is called the &amp;quot;truth&amp;quot;. l)uring that comparison, an ordered list of transtbrmation rules is learned. This ordering implies that the application of an earlier rule sometimes makes it possible tbr a later rule to apply (so-called &amp;quot;feeding&amp;quot;). In other cases, as also descrit)ed in the work of Roche and Schabes (1995), a given structure fiTdls to undergo a rule as a consequence of s()me earlier rule (&amp;quot;bleeding&amp;quot;). These rules are applied to tile output of the initial state ammtator ill order to t)ring that outt)ut closer to the &amp;quot;truth&amp;quot;. A rule consists of two parts: a transtbrmation and a &amp;quot;triggering environment&amp;quot;. For each iteration in the learning process, it is investigated tbr each possible rule how many mistakes can be corrected through al)t)lication of that rule. The rule which causes the greatest error reduction is retained.</Paragraph>
    <Paragraph position="3"> Figure 1 shows the TBEDL learning process apt)lied to the comparison of the Celex representation and the Fonilex &amp;quot;n()rmal&amp;quot; representa|;ion, which flmctions as &amp;quot;truth&amp;quot;. In this case, the task is to learn how to transtbrm Celex rel)resentations into Fonilex representations (i.e., translate Dutch pronunciation to Flemish prommciation). Both corpora serve as input tbr the &amp;quot;transtbrmation rule learner&amp;quot; (Brill, 1995). This learning process results in an ordered list of transformation rules which reflects the systematic differences between both representations.</Paragraph>
    <Paragraph position="4"> A rule is read as: &amp;quot;change x (Celex representation) into y (Fonilex representation) in the following triggering enviromnent&amp;quot;.</Paragraph>
    <Paragraph position="5"> E.g. /i:/ /~/ NEXT 1 Oil. 2 ()I/l. 3 PHON/e:/ (change a tense/i/to a lax/i/when one of the three tollowing Celex phonemes is at tense/e/).</Paragraph>
    <Paragraph position="6">  C5.0 (Quinlnn, 1993), on the other hand, wMch is a commercial version of the C4.5 t)rogr~mh gener~l;(;s a classifier in the form of ~r decision tree. This decision tree (:~Lll \])e used to cLassi(y ;~ case 1)y starting a.t the root of |;11(; I;ree mM then moving througLl the tree untiL a le~ff node (associated with ~ class) is eneomltered. Since decision t;recs c;m be hard to read, the decision tree is (:onver(;ed to a set of production rules, which ~re more intcJligibh'.</Paragraph>
    <Paragraph position="7"> to the user. All rules h~we the form &amp;quot;L -&gt; H.&amp;quot;, in w\]fich the left-hml(1 side is ~ conjmmtion of a.l,I;ribute-b~tsc(l tests and the rightqm.ud side is a. (:l~ss. Note that in thC/'. imt)hmleul,at;ion of C5.0, feeding mM t)leeding effects of rules do not occur, due to |.h('. (:onlti(:l. resolution sl, r~tegy used, whi(:h ensures that tbr each case only one rule (:ml apply (Quinla.n, 199:3). In this cxperillR',nt; we ha,ve 111~,(t(; use of a, (:onl;exl; of three phonemes preceding (in(Lic:tted by t-1~ 52, and f3) and fi)llowing (f+l, 1&amp;quot;+2, f+3) th(' fi)(:us phoneme, which is in(tic;d;(;d t)y an 'f~.</Paragraph>
    <Paragraph position="8"> The t)r(;(li(:t;e(1 class for this ease is then t;t1(', right-hand side of the rule. At the top of the rule the nmnber of training cnses covered by the rule is given together with the number of cases that do not 1)elong to the class L)redicted t)y the rule. The &amp;quot;lift;&amp;quot; is l;he (;stim~l;ed ac(:urt~cy of the rule divided by the prior probnt)ility of l;he t)redicl;c(l class.</Paragraph>
    <Paragraph position="9"> E.g.: (4370/138, lift 82.8)</Paragraph>
    <Paragraph position="11"> Before presenting the d~ta to TBEDL and C5.0, aLignment is required (Daek;mans and v}l, ll (tOll Bosch, 1.996) for l;he gr~phenfic and phonemic rel)resentations of CeLex and FoniLex, since the l)honemic representation mid the spelling of a word often differ in length.</Paragraph>
    <Paragraph position="12"> Therefore, the phonemic symbols are aLigned with the graphemes of the written word tbrm.</Paragraph>
    <Paragraph position="13"> In case the phonemic transcription is shorter than the speLLing, mill phonemes ('-') are used to filL the g~L)S. In the exmnpLe '%ahnoezenier&amp;quot; (Eng.: &amp;quot;chaL)Lain&amp;quot; ) this results in: ~ a 1 m o (: z e n i e r - l m u: - z o n i: - r A further step in the t)reparation of the (latch consists of the use of an extensive set of so-called &amp;quot;compound i)honemes&amp;quot;. Compound phonem(;s ~I,I;C llSCd whelleVer gr~l)hemes ilia 4) with more than one phoneme, a.s in the word ':taxi&amp;quot;, in which the &lt;x&gt; is 1)honemically reL)reseni;ed as il, pr(,1,Le,11 .',olv,.,l t,y defining a new t)honemic sylnbol l;h;tl; (:orre-Sl)on(ls to the two l)honemes.</Paragraph>
    <Paragraph position="14"> Our d~t|;ase|; consists of all Fonilex entries wil;h omission of the doul)le transcriptions (only tl,x; Lirst tr;mscriL)t;ion is taken), as in the word &amp;quot;(:~mw;m&amp;quot;, which can be i)honemi(:ally rcl)re,s(',nl;(;(t ;ts /k(ir(lv(m/ or ;is /k~;l't:v~:n/. Also wor(ls of which the l)hon(;mi( * l;rm~s(:ript;ion is longer l;h~m l;he orl;hograL)\]g and for whi('h no (:oml)ound phonemes ~tr(; l)rovi(ted: are omitted, e.g. &amp;quot;b'tje&amp;quot; (Eng.: &amp;quot;little b&amp;quot;)(L)honenfically: /',e:O/). The ,)f 20 . 36 word forms or 1.972.577 phonemes. I)ISC is used as phonemi(&amp;quot; encoding st;henle. All DISC phonemes are included and new phonem(;s are created for the t)honemic symbols which only occur in the Fonih'x (lttl;ab;~se. V~k; h~we divided the corlms into a training part, consisting of 90% of the data and ~ 10% test part.</Paragraph>
    <Paragraph position="15"> InitialLy, an overla L) of 59.07% on the word level and !)2.77% on 1;he 1)honeme level was observed in the 10% test sol; l)etween Dutch and Flemish reL)resentations. CollSOll}lJlt;S ~gll(t dit)hthollgs are highly overlapping.</Paragraph>
  </Section>
  <Section position="4" start_page="329" end_page="329" type="metho">
    <SectionTitle>
3 Quantitative analysis
</SectionTitle>
    <Paragraph position="0"> We first test whettmr rule induction techniques are able to learn to adapt Northern Dutch pronunciations to Flemish when trained on a nun:ber of examples. With .Transformation-Based Error-Driven Learning and C5.0, we looked for the systematic differences between Northern Dutch and Flenfish.</Paragraph>
    <Paragraph position="1"> In TBEDL, the complete training set of 90% was used for learning the transfbrmation rules. A threshold of 15 errors was specified, which means that learning stops if the error reduction lies under that threshold. Due to the large amount of training data, this threshold was chosen to reduce training time. This resulted in about 450 rules. In figure 2, the number of transformation rules is plotted against the accuracy of the conversion between both w, riants.  word and phoneme level in relation to the nnmber of transtbrmation rules.</Paragraph>
    <Paragraph position="2"> Figure 2 shows that especially tile first 50 rules lead to a considerable increase of performance fl'om 59.07% to 79A0% on the word level and from 92.77% to 96.98% for phonemes, which indicates the high applicability of these rules. Afterwards, the increase of accuracy is more graduah from 79.40% to 88.95% (words) and fl'om 96.98% to 98.52% (phonemes).</Paragraph>
    <Paragraph position="3"> For the C5.0 experiment~ 50% (887.647 cases) of the original training set served as training set (more training data was not feasible). A decision tree model and a production rule model were lmilt from the training cases. The tree gave rise to 745 rules. These production rules were applied to the original 10% test set we nsed in the TBEDL experiment. In order to make the type of task comparable for the transfbrmarion based approach used by TBEDL, and the classification-based approach used in C5.0, the output class to be predicted by C5.0 was either ~0' when the Celex and Fonilex phoneme are identical (i.e. no change), or the Fonilex phoneme when Celex and Fonilex differ.</Paragraph>
    <Paragraph position="4"> Table 2 gives an overview of the overlap between Celex and Fonilex after application of t)oth rule induction techniques. A comparison of these results shows that, when evaluating both TBEDL and C5.0 on the test set, the rules learned by the Brill-tagger have a higher error rate, even when C5.0 is only trained on half the data used by TBEDL. On tile word level, the initial overlap of 59.07% is raised to 88.95% af ter application of the 450 transformation rules, and to 90.35% when using the C5.0 rules. On the phoneme level, the initial 92.77% overlap is increased to 98.52% (TBEDL) and 98.74% (C5.0). C5.0 also has a slightly lower error rate for the consonants, vowels and diphthongs.</Paragraph>
    <Paragraph position="5"> ~_ I Word Phon. Cons. Vowel Diph. \] ~  after application of' 450 transformation rules and all C5.() production rules.</Paragraph>
    <Paragraph position="6"> When looking at those cases where Celex and Fonilex difl'er, we see that it; ix possible to learn Brill rules which predict 73% of these differences at the word level and 79.5% of the differences at the phoneme level. Tile C5.0 rules are more or less 3% more accurate: 76.4% (words) and 82.6% (phonemes). It is indeed possible to reliably 'translate' Dutch into Flenfish.</Paragraph>
  </Section>
  <Section position="5" start_page="329" end_page="331" type="metho">
    <SectionTitle>
4 Qualitative Analysis
</SectionTitle>
    <Paragraph position="0"> In this section, we are concerned with the linguistic quality of tile rules that were extracted using TBEDL and C5.0. To gain more insight ill the important differences between both pronunciation variants, a qualitative analysis of tile rules was performed. Therefore, the conversion rules were listed and compared. The following list presents some examples for consonants, vowels and diphthongs. Starting point  is the first 10 rules that were learned during TBEDL, which are compared with the 10 C5.0 rules, which most reduce the error rate. In the, transtbrmation rules 1)resented below, the relationship between Dutch art(1 Flemish, especially the most important differences, are extracted fronl the eorl)ora and tbrmulated in a set of easily understmldal)le rules. The C5.0 produ(:tion rules, Oll the other hand also descrit/e the overlapl)ing phonelnes between Celex and Dmilex, which makes it hard to have at clear overview of the regularities in the dilt'erences 1)etween both variants of Dutch. The fact that the category '0' was used to describe the overlap between the databases (no chauge) does not; really hell).</Paragraph>
    <Paragraph position="1"> Even if C5.0 discovers that no change is the default rifle, additional specific rules (lescrit)ing the, default condition are neverthel(~ss ne('c, ssary to l)revent the other rules fl'om tiring incorrectly.</Paragraph>
    <Section position="1" start_page="330" end_page="331" type="sub_section">
      <SectionTitle>
4.1 Consonants
</SectionTitle>
      <Paragraph position="0"> Nearly 60% of the differences on the consonant level concerns l;he alternation 1)etween voiced and unvoiced consonants. In the word &amp;quot;gelijkaardig&amp;quot; (Eng.: &amp;quot;equal&amp;quot;), for example, we lind a /xolcika:rdox/ with a voi('eless velm: fri('ative in Dutch and /golcika:rdax/ with a voiced velar fricative in Flenlish. The word 'hnachiavellisme&amp;quot; (Eug.: &amp;quot;Machiavellism&amp;quot;) is pronommed as /ln(igi:ja:w:hsm,)/ in Dutch an(t  T~tble 3 clearly shows the alternation t)etween /x/ and /,g/. This alternation also is the subject of the first transformation rule, namely &amp;quot;/x/ changes into/,g/ it, case of a word t)eginning (indicated by &amp;quot;STAAll.T&amp;quot;) one or two I)ositions t)efbre&amp;quot;. When looking at the to t) ten of the C5.0 l)roduction rules that most reduce error rate, the two most important rules also describe this alternation:  Another important phenomenon is the use of p~tlatalisation in Flemish, as in the word &amp;quot;aaitje&amp;quot; (Eng.: &amp;quot;stroke&amp;quot;), where Folfilex uses the t)alatalized fortll/aljtJ'o/instead of/a:jtjo/. The two sul)sequent transtbrmation rules 3 and d make this change possible. In the top 10 of C5.0 rules, only the tirst i)arl; of this change is descril)ed. Transtbrmation rule 8 (les('rit)es the omission of the i)honeme /t/ in ea,se, of the gral)hemic combination &lt;ti&gt;, as in &amp;quot;t)olitie '~ (Eng.: &amp;quot;police&amp;quot;).</Paragraph>
      <Paragraph position="2"> 4: %'anstbrmation rules for the most freditl'erences at the (;otlSOll,'l, tlt level.</Paragraph>
      <Paragraph position="3"> Vowels of the difl'erences at; the vowel level 1)etwe('al l)ut;(:h mM Ph;mish concerns t;he use of a lax vowel instead of a tellse vowel fi)r the /i:/, le:l, la:l, I(,:/.'.. I.:1. This aH, ernatio. is illustrated t)y the following confllsion matrix, wlfich clearly shows that tense Celex-vowels not only (:orre, sl)ond with tense, but also with lax vowels in Fonilex. Other less frequent dif t'erences are glide insertion, e.g. in &amp;quot;geshakct&amp;quot; and the use of schwa instead of another vowel, as in %eleprocessing&amp;quot; in Flelnish.</Paragraph>
      <Paragraph position="4"> I I i: by:l'&amp;quot;: I ~: I&amp;quot;: I I I&amp;quot; I'- I&amp;quot; I:' I  Flemish btx and tense vowels given the Dutch tense vowels.</Paragraph>
      <Paragraph position="5"> 'The,/;/and/Q/m'e compound phonemes wc introduced. They do not have an IPA equivalent.  In transformation rules 2, 5, 6, 7, 9, there is a transition from a tense vowel into a lax vowel in a certain triggering environment. An example is the word &amp;quot;multipliceer&amp;quot; (Eng.: &amp;quot;multiply&amp;quot;) which is transcribed as/multi:pli:se:r/in Celex and as/multlphse:r/in Fonilex.</Paragraph>
      <Paragraph position="6"> Nr.</Paragraph>
      <Paragraph position="7">  2. i: 5. i: 6. i:j 7. o: 9. a:  differences between the Dutdl and vowels.</Paragraph>
      <Paragraph position="8"> A closer look at the ten most important C5.0 production rules shows that seven out of ten rules descrit)e this transition from a Celex tense vowel to a Fonilex lax vowel. E.g.</Paragraph>
    </Section>
    <Section position="2" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
4.3 Diphthongs
</SectionTitle>
      <Paragraph position="0"> For tile dit)hthongs, few transformation rules are learned during training, since Celex and Fonilex are highly overlapping (see table 1).</Paragraph>
      <Paragraph position="1"> The rnles concern the phonemes that follow the diphthongs: /.j/ after /ei/ and /u/ afl;er /ou/. E.g. '%lanw&amp;quot; &amp;quot;blue&amp;quot;), the/l,/ is omitted in Flemish: /bkm/. In the top ten of C5.0 rules, no rules are given describing this phenomenon.</Paragraph>
      <Paragraph position="2"> Nr. C. F. %'iggering envirolmmnt 10. u - PREV PHON (m I Table 7: %'ansfonnation rule concerning the lack or presence of a/u/ tbllowing an/au/.</Paragraph>
      <Paragraph position="3"> These rules, describing the differences between Northern Dutch and Flemish consonants, vowels and diphthongs also make linguistic sense. Linguistic literature, such as the work of Booij (1995) and De Schutter (1978) indicates tendencies such as voicing and devoicing on the consonant level and the confllsion of tense and lax vowels as important differences between Northern Dutch and Flemish. The same discrepancies are f(mnd in the transcriptions made by both Flemish and Dutch subjects in the Dutch transcription experiments described in Gillis (19!)9).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="331" end_page="332" type="metho">
    <SectionTitle>
5 Error Analysis
</SectionTitle>
    <Paragraph position="0"> Besides tile systematic phonenlic differences between Flemish and Dutch, there are a number of mmystematic differences between both databases. After application of 450 transformation rules, 88.95% of the words makes a correct; transition from the Celex-transcription to the Fonilex-transcriptiolL The 7d5 C5.0 rules lead to a 90.35%. Using the Brill-tagger, it also has to be taken into account that rules can be undone by a later rule (see also Roche and Schabes (1995)), as in tile word &amp;quot;feuilleteer&amp;quot; (Bug.: &amp;quot;leaf througlf'). Celex provides the transcrit)tion/t'(x'.y.iol;e:r/, while Fonilex transcribes it as /f~:jate:r/. During learning, the transtbrmation rule &amp;quot;change /my/ into /~:/ if the preceding graphenm is an &lt;e&gt;&amp;quot; is learned. This results in the correct Fonilex-/ff~:jote:r/. This transformation, however, is canceled by a later rule, which changes /0:/ back into /oey/ if tile *bllowing grapheme is an &lt;i&gt;. This leads again to the original Celex-transcription. C5.0, which does not suflhr from sinfilar consequences of rule ordering, will correctly classify &amp;quot;feuilleteer&amp;quot;. hi this section, we are concerned with the relnaining errors after application of all rules.</Paragraph>
    <Paragraph position="1"> Making use of a rule induction technique to extract the sub-regularities in the differences between the corpora can lead to some rules, which, however, may be based on noise or errors in the databases. Theretbre, a manual analysis was done, which showed that the explanation of these remaining errors is twotbhl.</Paragraph>
    <Paragraph position="2"> A first reason is that no rule is awfilable tbr less frequent cases. TILe rules are induced on the basis of a sufficiently big frequency effect.</Paragraph>
    <Paragraph position="3"> This leads to no rule at all tbr less frequent 1)honemes and phoneme coml)inations mid also for phonemes which are not always consistently transcribed. Examples are loan words, such as &amp;quot;points&amp;quot; and &amp;quot;pantys '~ or the loan sound /-/ which only appears in Fonilex.</Paragraph>
    <Paragraph position="4"> Another cause tbr errors is that rules will overgeneralise in certain cases. The confusion  matrix for vowels in table 5 clearly indicates the tendency to use more l~x vowels in Flemish. This leads to a mmfl)er of tr;mstbrInation rules ~md C5.0 rules describing this tendency. A (:loser investigation of the errors committed t)y the, Brill-tagger, however, shows thnt 41..7% of the errors concerns the use of a wrong vowel. In 25% of the errors conmfitted on the t)honeme level, there was an incorrect transition fl'om a tense to a b~x vowel, as in '%ntagonislne&amp;quot; (Eng.: '%ntagonisln&amp;quot;) where there was no transition from an /o:/ to ;u, /:)/. in 16.8% of the errors, a tense vowel is errolleOllsly used instead of a \]~x vowel, as in &amp;quot;atfi('he&amp;quot; (Eng.: &amp;quot;I)OSt(~r &amp;quot;) where an/,/ is used instead of a ((:orr(x:t) /i/.</Paragraph>
    <Paragraph position="5"> 1) ifliculties ill the alternation t)etween voiced ;u mlvoiced consort;mrs account for 6.3% of the errors on the phoneme level. E.g. in &amp;quot;a(hninistrestle&amp;quot; the/t/ was not (:onverted into/d/. In order to analyse why C5.0 1)ertbrms better on our task them TBEI)\],, a (:loser comparison was made of the errors ex(:lusively made 1)y the Brill-tagger ;rod those ex(:lusively re;Me l)y C5.(). Ih)wever, no system~ttie dilt'eren(:es in errors were t'(mnd .which could exl)la.in the higher accuracies when using C5.0.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML