File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1048_intro.xml

Size: 4,952 bytes

Last Modified: 2025-10-06 14:00:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1048">
  <Title>A Rule Induction Approach to Modeling Regional Pronunciation Variation.</Title>
  <Section position="2" start_page="0" end_page="327" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> A central (:onq)onenl; of speech l)ro(;essing systems is a t)rommciation lexicon detining the relntionshi t) between the sl)elling mM t)rommcin|;ioi1 of words. Regionnl wMants of ~ langut~ge may differ considerably in their l)ronunci:ttion.</Paragraph>
    <Paragraph position="1"> Once a spe~ker from a particular region is detected, speech inlmt and output systems should be al)lc to ~Mat)l; their t)rommei;Ltion lexi(:on l;o this regionM vm'bml;. Regional l)rommciation (litiin'ences are mostly systeln~ti(: mM can t)e modeled using rules designed by experts. However, in this 1)at)er, we investigate the :mtoma null tion of this process by using data-driven ted&gt; niques, more. specitically, rule induction techniques. null l)ata-(lriven reel;hods have proven their effi(',;tcy in severM language engineering tasks: such as gr~l)hemc-to-tfl~oncmc conversion, tmrt;of-sl)eech tagging, el;(:. Extraction of linguistic knowledge, fl'(nn a snmple corlms instead of numuM encoding of linguistic intbrmation proved to be ml extremely powcrflfl method tbr overcoming the, linguistic knowledge acquisition bottlene(:k. \])itt'erent at)preaches are awfilM)le, such as decision-tree le~rrning (l)ietterich, 1997), lleural lml;work or (:onne(:tionist al)proaches (Sejnowski ~tnd l/.os(ml)erg, 1987), lnemory-base(1 lena'ning (Daelemans mM van den Bosch, 1996) el;(:, l)at~-driv(m al)i)roaehcs (:~m yield (:Oral);&gt; ral)le (;111(t often eVell better) results ttum the rule-lmsed at)t)ro;mh, as described in the work of l)aelemans nnd wm den \]~os(:h (199(i) in which a (:omt)~rison is mnde 1)ctwe(m Morpa-cmn-Morphon (Heemskerk and wm He, uv(m, 1993), an ex:mlt)le of n linguistic knowledge 1)a.sed at)1)roacll |;o gr~t)heme-to-1)honem(~ (:OllVersion and \[G-'.lh'ee, an examph; of n m(mloryd)ased at)1)roach (Daelen~ms et M., 1996).</Paragraph>
    <Paragraph position="2"> Ill this study, we will look tbr the patterns and generalizations in the i)honemic ditrer(m(:es 1)et;ween Dutch and Fhmfish 1)y using two (tat;n-driven t(~chniques. It; is our aim to extract the regularities that are implicitly contained in the data. Two corpora were used tbr this study, r(~l)resenting the Norl;hern Dul, eh and Sout;hern Dutch w~rbmts. D)r Northenl Dut(:h Celex (releas(; 2) was used and for Flemish Fonilex (versioll 1.01)). The Celex datM)ase contains fiequen(:y infi)rlnation (based on the INL corl)uS of the hlsl;itute fi)r 1)ul;(:h Lexieology), and i)honologi(:al~ morl)hologicM , and synt;a(:tic lexicM intbrmation tbr more l;tmn 384.000 word forms,  and uses DISC as encoding tbr word pronunciation. The Fonilex database is a list of more than 200.000 word tbrms together with their Flemish pronunciation. For each word tbrm, an abstract lexical representation is given, together with the concrete pronunciation of that word tbrm in three speech styles: highly formal st)eech, sloppy speech and &amp;quot;normal&amp;quot; speech (which is an intermediate level). A set of phonological rewrite rules was used to deduce these concrete speech styles ti'om the abstract t)honological tbrm. The initial phonological transcription was obtained by a grapheme-to-phoneme converter and was afterwards corrected by hand. Fonilex uses YAPA as encoding scheme. By means of their identification number, the Fonilex entries also contain a rethrence to the Celex entries, since Celex served as basis tbr the list of word tbrms in Fonilex. E.g. tbr the word &amp;quot;aaitje&amp;quot; (Eng.: &amp;quot;stroke&amp;quot;), the relevant Celex entry is &amp;quot;25/aait.je/5/'aj-tjC/}/&amp;quot; and the corresponding Fonilex entry looks like &amp;quot;251aaitjel'ajts@l&amp;quot;. The word tbrms in Celex with a fl'equency of 1 and higher (indicated in field 3) ~re included in Fonilex and fl:om the list with tiequency 0, only the monomorphematic words were selected.</Paragraph>
    <Paragraph position="3"> In the fi)llowing section, a brief explanation is given of the method we used to search for the overlap and ditfhrences between both regional w~riants of Dutch. Section 3 provides a quantitative analysis of the results. Section 4 discusses the dittbrences between Celex and Fonilex, starting fl'om tile set of transtbrmation rules that is learned during Transtbrmation-</Paragraph>
    <Section position="1" start_page="327" end_page="327" type="sub_section">
      <SectionTitle>
Based Error-Driven Learning (TBEDL). These
</SectionTitle>
      <Paragraph position="0"> rules are COlnpared to the production rules produced by C5.0. In addition, we present an overview of the non-systematic diflhrences. In a final section, some concluding remarks are given.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML