File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3202_metho.xml
Size: 14,473 bytes
Last Modified: 2025-10-06 14:10:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3202"> <Title>Improving Syllabification Models with Phonotactic Knowledge</Title> <Section position="4" start_page="14" end_page="15" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> In this section, we report on our experiments with four different phonotactic grammars introduced in Section 2.1 (see grammar 2.1.3-2.1.6), as well as with a re-implementation of M&quot;uller's less complex grammar (M&quot;uller, 2002). All these grammars are trained on a corpus of transcribed words from the pronunciation lexicon CELEX. We use the full forms of the lexicon instead of the lemmas. The German lexicon contains 304,928 words and the English lexicon 71,493 words. Homographs with the same pronunciation but with different part of speech tags are taken only once. We use for our German experiments 274,435 words for training and 30,492 for testing (evaluating). For our English experiments, we use 64,343 for training and 7,249 for testing.</Paragraph> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 3.1 Training procedure </SectionTitle> <Paragraph position="0"> We use the same training procedure as M&quot;uller (2001). It is a kind of treebank training where we obtain a probabilistic context-free grammar (PCFG) by observing how often each rule was used in the training corpus. The brackets of the input guarantee an unambiguous analysis of each word. Thus, the formula of treebank training given by (Charniak, 1996) is applied: r is a rule, let |r |be the number of times r occurred in the parsed corpus and l(r) be the non-terminal that r expands, then the probability assigned to r is given by</Paragraph> <Paragraph position="2"> After training, we transform the PCFG by dropping the brackets in the rules resulting in an analysis grammar. The bracket-less analysis grammar is used for parsing the input without brackets; i.e., the phoneme strings are parsed and the syllable boundaries are extracted from the most probable parse.</Paragraph> <Paragraph position="3"> In our experiments, we use the same technique.</Paragraph> <Paragraph position="4"> The advantage of this training method is that we learn the distribution of the grammar which maximizes the likelihood of the corpus.</Paragraph> </Section> <Section position="2" start_page="14" end_page="15" type="sub_section"> <SectionTitle> 3.2 Evaluation procedure </SectionTitle> <Paragraph position="0"> We evaluate our grammars on a syllabification task which means that we use the trained grammars to predict the syllable boundaries of an unseen corpus.</Paragraph> <Paragraph position="1"> As we drop the explicit markers for syllable boundaries, the grammar can be used to predict the boundaries of arbitrary phoneme sequences. The boundaries can be extracted from the syl-span which governs an entire syllable.</Paragraph> <Paragraph position="2"> Our training and evaluation procedure is a 10-fold cross validation procedure. We divide the original (German/English)corpusintotenpartsequalinsize.</Paragraph> <Paragraph position="3"> We start the procedure by training on parts 1-9 and evaluating on part 10. In a next step, we take parts 1-8 and 10 and evaluate on part 9. Then, we evaluate on corpus 8 and so forth. In the end, this procedure yields evaluation results for all 10 parts of the original corpus. Finally, we calculate the average mean of all evaluation results.</Paragraph> <Paragraph position="4"> Our three evaluation measures are word accuracy, syllable accuracy and syllable boundary accuracy.</Paragraph> <Paragraph position="5"> Word accuracy is a very strict measure and does not depend on the number of syllables within a word. If a word is correctly analyzed the accuracy increases.</Paragraph> <Paragraph position="6"> We define word accuracy as # of correctly analyzed words total # of words Syllable accuracy is defined as # of correctly analyzed syllables total # of syllables The last evaluation metrics we used is thesyllable boundary accuracy. It expresses how reliable the boundaries were recognized. It is defined as # of correctly analyzed syllable boundaries total # of syllable boundaries The difference between the three metrics can be seen in the following example. Let our evaluation corpus consist of two words, transferring and wet. The transcription and the syllable boundaries are displayed in table 1. Let our trained grammar predict the boundaries shown in table 2. Then the word accuracy will be 50% racy is 75% (3 correct syllable boundaries4 syllable boundaries ). The difference between syllable accuracy and syllable boundary accuracy is that the first metric punishes the wrong prediction of a syllable boundary twice as the complete syllable has to be correct. The syllable boundary accuracy only judges the end of the syllableandcountshowoftenitiscorrect. Mono-syllabic words are also included in this measure. They serve as a baseline as the syllable boundary will be always correct. If we compare the baseline for English and German (tables 3 and 4, respectively), we observe that the English dictionary contains 10.3% monosyllabic words and the German one 1.59%.</Paragraph> <Paragraph position="7"> Table 3 and table 4 show that phonotactic knowledge improves the prediction of syllable boundaries. The syllable boundary accuracy increases from 95.84% to 97.15% for English and from 95.9% to 96.48% for German. One difference between the two languages is if we encode the nucleus in the onset or coda rules, German can profit from this information compared to English. This might point at a dependence of German onsets from the nucleus.</Paragraph> <Paragraph position="8"> For English, it is even the case that the on-nuc and the nuc-cod grammars worsen the results compared to the phonotactic base grammar. Only the combination of the two grammars (the on-nuc-coda grammar)achievesahigheraccuracythanthephonotactic null grammar. We suspect that the on-nuc-coda grammar encodes that onset and coda constrain each other on the repetition of liquids or nasals between /s/C onsets and codas. For instance, lull and mam are okey, whereas slull and smame are less good.</Paragraph> </Section> </Section> <Section position="5" start_page="15" end_page="18" type="metho"> <SectionTitle> 4 Learning phonotactics from PCFGs </SectionTitle> <Paragraph position="0"> We want to demonstrate in this section that our phonotactic grammars does not only improve sylgrammar version word syllable syll bound.</Paragraph> <Paragraph position="1"> accuracy accuracy accuracy baseline 10.33% (M&quot;uller, 2002) 89.27% 91.84% 95.84% phonot. grammar 92.48% 94.35% 97.15% phonot. on-nuc 92.29% 94.21% 97.09% phonot. nuc-cod 92.39% 94.27% 97.11% phonot. on-nuc-cod 92.64% 94.47% 97.22% labification accuracy but can be used to reveal interesting phonotactic2 information at the same time. Our intension is to show that it is possible to augment symbolic studies such as e.g., Hall (1992), Pierrehumbert (1994), Wiese (1996), Kessler and Treiman (1997), or Ewen and van der Hulst (2001) with extensive probabilistic information. Due to time and place constraints, we concentrate on two-consonantal clusters of grammar 2.1.3.</Paragraph> <Paragraph position="2"> Phonotactic restrictions are often expressed by tables which describe the possibility of combination of consonants. Table 5 shows the possible combinations of German two-consonantal onsets (Wiese, 1996). However, the table cannot express differences in frequency of occurrence between certain clusters. For instance, it does not distinguish between onset clusters such as [pfl] and [kl]. If we consider the frequency of occurrence in a German dictionary then there is indeed a great difference. [kl] is much more common than [pfl].</Paragraph> <Section position="1" start_page="15" end_page="16" type="sub_section"> <SectionTitle> 4.1 German </SectionTitle> <Paragraph position="0"> Our method allows additional information to be added to tables such as shown in table 5. In what follows, the probabilities are taken from the rules of grammar 2.1.3. Table 6 shows the probability of</Paragraph> </Section> <Section position="2" start_page="16" end_page="17" type="sub_section"> <SectionTitle> Sonorants Obstruents </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> occurrence of German obstruents ordered by their probability of occurrence. [S] occurs very often in German words as first consonant in two-consonantal onsets word initially. In the first row of table 6, the consonants which occur as second consonants are listed. We observe, for instance, that [St] is the most common two-consonantal onset in mono-syllabic words. This consonant cluster appears in words such as Staub (dust), stark (strong), or Stolz (pride). We believe that there is a threshold indicating that a certain combination is very likely to come from a loanword. If we define the probability of a two-consonantal onset as p(onset ini 2) =def p(C1)xp(C2) where p(C1) is the probability of the rule</Paragraph> <Paragraph position="4"> then we get a list of two-consonantal onsets ordered by their probabilities: p(St) > ... > p(sk) > p(pfl) > p(sl) > ... > p(sf) These onsets occur in words such as Steg (footbridge), stolz (proud), Staat (state), Skalp (scalp), Skat (skat) Pflicht (duty), Pflock (stake), or Slang (slang) and Slum (slum). The least probable combination is [sf] which appears in the German word Sph&quot;are (sphere) coming from the Latin word sphaera. The consonant cluster [sl] is also a very uncommon onset. Words with this onset are usually loanwords from English. The onset [sk], however, is an onset which occur more often in German words.</Paragraph> <Paragraph position="5"> Most of the words are originally from Latin and are translated into German long ago. Interestingly, the onset [pfl] is also a very uncommon onset. Most of these onsets result from the second sound shift where in certain positions the simple onset conso- null nant /p/ became the affricate /pf/. The English translation of these words shows that the second sound shift was not applied to English. However, the most probable two-consonantal onset is [St]. The whole set of two-consonantal onsets can be seen in Table 8.</Paragraph> </Section> <Section position="3" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 4.2 English </SectionTitle> <Paragraph position="0"> English two-consonantal onsets show that unvoiced first consonants are more common than voiced ones.</Paragraph> <Paragraph position="1"> However, two combinations are missing. The alveolarplosives/t/and/d/donotcombinewiththelateral null /l/ in English two-consonantal onsets. Table 8 shows the most probable two-consonantal onsets sorted by their joint probability.</Paragraph> </Section> <Section position="4" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 4.3 Comparison between English and German </SectionTitle> <Paragraph position="0"> The fricatives /s/ and /S/ are often regarded as extra syllabic. Accordingtoourstudyontwo-consonantal onsets, these fricatives are very probable first consonants and combine with more second consonants than all other first consonants. They seem to form an own class. Liquids and glides are the most importantsecondconsonants. However,Englishprefers/r/ over /l/ in all syllable positions and /j/ over /w/ (except in monosyllabic words) and /n/ as second consonants. Nasals can only combine with very little first consonants. In German, we observe that /R/ is preferred over /l/ and /v/ over /n/ and /j/. Moreover, the nasal /n/ is much more common in German than in English as second consonants which applies especially to medial and final syllables.</Paragraph> <Paragraph position="1"> When we compare the phonotactic restrictions of two languages, itis also interesting toobserve which combinations are missing. If certain consonant clusters are not very likely or never occur in a language, this might have consequences for language understanding and language learning. Phonotactic gaps in one language might cause spelling mistakes in a second language. For instance, a typical Northern German name is Detlef which is often misspelled in English as Deltef. The onset cluster /tl/ can occur in medial and final German syllables but not in English. The different phonetic realization of /l/ may play a certain role that /lt/ is more natural than /tl/ in</Paragraph> </Section> </Section> <Section position="6" start_page="18" end_page="18" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Comparison of the syllabification performance with other systems is difficult: (i) different approaches differ in their training and evaluation corpus; (ii) comparisons across languages are hard to interpret; (iii) comparisons across different approaches require cautious interpretations. Nevertheless, we want to refer to several approaches that examined the syllabification task. Van den Bosch (1997) investigated the syllabification task with five inductive learning algorithms. He reported a generalization error for words of 2.22% on English data.</Paragraph> <Paragraph position="1"> However, the evaluation procedure differs from ours as he evaluates each decision (after each phoneme) made by his algorithms. Marchand et al. (to appear 2006) evaluated different syllabification algorithms on three different pronunciation dictionaries. Their best algorithm (SbA) achieved a word accuracy of 91.08%. The most direct point of comparison are the results presented by M&quot;uller (2002). Her approach differs in two ways. First, she only evaluates the German grammar and second she trains on a newspaper corpus. As we are interested in how her grammars perform on our corpus, we re-implemented her grammars and tested both in our 10-fold cross evaluation procedure. We find that the first grammar (M&quot;uller, 2001) achieves 85.45% word accuracy, 88.94% syllable accuracy and 94.37% syllable boundary accuracy for English and 84.21%, 90.86%, 95.36% for German respectively. The results show that the syllable boundary accuracy increases from 94,37% to 97.2% for English and from 95.3% to 97.2% for German. The experiments point out that phonotactic knowledge is a valuable source of information for syllabification.</Paragraph> </Section> class="xml-element"></Paper>