File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1030_metho.xml
Size: 21,147 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1030"> <Title>Learning Pronunciation Dictionaries Language Complexity and Word Selection Strategies</Title> <Section position="4" start_page="232" end_page="233" type="metho"> <SectionTitle> 3 LTS Algorithm Basics </SectionTitle> <Paragraph position="0"> A wide variety of approaches have been applied to the problem of letter-to-sound rule induction. Due to simplicity of representation and ease of manipulation, our LTS rule learner follows the Default & Refine algorithm of Davel (Davel and Barnard, 2004). In this framework, each letter c is assigned a default production p1-p2... denoting the sequence of zero or more phonemes most often associated with that letter. Any exceptions to a letter's default rule is explained in terms of the surrounding context of letters. The default rules have a context width of one (the letter itself), while each additional letter increases the width of the context window. For example, if we are considering the first occurrence of 's' in the word basics, the context windows are as listed in Table 1. By convention, the underscore character denotes the predicted position, while the hash represents word termination.</Paragraph> <Paragraph position="1"> width context sets ordered by increasing width In this position there are 20 possible explanatory contexts. The order in which they are visited defines an algorithm's search strategy. In the class of algorithms knows as &quot;dynamically expanding context (DEC)&quot;, contexts are considered top-down as depicted in Table 1. Within one row, some algorithms follow a fixed order (e.g. center, left, right). Another variant tallies the instances of productions associated with a candidate context and chooses the one with the largest count. For example, in Spanish the letter 'c' may generate K (65%), or TH when followed by e or i (32%), or CH when followed by h (3%). These are organized by frequency into a &quot;rule chain.&quot; If desired, rules 2 and 3 in this example can be condensed into 'c' - TH /_{i,e}, but in general are left separated for sake of simplicity.</Paragraph> <Paragraph position="2"> In our variant, before adding a new rule all possible contexts of all lengths are considered when selecting the best one. Thus the rule chains do not obey a strict order of expanding windows, though shorter contexts generally precede longer ones in the rule chains.</Paragraph> <Paragraph position="3"> One limitation of our representation is that it does not support gaps in the letter context. Consider the word pairs tom/tome, top/tope, tot/tote. A CART tree can represent this pattern with the rule: if (c-1 = 't' and c0='o' and c2='e') then ph=OW. In practice, the inability to skip letters is not a handicap.</Paragraph> <Section position="1" start_page="233" end_page="233" type="sub_section"> <SectionTitle> 3.1 Multiple Pronunciation Predictions </SectionTitle> <Paragraph position="0"> Given a word, finding the predicted pronunciation is easy. Rule chains are indexed by the letter to be predicted, and possible contexts are scanned starting from the most specific until a match is found.</Paragraph> <Paragraph position="1"> Continuing our example, the first letter in the Spanish word ciento fails rule 4, fails rule 3, then matches rule 2 to yield TH. For additional pronunciations the search continues until another match is found: here, the default rule 'c' - K /_. This procedure is akin to predicting from progressively smoother models. In a complex language such as English, a ten letter word can readily generate dozens of alternate pronunciations, necessitating an ordering policy to keep the total manageable.</Paragraph> </Section> </Section> <Section position="5" start_page="233" end_page="236" type="metho"> <SectionTitle> 4 Language Characterization </SectionTitle> <Paragraph position="0"> English is notorious for having a highly irregular spelling system. Conversely, Spanish is admired for its simplicity. Most others lie somewhere in between. To estimate how many words need to be seen in order to acquire 90% coverage of a language's LTS rules, it helps to have a quantitative measure. In this section we offer a perplexity-based measure of LTS regularity and present measurements of several languages with varying corpus size. These measurements establish, surprisingly, that a rule system's perplexity increases without bound as the number of training words increases. This holds true whether the language is simple or complex. In response, we resort to a heuristic measure for positioning languages on a scale of relative difficulty.</Paragraph> <Section position="1" start_page="233" end_page="233" type="sub_section"> <SectionTitle> 4.1 A Test Suite of Seven Languages </SectionTitle> <Paragraph position="0"> Our test suite consists of pronunciation dictionaries from seven languages, with English considered under two manifestations.</Paragraph> <Paragraph position="1"> English. Version 0.6d of CMU-DICT, considered without stress (39 phones) and with two level stress marking (58 phones). German. The Celex dictionary of 321k entries (Burnage, 1990). Dutch.</Paragraph> <Paragraph position="2"> The Fonilex dictionary of 218k entries (Mertens and Vercammen, 1998). Fonilex defines an abstract phonological level from which specific dialects are specified. We tested on the &quot;standard&quot; dialect. Afrikaans. A 37k dictionary developed locally. Afrikaans is a language of South Africa and is a recent derivative of Dutch. Italian. A 410k dictionary distributed as part of a free Festival-based Italian synthesizer (Cosi, 2000). Spanish. Generated by applying a set of hand written rules to a 52k lexicon. The LTS rules are a part of the standard Festival Spanish distribution. Telugu. An 8k locally developed dictionary. In its native orthography, this language of India possess a highly regular syllabic writing system. We've adopted a version of the Itrans-3 transliteration scheme (Kishore 2003) in which sequences of two to four English letters map onto Telugu phonemes.</Paragraph> </Section> <Section position="2" start_page="233" end_page="234" type="sub_section"> <SectionTitle> 4.2 Perplexity as a Measure of Difficulty </SectionTitle> <Paragraph position="0"> A useful way of considering letter to sound production is as a Markov process in which the generator passes through a sequence of states (letters), each probabilistically emitting observation symbols (phonemes) before transitioning to the next state (following letter). For a letter c, the unpredictability of phoneme emission is its entropy Hc=[?][?]pilog pi or equivalently its perplexity Pc=eHc . The perplexity can be interpreted as the average number of output symbols generated by a letter. The production perplexity of the character set is the sum of each individual letter's perplexity weighted by its unigram probability pc.</Paragraph> <Paragraph position="2"> Continuing with our Spanish example, the letter 'c' emits the observation symbols (K, TH, CH) with a probability distribution of (.651, .321, .028), for a perplexity of 2.105. This computation applies when each letter is assigned a single probabilistic state. The process of LTS rule discovery effectively splits the state 'c' into four context-defined substates: (-,c,-), (-,c,i), (-,c,e), (-,c,h). Each of these states emits only a single symbol. Rule addition is therefore an entropy reduction process; when the rule set is complete the letter-to-sound system has a perplexity of 1, i.e. it is perfectly predictable.</Paragraph> <Paragraph position="3"> The &quot;price paid&quot; for perfect predictability is a complex set of rule chains. To measure rule complexity we again associate a single state with each letter. But, instead of phonemes, the rules are the emission symbols. Thus the letter 'c' emits the symbols (K/_, TH/_i, TH/_e, CH/_h) with a distribution of (.651, .236, .085, .028), for a perplexity of 2.534. Applying equation (1) to the full set of rules defines the LTS system's average perplexity.</Paragraph> </Section> <Section position="3" start_page="234" end_page="236" type="sub_section"> <SectionTitle> 4.3 Empirical Measurements </SectionTitle> <Paragraph position="0"> In the Default & Refine representation, the rule chain for each letter is is initialized with its most probably production. Additional context-dependent rules are appended to cover additional letter productions, with the rule offering the greatest incremental coverage being added first. (Ties are broken in an implementation-dependent way.) Figure 1 uses Spanish to illustrate a characteristic pattern: the increase in coverage as rules are added one at a time. Since the figure of merit is letter-based, the upper curve (% letters correct) increases monotonically, while the middle curve (% words correct) can plateau or decrease briefly.</Paragraph> <Paragraph position="1"> In the lower curve of Figure 1 the growth procedure is constrained such that all width 1 rules are added before width 2 rules, which in turn must be exhausted before width 3 rules are considered.</Paragraph> <Paragraph position="2"> This constraint leads to its distinctive scalloped shape. The upper limit of the W=1 region shows the performance of the unaided default rules (68% words correct).</Paragraph> <Paragraph position="3"> function of rule size. For the lower curve, W indicates the rule context window width. The middle (blue) curve tracks near-optimal performance improvement with the introduction of new rules.</Paragraph> <Paragraph position="4"> For more complex languages the majority of rules have a context width in the range of 3 to 6. This is seen in Figure 2 for English, Dutch, Afrikaans, and Italian. However, a larger rule set does not mean that the average context width is greater. In Table 2, below, compare Italian to Dutch.</Paragraph> <Paragraph position="5"> Beyond a window width of 7, rule growth tapers off considerably. In this region most new rules serve to identify particular words of irregular spelling, as it is uncommon for long rules to generalize beyond a single instance. Thus when training a smoothed LTS rule system it is fair to ignore contexts larger than 7, as is done for example in the Festival synthesis system (Black, 1998).</Paragraph> <Paragraph position="6"> Figure 2 contrasts four languages with training data of around 40k words, but says nothing of how rule sets grow as the corpus size increases. Figure 3 summarizes measurements taken on eight encodings of seven languages (English twice, with and without stress marking), tested from a range of 100 words to over 100,000. Words were subsampled from each alphabetized lexicon at equal spacings.</Paragraph> <Paragraph position="7"> The results are interesting, and for us, unexpected. increased, for seven languages. From top to bottom: English (twice), Dutch, German, Afrikaans, Italian, Telugu, Spanish. The Telugu lexicon uses an Itrans-3 encoding into roman characters, not the native script, which is a nearly perfect syllabic transcription. The context window has a maximum width of 9 in these experiments.</Paragraph> <Paragraph position="8"> Within this experimental range none of the languages reach an asymptotic limit, though some hint at slowed growth near the upper end. A straight line on a log-log graph is characteristic of geometric growth, to which a power law function y=axb+c is an appropriate parametric fit. For difficult languages the growth rates (power exponent b) vary between 0.5 and 0.9, as summarized in Table 3. The language with the fastest growth is English, followed, not by Dutch, but Italian. Italian is nonetheless the simpler of these two, as indicated by the smaller multiplicative factor a.</Paragraph> <Paragraph position="9"> y=axb+c to the growth of LTS system size.</Paragraph> <Paragraph position="10"> It would be good if a tight ceiling could be estimated from partial data in order to know (and report to the lexicon builder) that with n rules defined the system is m percent complete. However, this trend of geometric growth suggests that asking &quot;how many letter-to-sound rules does a given language have?&quot; is an ill-posed question.</Paragraph> <Paragraph position="11"> In light of this, two questions are worth asking.</Paragraph> <Paragraph position="12"> First, is the geometric trend particular to our rule representation? And second, is &quot;total number of rules&quot; the right measure of LTS complexity? To answer the first question we repeated the experiments with the CART tree builder available from the Festival speech synthesis toolkit. As it turns out - see Table 4 - a comparison of contextual rules and node counts for Italian demonstrate that a CART tree representation also exhibits geometric growth with respect to lexicon size.</Paragraph> <Paragraph position="13"> Italian as the corpus size is increased. CART tree nodes (i.e. questions) are the element comparable to LTS rules used in letter context chains. The fitted parameters to the CART data are a=2.29 and b=0.765. This compares to a=2.16 and b=0.69.</Paragraph> <Paragraph position="14"> If geometric growth and lack of an obvious asymptote is not particular to expanding context rule chains, then what of the measure? The measure proposed in Section 4.2 is average chain perplexity. The hypothesis is that a system close to saturation will still add new rules, but that the average perplexity levels off. Instead, the data shows little sign of saturation (Figure 4). In contrast, the average perplexity of the letter-to-phoneme distributions remains level with corpus size (Figure 5). production perplexity as a function of lexicon size. Considering these observations we've resorted to the following heuristic to measure language complexity: a) fix the window width to 5, b) measure the average rule perplexity at lexicon sizes of 10k, 20k, and 40k, then c) take the average of these three values. Fixing the window width to 5 is somewhat arbitrary, but is intended to prevent the system from learning an unbounded suite of exceptions. Available values are contained in Table 5. The third (rightmost) column is the ratio of the second divided by the first. A purely phonetic system has a heuristic perplexity of one.</Paragraph> <Paragraph position="15"> From these measurements we conclude, for example, that Dutch and German are equally difficult, that English is 3 times more complex than either of these, and that English is 40 times more complex than Spanish.</Paragraph> </Section> </Section> <Section position="6" start_page="236" end_page="238" type="metho"> <SectionTitle> 5 Word Selection Strategies </SectionTitle> <Paragraph position="0"> A selection strategy is a method for choosing an ordered list of words from a lexicon. It may be based on an estimate of expected maximum return, or be as simple as random selection. A good strategy should enable rapid learning, avoid repetition, be robust, and not overtax the human verifier.</Paragraph> <Paragraph position="1"> This section compares competing selection strategies on a single lexicon. We've chosen a 10k Italian lexicon as a problem of intermediate difficulty, and focus on early stage learning. To provide a useful frame of reference, Figure 6 shows the results of running 5000 experiments in which the word sequence has been chosen randomly. The x-axis is number of letters examined.</Paragraph> <Paragraph position="2"> four deterministic strategies. They are: alphabetical word ordering, reverse alphabetical, alphabetical sorted by word length (groups of single character words first, followed by two character words, etc.), and a greedy ngram search. Of the first three, reverse alphabetical performs best because it introduces a greater variety of ngrams more quickly than the others. Yet, all of these three are substantially worse than random. Notice that grouping words from short to long degrades performance.</Paragraph> <Paragraph position="3"> This implies that strategies tuned to the needs of humans will incur a machine learning penalty.</Paragraph> <Paragraph position="4"> Figure 7. Comparison of three simple word orderings to the average random curve, as well as greedy ngram search.</Paragraph> <Paragraph position="5"> It might be expected that selecting words containing the most popular ngrams first would out-performs random, but as is seen in Figure 7, greedy selection closely tracks the random curve. This leads us to investigate active leaning algorithms, which we treat as variants of ngram selection.</Paragraph> <Section position="1" start_page="237" end_page="237" type="sub_section"> <SectionTitle> 5.1 Algorithm Description </SectionTitle> <Paragraph position="0"> Let W = {w1,w2,...} be the lexicon word set, having A = {'a', 'b',...} as the alphabet of letters. We seek an ordered list V = (... wi ...) s.t. score(wi) [?] score (wi+1). V is initially empty and is extended one word at a time with wb, the &quot;best&quot; new word. Let g=c1c2...cn ` A* be an ngram of length n, and Gw={gi}, gi ` w are all the ngrams found in word w. Then GW = 5 Gw, w ` W, is the set of all ngrams in the lexicon W, and GV = 5 Gw, w ` Vis the set of all ngrams in the selected word list V. The number of occurrences of g in W is score(g), while score(w) = [?] score(g) st. g ` w and g v GV. The scored ngrams are segmented into separately sorted lists, forming an ordered list of queues Q = (q1,q2,...qN) where qn contains ngram of length n and only n.</Paragraph> <Paragraph position="2"> In this search the outer loop orders ngrams by length, while the inner loop orders words by length. For selection based on ngram coverage, the queue Q is computed only once for the given lexicon W. In our active learner, Q is re-evaluated after each word is selected, based on the ngrams present in the current LTS rule contexts. Let GLTS = {gi} s.t. gi ` some letter context in the LTS rules.</Paragraph> <Paragraph position="3"> Initially GLTS,0 = {}. Then, at any iteration k, GLTS,k are the ngrams present in the rules, and G'LTS,k+1 is an expanded set of candidate ngrams that constitute the elements of Q. G' is formed by prepending each letter c of A to each g in G, plus appending each c to g. That is, G'LTS,k+1 = A%GLTS,k 4 GLTS,k%A where % is the Cartesian product. Executing the algorithm returns wb and yields GLTS,k+1 the set of ngrams covered by the expanded rule set. In this way knowledge of the current LTS rules guides the search for maximally informative new words.</Paragraph> </Section> <Section position="2" start_page="237" end_page="238" type="sub_section"> <SectionTitle> 5.2 Active Learner Performance </SectionTitle> <Paragraph position="0"> Figure 8 displays the performance of our active learner on the Italian 10k corpus, shown as the blue curve. For the first 500 characters encountered, the active learner's performance is almost everywhere better than average random, typically one half to one standard deviation above this reference level.</Paragraph> <Paragraph position="1"> Two other references are shown. Immediately above the active learner curve is &quot;Oracle&quot; word selection. The Oracle has access to the final LTS system and selects words that maximally increases coverage of the known rules. The topmost curve is for a &quot;Perfect Oracle.&quot; This represents an even more unrealistic situation in which each letter of each word carries with it information about the corresponding production rule. For example, that 'g' yields /F/ 10% of the time, when followed by the letter 'h' (as in &quot;laugh&quot;) . Carrying complete information with each letter allows the LTS system to be constructed directly and without mistake. In contrast, the non-perfect oracle makes mistakes sequencing rules in each letter's rule chain. This decreases performance.</Paragraph> <Paragraph position="2"> Italian, 10k dict, maxwin=5 word selection Oracle, our active learner, and average random performance. The perfect Oracle demarcates (impossibly high) optimal performance, while Oracle word selection suggests near-optimality. For comparison, standard deviation error bars are added to the random curve.</Paragraph> <Paragraph position="3"> Encouragingly, the active learning algorithm straddles the range in between average random (the baseline) and Oracle word selection (near-optimality). Less favorable is the non-monotonicity of the performance curve; for example, when the number of letters examined is 135, and 210. Analysis shows that these drops occur when a new letter-to-sound production is encountered but more than one context offers an equally likely explanation.</Paragraph> <Paragraph position="4"> Faced with a tie, the LTS learner sometimes chooses incorrectly. Not being aware of this mistake it does not seek out correcting words. Flat plateaus occur when additional words (containing the next most popular ngrams) do not contain previously unseen letter-to-sound productions.</Paragraph> </Section> </Section> class="xml-element"></Paper>