XML Viewer - p95-1001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1001_metho.xml
Size: 13,068 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1001">
  <Title>Learning Phonological Rule Probabilities from Speech Corpora with Exploratory Computational Phonology</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PRONLEX (COMLEX 1994), BRITPRON (Robin-
</SectionTitle>
    <Paragraph position="0"> son 1994). A text-to-speech system was used to gen2Although it was not relevant to the experiments de- null scribed here, our lexicon also included two sources which directly supply surface forms. These were 13,362 hand-transcribed pronunciations of 5871 words from TIMIT (TIMIT 1990), and 230 pronunciations of 36 words derived in-house from the OGI Numbers database (Cole et al. 1994).</Paragraph>
    <Paragraph position="1"> erate phone sequences from word orthography as an additional source of pronunciations.</Paragraph>
    <Paragraph position="2">  \[IPAIARPAIICSI I IPA I ARPAIICSI I b b b b deg bcl d d d d deg dcl g g g gO - gcl p p p pO pcl t t t t deg - tcl k k k k deg - kcl (1 aa aa s s s ae ae z z z A ah ah J' sh sh O ao ao ~ zh zh eh eh f f f 3 ~ er er v v v ih ih IJ th th i iy iy 6 dh dh o ow ow t j&amp;quot; ch ch c~ uh uh dz jh jh u uw uw h hh hh ct w aw aw l'i - hv a ~ ay ay y y y e ey ey r r r 3 y oy oy w w w el 1 1 1 em m m m en n n n a ax rj ng ng ix r dx axr silence h# h#  BET. This was expanded to include syllabics, stop closures, and reduced vowels, alveolar flap, and voiced h.</Paragraph>
    <Paragraph position="3"> We represent pronunciations with the set of 54 ARPAbet-like phones detailed in Table 2. All the lexicon sources except LIMSI use ARPABET-like phone sets 3. CMU, BRITPRON, and PRONLEX phone sets include three levels of vowel stress. The pronunciations from all these sources were mapped into our phone set using a set of obligatory rules for stop closures \[bcl, dcl, gcl, pcl, tcl, kcl\], and optional rules to introduce the syllabic consonants \[el, em, en\], reduced vowels \[ax, ix, axr\], voiced h \[hv\], and alveolar flap \[dx\].</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Applying Phonological Rules to Build a
Surface Lexicon
</SectionTitle>
      <Paragraph position="0"> We next apply phonological rules to our base lexicon to produce the surface lexicon. Since the rules 3The LIMSI pronunciations already included the syllabic consonants and reduced vowels. For this reason, the words found only in the LIMSI source lexicon did not participate in the probability estimates for the syllabic and reduced vowel rules.</Paragraph>
      <Paragraph position="1">  -stress \[aa ae ah ao eh er ey ow uh\]---~ ax -stress \[iy ih uw\] --* ix -stress er --* axr \[ax ix\] n --* en \[ax ix\] m ~ em \[ax ix\] 1 ---* el \[ax ix\] r ~ ~xr \[tcl dcl\] \[t d\]--~ dx/V \[ax ix axr\] * \[tcl dcl\] \[t d\]--* dx/V r __ \[ax ix axr\] . hh ~ hv / \[+voice\] \[+voice\]</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3: Phonological Rules
</SectionTitle>
    <Paragraph position="0"> are optional, the surface lexicon must contain each underlying pronunciation unmodified, as well as the pronunciation resulting from the application of each relevant phonological rule. Table 3 gives the 10 phonological rules used in these experiments.</Paragraph>
    <Paragraph position="1"> One goal of our rule-application procedure was to build a tagged lexicon to avoid having to implement a phonological-rule parser to p~rse the surface pronunciations. In a tagged lexicon, each surface pronunciation is annotated with the names of the phonological rules that applied to produce it. Thus when the speech recognizer finds a particular pronunciation in the speech input, the list of rules which applied to produce it can simply be looked up in the tagged lexicon.</Paragraph>
    <Paragraph position="2"> The algorithm applies rules to pronunciations recursively; when a context matches the left hand side of a phonological rule &amp;quot;RULE,&amp;quot; two pronunciations are produced: one unchanged by the rule (marked -RULE), and one with the rule applied (marked +RULE). The procedure places the +RULE pronunciation on the queue for later recursive rule application, and continues trying to apply phonological rules to the -RULE pronunciation. See Figure 1 for details of the algorithm. While our procedure is not guaranteed to terminate, in practice the phonological rules we apply have a finite recursive depth. The nondeterministic mapping produces a tagged equiprobable multiple pronunciation lexicon of 510,000 pronunciations for 160,000 words. For example, Table 4 gives our base forms for the word  For each lexical item, L, do: Place all base prons of L onto queue q While Q is not empty do: Dequeue pronunciation P from q For each phonological rule R, do: If context of R could apply to P  Apply R to P, giving P' Tag P' with +R, put on queue Tag P with -R Output P with tags  The resulting tagged surface lexicon would have the entries in Table 5.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Filtering with forced-Viterbi
</SectionTitle>
      <Paragraph position="0"> Given a lexicon with tagged surface pronunciations, the next required step is to count how many times each of these pronunciations occurs in a speech corpus. The algorithm we use has two steps;</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="4" type="metho">
    <SectionTitle>
PHONETIC LIKELIHOOD ESTIMATION and FORCED-
VITERBI ALIGNMENT.
</SectionTitle>
    <Paragraph position="0"> In the first step, PHONETIC LIKELIHOOD ESTI-MATION, we examine each 20ms frame of speech data, and probabilistically label each frame with the phones that were likely to produce the data. That is, for each of the 54 phones in our phone-set, we compute the probability that the slice of acoustic data was produced by that phone. The result of this labeling is a vector of phone-likelihoods for each acoustic frame.</Paragraph>
    <Paragraph position="1"> Our algorithm is based on a multi-layer perceptron (MLP) which is trained to compute the conditional probability of a phone given an acoustic feature vector for one frame, together with 80 ms of surrounding context. Bourlard ~ Morgan (1991)  bcl b ah dx ax:+BPU +FL1; +CWtl +FL1 +RVl; +PLX +FL1 +RVl bcl bah dx axr: +TTS +FL1; +BPU +FL1; +CI~J +FL1 -RVl +RV3; +LIM +FL1; +PLX +FL1 -RV1 +RV3 bcl b ah tel t ax:+BPU -FL1; +C~d -FL1 +RV1; +PLX -FL1 +RV1 bcl bah tel t axr:/TT$ -FL1; +BPU -FL1; +C/fiLl -FL1 -RVl +RV3; +LIM -FL1; +PLX -FL1 -RVl +KV3 bcl bah tcl t er:+CMrd -RVl -RV3; +PLX -RVl -RV3  and Renals et al. (1991) show that with a few assumptions, an MLP may be viewed as estimating the probability P(ql x) where q is a phone and x is the input acoustic speech data. The estimator consists of a simple three-layer feed forward MLP trained with the back-propagation algorithm (see Figure 2). The input layer consists of 9 frames of input speech data. Each frame, representing 10 msec of speech, is typically encoded by 9 PLP (Hermansky 1990) coefficients, 9 delta-PLP coefficients, 9 delta-delta PLP coefficients, delta-energy and delta-deltaenergy terms. Typically, we use 500-4000 hidden units. The output layer has one unit for each phone.</Paragraph>
    <Paragraph position="2"> The MLP is trained on phonetically hand-labeled speech (TIMIT), and then further trained by an iterative Viterbi procedure (forced-Viterbi providing the labels) with Wall Street Journal corpora.</Paragraph>
    <Paragraph position="3"> v b m r z Output: ~  The probability P(qlx) produced by the MLP for each frame is first converted to the likelihood P(xlq ) by dividing by the prior P(q), according to Bayes' rule; we ignore P(z) since it is constant here:</Paragraph>
    <Paragraph position="5"> The second step of the algorithm, FORCED-VITERBI ALIGNMENT, takes this vector of likelihoods for each frame and produces the most likely phonetic string for the Sentence. If each word had only a single pronunciation and if each phone had some fixed duration, the phonetic string would be completely determined by the word string. However, phones vary in length as a function of idiolect and rate of speech, and of course the very fact of optional phonological rules implies multiple possible pronunciations for each word. These pronunciations are encoded in a hidden Markov model (HMM) for each word.</Paragraph>
    <Paragraph position="6"> The Viterbi algorithm is a dynamic programming search, which works by computing for each phone at each frame the most likely string of phones ending in that phone. Consider a sentence whose first two words are &amp;quot;of the&amp;quot;, and assume the simplified lexicon in Figure 3.</Paragraph>
    <Paragraph position="8"> Each pronunciation of the words 'of' and 'the' is represented by a path through the probabilistic automaton for the word. For expository simplicity, we have made the (incorrect) assumption that consonants have a duration of i frame, and vowel a duration of 2 or 3 frames. The algorithm analyzes the input frame by frame, keeping track of the best path of phones. Each path is ranked by its probability, which is computed by multiplying each of the transition probabilities and the phone probabilities for each frame. Figure 4 shows a schematic of the path computation. The size of each dot indicates the magnitude of the local phone likelihood. The maximum path at each point is extended; non-maximal paths are pruned.</Paragraph>
    <Paragraph position="9"> The result of the forced-Viterbi alignment on a single sentence is a phonetic labeling for the sentence (see Figure 5 for an example), from which we</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Wall Street Journal sentence
</SectionTitle>
      <Paragraph position="0"> can produce a phonetic pronunciation for each word.</Paragraph>
      <Paragraph position="1"> By running this algorithm on a large corpus of sentences, we produce a list of &amp;quot;bottom-up&amp;quot; pronunciations for each word in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
2.4 Rule probability estimation
</SectionTitle>
      <Paragraph position="0"> The rule-tagged surface lexicon described in SS2.1 and the counts derived from the forced-Viterbi described in SS2.3 can be combined to form a tagged lexicon that also has counts for each pronunciation of each word. Following is a sample entry from this lexicon for the word Adams which shows the five derivations for its single pronunciation: Adams: ae dz az m z: count=2  Each pronunciation of each word in this lexicon is annotated with rule tags. Since each pronunciation may be derived from different source dictionaries or via different rules, each pronunciation of a word may contain multiple derivations, each consisting of the list of rules which applied to give the pronunciation from the base form. These tags are either positive, indicating that a rule applied, or negative, indicating that it did not.</Paragraph>
      <Paragraph position="1"> To produce the initial rule probabilities, we need to count the number of times each rule applies, out of the number-of times it had the potential to apply. If each pronunciation only had a single derivation, this would be computed simply as follows:</Paragraph>
      <Paragraph position="3"> However, since each pronunciation can have multiple derivations, the counts for each rule from each derivation need to be weighted by the probability of the derivation. The derivation probability is computed simply by multiplying together the probability of each of the applications or non-applications of the rule. Let * DERIVS(p} be the set of all derivations of a pronunciation p, * POSR ULES(p, r, d) be 1.0 if derivation d of pronunciation p uses rule r, else 0.</Paragraph>
      <Paragraph position="4"> * ALLRULES(p,r) be the count of all derivations of p in which rule r could have applied (i.e. in which d has either a +R or -R tag).</Paragraph>
      <Paragraph position="5"> * P(d\]p) be the probability of the derivation d of pronunciation p.</Paragraph>
      <Paragraph position="6"> * PRON be the set of pronunciations derived from the forced-Viterbi output.</Paragraph>
      <Paragraph position="7"> Now a single iteration of the rule-probability algorithm must perform the following computation:</Paragraph>
      <Paragraph position="9"> Since we have no prior knowledge, we make the zero-knowledge initial assumption that P(d\[p) = 1 The algorithm can the be run as a \[DERIVS(p)I&amp;quot; successive estimation-maximization to provide successive approximations to P(dlp ). For efficiency reasons, we actually compute the probabilities of all rules in parallel, as shown in Figure 6.</Paragraph>
      <Paragraph position="10"> For each word/pron pair P E PRON from -- - forced-Viterbi alignment Let DERIVS(P) be the set of rule derivations of P For every d q DERIVS(P) For every rule R 6 d if (R = +RULE)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML