File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1222_metho.xml

Size: 15,731 bytes

Last Modified: 2025-10-06 14:15:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1222">
  <Title>Extracting Phoneme Pronunciation Information from Corpora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Determining Alternative
</SectionTitle>
    <Paragraph position="0"> For each word in each sentence in the corpus, we match the transcribed phones with the phonemes in that word as recorded in a lexicon, and record the frequency of occurrence of each phoneme/phone match over the entire corpus. The major difficulties in this process are (1) transcriptions include extra phones that do not appear in the phoneme sequences corresponding to words in the lexicon, and (2) there are expected phones that were not pronounced. As a result, the phone sequences rarely align exactly with the phonemes corresponding to the words in the lexicon. This makes a simple alignment unreliable, and calls for a more flexible method of matching.</Paragraph>
    <Paragraph position="1"> Our method consists of examining the words of the corpus aligned with their lexicon entries, and choosing phoneme to phone pairs that we are confident are a match. This information is then used to make further matches in less certain areas. This process has three main steps: (1) aligning each corpus word with its corresponding lexicon entry, (2) building initial data structures, and (3) iteratively making additional certain matches.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Aligning Corpus Words with Lexicon
Entries
</SectionTitle>
      <Paragraph position="0"> We use the dynamic string alignment algorithm described by Sandoff and Kruskal (1983) to determine the minimum number of substitutions, insertions and deletions needed to turn one string into another.</Paragraph>
      <Paragraph position="1"> This algorithm can produce several edit sequences with the same cost, so the edit sequence with the highest number of exact matches is selected. Table 1 describes selected symbols from the ARPAbet symbol set (Shoup, 1980) used for representing phonemes and phones in TIMIT. 2 A typical alignment of words in a sentence is given in Figure 1. The first row contains phonemes from the lexicon words, and the second row contains phones from the corresponding words in a corpus sentence. Note the use of C to represent the sharing of the N phone between mayan and neoclassic due to co-articulation.</Paragraph>
      <Paragraph position="2"> The TIMIT transcriptions of sentences use some symbols that are not present in the lexicon entries.</Paragraph>
      <Paragraph position="3"> For example, stop closures (bcl, dcl, gcl, pcl, tcl, kcl) and releases (b, d, g, p, t, k) are provided in the corpus transcriptions, but only releases are used in the lexicon entries. In order to prevent this mismatch from causing the surrounding phones to  d day DCL D ey p pea PCL P iy t tea TCL T iy k key KCL K iy q bat bcl b ae Q s sea S iy sh she SH iy z zone Z own v van Vaen dh then DH e n m morn Maa M n noon N uw N en button bah q EN l lay L ey r ray R ey w way W ey y yacht Y aa tcl t hh hay HH ey  hv ahead ax HV eh dcl d el bottle bcl baa tcl t EL iy beet bcl b IT tcl t ih bit bcl b Itt tcl t eh bet bcl b EH tcl t ey bait bcl b EY tcl t ae bat bcl b AE tcl t aa bott bcl b AA tel t ay bite bcl b AY tcl t all but bcl b AH tcl t ow boat bcl b OW tcl t uw boot bcl b UW tcl t ux toot tel t UX tcl t er bird bcl b ER dcl d ax about AX bcl b aw tcl t ix debit dcl d eh bcl b IX tcl t axr butter bcl b ah dx AXR be misaligued, we remove stop closures when they appear before stop releases. For example, in Figure 1 we have removed the kcl closure that preceded the first K phone in neoclassic, but the final kcl of that word was not removed because there was no release found for that closure (we don't want to eliminate the evidence of the k sound completely).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Building Initial Data Structures
</SectionTitle>
      <Paragraph position="0"> We scan through each aligned word pair in every sentence in the corpus and record certain matches and uncertain areas, generating frequency counts of the certain matches.</Paragraph>
      <Paragraph position="1"> A certain match is a pairing between a phoneme in a lexicon word and a phone in the same corpus word which we are confident represent the same intended sound. Initially, the certain matches are insertions, deletions and substitutions bounded immediately on did you did d ih d I y uw d ih d dx ix jh I jh ux dx ix dcl  and (b) multiple choice matches the left and right by an exact match or the beginning or end of a word. Since we know the boundaries of words from the transcriptions, we can reliably match up phonemes on word boundaries provided they are bounded on the other side by an exact match. Exact matches are also recorded as certain matches. In the sentence in Figure 1, ow/ix in neoclassic is a certain match, as is ih/ix in disappeared, hh/ax and ay/aa in while, and ix/- in surveying.</Paragraph>
      <Paragraph position="2"> An uncertain area is a group of two or more operations (insert, delete or substitute) bounded by a certain match or the beginning or end of a word. Exampies of uncertain areas in Figure 1 are the last three operations in ancient and the second and third operations in ruins. Uncertain areas potentially have matches within them, but we have not committed to an aligmnent at this stage.</Paragraph>
      <Paragraph position="3"> TIMIT uses a number of phonological rules for sharing and deleting phones on the boundaries of the words in the transcriptions (such as shown between the second and third word in Figure 1). These rules correspond to cases of co-articulation of phones (Giachin et aL,, 1991). Such co-articulated phones remove word boundaries, resulting in the concatenation of the end of a word with the beginning of the next word. For example, Figure 2(a) illustrates how co-articulation of the words did you renders the entire phone sequence an uncertain area. In contrast, the co-articulated N/N phones between mayan and neoclassic in Figure 1 constitutes a certain match.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Making Additional Certain Matches
</SectionTitle>
      <Paragraph position="0"> We use frequency counts of certain matches obtained in Step 2 to generate additional certain matches from the uncertain areas. To this effect, we consider each possible phoneme/phone match in each uncertain area, and select the phoneme/phone match with the highest frequency of certain matches. For instance, if the match n/en occurs more often than n/tel, then n/en will be chosen in the uncertain area between SH and the end of ancient in Figure 1.</Paragraph>
      <Paragraph position="1"> Whenever there are multiple instances of the same phoneme in an uncertain area, it is matched to the phone that is positionally closest. For example, both Thomas, Zukerman and Raskutti 177 Extracting Phoneme Pronunciation Information  d phonemes in the uncertain area in Figure 2(b) match the phones dx and dcl. In this case, we match the first d tO dx and the second d to del. During this process, we consider only potential phone/phoneme matches, and we ignore any match involving a &amp;quot;-&amp;quot; symbol (which indicates an insertion or a deletion).</Paragraph>
      <Paragraph position="2"> Insertions and deletions that are certain matches are collected for statistical purposes but do not influence the match decision process.</Paragraph>
      <Paragraph position="3"> After a phoneme/phone match has been determined, the phonemes and phones of the uncertain area are shifted so that the matched phoneme and phone are lined up. This new match is then removed from the uncertain area and recorded as another certain match. This process can create other bounded matches that are also removed from the uncertain areas and added to the certain matches. For example, in Figure 1, finding the certain match ih/ix in disappeared in Step 2 suggests that the match to be made in the uncertain areas in ruins and neoclassic is ih/ix. The match in ruins in turn creates a uw/ux pairing on its left-hand side, which is also recorded as a certain match, eliminating this uncertain area completely. Similarly, the match in neoclassic yields the k/kcl match on its right-hand side. This step is repeated until the number of matches being made levels off.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> For the TIMIT corpus, the number of certain matches levels off at 123115 from 109898 initial matches (after six iterations of Step 3), and the number of remaining uncertain matches falls from 13913 to 908. Figure 3 shows the percentage of phones (columns) found for every phoneme in the lexicon words (rows) after six iterations of matching attempts in each uncertain area. For example, between 12-16% of t and d are pronounced as dx.</Paragraph>
      <Paragraph position="1"> It is important to note that exact phoneme/phone matches registered in the 1000-2000 range, while most of the alternative pronunciations had a frequency in the few dozen.</Paragraph>
      <Paragraph position="2"> Figure 3 suggests that phonemes are often mis-pronounced as phones in the same broad sound group, i.e., both the intended phonemes and the uttered phones are generated by the same physical method of production. This result is most evident for vowels. Some of the other types of sounds, such as fricatives and nasals, have not been differentiated so clearly.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Considering Context
</SectionTitle>
    <Paragraph position="0"> To investigate the effect of the context of a phone (the attributes of the phones before and after this phone) on its pronunciation, we examined sentences aligned like the sentence in Figure 1 and recorded phoneme/phone pairs, along with acoustic features for the phones that appear before and after each phoneme/phone pair. 3 It includes the following attributes: broad sound category, voiced/voiceless, sibilant/non-sibilant and sonorant/obstruent. Table 2 shows how phones are classified according to these acoustic features, from (Yannakoudakis and Hutton, 1987) and (Rabiner and Juang, 1993).</Paragraph>
    <Paragraph position="1"> The contexts for all the phones were fed into an inductive inference program by Wallace and Patrick (1993) in order to find functions of the context attributes (i.e., acoustic features) that are good predictors of the phoneme intended by a speaker when s/he utters a particular phone. The inference program uses the Minimum Message Length (MML) principle (Georgeff and Wallace, 1984) to measure the significance of these functions. These functions are realized by a decision tree from the contexts for each uttered phone, with nodes splitting on the values of particular attributes. 4 Leaf nodes in a decision tree represent a partition of the context sample. Each leaf node contains a collection of contexts in the sample and the phoneme intended by a speaker for each context. An internal node of the tree is split on an attribute only when doing so creates a statistically significant reduction in the number of different types of intended phonemes in the leaf nodes (better than that expected from random effects). Ideally, each leaf node should contain several contexts, all of which have the same intended phoneme. This means that the attributes these contexts have in common are sufficient to predict this intended phoneme from an uttered phone.</Paragraph>
    <Paragraph position="2"> The MML principle is based on the following premise: if a sender knows both the attribute values and the class of the objects in a set, and wants to send the class of each object to a receiver (w.ho knows the attribute values but not the classes), the sender aims to send the shortest possible message (in bits). The MML criterion is used to produce the decision tree that can be sent by means of the shortest possible message. A split is made in the decision tree only if it decreases the message length for transmitting the intended information. The decision tree is then sent in a two-part message: (1) instructions for the receiver on how to reconstruct the tree; and SUncertain areas were treated as a single phoneme/phone pair involving complicated sounds.</Paragraph>
    <Paragraph position="3"> 4Words which are common in the corpus generate contexts that appear with high frequency. We assume that these frequencies are representative of those in English. null Thomas. Zukerman and Raskutti 179 Extracting Phoneme Pronunciation Information  (2) the labels for the classes of the objects in the leaves of the tree. Standard coding techniques show that the encoded set of labels for the classes in a leaf node will be short if most of the objects in that leaf node have the same label, and will be longer if the labels of the objects are equally likely. Therefore, if the objects in each leaf node are predominantly of one class, then the message encoding the decision tree will be shorter than a message which simply encodes the class label for each object in the set (without the tree).</Paragraph>
    <Paragraph position="4"> Given an uttered phone, the resulting decision tree shows the significant attributes and values for classifying the intended phoneme, which is the effect that the surrounding sounds have on predicting the intended phoneme.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Evaluation
</SectionTitle>
      <Paragraph position="0"> As indicated in Figure 3, in the majority of cases the uttered phone and the intended phoneme are the same. Table 3 summarizes the decision trees for situations where uttered phones are different from intended phonemes. This summary shows the effect of contextual phonetic information on the intended phoneme for each possible uttered phone (one line per phone). For example, for the uttered phone d, the intended phoneme was t when the next sound was&amp;quot; neither a consonant nor a vowel, i.e., a word boundary, a sentence boundary or missing; this occurred in 12 of the samples. Also, the intended phoneme was missing (i.e., d was uttered when no phoneme was intended) when the next sound was an obstruent; this happened in 2 of the samples. In 220 samples, the uttered phone was iy with intended phoneme ax when iy was the last sound in a word, and the previous sound was a fricative.</Paragraph>
      <Paragraph position="1"> Some uttered phones found in Figure 3 are missing from Table 3 because either there were not enough samples to create a tree (the stops b, g and p, the nasal m, and the pause), or more often because the trees produced had no discriminatory power. This occurred when each leaf node in a decision tree had an evenly spread mixture of intended phonemes, or when the same intended phoneme appeared throughout the tree.</Paragraph>
      <Paragraph position="2"> The decision trees were evaluated using test contexts from 1344 sentences (out of 5040 sentences) spoken by 26% of the speakers. No speaker was in the test and training sets. Each phone and its context was classified into a leaf node using the attributes in the context. A phoneme prediction was considered correct when the intended phoneme was the same as the most common phoneme in the leaf node (determined during training). 75% of the different test samples were predicted correctly.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML