File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3002_metho.xml
Size: 8,485 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-3002"> <Title>The Importance of Prosodic Factors in Phoneme Modeling with Applications to Speech Recognition</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Prosodic Annotation </SectionTitle> <Paragraph position="0"> The set of 38 different phonemes, shown in figure 1, were used in the experiments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Allophone Modeling </SectionTitle> <Paragraph position="0"> Recognition experiments were preformed for four different allophone sets:</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> * Tied * Accent * Boundary * Untied </SectionTitle> <Paragraph position="0"> The Tied set contained no prosodically labeled data.</Paragraph> <Paragraph position="1"> The Accent set contained monophones that were split into two groups, accented and unaccented. Phonemes were not distinguished on the basis of phrase position.</Paragraph> <Paragraph position="2"> The Boundary set modeled monophones as phrase initial, phrase medial, or phrase final. Accented phonemes were not distinguished from unaccented phonemes.</Paragraph> <Paragraph position="3"> The Untied set distinguish phonemes by both phrasal position and accentuation. A monophone in this group could be labeled as phrase medial, phrase medial accented, phrase initial, phrase initial accented, phrase final or phrase final accented.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Allophone Definitions </SectionTitle> <Paragraph position="0"> Figure 2 contains the six different labels used to represent the allophones of a single imaginary phoneme &quot;phn.&quot; A phrase final phoneme was considered to be any phoneme that occurred in the nucleus or coda of the final syllable of a word directly preceding an intonational phrase boundary. Phrase initial phonemes, on the other hand, were considered to be any phoneme in the onset or nucleus of the initial syllable of a word that followed an intonational phrase boundary. Phase medial phonemes were considered to be any other phoneme.</Paragraph> <Paragraph position="1"> An accented vowel was the lexically stressed vowel in a word containing a transcribed pitch accent. Because accented consonants are not clearly defined, three different labeled sets of accented consonants were developed: with an accented vowel to also be accented. After Vowel considered as accented only the coda consonants. Before Vowel recognized only the onset consonants of the accented syllable as being accented. Accents were considered to be limited to a single syllable.</Paragraph> <Paragraph position="2"> Because there were three different groups of accented consonants and because there is only one way a vowel can be labeled as accented, vowels were beyond b iy y aa n d beyond! b iy y aa! n! d! beyondB4 b iy y aaB4 nB4 dB4 beyondB4! b iy y aaB4! nB4! dB4! B4beyond B4b B4iy y aa n d B4beyond! B4b B4iy y aa! n! d! defined with Untied allophones for the After Vowel experimental condition. Boundary allophones could only be used to define three distinct word types, Accent only two, and Tied only one.</Paragraph> <Paragraph position="3"> After Vowel accent condition. The transcribed word is &quot;wanted.&quot; separated into a fourth group of their own, entitled Vowels. The four groups along with the four different allophone models lead to the sixteen experimental conditions illustrated in figure 3.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Dictionaries and Transcription Types </SectionTitle> <Paragraph position="0"> Each experimental condition required its own dictionary and transcription. Just as each phoneme had six distinct allophones, each word had six distinct types. A word could be phrase initial, medial or final and accented or unaccented. Each word type had its own definition.</Paragraph> <Paragraph position="1"> An example dictionary is shown in figure 4.</Paragraph> <Paragraph position="2"> Every experimental condition had both a word level transcription and a phone level transcription.</Paragraph> <Paragraph position="3"> Figure 5 shows an example of the two different levels of transcription files.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Experiments </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> Experiments were performed using the Hidden Markov Toolkit (HTK), which is distributed by the University of Cambridge (2002). Phonemes were modeled using a three-state HMM with no emitting start and end states. Each emitting state consisted of three mixture Gaussians and no state skipping was allowed.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Procedure </SectionTitle> <Paragraph position="0"> The Radio News Corpus data was divided into 2 sets: a training set and a test set. The test set was approximately 10% of the size of the training set. The experimental procedure was completed for sixteen experimental conditions.</Paragraph> <Paragraph position="1"> The experimental procedure can be divided into two steps. In step one, the training data was used to re-estimate the HMM definitions for each phoneme.</Paragraph> <Paragraph position="2"> Re-estimation was performed with the HTK tool HRest, which uses Baum-Welsh re-estimation described in detail in the HTK book available from Cambridge University (2002). HMM parameters were re-estimated until either the log likehood converged or HRest had performed 100 iterations of the re-estimation algorithm. In the second step of the experiments, HRest was used to perform a single iteration of the re-estimation algorithm on the test data using the HMM definitions that were updated from the re-estimation of the training set. During re-estimation, the log likehoods of each phoneme were output and saved for later comparisons.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Post Processing </SectionTitle> <Paragraph position="0"> Once all the log likehoods had been recorded, the Untied allophone sets were used as a basis to determine if the considered monophones were better modeled as prosody independent or prosody dependent. To determine the best modeling strategy for a particular monophone, six different weighted averages (WA's) were calculated from the Untied log likehoods and compared to the computed log likehoods of the Boundary, Accent and Tied models.</Paragraph> <Paragraph position="1"> experiments for the Accented allophone sets. The &quot;Merge&quot; column lists phonemes with WA [?] LL. The &quot;Separate&quot; column indicates phonemes where WA < LL. Due to the relatively small size of the data set, several phonemes are missing from the table.</Paragraph> <Paragraph position="3"> b. The proposed modeling of Vowels. Numbers 1-6 indicate six different distinguishable prosodic types The following three formulas were used to calculate the WA's of the Untied set for comparison with the Boundary set computed value: where PM, PI, and PF stand for phrase medial, initial and final, respectively. L x represents the computed log likehood of the allophone label x in the Untied allophone set, and W x represents the frequency of that x. represents the number of examples of the token x, and TOTAL is the sum of all the different phoneme tokens being taken into account for the computation of WA of some set of phonemes. The two formulas used in calculating the WA's for comparison with the Accent allophone set are as follows: are the weighted averages of log likehoods for the accented and unaccented tokens respectively.</Paragraph> <Paragraph position="4"> The WA compared to the Tied set was computed as follows: is the weighted average of all of the phonemes in the Untied model.</Paragraph> <Paragraph position="5"> The weighted averages were then compared to the log likehoods using the following algorithm: if (WA < LL), then split using prosodic labels if (WA [?] LL), then do not split using prosodic labels LL is the log likehood computed using HRest.</Paragraph> </Section> </Section> class="xml-element"></Paper>