File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1008_evalu.xml
Size: 3,914 bytes
Last Modified: 2025-10-06 13:59:10
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1008"> <Title>Statistical Modeling for Unit Selection in Speech Synthesis</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experimental results </SectionTitle> <Paragraph position="0"> We used the AT&T Natural Voices Product speech synthesis system to synthesize 107,987 AP news articles, generating a large corpus of 8,731,662 unit sequences representing a total of 415,227,388 units.</Paragraph> <Paragraph position="1"> We used this corpus to build several n-gram Katz backoff language models with n = 2 or 3. Table 1 gives the size of the resulting language model weighted automata. These language models were built using the GRM Library (Allauzen et al., 2004).</Paragraph> <Paragraph position="2"> We evaluated these models by using them to synthesize an AP news article of 1,000 words, corresponding to 8250 units or 6 minutes of synthesized speech. Table 2 gives the unit selection time (in seconds) taken by our new system to synthesize this AP system when used to synthesize the same AP news article.</Paragraph> <Paragraph position="3"> news article. Experiments were run on a 1GHz Pentium III processor with 256KB of cache and 2GB of memory. The baseline system mentioned in this table is the AT&T Natural Voices Product which was also used to generate our training corpus using the concatenation cost caching method from (Beutnagel et al., 1999b). For the new system, both the computation times due to composition and to the search are displayed. Note that the AT&T Natural Voices Product system was highly optimized for speed. In our new systems, the standard research software libraries already mentioned were used. The search was performed using the standard speech recognition Viterbi decoder from the DCD library (Allauzen et al., 2003). With a trigram language model, our new statistical unit selection system was about 2.6 times faster than the baseline system.</Paragraph> <Paragraph position="4"> A formal test using the standard mean of opinion score (MOS) was used to compare the quality of the high-quality AT&T Natural Voices Product synthesizer and that of the synthesizers based on our new unit selection system with shrunken and unshrunken trigram language models. In such tests, several listeners are asked to rank the quality of each utterance from 1 (worst score) to 5 (best). The MOS results of the three systems with 60 utterances tested by 21 listeners are reported in Table 3 with their correspond-Model raw score normalized score system, the mean and standard error of the raw and the listener-normalized scores.</Paragraph> <Paragraph position="5"> ing standard error. The difference of scores between the three systems is not statistically significant (first column), in particular, the absolute difference between the two best systems is less than .1.</Paragraph> <Paragraph position="6"> Different listeners may rank utterances in different ways. Some may choose the full range of scores (1-5) to rank each utterance, others may select a smaller range near 5, near 3, or some other range. To factor out such possible discrepancies in ranking, we also computed the listener-normalized scores (second column of the table). This was done for each listener by removing the average score over the full set of utterances, dividing it by the standard deviation, and by centering it around 3. The results show that the difference between the normalized scores of the three systems is not significantly different. Thus, the MOS results show that the three systems have the same quality.</Paragraph> <Paragraph position="7"> We also measured the similarity of the two best systems by comparing the number of common units they produce for each utterance. On the AP news article already mentioned, more than 75% of the units were common.</Paragraph> </Section> class="xml-element"></Paper>