File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/c96-1046_evalu.xml
Size: 7,756 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1046"> <Title>Pronouncing Text by Analogy</Title> <Section position="6" start_page="270" end_page="271" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="270" end_page="271" type="sub_section"> <SectionTitle> 5.1 Pseudowords </SectionTitle> <Paragraph position="0"> Pronunciations have been obtained lot: * the 70 pseudowords froln Glushko (1979) used by Dedina and Nusbaum to test PRONOUNCE. The 'correct' pronunciation for these strings is taken to be that given by Dedina and Nusbaum (1991, pp. 61-62). We refer to this test set as D&N 70.</Paragraph> <Paragraph position="1"> * the lull set of 131 pseudowords from Glushko plus two others (goot, pome) plus two lexical words (cat and play) plus the pseudohomophone kwik, as used by Sullivan (1992). The 'correct' pronunciations are those read aloud by Sullivan's 20 non-phonetician subjects, and transcribed by him as British Received Pronunciation. We refer to this test set as Sul1136. Our expectation is that the error rate will be relatively high for this test set, partly because of its larger size but more importantly because the subjects' dialect of English is British RP rather that general American, i.e. there is a very significant inconsistency with the lexical databases.</Paragraph> <Paragraph position="2"> The output has been scored on words correct and also on symbol score (i.e. phonemes correct) using the Levenshtein (1966) string-edit distance as shown in Our best comparison with Dedina and Nusbaum (D&N70 test set, D&N model, Webster's database) gives a figure of 77.1% words correct. This is enormously poorer than their approximately 91% words correct - yet the implementation, reference pronunciations and test set are (as far as we can tell) identical. The only relevant difference is that the Webster's data-base is antomatically-aligned in their work and hand-aligned in ours. The clear expectation, given the crude nature of their alignment, is that they should have experienced a higher error rate, not a dramatically lower one. Overall, this result accords far more closely with Sullivan and Damper (1993) whose best word score for automatic alignment (and using smaller databases but a larger test set) was just over 70%.</Paragraph> <Paragraph position="3"> The re-implementation made 16 errors under the above conditions. Dedina and Nusbaum's claim of 9% words correct amounts to just 6 errors, 3 of which are the same as ours. The commonest problem is vowel substitution. It is possible to discount a very few errors as essentially trivial, reducing the error rate marginally to some 20%. We conclude, therefore, that Dedina and Nusbaum's reported error rate of 9% is unattainable.</Paragraph> <Paragraph position="4"> In our opinion, a major deficiency of the simple shortest-path length heuristic is that the output can become unreasonably sensitive to rare or unique pronunciations. For instance, mone receives the strange pronunciation /moni/by analogy with anemone. Also, the pseudoword shead receives the bizarre, vowelless pronunciation /f___d/ (where '2 denotes the null phoneme) when using the D&N model and the TWB database. As illustrated in Fig. 2 earlier, this turns out to be a result of matching the unique but long mapping head --+/_ __d/as in forehead --+ Itbr .... d~ (arc li'equency 1) in conjunction with the very common mapping sh -+/J'_/as in she and shed (arc frequency 174) which swamps the overall score of 175.</Paragraph> <Paragraph position="5"> The same bizarre pronunciation does not occur with the PROD model. In this case, the path through the (/e/, 3) node has a product score of 12 x 30 = 360 for the pronunciation/fed/ which considerably exceeds the score of 174 for/fd/.</Paragraph> <Paragraph position="6"> Replacing the arc-sum heuristic of the D&N model by arc-product as in the PROD model leads to a considerable increase in performance, e.g. from 77. 1% words correct to 82.9% for the D&N 70 test set with Webster's database. In turn, the MP model performs better than PROD in all cases.</Paragraph> <Paragraph position="7"> For the Sull 136 test set, our expectation of poorer performance (because of the larger test set and inconsistency between of dialect between the target pronunciations and the lexical databases) is borne out for Webster's dictionary. For TWB, however, the performance difl'erence between test sets is less consistent.</Paragraph> </Section> <Section position="2" start_page="271" end_page="271" type="sub_section"> <SectionTitle> 5.2 Lexical Words </SectionTitle> <Paragraph position="0"> The primary ability of a text-to-speech system must be to produce correct pronunciations lbr lexical words (rather than pseudowords) which just happen to be absent from the system's dictionary. Accordingly, we have tested the PbA implementations by removing each word in turn from its relevant database, and obtaining a pronunciation by analogy with the remainder.</Paragraph> <Paragraph position="1"> In these tests, the transcription standard employed by the compilers of the dictionary becomes its own ref erence and problems of transcription inconsistencies between input strings and lexical entries are avoided.</Paragraph> <Paragraph position="2"> Results for the testing of lexical words are shown in Table 2. Again there are consistent performance differences with the 'standard' D&N model worst and the mapping probability (MP) mode\[ best. All models perform better with the TWB database than with Webster's, probably simply because of its smaller size.</Paragraph> <Paragraph position="3"> For some lexical words, no pronunciation at all was produced because there was no complete path from Start to End in the lattice. This occurred for 92 of the TWB words and 117 of the Webster's words irrespective of the scoring model. This is a serious shortcoming: a PbA system should always produce a bestattempt pronunciation, even if it cannot produce the correct one. Sometimes, this failure is a consequence of the lbrm of pronunciation lattice in which nodes are used to represent the 'end-points' of mappings. One of the inputs for which no pronunciation was found is anecdote, whose (partial) lattice is shown in Fig. 3.</Paragraph> <Paragraph position="4"> There is in fact no arc in the complete lattice between nodes (/k/, 4) and (/d/, 5) because there is no cd -+/kd/ mapping anywhere in either dictionary. Nor is there an ecd or cdo trigram - with or without the right end-point phonemes - which could possibly bridge the gap.</Paragraph> <Paragraph position="5"> This problem is entirely avoided with the Sullivan and Damper style of lattice, because the shortest-length arc corresponds to a single-symbol mapping rather than to a bigram (which may be unique). Thus, there will always be a 'default' single-symbol mapping corresponding to the commonest pronunciation of the letter. This is not to say that Sullivan and Damper's system will necessarily produce the correct output here: it ahnost certainly will not because of the rarity of the c -+/k/mapping in the _d context.</Paragraph> <Paragraph position="6"> Another input which thils to produce a pronunciation is aardvark. The problem here is not that there is no aa bigram in the dictionary (which is found in words such as bazaar), but that it only appears towards the end of other words. Dedina and Nusbaum's strategy ol' performing substring matching only over a restricted range (the number of matching comparisons is equal to the difference in length between the input string and lexical entry) is at the root of this problem.</Paragraph> </Section> </Section> class="xml-element"></Paper>