XML Viewer - h94-1055

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1055_evalu.xml
Size: 12,474 bytes
Last Modified: 2025-10-06 14:00:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1055">
  <Title>Phonological Parsing for Bi-directional Letter-to-Sound/Sound-to-Letter Generation 1</Title>
  <Section position="6" start_page="290" end_page="292" type="evalu">
    <SectionTitle>
EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> Experiments on both letter-to-sound and sound-to-letter generation were conducted using 26 letters, one graphemic place-holder and 52 phonemes (including several unstressed vowels and pseudo diphthongs such as/o r/). Each entry in the test corpus contains a spelling corresponding to a single pronunciation. The generation procedures use evaluation criteria that directly mirror one another. Word accuracy is the percentage of parsable words for which the top-ranking theory generates a spelling/pronunciation that matches the lexical entry exactly. Non-parsable words are those for which no sPelling/pronunciation output is produced. &amp;quot;Top N&amp;quot; word accuracy refers to the percentage of parsable words for which the correctly generated spelling/pronunciation appears in the top N complete theories. Letter/Phoneme accuracies include insertion, substitution and deletion error rates, and are obtained using the program provided by NIST for evaluating speech recognition systems.</Paragraph>
    <Paragraph position="1"> Results on Letter-to-Sound Generation In letter-to-sound generation, about 6% of the test set was nonparsable. This set consists of compound words, proper names, and words that failed due to sparse data problems. Results for the parsable portion of the test set are shown in Table 2. The 69.3% word accuracy corresponds to a phoneme accuracy of 91.7%, where an insertion rate of 1.2% has been taken into account.</Paragraph>
    <Paragraph position="2"> Thus far there are no standardized evaluation methods for text-to-speech systems, and therefore comparison among different systems remains difficult. Errors in the generated stress pattern and/or phoneme insertion errors are often neglected. Evaluation criteria that have been used include word accuracy, accuracy per phoneme and accuracy per letter (in measuring the accuracy per letter, silent letters are regarded as mapping to a \[NULL\] phone). We believe that accuracy per letter would generally be higher than accuracy per phoneme, because there are generally more letters than phonemes per word, and the letters mapping to the generic category \[NULL\] would usually be correct. To verify our claim, we computed the two measurements based on our training set, using the alignment provided by the training parse trees. Our re- null ories as a function of N-best depth for the test set sult shows that a per letter measurement would lead to a .10% reduction in error rate.</Paragraph>
    <Paragraph position="3"> Figure 2 is a plot of cumulative percent correct of whole word theories as a function of the N-best depth for the test set. Although 30 complete theories were generated for each word, no correct theories occur beyond N ----18 after resorting, with an asymptotic value of just over 89%.</Paragraph>
    <Section position="1" start_page="291" end_page="291" type="sub_section">
      <SectionTitle>
Results on Sound-to-Letter Generation
</SectionTitle>
      <Paragraph position="0"> In sound-to-letter generation, about 4% of the test set was nonparsable. Results for the parsable words are shown in Table 3; top-choice word accuracy for sound-to-letter is about 52%. This corresponds to a letter accuracy of 88.6%, with an insertion error rate of 2.5% taken into account. This performance compares favorably with those reported in previous work.</Paragraph>
      <Paragraph position="1"> Figure 3 is a plot of the cumulative percent correct (in sound-to-letter generation) of whole word theories as a function of N-best depth of the test set. The asymptote of the graph shows that the first 30 complete theories generated by the parser contain a correct theory for about 83% of the test words. Within this pool, resorting using the actual parse score has put the correct theory within the top 10 choices for about 81% of the cases, while the remaining 2% have their correct theories ranked between N = 10 and N = 30. Resorting seems to be less effective in the sound-to-letter case, presumably because many more &amp;quot;promising&amp;quot; theories can be generated than for letter-to-sound. A possible reason for this is the ambiguity in phoneme-to-letter mapping, and another reason is that geminant letters are often mapped to the same (consonantal) phoneme. For example, the generated spellings from the pronunciation of &amp;quot;connector&amp;quot; i.e., the phoneme string (k t n e k t a~), include: &amp;quot;conecter&amp;quot;, &amp;quot;conector&amp;quot;, &amp;quot;connecter&amp;quot;, &amp;quot;connector&amp;quot;, &amp;quot;conectar&amp;quot;, &amp;quot;conectyr&amp;quot;, &amp;quot;conectur&amp;quot;, &amp;quot;connectyr&amp;quot;, &amp;quot;eonnectur&amp;quot;, &amp;quot;conectter', &amp;quot;connectter&amp;quot; and &amp;quot;cannecter'. Many of these hypotheses can be rejected with the avail-' ability of a large lexicon of legitimate English spellings.</Paragraph>
    </Section>
    <Section position="2" start_page="291" end_page="292" type="sub_section">
      <SectionTitle>
Error Analyses
</SectionTitle>
      <Paragraph position="0"> Both of the cumulative plots shown above reach an asymptotic value well below 100%. The words that belong to the portion of the test set lying above the asymptote appear intractable - a correct pronunciation/spelling did not emerge as one of the 30 complete theories. Detailed analysis of these words shows that they fall into approximately 4 categories. (1) Generated pronunciations that have subtle deviations from the reference strings. (2) Unusual pronunciations due to influences from foreign languages. (3) Generated pronunciations which agree with the regularity of English letter-phoneme mappings, but were nevertheless incorrect. (4) Errors attributable to sparse data problems. Some examples are shown in Table 4. It is interesting to note that there is much overlap between the set of problematic words in letter-to-sound and sound-to-letter generation. This implies that  Category correct generated generated correct spelling spelling pronunciation pronunciation  (1) Subtle acquiring equiring IkwoYrzl\] ikwo/~llj balance balence correct ba~hns launch lawnch correct lon5 pronounced pronounst pnnoWnst proWnaWnst (2) Umzsual champagne shampain ~a~mplgniY ~a:mpeYn debris dibree diYbns dlbriY (3) Regular basis correct ba~sls beYsts elite aleat doYt diYt violence viallence correct voYdms viscosity viscossity v,skoWs,ti y vIskos~ti y (4) Sparse braque brack bra~kwiY bra~k</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="292" end_page="293" type="evalu">
    <SectionTitle>
EVALUATING THE
HIERARCHY
</SectionTitle>
    <Paragraph position="0"> We believe that the higher level linguistic knowlege incorporated in the hierarchy is important for our generation tasks. Consequently, we would like to empirically assess: (1) the relative contribution of the different linguistic layers towards generation accuracy, and (2) the relative merits of the overall design of the hierarchical lexical representation. Our studies \[13\] are based on letter-to-sound generation only, although we expect that the implications of our study should carry over to sound-to-letter generation.</Paragraph>
    <Paragraph position="1"> Investigations on the Hierarchy The implementation of our parser is flexible, in that it can train and test on a variable number of layers in the hierarchy. This enables us to explore the relative contribution of each linguistic level in the generation task. We conducted a series of experiments whereby an increasing amount of linguistic knowledge (in terms of the number of layers in the hierarchy) is omitted from the training parse trees. For each reduced configuration, the system is re-trained and re-tested on the same training and testing corpora as described earlier. For each experiment we compute the top-choice word accuracy and perplexity, which reflect the amount of constraint provided by the hierarchical representation. We also measure the coverage to show the extent to which the parser can generalize to account for previously unseen structures, and count the number off system parameters in order to observe the computational load, as well as the parsimony of the hierarchical framework in capturing English orthographic-phonological regularities. We found that for every layer omitted from the representation, linguistic constraints are lost, manifested as a lower generation accuracy, higher perplexity and greater coverage.</Paragraph>
    <Paragraph position="2"> Fewer layers also require fewer training parameters.</Paragraph>
    <Paragraph position="3"> The significant exception was the case of omitting the layer of broad classes (layer 5), which seems to introduce additional constraints, thus giving the highest generation performance. The word accuracy based on the parsable portion of the test set was 71.8%, 5 which corresponds to a phoneme accuracy of 92.5%. This improvement 6 can be understood by realizing that broad classes can be predicted from phonemes with certainty, and the inclusion of the broad class layer probably led to excessive smoothing across the individual phonemes within each broad class7 Again, about 6% of the test set was nonparsable. When a robust parsing scheme is used to recover the nonparsable words, 100% coverage was achieved, but performance degrades to 69.2% word and 91.3% phoneme accuracy.</Paragraph>
    <Paragraph position="4"> Comparison with a Single-Layer Approach We also compared our current hierarchical framework with an alternative approach which uses a single-layer representation. Here, a word is represented mainly by its spelling and an aligned phonemic transcription, using the \[NULL\] phoneme for silent letters. The alignment is based on the training parse trees from the hierarchical approach. For example, &amp;quot;bright&amp;quot; is transcribed as/b raY NULL NULL \[/. The word is then fragmented exhaustively to obtain letter sequences (word fragments) shorter than a set maximum length. During training, bigram probabilities and phonemic transcription probabilities are computed for each letter sequence. Therefore this approach  89.4% letter accuracy, and 5% of the words were nonparsable. 7However, broad classes may still serve a role as a &amp;quot;fast match&amp;quot; layer in recognition experiments, where their predictions could no longer be certain, due to recognition errors.</Paragraph>
    <Paragraph position="5">  captures some graphemic constraints within the word fragment, but higher level linguistic knowledge is not explicitly incorporated. Letter-to-sound generation is accomplished by finding the &amp;quot;best&amp;quot; concatenation of letter sequences which constitutes the spelling of the test word.</Paragraph>
    <Paragraph position="6"> TO facilitate comparison with the hierarchical approach, we use the same training and test sets to run letter-to-sound generation experiments with the single-layer approach. Several different value settings were used for the maximum word fragment length. We expect generation accuracy to improve as the maximum word fragment length increases, because longer letter sequences can capture more context. However, this should be accompanied by an increase in the number of system parameters due to the combinatorics of the letter sequences. Furthermore, there are no nonparsable test words in the single-layer approach, because it can always &amp;quot;backolT' to mapping a single letter to its most probable phoneme.</Paragraph>
    <Paragraph position="7"> The hierarchical approach (without the broad class layer) achieved the same performance as the highest performing single-layer approach, which allowed a maximum fragrnent length of 6. 8 The mean fragment length of the segmentations used in the test set by the single-layer approach was 3.7, while the mean grapheme length used by the hierarchical approach was only 1:2. The hierarchical approach is capable of reversible generation using about 32,000 parameters, while the single-layer approach requires 693,300 parameters (a 20-fold increase) for uni-directional letter-to-sound generation. In order to achieve reversibility, the number of parameters would have to be doubled.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML