File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1005_evalu.xml
Size: 5,013 bytes
Last Modified: 2025-10-06 13:58:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1005"> <Title>Automatic Acquisition of Names Using Speak and Spell Mode in Spoken Dialogue Systems</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5.1 Results and Discussion </SectionTitle> <Paragraph position="0"> For each test set, letter error rates (LER) and word error rates (WER) are computed for the output for the first letter recognizer, and the output for the entire multi-stage ment, the subword trigram is omitted in the intermediate stage (Multi-Stage System II). Results are summarized in Tables 1 and 2.</Paragraph> <Paragraph position="1"> When evaluating output from the first-stage letter recognizer only, it is found that errors remain high (40.4% WER for Test Set A and 58.9% WER for Test Set B).</Paragraph> <Paragraph position="2"> It should be noted that none of the training data for the acoustic models contain any letter spellings, contributing to relatively poor performance compared to that in other domains using the same models. Many of the errors are also caused by poor detection of the transition from the spoken name to the spelled portion of the waveform. Deletions occur when part of the spelled portion is mistakenly identified as part of the unknown word or insertions arise when the end of a spoken word is confused for a spelled letter. However, the multi-stage system produces a marked improvement if we compare it with the single-stage letter recognizer as a baseline. WER improves by 36.4% (from 40.4% to 25.7%) for Test Set A, and 17.0% (from 58.9% to 48.9%) for Test Set B. The improvement is more pronounced for Test Set A because the words have been observed in the ANGIE training data.</Paragraph> <Paragraph position="3"> The most commonly confusable letter pairs are: M/N, A/E, J/G, Y/I, L/O, D/T. These letters are confusable both acoustically in the spelled letters as well as in the pronunciation of the spoken word.</Paragraph> <Paragraph position="4"> When the subword trigram is removed from the language model in the later stages, further WER improvements result in Test Set B (46.1%), although performance in Test Set A deteriorates. We infer that unknown words benefit more with a less constrained language model, and when more weighting is given to the ANGIE model for test sets. In Test Set A, 309 words (74.3%) are spelled correctly, and 107 words (25.7%) are incorrect. In Test Set B, 112 (51.1%) words are correctly spelled.</Paragraph> <Paragraph position="5"> generating possible spelling alternatives.</Paragraph> <Paragraph position="6"> To evaluate the phoneme extraction accuracy, the best letter hypothesis of the multi-stage system is used to compute the phonemes, as described in Section 3.2.4. In the actual ORION system, when a user confirms the correct spelling of their name, if the name exists in the training pronunciation lexicon, the phoneme extraction stage may be redundant. This assumes the pronunciation lexicon itself is reliable, and contains all the correct alternate pronunciations of the word. For the purpose of evaluation, we examine the phoneme outputs of both in-vocabulary Test Set A, and OOV Test Set B, whose phonemic baseforms have been hand-transcribed.</Paragraph> <Paragraph position="7"> Within ANGIE, phonemes are marked for lexical stress and syllable onset positions. There are also many special compound phonemic units (e.g., /sp, sk, st/). A much smaller phoneme set of 50 units is derived for evaluation, by applying rules to collapse the phoneme hypotheses.</Paragraph> <Paragraph position="8"> The phoneme error rate (PER) for Test Set A and B are depicted in Table 3. Error rates are provided for the sub-sets of words where the letter hypotheses are either correct or incorrect. Many of the confusable phoneme pairs are vowels: ih/iy, ae/aa, eh/ey. Other commonly confused phoneme pairs are: m/n, en/n, er/r, l/ow, d/t, s/z, th/dh. In another experiment, we evaluated the accuracy of the phoneme extraction by using the correct letter sequence as input, instead of the highest scoring letter sequence. The PER for Test Set A is 7.2% and the PER for Test Set B is 13.3%. While phoneme error rates are generally higher than letter error rates, it should be noted that the reference baseforms for the names contain only one or two alternate pronunciations for each name. However, it is not uncommon for a name to have many irregular pronunciation variants, which are not covered in the reference baseforms. Also the phonemic baseform determined by the recognizer is likely to be one preferred by the system for the particular speaker, assumed to be the owner, of the name. Therefore, we believe that the baseforms favored by the system may be more appropriate for subsequent recognition, especially if the name is to be spoken by the same speaker. This may be the case in spite of the mismatch between the favored phonemic baseform and that in the pronunciation dictionary.</Paragraph> </Section> class="xml-element"></Paper>