File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0204_intro.xml
Size: 4,912 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0204"> <Title>PLASER: Pronunciation Learning via Automatic Speech Recognition</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The phenomenal advances in automatic speech recognition (ASR) technologies in the last decade led to the recent employment of the technologies in computer-aided language learning (CALL) 1. One example is the LIS-TEN project (Mostow et al., 1994). However, one has to bear in mind that the goal of ASR in most other common classification applications (such as automated call centers, dictation, etc.) is orthogonal to that in CALL: while the former requires ASR in general to be forgiving to allophonic variations due to speaker idiosyncrasies or accent, pronunciation learning demands strict distinction among different sounds though the extent of strictness could be very subjective with a human teacher. As a result, technologies developed for mainstream ASR applications may not work satisfactorily for pronunciation learning.</Paragraph> <Paragraph position="1"> In the area of pronunciation learning, ASR has been used in CALL for two different purposes: teaching correct pronunciation of a foreign language to students (Kawai and Hirose, 2000), and assessing the pronunciation quality of a speaker speaking a foreign language (Witt and Young, 2000; Neumeyer et al., 2000; Franco et al., 2000). The former asks for accurate and precise phoneme recognition while the latter may tolerate more recognition noises. The judgment for the former task is comparatively more objective than that for the latter which, on the other hand, is usually required to correlate well with human judges. In this paper, we describe a multimedia tool we built for high-school students in Hong Kong to self-learn American English pronunciation. Their mother tongue is Cantonese Chinese. The objective is to teach correct pronunciation of basic English phonemes (possibly with local accent), and not to assess a student's overall pronunciation quality. Although there 1CALL applies many different technologies to help language learning, but this paper concerns only the one area of pronunciation learning in CALL.</Paragraph> <Paragraph position="2"> exist commercial products for the purpose, they have two major problems: First, they are not built for Cantonesespeaking Chinese; and, second, the feedback from these products does not pinpoint precisely which phonemes are poorly pronounced and which phonemes are well pronounced. As a matter of fact, most of these systems only provide an overall score for a word or utterance. As the feedback is not indicative, students would not know how to improve or correct their mistakes. One reason is the relatively poor performance of phoneme recognition -the best phoneme recognition accuracy is about 75% for the TIMIT corpus.</Paragraph> <Paragraph position="3"> We took a pragmatic view and designed a multimedia learning tool called PLASER -- Pronunciation Learning via Automatic SpEech Recognition -- according to our following beliefs and guidelines: 1. It is an illusive goal for average students to learn to speak a second language without local accent.</Paragraph> <Paragraph position="4"> Therefore, PLASER should be tolerant to minor Cantonese accents, lest the students become too frustrated from continually getting low scores.</Paragraph> <Paragraph position="5"> For example, there is no &quot;r&quot; sound in Cantonese and consequently Cantonese usually speaks the &quot;r&quot; phoneme with weak retroflexion.</Paragraph> <Paragraph position="6"> 2. Performance of phoneme recognition over a long continuous utterance is still far from being satisfactory for pedagogical purpose.</Paragraph> <Paragraph position="7"> 3. PLASER's performance must be reliable even at the expense of lower accuracy.</Paragraph> <Paragraph position="8"> 4. To be useful for correcting mistakes, PLASER must provide meaningful and indicative feedbacks to pinpoint which parts of an utterance are wrongly pronounced and to what extent.</Paragraph> <Paragraph position="9"> 5. The knowledge of IPA symbols is not a pre-requisite to learning pronunciation.</Paragraph> <Paragraph position="10"> This paper is organized as follows: in the next Section, we first present the overall system design of PLASER.</Paragraph> <Paragraph position="11"> This is followed by a discussion of our acoustic models in Section 3. Section 4 gives a detailed description of our confidence-based approach in pronunciation scoring, and the related feedback visualization is given in Section 5. Both quantitative and qualitative evaluation results are given in Section 6. Finally, we summarize the lessons we learned in building PLASER and point out some future works in Section 7.</Paragraph> </Section> class="xml-element"></Paper>