File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0402_metho.xml
Size: 8,868 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0402"> <Title>ELICITING NATURAL SPEECH FROM NON-NATIVE USERS: COLLECTING SPEECH DATA FOR LVCSR</Title> <Section position="5" start_page="6" end_page="7" type="metho"> <SectionTitle> 3 Pilot Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="6" end_page="7" type="sub_section"> <SectionTitle> 3.1 Recording Setup </SectionTitle> <Paragraph position="0"> All recordings were taken by a DAT recorder; speakers wore a Sennheiser headset. Recordings were done in a small computer lab with some incidental noise but no excessive outside noise.</Paragraph> <Paragraph position="1"> On some occasions there were other people in the room when the recording was being done; this will be discussed further below. In non-interactive recordings, users were seated at a table with the instruction sheets, pen or pencil, and water. Speakers were permitted to stop and restart recording at any time.</Paragraph> <Paragraph position="2"> We did two pilot experiments which greatly helped us to understand the needs of our speakers and how we could make them more comfortable, in turn improving the quality of our data. For these experiments, we recorded native speakers of Japanese.</Paragraph> <Paragraph position="3"> In the first experiment, we drew from a human-machine collection task that we had had success with for native speakers in a similar application in another domain. Speakers were provided with prompts such as the following: in English. As we had predicted, they were strongly influenced in their word choice by the phrasings used in the prompts. The second time they came in, they were given the prompts in their native language . They felt that this task was much harder; they perceived it as a translation task in which they were expected to give a correct answer, whereas with the English prompts they were effectively given the correct answer. Their productions, however, were more varied, different both from each other and from the original English prompt.</Paragraph> <Paragraph position="4"> In addition to the prompt-based task, we had speakers read from a local travel guide, specifically about the university area so that the context would be somewhat familiar. We found that there were indeed reading errors of the type that would not occur in spontanous speech.</Paragraph> <Paragraph position="5"> We observed that some speakers were stumbling over words that they obviously didn't know. We attempted to normalize for this by having them read utterances that had been previously recorded and transcribed, hoping that they would be more likely to be familiar with words that other speakers of similar fluency had used. We still found that they had some difficulty in reading. Our speakers were native speaker s of Japanese, however, which has a different writing system; this would have some influence. null There was also a fair amount of stumbling over words in the prompted tasks, especially with proper nouns, and we have not yet looked at the correspondence between stumbling in read speech of familiar words and stumbling in spontaneous speech. It may be the case that they are more closely related than they are for native speakers.</Paragraph> <Paragraph position="6"> In the second pilot experiment, we attempted a wizard-of-oz collection using an interactive map; the speakers could ask for locations and routes to be highlighted in the map, and there was a text screen to which the wizard could send messages to answer speaker queries. Instead of a list of prompts, the speakers were given a sheet of paper listing some points of interest in the city, hotel names, some features that they could ask about (business hours, location, etc.) and the dates that they would be in the city. Their task was to plan a weekend, finding hotels, restaurants, and things to do. Our thought was that perhaps speakers would speak more naturally in an information-gathering task, where they are actually trying to communicate instead of simply producing sentences.</Paragraph> <Paragraph position="7"> Our general impression was that although the visual highlighting of the locations was a feature that the users enjoyed, and which helped them to become involved in the task, the utterances could not be characterized as more natural than those given in the prompted task. It was also our feeling that speakers were less sure of what to do in a less structured task; both lack of confidence in speaking and unfamiliarity with a &quot;just say whatever comes to mind&quot; approach contributed to their general discomfort. It took time to read and understand the responses from the wizard; also, speakers were aware that someone (the wizard) was listening in. Both of these factors were additional sources of self-consciousness. Although we thought that the repair dialogues that came about when the wizard misunderstood the speaker were valuable data, and that someone trained to provide responses geared toward the fluency level of the speaker would have more success as a wizard, it was our opinion that given the range of fluency levels we were targeting, wizard-of-oz collection would not be ideal for the following two reasons: * communication and speaker confidence break down when the speaker is really having trouble expressing himself and the wizard cannot understand * simulating a real-life experience, such as making a hotel reservation, without the real goal of wanting to stay in a hotel and background knowledge about the trip, can be very difficult depending on language ability and cultural background</Paragraph> </Section> </Section> <Section position="6" start_page="7" end_page="8" type="metho"> <SectionTitle> 4 Final Protocol </SectionTitle> <Paragraph position="0"> The final data collection protocol that we settled on has three parts. The first is a series of scenarios, in each of which a situation is described in the speaker's native language (L1) and a list is given in bullet form of things relevant to the situation that the speaker is to ask about. For instance, if the situation is a Pittsburgh Steelers game, the speakers would see the The bullets are made as short as possible so that the speakers absorb them in a glance and can concentrate on formulating an original question instead of on translating a specific phrase or sentence.</Paragraph> <Paragraph position="1"> The second part is a read task. There was no doubt left after the pilot experiments that the amount of patience speakers had with the prompted task was limited; after the novelty wore off speakers tired quickly. Although spontaneous data would be better than read data, read data would be better than no data, and speakers seemed willing to continue at least as long again reading as they had with the prompted task. We considered two types of material for the reading. Some sort of phonetically balanced text is often used for data collection, so that the system is trained with a wide variety of phonetic contexts. Given that our speakers are even more restricted in their phrasings than native speakers are in conversational speech, it is likely that some phonetic contexts are extremely sparsely represented in our data. However, it may be the case that semi-fluent speakers avoid some constructions precisely because they are difficult to pronounce, and a sparsity in the training data probably is a good predictor of a sparsity in unseen data; even with new words, which may have as-yetunseen phonetic contexts, non-native speakers may not pronounce them at all in the way that the designer of the phonetically balanced text had anticipated. We chose a 1000-word version of the fairy tale Snow White for our read texts; it had the highest syllable growth rate of any of the fairy tales we looked at and we augmented the syllable inventory by replacing some words with others, trying to ensure at the same time that all of the words were ones our speakers were likely to have encountered before.</Paragraph> <Paragraph position="2"> Finally, we ask speakers to read a selection of previously recorded and transcribed utterances from the prompted task, both by native speakers and non-native speakers, randomly selected and with small modifications made to preserve anonymity. Our objective here was threefold: to quantify the difference between read dialogues and spontaneous dialogues; to quantify the difference between read dialogues and read prose; and to compare the performance of the end recognizer on native grammar with non-native pronunciation with performance on non-native grammar with non-native pronunciation.</Paragraph> <Paragraph position="3"> We have recorded 23 speakers so far in the post-pilot phase of data collection, and all have expressed satisfaction with the protocol.</Paragraph> </Section> class="xml-element"></Paper>