File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1005_metho.xml
Size: 15,113 bytes
Last Modified: 2025-10-06 14:13:20
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1005"> <Title>THE HCRC MAP TASK CORPUS: NATURAL DIALOGUE FOR SPEECH RECOGNITION</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> THE HCRC MAP TASK CORPUS: NATURAL DIALOGUE FOR SPEECH RECOGNITION </SectionTitle> <Paragraph position="0"> 1: Human Communication Research Centre 2: Department of Artificial Intelligence 3: Centre for Cognitive Science 4: Department of Linguistics</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Buccleuch Place, </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> SCOTLAND 5: Department of Psychology </SectionTitle> <Paragraph position="0"/> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The HCRC Map Task corpus has been collected and transcribed in Glasgow and Edinburgh, and recently published on CD-ROM. This effort was made possible by funding from the British</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Economic and Social Research Council. </SectionTitle> <Paragraph position="0"> The corpus is composed of 128 two-person conversations in both high-quality digital audio and orthographic transcriptions, amounting to 18 hours and 150,000 words respectively.</Paragraph> <Paragraph position="1"> The experimental design is quite detailed and complex, allowing a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way.</Paragraph> <Paragraph position="2"> The corpus is a uniquely valuable resource for speech recognition research in particular, as we move from developing systems intended for controlled use by familiar users to systems intended for less constrained circumstances and naive or occasional users. Examples supporting this claim are given, including preliminary evidence of the phonetic consequences of second mention and the impact of different styles of referent negotiation on communicative efficacy.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="25" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> The HCRC Map Task corpus has been collected and transcribed in Glasgow and Edinburgh, and recently published on CD-ROM (HCRC 1993). This effort was made possible by funding from the British</Paragraph> <Section position="1" start_page="0" end_page="25" type="sub_section"> <SectionTitle> Economic and Social Research Council. </SectionTitle> <Paragraph position="0"> The group which designed and collected the corpus covers a wide range of interests and the corpus reflects this, providing a resource for studies of natural dialogue from many different perspectives.</Paragraph> <Paragraph position="1"> In this paper we will give a brief summary of the experimental design, and then concentrate on those aspects of the corpus which make it a uniquely valuable resource for speech recognition research in particular, as we move from developing systems intended for controlled use by familiar users to systems intended for less constrained circumstances and naive or occasional users. Some preliminary results of work on the phonetic consequences of second mention and on the impact of different styles of referent negotiation on communicative efficacy will also be presented.</Paragraph> </Section> </Section> <Section position="6" start_page="25" end_page="25" type="metho"> <SectionTitle> 2. CORPUS DESIGN AND CHARACTERISTICS 2.1. The Task </SectionTitle> <Paragraph position="0"> The conversations were elicited by an exercise in task-oriented cooperative problem solving. The two participants sat facing one another in a small recording studio, separated by a table on which sat back-to-back reading stands. On each stand was a schematic map, each visible only to one participant. Each map consisted of an outline and roughly a dozen labelled features (e.g.</Paragraph> <Paragraph position="1"> &quot;white cottage&quot;, &quot;Green Bay&quot;, &quot;oak forest&quot;). Most features are common to the two maps, but not all, and the participants were informed of this. One map had a route drawn in, the other did not. The task was for the participant without the route to draw one on the basis of discussion with the participant with the route.</Paragraph> <Paragraph position="2"> Using an elaboration of a design developed over a number of years (see e.g. Brown, Anderson et al. 1983), we recorded 128 two-person conversations (each talker in four conversations), employing 64 talkers (32 male, 32 female), almost all born and raised in the Glasgow area, speaking with an educated West of Scotland accent. High quality recordings were made using Shure SM10A close-talking microphones, one talker per channel on stereo DAT (Sony DTC1000ES).</Paragraph> <Paragraph position="3"> The experimental design is quite detailed and complex, allowing a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts, and to provide a range of different stimuli to referent negotiation, based on matches and mis-matches between the two maps.</Paragraph> <Paragraph position="4"> Among the independent variables in the design were: * Eye-contact---in half the conversations, the participants could see one another's faces, in half, they could not.</Paragraph> <Paragraph position="5"> Familiarity--in half the conversations, the talkers were acquaintances, in half, strangers.</Paragraph> <Paragraph position="6"> Task role--Each talker participated in four conversations, two as Instruction Giver (the one with the route) and two as Instruction Follower (the one trying to draw it) For a complete description of the experimental design, see Anderson, Bader et al. (1991).</Paragraph> <Section position="1" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.3. Corpus Ch~racterlstics </SectionTitle> <Paragraph position="0"> Subjects accommodated easily to the task and experimental setting, and produced evidently unselfconscious and fluent speech. The syntax is largely clausal rather than sentential; showing good turn-taking, with relatively little overlap/interruption. The total corpus runs about 18 hours of speech, yielding 150,000 word tokens drawn from 2,000 word form types. Word lists containing all the feature names were also elicited from all speakers, along with a number of 'accent diagnosis' utterances.</Paragraph> <Paragraph position="1"> The acoustic quality of the recordings is good but not outstanding--in particular, stereo separation is not perfect, in that it is often possible to detect the voice of one talker very faintly on the other talker's channel. A very modest amount of rumble and other non-specific background noise is occasionally detectable. null</Paragraph> </Section> </Section> <Section position="7" start_page="25" end_page="26" type="metho"> <SectionTitle> 3. THE TRANSCRIPTIONS </SectionTitle> <Paragraph position="0"> The transcriptions are at the orthographic level, quito detailed, including filled pauses, false starts and repetitions, broken words, etc. Considerable care has been taken to ensure consistency of notation, which is thoroughly documented. Although the full complexity of overlapped regions has not been reflected in the transcriptions, such regions are clearly set off from the rest of the transcripts. Transcripts are connected to the acoustic snmpled data by sample numbers marked every few turns.</Paragraph> <Paragraph position="1"> Text Encoding Initiative-compliant SGML markup is used, both within transcripts to indicate turn boundaries and for other metatextual purposes, and also in separate corpus header and transcript header files, but this was done in a manner designed to make accessing the transcripts as plain text very easy.</Paragraph> <Paragraph position="2"> We also used a very light-weight non-TEI markup for textual annotations, to mark such things as abandoned words, letter names, filled pauses and editorial uncertainties. null A brief extract from a transcript is given below as Figure 1, illustrating various aspects of the transcription, including the tags u for utterance, s fo for speech file offset, bo for begin overlap and eo for end overlap, as well as the le microtag for a letter name.</Paragraph> <Paragraph position="3"> <u who=G n=3> <sfo samp=107715> <bo id=o75a> About half an inch above it, we've got an {lelx} marking start. Have</Paragraph> </Section> <Section position="8" start_page="26" end_page="26" type="metho"> <SectionTitle> 4. THE CD-ROMS </SectionTitle> <Paragraph position="0"> The published version of the corpus occupies channels of the associated speech for all the conversations; * for each talker sampled audio for an accent diagnostic passage and a scripted reading of a list of all the feature names from the map; * images of the maps employed; * documentation; * UNIX TM tools for linking the spoken and written material and other manipulations of the corpus materials.</Paragraph> <Paragraph position="1"> Preparation of the corpus for publication was a much larger task than we had expected, and is described in some detail in (Thompson & Bader, 1993).</Paragraph> </Section> <Section position="9" start_page="26" end_page="28" type="metho"> <SectionTitle> 5. IMPLICATIONS FOR SPEECH RECOGNITION </SectionTitle> <Paragraph position="0"> 5.1. High-q-~llty un~cripted dialogue Recorded collections of natural conversation are not newmnot only do many linguists have a drawer full of tapes of dinner table or staff room talk, but also more systematic and extensive collection efforts have been carried out on several occasions as part of major reference corpora building projects. But with no exceptions we are aware of, all such material is of highly varying acoustic quality, and is rarely if ever suitable for extensive computational processing.</Paragraph> <Paragraph position="1"> On the other hand, to date the large development corpora collected and used to such good effect by the speech recognition community, although of a very high standard acoustically, have been exclusively monologue, and until very recently exclusively scripted.</Paragraph> <Paragraph position="2"> Thus the Map Task corpus occupies a hitherto vacant position in corpus design space it is natural, unscripted dialogue recorded to a standard suitable for digital processing.</Paragraph> <Paragraph position="3"> We hope the widespread availability of such a resource will help to stimulate a change in the way phonology, morphology, syntax and semantics are pursued parallel to the change which has already occurred in phonetics, that is, a change from theory development dependent on small amounts of data, often constructed by the theorist, to theory development dependent on, indeed immersed within, a large amount of naturally occurring data.</Paragraph> <Paragraph position="4"> Note that this methodological change is, or at least ought to be, independent of meta-theoretical disposition, and in particular the above remarks are not meant to imply a bias in favour of stochastic or self-organising theoretical frameworks.</Paragraph> <Section position="1" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 5.2. Syntax </SectionTitle> <Paragraph position="0"> There is modest controversy brewing about the relation between spoken and written language, particularly in a highly literate language/culture context such as obtains for English. It has been argued (see e.g. Miller 1993) that the grammar of spoken English is qualitatively different from that of written English, and demands separate treatment.</Paragraph> <Paragraph position="1"> In so far as the progress of speech recognition from relatively constrained interaction situations and relatively constrained language will depend on grammars and/or models of natural English conversation, the resource provided by the Map Task has an obvious rule to play.</Paragraph> </Section> <Section position="2" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 5.3. Prosody </SectionTitle> <Paragraph position="0"> It has long been assumed that there is a mutually informing relationship between prosody and discourse structure. The simple goal-oriented nature of the Map Task conversations, and the ease with which quite local, short-term goals can be identified in terms of the part of the route in question at any given time, means that the corpus provides an excellent base at attempting to explicate this relationship in some detail. Work has begun on relating the inventories of intonation on the one hand and moves within conversational gRmes on the other, with initially encouraging results (Kowtko, Isard and Doherty 1992).</Paragraph> <Paragraph position="1"> As in the case of syntax, we would hope that widespread provision of the corpus will enable comparative exploration of the numerous theories of discourse structure, prosody and their relations now being suggested.</Paragraph> <Paragraph position="2"> 5.4. Fast speech rules The names associated with the landmarks drawn on the maps were designed inter alia to provide opportunities for various forms of phonological modification, in particular t-deletion (&quot;vast meadow&quot;), d-deletion (&quot;reclaimed fields&quot;), glottalisation (&quot;white mountain&quot;) and nasal assimilation (&quot;crane bay&quot;). Furthermore, on each map one such name would be paired with another, similar name, with the intention of assessing the impact of the (putative) necessity of contrastive stress (&quot;crane bay&quot; vs. &quot;green bay&quot;). The availability in the corpus of citation form pronunciations be each speaker will provide a very useful baseline for studies in this area. 5.5. The role of eye contact Not surprisingly, there are obvious gross effects on the conversations of the difference between the eye-contact and no-eye-contact conditions. The no-eye-contact conversations contained 22% more turns on average, but only 18% more words, i.e. more turns, but each fewer words per turn. This is presumably because of the increased need for frequent back-channel confirmations in the no-eye-contact condition.</Paragraph> <Paragraph position="3"> The overall statistics for word tokens and turns are as given in Table 1. The implications of the language differences induced by the presence or absence of eye-contact are clearly significant for a range of different potential speech technology applications. See (Boyle, Anderson & Newlands, in press) for more details.</Paragraph> </Section> </Section> class="xml-element"></Paper>