XML Viewer - h89-2021

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/89/h89-2021_abstr.xml
Size: 18,965 bytes
Last Modified: 2025-10-06 13:46:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2021">
  <Title>Evaluating spoken language interaction</Title>
  <Section position="1" start_page="0" end_page="153" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> To study the spoken language interface in the context of a complex problem-solving task, a group of users were asked to perform a spreadsheet task, alternating voice and keyboard input. A total of 40 tasks were performed by each participant, the first thirty in a group (over several days), the remaining ones a month later. The voice spreadsheet program used in this study was extensively instrumented to provide detailed information about the components of the interaction. These data, as well as analysis of the participants's utterances and recognizer output, provide a fairly detailed picture of spoken language interaction.</Paragraph>
    <Paragraph position="1"> Although task completion by voice took longer than by keyboard, analysis shows that users would be able to perform the spreadsheet task faster by voice, if two key criteria could be met: recognition occurs in real-time, and the error rate is sufficiently low. This initial experience with a spoken language system also allows us to identify several metrics, beyond those traditionally associated with speech recognition, that can be used to characterize system performance. Introduction The ability to communicate by speech is known to enhance the quality of communication, as reflected in shorter problem-solving times and general user satisfaction \[2\]. Recent advances in speech recognition technology \[4\] have made it possible to build &amp;quot;spoken language&amp;quot; systems that create the opportunity for interacting naturally with computers. Spoken language systems combine a number of desirable properties.</Paragraph>
    <Paragraph position="2"> Recognition of continuous speech allows users to use a natural speech style. Speaker independence allows casual users to easily use the system and eliminates training as well as its associated problems (such as drift). Large vocabularies make it possible to create habitable languages for complex applications. Finally, a natural language processing capability allows the user to express him or herself using familiar locutions.</Paragraph>
    <Paragraph position="3"> While the recognition technology base that makes spoken language systems possible is rapidly maturing, there is no corresponding understanding of how such systems should be designed or what capabilities users will expect to have available. It is intuitively apparent that speech will be suited for some functions (e.g., data entry) but unsuited for others (e.g., drawing). We would also expect that users will be willing to tolerate some level of recognition error, but do not know what this is or how it would be affected by the nature of the task being performed or by the error recovery facilities provided by the system.</Paragraph>
    <Paragraph position="4"> Meaningful exploration of such issues is difficult without some baseline understanding of how humans interact with a spoken language system. To provide such a baseline, we implemented a spoken language system using currently available technology and used it to study humans performing a series of simple tasks.</Paragraph>
    <Paragraph position="5"> We chose to work with a spreadsheet program since the spreadsheet supports a wide range of activities, from simple data entry to complex problem solving. It is also a widely used program, with a large experienced user population to draw on. We chose to examine performance over an extended series of tasks because we believe that regular use will be characteristic of spoken language applications.</Paragraph>
    <Paragraph position="6"> The voice spreadsheet system The voice spreadsheet (henceforth &amp;quot;vsc&amp;quot;) consists of the uNIx-based spreadsheet program sc interfaced to a recognizer embodying the SPHINX technology described in \[4\]. Additional description of vsc is available elsewhere \[6\], as is a description of the spreadsheet language \[9\].</Paragraph>
    <Paragraph position="7"> The recognition component of the voice spreadsheet makes use of two pieces of special-purpose hardware: a signal processing unit (the USA) and a search accelerator BEAM. See \[1\] for fuller descriptions of these units. The recognition code is embedded in the spreadsheet program, so that the complete system runs as a single process.</Paragraph>
    <Paragraph position="8">  To train the phonetic models used in the recognizer, we combined several different databases, all recorded at Carnegie Mellon using the same microphone as used for the spreadsheet study (a close-talking Sennheiser HMD-414). The training speech consisted of: calculator sentences (1997 utterances), a (general) spreadsheet database (1819 utterances), and a task-specific database for financial data (196 utterances). A total of 4012 utterances was thus included in the training set. Table 1 provides some performance data that characterize system performance.</Paragraph>
    <Paragraph position="9"> The basic recognition performance (&amp;quot;Reference&amp;quot;), as tested on speech collected at the same time as the training data, is about what might be expected given the known performance characteristics of the SPI-mqx system (specifically, 94% word accuracy for the perplexity 60 version of the Resource Management task).</Paragraph>
    <Paragraph position="10"> The Table also presents recognition performance for speech collected in the user study described below (&amp;quot;Live Session&amp;quot;). The &amp;quot;complete&amp;quot; version shows system performance over 4 sessions representing 4 different talkers and chosen from about the mid-point of the initial 30 task series (details below). Note that this set includes utterances that contain various spontaneous speech phenomena that cannot be handled correctly by the current system. The &amp;quot;clean speech&amp;quot; set includes only those utterances that both contain no interjected material (e.g., audible non-speech) and that are grammatical. Performance on this set is quite good, and there is no evidence that mere &amp;quot;spontaneity&amp;quot; leads to poorer recognition performance. We can verify this equivalence more concretely by comparing read and spontaneous speech produced by the same talkers. To do this, we asked the four participants whose speech comprised the spontaneous test sets to return and record read versions of their spontaneous utterances, using scripts taken from our transcriptions.</Paragraph>
    <Paragraph position="11"> As can be seen in the Table, performance is comparable for read and live speechl.</Paragraph>
    <Paragraph position="12"> Given that this pattern of results can be shown to generalize to other tasks (and there is no reason to believe that they would not), the implications of this experiment are highly significant: A system trained on read speech will not substantially degrade in accuracy when presented with spontaneous speech provided that certain other characteristics, such as speech rate, will be comparable. Note that this only applies to those utterances that are comparable to read speech insofar as they are grammatical and contain no extraneous acoustic events. The system will still need to deal with these phenomena. This result is encouraging for those approaches to spontaneous speech \[10\] that deal with such speech in terms of accounting for extraneous events and interpreting agrammatical utterances. If these problems can be solved in a satisfactory manner, then we can comfortably expect spontaneous spoken language system performance to be comparable to system performance evaluated on read speech.</Paragraph>
    <Paragraph position="13"> A study of spoken language system usage To understand how users approach a voice-driven system and how they develop strategies for dealing with this type of interface, we had a group of users perform a series of more or less comparable task over an extended period of time and monitored various 1The slightly better performance with Live speech might seem counter-intuitive. Examination of specific errors in the Read version indicates that one of the speakers read her raated~l at a distinctly slower pace than she spoke it spontaneously (we estimate 34% slower). The bulk of the excess errors can be accounted for by this interpretation. For example, many of the errors are splits, characteristic of slow speech.</Paragraph>
    <Paragraph position="14">  aspects of system and user performance over this period.</Paragraph>
    <Section position="1" start_page="151" end_page="151" type="sub_section">
      <SectionTitle>
Method
</SectionTitle>
      <Paragraph position="0"> We were interested in not only how a casual user approaches a spoken language system, but also how his or her skill in using the system develops over time.</Paragraph>
      <Paragraph position="1"> Accordingly, we had a total of 8 participants complete a series of 40 spreadsheet tasks.</Paragraph>
      <Paragraph position="2"> The task chosen for this study was the entry of personal financial data from written descriptions of various items in a fictitious person's monthly finances. An attempt was made to make each version of the task comparable in the amount of information it contained and in the number of complex arithmetic operations required. On the average, each task required entering 38 pieces of financial information, an average of 6 of these entries required arithmetic operations such as addition and multiplication. Movement within the worksheet, although generally following a top to bottom order, skipped around, forcing the user to make arbitrary movements, including off-screen movements.</Paragraph>
      <Paragraph position="3"> Users were presented with preformatted worksheets containing appropriate headings for each of the items they would have to enter. In addition, each relevant cell location was given a label that would allow the user to access it using symbolic movement instructions (as defined in \[9\]).</Paragraph>
      <Paragraph position="4"> The information to be entered was presented on separate sheets of paper, one entry to a sheet, conmined in a binder positioned to the side of the workstation. This was done to insure that all users dealt with the information in a sequential manner and would follow a predetermined movement sequence within the worksheet. To aid the user, the bottom of each sheet gave the category heading for the information to be entered and, if existing, a symbolic label for the cell into which the information was to be entered.</Paragraph>
      <Paragraph position="5"> PROCEDURE AND DESIGN. All participants performed 40 tasks. The first 30 tasks were completed in a block, over several days. The last ten were completed after an interval of about one month. The purpose of the latter was to determine the extent to which users remembered their initial extended experience with the voice spreadsheet and to what degree this retest would reflect the performance gains realized over the course of the original block of sessions. Since we were interested in studying a spoken language system in an environment that realistically reflects the settings in which such a system might eventually be used, we made no special attempt to locate the experiment in a benign environment or to control the existing one.</Paragraph>
      <Paragraph position="6"> The workstation was located in an open laboratory and was not surrounded by any special enclosure.</Paragraph>
      <Paragraph position="7"> At the beginning of each session, each participant was given a standard-format typing test to determine their facility with the keyboard. The typing test revealed two categories of participant, touch typists (3 people) with a mean typing rate of 63 words per minute (wpm) and &amp;quot;hunt and peck&amp;quot; typists (5 people), with a mean typing rate of 31 wpm. Task modality (whether speech or typing) alternated over the course of the experiment, each successive task being carried out in a different modality. To control for order and task-version effects the initial modality and the sequence of tasks (first-to-last vs last-to-firs0 was varied to produce all possible combinations (four). Two people were assigned to each combination.</Paragraph>
      <Paragraph position="8"> The participants were informally solicited from the university community through personal contact and bulletin board announcements. There were 3 women and 5 men, ranging in age from 18 to 26 (mean of 22).</Paragraph>
      <Paragraph position="9"> With the exception of one person who was of English/Korean origin, all participants were native speakers of English. All had previous experience with spreadsheets, an average of 2.3 years (range 0.75 to 5), though current usage ranged from daily to &amp;quot;several times a year&amp;quot;. None of the participants reported any previous experience with speech recognition systems (though one had previously seen a SPHINX demonstration). null</Paragraph>
    </Section>
    <Section position="2" start_page="151" end_page="153" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> The data collected in this study consisted of detailed timings of the various stages of interaction as well as the actual speech uttered over the course of system interaction. The analyses presented in this section are based on the first 30 sessions completed by the 8 participants. null  Recognition performance and language habitability To analyze recognizer performance we captured and stored each utterance spoken as well as the corresponding recognition string produced by the system. All utterances were listened to and an exact lexical transcription produced. The transcription conventions are described more fully in \[8\], but suffice it to note that in addition to task-relevant speech, we coded a variety of spontaneous speech phenomena, including speech and non-speech interjections, as well as interrupted words and similar phenomena.</Paragraph>
      <Paragraph position="1"> The analyses reported here are based on a total of 12507 recorded and transcribed utterances, comprising 43901 tokens. We can use these data to answer a variety of questions about speech produced in a complex problem-solving environment. Recognition performance data are presented in Figure 1. The values plotted represent the error rate averaged across all  The top line in Figure 1 shows exact utterance accuracy, calculated over all utterances in the corpus, including system firings for extraneous noise and abandoned (i.e., user interrupted) utterances. It does not include begin-end detector failures (which produce a zero-length utterance), of which there were on the average 10% per session. Exact accuracy corresponds to utterance accuracy as conventionally reported for speech recognition systems using the NBS scoring algorithm \[5\]. The general trend of recognition performance over time is improvement, though the improvement appears to be fairly gradual. The improvement indicates that users are sufficiently aware of what might improve system performance to modify their behavior accordingly. On the other hand, the amount of control they have over it appears to be limited.</Paragraph>
      <Paragraph position="2"> The next line down shows semantic accuracy, calculated by determining, for each utterance, no matter what its content, whether the correct action was taken by the system 2. Semantic accuracy, relative to exact accuracy, represents the added performance that can be realized by the parsing and understanding components of an SLS. In the present case, the added performance results from the 'silent' influence of the word-pair grammar which is part of the recognizer. Thus, grammatical constraints are enforced not through, say, explicit identification and reanalysis of out-of-language utterances, but implicitly, through the word-pair grammar. The spread between semantic and exact accuracy defines the contribution of higher-level process and is a parameter that can be used to track the performance of &amp;quot;higher-lever' components of a spoken language system.</Paragraph>
      <Paragraph position="3"> The line at the bottom of the graph shows grammaticality error. Grammaticality is determined by first eliminating all non-speech events from the transcribed corpus then passing these filtered utterances through the parsing component of the spreadsheet system. Grammaticality provides a dynamic measure of the coverage provided by the system task language (on the assumption that the user's task language evolves with experience) and is one indicator of whether the language is sufficient for carrying out the task in question.</Paragraph>
      <Paragraph position="4"> The grammaticality function can be used to track a number of system attributes. For example, its value over the period that covers the user's initial experience with a system indicate the degree to which the im2For example, the user might say &amp;quot;LET' S GO DOWN FIVE&amp;quot;, which lies outside the system language. Nevertheless, because of grammatical constraints, the system might force this utterance into &amp;quot;DOWN FIVE&amp;quot;, which happens to be grammatically acceptable and which also happens to cany out the desired action. From the task point of view, this recognition is correct; from the recognition point of view it is, of course, wrong.</Paragraph>
      <Paragraph position="5">  plemented language covers utterances produced by the inexperienced user and provides one measure of how successfully the system designers have anticipated the speech language that users intuitively select for the task. Examined over time, the grammaticality function indicates the speed with which users modify their speech language for the task to reflect the constraints imposed by the implementation and how well they manage to stay within it. Measurement of grammaticality after some time away from the system indicates how well the task language can be retained and is an indication of its appropriateness for the task. We believe that grammaticality is an important component of a composite metric for the language habitability of an SLS and can provide a meaningful basis for comparing different SLS interfaces to a particular application 3.</Paragraph>
      <Paragraph position="6"> Examining the curves for the present system we find, unsurprisingly, that vsc is rather primitive in its ability to compensate for poor recognition performance, as evidenced by how close the semantic accuracy line is to the exact accuracy line. On the other hand, it appears to cover user language quite well, with only an average of 2.9% grammaticality error 4. In all likelihood, this indicates that users found it quite easy to stay within the confines of the task, which in turn may not be surprising given its simplicity.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML