File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1074_metho.xml
Size: 17,995 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1074"> <Title>Mode preference in a simple data-retrieval task</Title> <Section position="3" start_page="0" end_page="364" type="metho"> <SectionTitle> SYSTEM IMPLEMENTATION </SectionTitle> <Paragraph position="0"> The Personal Information Database (PID) component of the OM system \[3, 7\] served as the database system in this study. Given a search request specified in some combination of first name, last name and affiliation, PID displays a window with the requested information (in this study, the information consisted of name, affiliation and all known telephone numbers).</Paragraph> <Paragraph position="1"> If an unknown name was entered, an error panel came up. If a query was underspecified, a choice panel containing all entries satisfying the query was shown; for example asking for &quot;Smith&quot; produced a panel showing all Smiths in the database. The existing PID was altered to incorporate a scroll window in addition to the already available keyboard and speech interfaces.</Paragraph> <Paragraph position="2"> The remainder of this section provides detailed descriptions for each input mode.</Paragraph> <Paragraph position="3"> Speech Input The OM system uses a hidden Markov model (I-IMM) recognizer based on Sphinx \[2\] and is capable of speaker-independent continuous speech recognition.</Paragraph> <Paragraph position="4"> The subject interacted with the system through a NeXT computer which provided attention management \[3\] as well as application-specific displays. To offload computation, the recognition engine ran on a separate NeXT computer and communicated through an ethernet connection. For the 731-word vocabulary and perplexity 33 grammar used in the first experiment, the system responded in 2.1 times real-time (xRT). Database retrieval was by a command phrase such as SHOW ME ALEX RUDNICKY. While subjects were instructed to use this specific phrase, the system also understood several variants, such as SHON, GIVE (ME), LIST, etc. The input protocol was &quot;Push and Hold&quot;, meaning that the user had to depress the mouse button before beginning to speak and release it after the utterance was complete. Subjects were instructed to keep repeating a spoken command in case of recognition error, until it was processed correctly and the desired information appeared in the result window.</Paragraph> <Paragraph position="5"> Keyboard Subjects were required to click a field in a window then type a name into it, followed by a carriage return (which would drop them to the next field or would initial the retrieval). Three fields were provided: First name, Last Name and Organization. Subjects were provided with some shortcuts: last names were often unique and might be sufficient for a retrieval. They were also informed about the use of a wildcard character which would allow then to minimize the number of keystrokes need for a retrieval. Ambiguous search patterns produced a panel of choices; the sub-ject could click on the desired one.</Paragraph> <Section position="1" start_page="364" end_page="364" type="sub_section"> <SectionTitle> Scroller </SectionTitle> <Paragraph position="0"> The scroller window displayed the names in the database sorted alphabetically by last name. Eleven names were visible in the window at any one time, providing approximately 4-5% exposure of the 225 name list. The NeXT scroller provides a handle and two arrow buttons for navigation. Clicks on the scrollbar move the window to the corresponding position in the text and the arrow buttons can be amplified to jump by page when a control key is simultaneously depressed. Each navigation technique was demonstrated to the subject.</Paragraph> <Paragraph position="1"> Session controller The experiment was controlled by a separate process visible to the subject as a window displaying a name to look up, a field in which to enter the retrieved information and a field containing special instructions such as Please use KEYBOARD only or Use any mode. The subject progressed through the experiment by clicking a button in this window labeled the control program.</Paragraph> <Paragraph position="2"> TO T1 T2 .........................</Paragraph> <Paragraph position="3"> ! Reedy Acquke task i Initiate select mode response travel Trial time line, showing events logged by</Paragraph> <Paragraph position="5"> i end app i processing Next; this would display the next name to retrieve. Equidistant from the the Next button were three windows corresponding to the three input modes used in the experiment: voice, keyboard and scroller. All modes required a mouse action to initiate input, either a click on the speech input button, a click on a text input field or button in the keyboard window or the (direct) initiation of activity in the scroller.</Paragraph> <Paragraph position="6"> Instrumentation All applications were instrumented to generate a stream of time-stamped events corresponding to user and system actions. Figure 1 shows the time line for a single trial. In addition to the overall timeline, each mode was also instrumented to generate logging events corresponding to significant internal events. All logged events were time-stamped using absolute system time, then merged in analysis to produce a composite timeline corresponding to the entire experimental session.</Paragraph> <Paragraph position="7"> The merged event stream was processed using a hierarchical set of finite-state machines (FSMs). Figure 2 shows the FSM for a single transaction with the database retrieval program. Figures 3 show the FSM for the voice mode. During the analysis process, the latter FSM (as well as FSMs for keyboard and scroller) would be invoked within state 1 of the transaction FSM (Figure 2). An intermediate level of analysis (corresponding to conditions) is also used to simplify analysis. Arcs in the FSMs correspond to observable events, either system outputs or user inputs. The products of the analysis include transition frequencies for all arcs in an FSM as well as transition times. The analysis can be treated in terms of Markov chains \[6\] to compactly describe recognition error, user-mode preferences and other system characteristics. null</Paragraph> </Section> </Section> <Section position="4" start_page="364" end_page="366" type="metho"> <SectionTitle> USER MODE PREFERENCE IN DATA RE- TRIEVAL </SectionTitle> <Paragraph position="0"> The purpose of the first experiment was to establish what mode-preference patterns users would display when using the PID system. To ensure that subjects initial state (0) the subject can click the Next button to move to state 1 at which point the subject has a name to look up and can initiate a query. Queries are described by mode~specific FSMs which are invoked within this state. Figure 3 shows one such FSM. If properly formed, a query will produce a database retrieval and move the transaction to state 4. The sub-ject can opt to enter a response, moving the transaction to state 2 or to repeat queries (by re-entering state 1). At this point, the subject is ready to begin a new trial by transitioning to state O.</Paragraph> <Paragraph position="1"> t_t C/0t~fllcn were equally familiar with each of the input modes, the experiment was divided into two parts (although it was run as a single session, without breaks). In the first part, subjects were asked to perform 20 retrievals using each mode. Initial testing determined that this was sufficient to acquaint the subjects with the operation of each mode. In the second part, they were instructed to use &quot;any mode&quot;, with the expectation that they would choose on the basis of their assessment of the suitability of each mode. A total of 55 entries were presented in the second part.</Paragraph> <Paragraph position="2"> The same sequence of 60 entries was used for the familiarization stage for all subjects. However, the order in which the subject was exposed to the different modes was counter-balanced according to a Latin square. Three different blocks of test items (each containing 55 entries) were used, for a total of nine different combinations.</Paragraph> <Paragraph position="3"> Details about the operation of the different modes as well as the experiment controller were explained to the subject during a practice session prior to the experiment proper (a total of four practice retrievals were performed by the subject in this phase).</Paragraph> <Section position="1" start_page="365" end_page="365" type="sub_section"> <SectionTitle> Subjects </SectionTitle> <Paragraph position="0"> Nine subjects participated in this study, 7 male and 2 female. All had had some previous exposure to speech systems, primarily through their participation in on-going speech data collection efforts conducted by our research group. This prior exposure ensured that the subjects were familiar with the mechanics of using a microphone and of interacting with a computer by voice. No attempt was made to select on demographic characteristics or on computer skills. The group consisted primarily of students, none of whom however were members of our research group.</Paragraph> </Section> <Section position="2" start_page="365" end_page="366" type="sub_section"> <SectionTitle> Results and Analysis </SectionTitle> <Paragraph position="0"> A finite state machine (FSM) description of user behavior was used to analyze session data. Separate FSMs were defined for condition, transaction, sequence and intra-modal levels and were used to tabulate metrics of interest.</Paragraph> <Paragraph position="1"> Table 1 shows the durations of transactions for each of the modes during the familiarization phase. A transaction is timed from the click on the Next button to the carriage return terminating the entry of the retrieved telephone number. Speech input leads to the longest transaction times. Input time measures the duration between the initiation of input and system response (note that these times include recognition time, as well as the consequences of mis-recognition, in the first experiment.</Paragraph> <Paragraph position="2"> i.e., having to repeat an input). Here speech is also at a disadvantage (though note that the duration of a single utterance is only 2.464 see). Transaction durations for modes are statistically different (F(2, 14) = 5.54, MS~rr = 0.836, p < 0.05), though in individual comparisons only voice and scroller differ (p < 0.05, the Neuman-Keuls procedure was used for this and all subsequent comparisons). Order of presentation was a significant factor (F(2, 14) = 8.3, p < 0.01), with the first mode encountered requiring the greatest amount of time.</Paragraph> <Paragraph position="3"> Table 2 shows choice of mode in the Free block. The mixed mode line refers to cases where subjects would first attempt a lookup in one mode then switch to another (for example because of misrecognition in the speech mode). The right-hand column in the table shows the first mode chosen in a mixed-mode transaction. In this case, voice is preferred 62.8% of the time as a first choice. The pattern of choices is statistically significant (F(2, 14) = 6.31,MSerr = 288,p < 0.01), with speech preferred significantly more than either keyboard or scroller(p < 0.05).</Paragraph> <Paragraph position="4"> This experiment suggests that speech is the preferred mode of interaction for the task we examined. This is particularly notable since speech is the least efficient of the three modes offered to the user, as measured in traditional terms such as time-to-completion. Most previous investigations ( see, e.g. the review in \[4\]) have concentrated on this dimension, treating it as the single most important criterion for the suitability of speech input. The present result suggests that other aspects of performance may be equally important to the user.</Paragraph> </Section> </Section> <Section position="5" start_page="366" end_page="367" type="metho"> <SectionTitle> EXTENDED EXPERIENCE </SectionTitle> <Paragraph position="0"> One possible explanation of the above result is that it's due to a novelty effect. That is, users displayed a preference for speech input in this task not because of any inherent preference or benefit but simply because it was something new and interesting. Over time we might expect the novelty to wear off and users to refocus their attention on system response characteristics and perhaps shift their preference.</Paragraph> <Paragraph position="1"> To test this possibility, we performed a second experiment, scaling up the amount of time spent on a task by different amounts. Since it was not possible to predict the length of a novelty effect a priori, three separate experience levels were examined. A total of 9 subjects participated (4 male and 5 female): 3 did 720 trials, 3 did 1440 trials and 3 did 2160. This is in contrast to the 115 trials per subject in the first experiment.</Paragraph> <Section position="1" start_page="366" end_page="367" type="sub_section"> <SectionTitle> Method </SectionTitle> <Paragraph position="0"> Based on observations made during the first experiment, several changes were made to the system, primarily to make the speech and keyboard inputs more efficient. Recognition response was improved from 2.1 xRT to 1.5 xRT by the use of an IBM 6000/530 computer as the recognition engine. Keyboard entry was made more efficient by eliminating the need for the user to clear entry fields prior to entry. These changes resulted in improved transaction times for these two modes relative to the scroller, which was unchanged except for a slight reduction in exposure (this due to an increase of the number of entries to 240, done to facilitate details of the design).</Paragraph> </Section> <Section position="2" start_page="367" end_page="367" type="sub_section"> <SectionTitle> Results and Analysis </SectionTitle> <Paragraph position="0"> The mean preference for different modes in this experiment is shown in Table 3. Subjects display a strong bias in favor of voice input (74.9%). Preference for voice across individual subjects ranged from 28% to 91% with all but one subject ($3) showing preference levels above 70% (the median preference is 82.5%). Differences in mode preference are significant (F(2, 16) = 34.6, MSerr = 0.037,p < 0.01) and the preference is greater (p < 0.01) for voice than for either of the other input modes.</Paragraph> <Paragraph position="1"> Since some of the names in the database were difficult to pronounce, we also tabulated choice data excluding such names. Nineteen names (about 8% of the database) were excluded on the basis of ratings provided by subjects. 1 The data thus filtered are shown in Table 3; in this case (for names that subjects were reasonably comfortable about pronouncing) preference for speech rises to 79.9% (median of 86.1%).</Paragraph> <Paragraph position="2"> 1 Participants in this experiment rated each name in the database prior to the experiment itself. A name was presented to the subject, who was asked to rate on a 4-point scale their lack of confidence in their ability to pronounce it. They then heard a recording of the name pronounced as expected by the recognizer and finally rated the degree to which the canonical pronunciation disagreed with their own expectation. A conservative criterion was used to place names on the exclusion list: any name for which both ratings averaged over 1.0 (on a 0-3 scale) was excluded.</Paragraph> <Paragraph position="3"> Table 4 shows the mean transaction and input times for the second experiment, computed over subjects.</Paragraph> <Paragraph position="4"> Compared to the first experiment, these times are faster, probably reflecting the greater amount of experience with the task for the second group of subjects. Transaction times are significantly different</Paragraph> <Paragraph position="6"> scroller times longer than keyboard or speech times (p < 0.01) which in turn are not different. If subjects were attending to the time necessary to carry out the task, keyboard and voice should have been chosen with about equal frequency. The subjects in this experiment nevertheless chose speech over keyboard (and scroller) input.</Paragraph> <Paragraph position="7"> Figure 4 shows preference for voice input over the course of the experiment. Preference for speech increases over time, and begins to asymptote at about 10-15 blocks (representing about 250 utterances).</Paragraph> <Paragraph position="8"> This phenomenon suggests that speech input, while highly appealing to the user requires a certain amount of confidence building, certainly a period of extended familiarization with what is after all a novel input mode. Additional investigation would be needed, however, to establish the accuracy of this observation.</Paragraph> <Paragraph position="9"> In any case, this last result underlines the importance of providing sufficient training.</Paragraph> <Paragraph position="10"> As can be seen in Figure 4 that preference for speech shows no sign of decreasing over time for the duration examined in this experiment. Preference for voice input appears to be robust. The 36 block version of the experiment took on the average 8-9 hours to complete, with subjects working up to 2 hours per day.</Paragraph> <Paragraph position="11"> A possible explanation for this finding may be that, rather than basing their choice on overall transaction time, users focus on simple input time (in both experiments voice input is the fastest). This would imply that users are willing to disregard the cost of recognition error, at least for the error levels associated with the system under investigation. Data from followup experiments not reported here suggest that this may be the case: increasing the duration of the query utterance decreases the preference for speech.</Paragraph> </Section> </Section> class="xml-element"></Paper>