File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4019_evalu.xml
Size: 5,761 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4019"> <Title>Speech Graffiti vs. Natural Language: Assessing the User Experience</Title> <Section position="5" start_page="1" end_page="7" type="evalu"> <SectionTitle> 3 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Subjective assessments </SectionTitle> <Paragraph position="0"> Seventeen out of 23 participants preferred Speech Graffiti to the natural language interface. User assessments were significantly higher for Speech Graffiti overall and for each of the six subjective factors, as shown in Fig. 2 (REML analysis: system response accuracy F=13.8, p<0.01; likeability F=6.8, p<0.02; cognitive demand F=5.7, p<0.03; annoyance F=4.3, p<0.05; habitability F=7.7, p<0.02; speed F=34.7, p<0.01; overall F=11.2, p<0.01). All of the mean SG-ML scores except for annoyance and habitability are positive (i.e. > 4), while the NL-ML did not generate positive mean ratings in any category. For individual users, all those and only those who stated they preferred the NL-ML to the SG-ML gave the NL-ML higher overall subjective ratings.</Paragraph> <Paragraph position="1"> Although users with CSE/programming backgrounds tended to give the SG-ML higher user satisfaction ratings than non-CSE/programming participants, the differences were not significant. Training domain likewise had no significant effect on user satisfaction.</Paragraph> <Paragraph position="2"> Some component statements are reversal items whose values were converted for analysis, so that high scores in all categories are considered good.</Paragraph> </Section> <Section position="2" start_page="1" end_page="7" type="sub_section"> <SectionTitle> 3.2 Objective assessments </SectionTitle> <Paragraph position="0"> Task completion. Task completion did not differ significantly for the two interfaces. In total, just over two thirds of the tasks were successfully completed with each system: 67.4% for the NL-ML and 67.9% for the SG-ML. The average participant completed 5.2 tasks with the NL-ML and 5.4 tasks with the SG-ML. As with user satisfaction, users with CSE or programming background generally completed more tasks in the SG-ML system than non-CSE/programming users, but again the difference was not significant. Training domain had no significant effect on task completion for either system.</Paragraph> <Paragraph position="1"> To account for incomplete tasks when comparing the interfaces, we ordered the task completion measures (times or turn counts) for each system, leaving all incompletes at the end of the list as if they had been completed in &quot;infinite time,&quot; and compared the medians. Time-to-completion. For completed tasks, the average time users spent on each SG-ML task was lower than for the NL-ML system, though not significantly: 67.9 versus 71.3 seconds. Considering incomplete tasks, the SG-ML performed better than the NL-ML, with a median time of 81.5 seconds, compared to 103 seconds.</Paragraph> <Paragraph position="2"> Turns-to-completion. For completed tasks, the average number of turns users took for each SG-ML task was significantly higher than for the NL-ML sys-tem: 8.2 versus 3.8 (F=26.4, p<0.01). Considering incomplete tasks, the median SG-ML turns-to-completion rate was twice that of the NL-ML: 10 versus 5.</Paragraph> <Paragraph position="3"> Word-error rate. The SG-ML had an overall word-error rate (WER) of 35.1%, compared to 51.2% for the NL-ML. When calculated for each user, WER ranged from 7.8% to 71.2% (mean 35.0%, median 30.0%) for the SG-ML and from 31.2% to 78.6% (mean 50.3%, median 48.9%) for the NL-ML. The six users with the highest SG-ML WER were the same ones who preferred the NL-ML system, and four of them were also the only users in the study whose NL-ML error rate was lower than their SG-ML error rate. This suggests, not surprisingly, that WER is strongly related to user preference. To further explore this correlation, we plotted WER against users' overall subjective assessments of each system, with the results shown in Fig. 3. There is a significant, moderate correlation between WER and user satisfaction for Speech Graffiti (r=-0.66, p<0.01), but no similar correlation for the NL-ML system (r=0.26).</Paragraph> <Paragraph position="4"> Understanding error. Word-error rate may not be the most useful measure of system performance for many spoken dialogue systems. Because of grammar redundancies, systems are often able to &quot;understand&quot; an utterance correctly even when some individual words are misrecognized. Understanding error rate (UER) may therefore provide a more accurate picture of the error rate that a user experiences. For this analysis, we only made a preliminary attempt at assessing UER. These error rates were hand-scored, and as such represent an approximation of actual UER. For both systems, we calculated UER based on an entire user utterance rather than individual concepts in that utterance.</Paragraph> <Paragraph position="5"> SG-ML UER for each user ranged from 2.9% to 65.5% (mean 26.6%, median 21.1%). The average change per user from WER to understanding-error for the SG-ML interface was -29.2%.</Paragraph> <Paragraph position="6"> The NL-ML understanding-error rates differed little from the NL-ML WER rates. UER per user ranged from 31.4% to 80.0% (mean 50.7%, median 48.5%). The average change per user from NL-ML WER was +0.8%.</Paragraph> <Paragraph position="7"> Figure 2. Mean user satisfaction for system response accuracy, likeability, cognitive demand, annoyance, habitability, speed and overall.</Paragraph> </Section> </Section> class="xml-element"></Paper>