PREDICTING AND MANAGING 
SPOKEN DISFLUENCIES DURING 
HUMAN-COMPUTER INTERACTION* 
Sharon Oviatt 
Computer Dialogue Laboratory & Artificial Intelligence Center 
SKI International, 333 Ravenswood Avenue, Menlo Park, CA. 94025 
ABSTRACT 
This research characterizes the spontaneous spoken disfluen- 
cies typical of human-computer interaction, and presents a 
predictive model accounting for their occurrence. Data were 
collected during three empirical studies in which people spoke 
or wrote to a highly interactive simulated system. The stud- 
ies involved within-subject factorial designs in which input 
modality and presentation format were varied. Spoken dis- 
fluency rates during human-computer interaction were doc- 
umented to be substantially lower than rates typically ob- 
served during comparable human-human speech. Two sep- 
arate factors, both associated with increased planning de- 
mands, were statistically related to increased speech disflu- 
ency rates: (1) length of utterance, and (2) lack of struc- 
ture in the presentation format. Regression techniques re- 
vealed that a linear model based simply on utterance length 
accounts for over 77% of the variability in spoken disflu- 
encies. Therefore, design techniques capable of channeling 
users' speech into briefer sentences potentially could elim- 
inate most spoken disfluencies. In addition, the degree of 
structure in the presentation format was manipulated in a 
manner that successfully elimluated 60 to 70% of all disflu- 
ent speech. The long-term goal of this research is to provide 
empirical guidance for the design of robust spoken language 
technology. 
1. INTRODUCTION 
Recently, researchers interested in spoken language process- 
ing have begun searching for reliable methods to detect and 
correct disfluent input automatically during interactions with 
spoken language systems \[2, 4, 9\]. In general, this research 
has focused on identifying acoustic-prosodic cues for detect- 
ing self-repairs, either alone or in combination with syntac- 
tic, semantic, and pattern matching information. To date, 
however, possible avenues for simply reducing or eliminating 
disfluencies through manipulation of basic interface features 
have not been explored. 
Another underdeveloped but central theme in disfluency re- 
search is the relation between spoken disfluencies and plan- 
ning demands. Although it is frequently claimed that dis- 
fluencies rise with increased planning demands of different 
kinds \[3\], the nature of this relation remains poorly under- 
stood. The major factors contributing to planning have yet 
*This research was supported by Grant No. IRI-9213472 
from the National Science Foundation, contracts from USWest, 
AT&T/NCR, and ATR International to SRI International, and 
equipment donations from Apple Computer, Sun Microsystems, 
and Wacom Inc. 
to be identified and defined in any comprehensive manner, 
or linked to disfluencies and self-repairs. From the viewpoint 
of designing systems, information on the dynarnics of what 
produces disfluencies, and how to structure interfaces to min- 
imize them, could improve the robust performance of spoken 
language systems. 
A related research issue is the extent to which qualitatively 
different types of speech may differ in their disfluency rates. 
That is, does the rate of spoken disfluencies tend to be stable, 
or variable? If variable, do disfluency rates differ systemat- 
ically between human-human and human-computer speech? 
And are disfluency rates sufficiently variable that techniques 
for designing spoken language interfaces might exert much 
leverage in reducing them? To compare disfluency rates di- 
rectly across different types of human-human and human- 
computer interactions, research needs to be based on com- 
parable rate-per-word measures, the same definition of dis- 
fluencies and self-repairs, and so forth, in order to obtain 
meaningful comparisons. 
For the purpose of the present research, past studies by 
the author and colleagues \[1, 6, 7\] were reanalyzed: (1) 
to yield data on the rate of disfluencies for four different 
types of human-human speech, and (2) to conduct com- 
parative analyses of whether human-human disfluencies .dif- 
fer from human-computer ones. In addition, three simula- 
tion studies of human-computer interaction were conducted, 
which generated data on spoken and handwritten disfluen- 
cies. Apart from comparing disfluencies in different com- 
munication modalities, two separate factors associated with 
planning demands were examined. First, presentation format 
was manipulated to investigate whether degree of structure 
might be associated with disfluencies. It was predicted that 
a relatively unconstrained format, which requires the speaker 
to self-structure and plan to a greater degree, would lead to a 
higher rate of speech disfluencies. Second, the rate of disflu- 
encies was examined in sentences of varying length. Spoken 
utterances graduated in length were compared to determine 
whether longer sentences have an elevated rate of disfluencies 
per word, since they theoretically require more planning. Fi- 
nally, implications are outlined for designing future interfaces 
capable of substantially reducing disfluent input. 
2. SIMULATION EXPERIMENTS ON 
HUMAN-COMPUTER 
INTERACTION 
This section outlines three experiments on human spoken and 
handwritten input to a simulated system, with spoken dlsflu- 
222 
encies constituting the primary analytical focus. 
2.1. Method 
Subjects, Tasks, and Procedure- Forty-four subjects 
participated in this research as paid volunteers. A USer- 
vice 'l~ansaction System" was simulated that could assist 
users with tasks that were either (1) verbal-temporal (e.g., 
conference registration or cax rental exchanges, in which 
proper names and scheduling information predominated), or 
(2) computational-numeric (e.g., personal banking or scien- 
tific calculations, in which digits and symbol/sign informa- 
tion predominated). During the study, subjects first received 
a general orientation to the Service Transaction System, and 
then were given practice using it to complete tasks. They re- 
ceived instructions on how to enter information on the LCD 
tablet when writing, speaking, and free to use both morali- 
ties. When speaking, subjects held a stylus on the tablet as 
they spoke. 
People also were instructed on completing tasks in two dif- 
ferent presentation formats. In an unconstrained format, 
they expressed information in an open workspace, with no 
specific system prompts used to direct their speech or writ- 
ing. People simply continued providing information while 
the system responded interactively with confirmations. For 
example, in this format they spoke digits, computational 
signs, and requested totals while holding their stylus on an 
open %cratch pad" area of their LCD screen. During other 
interactions, the presentation format was explicitly struc- 
tured, with linguistic and graphical cues used to structure 
the content and order of people's input as they worked. 
For example, in the verbal-temporal simulations, form-based 
prompts were used to elicit input (e.g., Car pickup lo- 
cation I 0, and in the computational- 
numeric simulation, patterned graphical layouts were used 
to elicit specific digits and symbols/signs. 
Other than specifying the input modality and format, an ef- 
fort was made not to influence the manner in which people 
expressed themselves. People's input was received by an in- 
formed assistant, who performed the role of interpreting ~nd 
responding as a fully functional system would. Essentially, 
the assistant tracked the subject's written or spoken input, 
and clicked on predefined fields at a Sun SPARCstation to 
send confirmations back to the subject. 
Semi-Automatic Simulation Technique- In developing 
this simulation, an emphasis was placed on providing auto- 
mated support for streamlining the simul~ttion to the extent 
needed to create facile, subject-paced interactions with deax 
feedback, and to have compaxable specifications for the differ- 
ent input modalities. In the present simulation environment, 
response delays averaged 0.4 second, with less than a 1-second 
delay in all conditions. In addition, the simulation was or- 
ganized to transmit analogues of human backchannel and 
propositional confirmations, with propositional-level comb- 
inations embedded in a compact transaction receipt. The 
simulation also was designed to be sufficiently automated so 
that the assistant could concentrate attention on monitor- 
ing the accuracy of incoming information, and on maintain- 
ing sufficient vigilance to ensure prompt responding. This 
semi-automation contributed to the fast pace of the simula- 
tion, and to a low rate of technical errors. Details of the 
simulation technique and its capabilities have been detailed 
elsewhere \[8\]. 
Research Design and Data Capture- Three studies 
were completed in which the research design was a com- 
pletely crossed factorial with repeated measures. In all stud- 
ies, the main factors of interest included: (1) communication 
modality - speech-only, pen-only, combined pen/voice, and 
(2) presentation format - form-based, unconstrained. The 
first two studies exmmined disfluencies during communica- 
tion of verbal-temporal content. To test the generality of 
certain findings, a third study was conducted that compared 
disfluencies in computational-numeric content. 
In total, data were available from 528 tasks for analysis of 
spoken and written disfluencies. All human-computer inter- 
actions were videotaped. Hardcopy transcripts also were cre- 
ated, with the subject's handwritten input captured auto- 
matically, and spoken input transcribed onto the printouts. 
Transcript Coding- To summarize briefly, spontaneously 
occurring disfluencies and self-corrections were totaled for 
each subject and condition. The total number of disflu- 
encies per condition then was converted to a rate per 100 
words, and average disfluency rates were summaxized as a 
function of condition and utterance length. Disfluencies 
were classified into the following types: (1) content self- 
corrections- task-content errors that were spontaneously 
corrected as the subject spoke or wrote, (2) false starts-- 
alt~ations to the grammatical structure of an utterance that 
occurred spontaneously as the subject spoke or wrote, (3) 
verbatim repetitions-- retracings or repetitions of a letter, 
phoneme, syllable, word, or phrase that occurred sponta- 
neously as the subject spoke or wrote, (4) frilled pauses-- 
spontaneous nonlexical sounds that frill pauses in running 
speech, which have no analogue in writing, (5) self-corrected 
sp~lllngs and abbreviations-- spontaneously corrected mis- 
spelled words or further specification of abbreviations, which 
occur in writing but have no analogue in speech. 
2.2. Results 
Figure 1 summarizes the percentage of all spoken and writ- 
ten distiuencies representing different categories during com- 
munication of verbal-temporal content (i.e., studies 1 and 
2). However, when people communicated digits (i.e., study 
3), disfluencies representing the diiferent categories were dis- 
tributed differently. Filled pauses dropped from 46% to 
15.5% of all observed disfluencies. In contrast, content cor- 
rections of digits increased from 25% to 34%, repetitions in- 
creased from 21% to 31.5%, and false staxts increased from 
8% to 19% of all disfluencies. This drop in frilled pauses and 
increase in other types of disfluency is niost likely related 
to the much briefer utterance lengths observed during the 
computational-numeric tasks. CCleaxly, the relative distribu- 
tion of different types of disfluency fluctuates with the content 
and structure of the information presented. 
The overall baseline rate of spontaneous disfluencies and self- 
corrections was 1.33 per 100 words in the verbal-ten~poral 
simulations, or a total of 1.51 disfluencies per task set. The 
rate per condition ranged from an average of 0.78 per 100 
223 
100% 
80 
60 
i Speech 
i Wriung 
40 
20 
0 a--~======i~====.~===i=~ +D<=" --°°'/ -+'o+" i 
,- o ~" .m" . .~,',~..':,~ 
Figure 1. Psrcentage cf all spoken anct written disfluencies in 
different care.gores. 
words when speaking to a form, 1.17 when writing to a form, 
1.61 during unconstrained writing, and a high of 1.74 dur- 
ing unconstrained speech. Figure 2 illustrates this rate of 
disfluencies as ~ function of mode and format. 
Wilcoxon Signed Ranks tests revealed no significant modal- 
ity difference in the rate of disfluent input, which averaged 
1.26 per 100 words for speech and 1.39 for writing, T+ = 75 
(N = 17), z < 1. However, the rate of disfluencies was 1.68 
per 100 words in the unconstrained format, in comparison 
with a reduced .98 per 100 words during form-based interac- 
tions. Followup analyses revealed no significant difference in 
the disfluency rate between formats when people wrote, T+ 
= 64.5 (N = 14), p > .20. However, significantly increased 
disfluencies were evident in the unconstrained format com- 
pared to the form-based one when people spoke, T+ = 88 
(N = 14), p < .015, one-tailed. This significant elevation was 
replicated for unconstrained speech that occurred during the 
free choice condition, 7% = 87 (N -- 14), p < .015, one-tailed, 
which simulated a multimodal spoken exchange rather than 
a unimodal one. 
A very similax pattern of disfluency rates per condition 
emerged when people communicated digits. In study 3, the 
baseline rate of spontaneous disfluencies averaged 1.37 per 
100 words, with 0.87 when speaking to a form, 1.10 when 
writing to a. form, 1.42 during unconstrained writing, and a 
high of 1.87 during unconstrained speech. Likewise, Wilcoxon 
Signed Ranks tests revealed no significant dit~erence in the 
disfluency rate between formats when people wrote, T-t- = 
36.5 (N = 11), p > .20, although significantly increased dis- 
fluencies again were apparent in the unconstrained format 
compared to the form-based one when people spoke, T+ = 
.2 
O 
Speech 
1.7 
t.6 WHt~ng 
l.:~' 
1.4' 
I.3' 
I.?T 
l.l' 
1.01 o9 t 
0.8 
0.7 i , 
Fonm-bascd Unconswalnecl 
?r~scnmz/on Form= 
Figure 2. Increasing raze of spoken disfluencics per 100 
words as a function ofsmacmre in presentation format. 
77 (N = 13), p < .015, one-tailed. 
For studies 1 and 2, disfluency rates were examined further 
for specific utterances that were graduated in length from 1 
to 18 words. I First, these analyses indicated that the aver- 
age rate of disfluencies per 100 words increased as a function 
of utterance length for spoken disfluencies, although not for 
written ones. When the rate of spoken disfluencies was com- 
pared for short (I-6 words), medium (7-12 words), and long 
utterances (13-18 words), it increased from 0.66, to 2.14, to 
3.80 disfluencies per 100 words, respectively. Statistical com- 
parisons confirmed that these rates represented significant in- 
creases from short to medium sentences, t = 3.09 (dr = 10), 
p < .006, one-tailed, and also from medium to long ones, t = 
2.06 (dr = 8), p < .04, one-tailed. 
A regression analysis indicated that the strength of predictive 
association between utterance length and disfluency rate was 
P~¢T = .77 (N = 16). That is, 77% of the variance in the 
rate of spoken disfluencies was predictable simply by knowing 
an utterance's specific length. The following simple linear 
model, illustrated in the scatterplot in Figure 3, summarizes 
this relation: l'~j = #Y-I- 13Y.x (X# -/.iX) -I- eij, with a Y-axis 
constant coefficient of-0.32, and a.u X-axis beta coefficient 
representing utterance length of +0.26. These data indicate 
that the demands associated with planning and generating 
longer constructions lead to substantial elevations in the rate 
of disfluent speech. 
To assess whether presentation format had an additional in- 
fluence on spoken disfluency rates beyond that of utterance 
length, comparisons were made of disfluency rates occur- 
1The average utterance length in study 3, in which people con- 
veyed digits during scientific calculations and personal banking 
tasks, was too brief to permit a parallel analysis. 
224 
6 
" Y = -0.32 + 0.26X.~,iw~i 
~3 
" f. . 
'I S , 
°; " ,; " ,'5 2'o 
U tttcran= I~n~h 
Figure 3. Linear regression model summarizing increasing 
rate of spoken disfluencies per 100 words as a function of 
utterance length. 
ring in unconstrained and form-based utterances that were 
matched for length. These analyses revealed that the rate of 
spoken disfluencies also was significantly higher in the uncon- 
strained format than in form-based speech, even with utter- 
ance length controlled, t (paired) -- 2.42 (df = 5), p < .03, 
one-tailed. That is, independent of utterance length, lack of 
structure in the presentation format also was associated with 
elevated disfluency rates. 
From a pragmatic viewpoint, it also is informative to com- 
pare the total number of disfluencies that would require pro- 
cessing during an application. Different design alternatives 
can be compared with respect to effective reduction of total 
disfluencies, which then would require neither processing nor 
repair. In studies 1 and 2, a comparison of the total num- 
ber of spoken disfiuencies revealed that people averaged 3.33 
per task set when using the unconstrained format, which re- 
duced to an average of 1.00 per task set when speaking to 
a form. That is, 70% of all disfluencies were eliminated by 
using a more structured form. Likewise, in study 3, the aver- 
age number of disfluencies per subject per task set dropped 
from 1.75 in the unconstrained format to 0.72 in the struc- 
tured one. In this simulation, a more structured presentation 
format successfully eliminated 59% of people's disfluencies as 
they spoke digits, in comparison with the same people com- 
pleting the same tasks via an unconstrained format. 
During post-experimental interviews, people reported their 
preference to interact with the two different presentation for- 
mats. Results indicated that approximately two-thirds of the 
subjects preferred using the more structured format. This 2- 
to-1 preference for the structured format replicated across 
both the verbal and numeric simulations. 
3. EXPERIMENTS ON 
HUMAN-HUMAN SPEECH 
This section reports on data that were analyzed to explore the 
degree of variability in disfluency rates among different types 
of human-human and human-computer spoken interaction, 
and to determine whether these two classes differ systemati- 
cally. 
3.1. Method 
Data originally collected by the author and colleagues during 
two previous studies were reanalyzed to provide comparative 
information on human-human disfluency rates for the present 
research \[1, 6, 7\]. One study focused on telephone speech, 
providing data on both: (1) two-person telephone conver- 
sations, and (2) three-person interpreted telephone conver- 
sations, with a professional telephone interpreter interme- 
dinting. Methodological details of this study are provided 
elsewhere \[7\]. Essentially, within-subject data were collected 
from 12 native speakers while they participated in task- 
oriented dialogues about conference registration and travel 
arrangements. In the second study, also outlined elsewhere 
\[1, 6\], speech data were collected on task-oriented dialogues 
conducted in each of five different communication modalities. 
For the present comparison, data from two of these modal- 
ities were reanalyzed: (1) two-party face-to-face dialogues, 
and (2) single-party monologues into an audiotape machine. 
A between-subject design was used, in which 10 subjects de- 
scribed how to assemble a water pump. All four types of 
speech were reanalyzed from tape-recordings for the same 
categories of disfluency and self-correction as those coded 
during the simulation studies, and a rate of spoken disflu- 
encies per 100 words was calculated. 
3.2. Comparative Results 
Table 1 summarizes the average speech disfluency rates for 
the four types of human-human and two types of human- 
computer interaction that were studied. Disfluency rates for 
each of the two types of human-computer speech are listed 
in Table 1 for verbal-temporal and computational-numeric 
content, respectively, and are corrected for number of sylla- 
bles per word. All samples of human-human speech reflected 
substantially higher disfluency rates than human-computer 
speech, with the average rates for these categories confirmed 
to be significantly different, t = 5.59 (df = 38), p < .0001, 
one-tailed. Comparison of the average disfluency rate for 
human-computer speech with human monologues, the least 
discrepant of the human-human categories, also replicated 
this difference, t = 2.65 (df = 21), p < .008, one-tailed. The 
magnitude of this disparity ranged from 2-to-ll-times higher 
disfluency rates for human-human as opposed ;to human- 
computer speech, depending on the categories compared. 
Further analyses indicated that the average disfluency rate 
was significantly higher during telephone speech than the 
other categories of human-human speech, t = 2.12 (df = 20), 
p < .05, two-tailed. 
4. DISCUSSION 
Spoken disfluencies are strikingly sensitive to the increased 
planning demands of generating progressively longer utter- 
225 
Type of Spoken Interaction Disfluency 
Rate 
Humau-human speech: 
Two-person telephone call 
Three-person interpreted telephone call 
Two-person face-to-face dialogue 
One-person noninteractive monologue 
Human-computer speech: 
Unconstrained computer interaction 
Structured computer interaction 
8.83 
6.25 
5.50 
3.60 
x.r4 / L8v 
0.r8 / 0.8r 
Table 1: Spoken disfluency rates per 100 words for differ- 
ent types of human-human and simulated human-computer 
interaction. 
ances. Of all the variance in spoken disfluencies in the first 
two studies, 77% was predictable simply by knowing an ut- 
terance's specific length. A linear model was provided, 
Y = -0.32 -F 0.26X, to summarize the predicted rate of spo- 
ken disiluencies as a function of utterance length. Knowledge 
of utterance length alone, therefore, is a powerful predictor 
of speech disfiuencies in human-computer interaction. 
Spoken disfluencies also are influenced substantially by the 
presentation format used during human-computer interac- 
tion. An unconstrained format, which required the speaker 
to self-structure and plan to a. greater degree, led speak- 
ers to produce over twice the rate of disfluencies as & more 
structured interaction. Furthermore, this format effect was 
replicated across unimodal and multimodal spoken input, 
and across qualitatively very different spoken content. Since 
the observed difference between formats occurred in samples 
matched for length, it is clear that presentation format and 
utterance length each exert an independent influence on spo- 
ken disfluency levels. 
In these three studies, a substantial 60 to 70% of all spoken 
disfluencies were eliminated simply by using a more struc- 
tured format. That is, selection of presentation format was 
remarkably effective at channeling a speaker's language to 
be less disfluent. In part, this was accomplished by reducing 
sentential planning demands during use of the structured for- 
mats - i.e., reducing the need for people to plan the content 
and order of information delivered (see Oviatt, forthcoming 
\[5\]). It also was accomplished in part by the relative brevity 
of people's sentences in the structured formats. The percent- 
age of moderate to long sentences increased from 5% of all 
sentences during structured interactions to 20% during un- 
constrained speech-- a 4-fold or 300% increase. In addition, 
whereas the average disfluency rate was only 0.66 for short 
sentences, this rate increased to 2.81 for sentences categorized 
as moderate or lengthy-- a 326% increase. The structured 
format not only was effective at reducing disfluencies, it also 
was preferred by a factor of 2-to-1. 
Wide variability can be expected in the disfluency rates typi- 
cal of qualitatively different types of spoken language. Based 
on the six categories compared here, rates were found to vary 
by a magnitude of 2-to-11-fold between individual categories, 
with the highest rates occurring in telephone speech, and 
the lowest in human-computer interaction. This variability 
suggests that further categories of spoken language should 
be studied individually to evaluate how prone they may be 
to disfluencies, rather than assuming that the phenomenon 
is stable throughout spoken language. Future work explor- 
ing disfluency patterns during more complex multimodal ex- 
changes will be of special interest. 
Finally, future work needs to investigate other major human- 
computer interface features that may serve to decrease plan- 
ning load on users, and to estimate how much impact they 
have on reducing disfluencies. Such information would per- 
mit proactive system design aimed at supporting more robust 
spoken language processing. For future applications in which 
an unconstrained format is preferred, or disfluencies and self- 
repairs otherwise are unavoidable, methods for correctly de- 
tecting and processing the ones that occur also will be re- 
quired. To the extent that promising work on this topic can 
incorporate probabilistic information on the relative likeli- 
hood of a disfluency for a particular utterance (e.g., of length 
N), based on either the present or future predictive models, 
correct detection and judicious repair of actual disfluencies 
may become feasible. 
5. ACKNOWLEDGMENTS 
Sincere thanks to the generous people who volunteered to 
participate in this research as subjects. Thanks also to 
Michael Frank, Martin Fong, and John Dowding for program- 
ming the simulation environment, to Martin Fong and Dan 
Wilk for playing the role of the simulation assistant during 
testing, to Jeremy Gaston, Zak Zaidman, ~nd Aaron Hall- 
mark for careful preparation of transcripts, and to Jeremy 
Gaston, Zak Zaidman, Michelle Wang, and Erik Olsen for 
assistance with data analysis. Finally, thanks to Gary Dell 
and Phil Cohen for helpful manuscript comments. 
226 
References 
1. P. R. Cohen. The pragmatics of referring and the modal- 
ity of communication. Computational Linguistics, 1984, 
10(2):97-146. 
2. D. Hindle. Deterministic parsing of syntactic non- 
fluencies. In Proceedings of the 21st. Annual Meeting 
o\] the ACL, 1983, Cambridge, Mass. 123-128. 
3. W. J. M. Levelt. Speaking: From Intention to Articula- 
tion. ACL/M.I.T. Press, Cambridge, Mass:, 1989. 
4. C. Nakz~tani and J. Hirschberg. A corpus-based study 
of repair cues in spontaneous speech. In Journal of the 
Acoustical Society of America, in press. 
5. S. L. Oviatt. Predicting spoken disfluencies during 
human-computer interaction. Journal manuscript, in 
submission. 
6. S. L. Oviatt and P. R. Cohen. Discourse strtlcture 
and performance efficiency in interactive and noninterac- 
tive spoken modalities. Computer Speech and Language, 1991, 5(4):297-326. 
7. S. L. Oviatt and P. R. Cohen. Spoken language in inter- 
preted telephone dialogues. Computer Speech and Lan- 
guage, 1992, 6:277-302. 
8. S. L. Ovi~tt, P. R. Cohen, M. W. Fong, and M. P. Frank. 
A rapid semi-automatic simulation technique for investi- 
gating interactive speech and handwriting. In Proceed- 
ings of the 199~ ICSLP, 1992, ed. by J. Ohala et al., 
University of Alberta, vol. 2, 1351-1354. 
9. E. Shriberg, J. Bear, and g. Dowding. Automatic de- 
tection and correction of repairs in human-computer di- 
alog. In Proceedings of the DARPA Speech and Natural 
Language Workshop, 1992, Morgan Kanfmann, Inc., San 
Mateo, CA, 23-26. 
227 
