Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 995–1002, Vancouver, October 2005. c©2005 Association for Computational Linguistics
The Vocal Joystick: A Voice-Based Human-Computer Interface for
Individuals with Motor Impairments∗
Jeff A. Bilmes†, Xiao Li†, Jonathan Malkin†, Kelley Kilanski‡, Richard Wright‡,
Katrin Kirchhoff†, Amarnag Subramanya†, Susumu Harada§, James A.
Landay§, Patricia Dowden¶, Howard Chizeck†
†Dept. of Electrical Engineering
§Dept. of Computer Science & Eng.
‡Dept. of Linguistics
¶Dept. of Speech & Hearing Science
University of Washington
Seattle, WA
Abstract
We present a novel voice-based human-
computer interface designed to enable in-
dividuals with motor impairments to use
vocal parameters for continuous control
tasks. Since discrete spoken commands
are ill-suited to such tasks, our interface
exploits a large set of continuous acoustic-
phonetic parameters like pitch, loudness,
vowel quality, etc. Their selection is opti-
mized with respect to automatic recogniz-
ability, communication bandwidth, learn-
ability, suitability, and ease of use. Pa-
rameters are extracted in real time, trans-
formed via adaptation and acceleration,
and converted into continuous control sig-
nals. This paper describes the basic en-
gine, prototype applications (in particu-
lar, voice-based web browsing and a con-
trolled trajectory-following task), and ini-
tial user studies confirming the feasibility
of this technology.
1 Introduction
Many existing human-computer interfaces (e.g.,
mouse and keyboard, touch screens, pen tablets,
etc.) are ill-suited to individuals with motor
impairments. Specialized (and often expensive)
human-computer interfaces that have been devel-
oped specifically for this target group include sip
and puff switches, head mice, eye-gaze devices, chin
joysticks and tongue switches. While many indi-
viduals with motor impairments have complete use
∗This material is based on work supported by the National
Science Foundation under grant IIS-0326382.
of their vocal system, these devices make little use
of it. Sip and puff switches, for example, have low
communication bandwidth, making it impossible to
achieve more complex control tasks.
Natural spoken language is often regarded as
the obvious choice for a human-computer inter-
face. However, despite significant research efforts
in automatic speech recognition (ASR) (Huang et
al., 2001), existing ASR systems are still not suf-
ficiently robust to a wide variety of speaking condi-
tions, noise, accented speakers, etc. ASR-based in-
terfaces are therefore often abandoned by users after
a short initial trial period. In addition, natural speech
is optimal for communication between humans but
sub-optimal for manipulating computers, windows-
icons-mouse-pointer (WIMP) interfaces, or other
electro-mechanical devices (such as a prosthetic ro-
botic arm). Standard spoken language commands
are useful for discrete but not for continuous op-
erations. For example, in order to move a cursor
from the bottom-left to the upper-right of a screen,
the user might have to repeatedly utter “up” and
“right” or “stop” and “go” after setting an initial tra-
jectory and rate, which is quite inefficient. For these
reasons, we are developing alternative and reusable
voice-based assistive technology termed the “Vocal
Joystick” (VJ).
2 The Vocal Joystick
The VJ approach has three main characteristics:
1) Continuous control parameters: Unlike standard
speech recognition, the VJ engine exploits continu-
ous vocal characteristics that go beyond simple se-
quences of discrete speech sounds (such as syllables
or words) and include e.g., pitch, vowel quality, and
loudness, which are then mapped to continuous con-
995
trol parameters.
2) Discrete vocal commands: Unlike natural speech,
the VJ discrete input language is based on a pre-
designed set of sounds. These sounds are selected
with respect to acoustic discriminability (maximiz-
ing recognizer accuracy), pronounceability (reduc-
ing potential vocal strain), mnemonic characteris-
tics (reducing cognitive load), robustness to environ-
mental noise, and application appropriateness.
3) Reusable infrastructure: Our goal is not to create
a single application but to provide a modular library
that can be incorporated by developers into a variety
of applications that can be controlled by voice. The
VJ technology is not meant to replace standard ASR
but to enhance and be compatible with it.
2.1 Vocal Characteristics
Three continuous vocal characteristics are extracted
by the VJ engine: energy, pitch, and vowel qual-
ity, yielding four specifiable continuous degrees of
freedom. The first of these, localized acoustic en-
ergy, is used for voice activity detection. In addi-
tion, it is normalized relative to the current vowel
detected (see Section 3.3), and is used by our cur-
rent VJ-WIMP application (Section 4) to control the
velocity of cursor movements. For example, a loud
voice causes a faster movement than does a quiet
voice. The second parameter, pitch, is also extracted
but is currently not mapped to any control dimension
in the VJ-WIMP application but will be in the future.
The third parameter is vowel quality. Unlike conso-
nants, which are characterized by a greater degree of
constriction in the vocal tract, vowels have much in-
herent signal energy and are therefore well-suited to
environments where both high accuracy and noise-
robustness are crucial. Vowels can be characterized
using a 2-D space parameterized by F1 and F2, the
first and second vocal-tract formants (resonant fre-
quencies). We initially experimented with directly
extracting F1/F2 and using them for directly spec-
ifying 2-D continuous control. While we have not
ruled out the use of F1/F2 in the future, we have
so far found that even the best F1/F2 detection al-
gorithms available are not yet accurate enough for
precise real-time specification of movement. There-
fore, we classify vowels directly and map them onto
the 2-D vowel space characterized by degree of con-
striction (i.e., tongue height) and tongue body posi-
tion (Figure 1). In our VJ-WIMP application, we use
Degree of Constriction
FrontCentral
Back
High
Mid
Low
Tongue Body Position
[iy] [ix] [uw]
[ey]
[ax]
[ow]
[ae]
[a]
[aa]
Figure 1: Vowel configurations as a function of their
dominant articulatory configurations.
the four corners of this chart to map to the 4 princi-
ple directions of up, down, left, and right as shown
in Figure 2 (note that the two figures are flipped and
rotated with respect to each other). We have four
different VJ systems running: A) a 4-class system
allowing only the specification of the 4 principle di-
rections; B) a 5-class system that also includes the
phone [ax] to act as a carrier when wishing to vary
only pitch and loudness; C) a 8-class system that in-
cludes the four diagonal directions; and D) a 9-class
system that includes all phones and directions. Most
of the discussion in this paper refers to the 4-class
system.
A fourth vocal characteristic is also extracted
by the VJ engine, namely discrete sounds. These
sounds may correspond to button presses as on a
mouse or joystick. The choice of sounds depends
on the application and are chosen according to char-
acteristic 2 above.
3 The VJ Engine
Our system-level design goals are modularity, low
latency, and maximal computational efficiency. For
this reason, we share common signal processing
operations in multiple signal extraction modules,
which yields real-time performance but leaves con-
siderable computational headroom for the back-end
applications being driven by the VJ engine.
Figure 3 shows the VJ engine architecture having
three modules: signal processing, pattern recogni-
tion, and motion control.
3.1 Signal Processing
The goal of the signal processing module is to ex-
tract low-level acoustic features that can be used in
996
[iy]
[ix]
[uw]
[ey]
[ow]
[ae]
[a]
[aa]
[ax]
Figure 2: Vowel-direction mapping: vowels corre-
sponding to directions.
Acoustic
Waveform Feature
Extraction
Features:
Energy
NCCF
F1/F2
MFCC
Signal
Processing
Energy
Vowel
Classification
Pattern
Recognition
Pitch
Tracking
Discrete Sound
Recognition
Motion
Parameters:
xy-directions,
Speed,
Acceleration,
Motion Control
Space
Transformation
Motion
Computer
Interface
Driver
Adaptation
Figure 3: System organization
estimating the vocal characteristics. The features we
use are energy, normalized cross-correlation coeffi-
cients (NCCC), formant estimates, Mel-frequency
cepstral coefficients (MFCCs), and formant esti-
mates. To extract features, the speech signal is PCM
sampled at a rate of Fs =16,000Hz. Energy is mea-
sured on a frame-by-frame basis with a frame size
of 25ms and a frame step of 10ms. Pitch is ex-
tracted with a frame size of 40ms and a frame step of
10ms. Multiple pattern recognition tasks may share
the same acoustic features: for example, energy and
NCCCs are used for pitch tracking, and energy and
MFCCs can be used in vowel classification and dis-
crete sound recognition. Therefore, it is more ef-
ficient to decouple feature extraction from pattern
recognition, as is shown in Figure 3.
3.2 Pattern Recognition
The pattern recognition module uses the acoustic
features to extract desired parameters. The estima-
tion and classification system must simultaneously
perform energy computation (available from the in-
put), pitch tracking, vowel classification, and dis-
crete sound recognition.
Many state-of-the-art pitch trackers are based on
dynamic programming (DP). This, however, often
requires the meticulous design of local DP cost func-
tions. The forms of these cost functions are usu-
ally empirically determined and/or their parameters
are tuned by algorithms such as gradient descent
(D.Talkin, 1995). Since different languages and ap-
plications may follow very different pitch transition
patterns, the cost functions optimized for certain lan-
guages and applications may not be the most appro-
priate for others. Our VJ system utilizes a graphi-
cal model mechanism to automatically optimize the
parameters of these cost functions, and has been
shown to yield state-of-the-art performance (X.Li et
al., 2004; J.Malkin et al., 2005).
For frame-by-frame vowel classification, our de-
sign constraints are the need for extremely low la-
tency and low computational cost. Probability es-
timates for vowel classes thus need to be obtained
as soon as possible after the vowel has been uttered
or after any small change in voice quality has oc-
curred. It is well known that models of vowel clas-
sification that incorporate temporal dynamics such
as hidden Markov models (HMMs) can be quite ac-
curate. However, the frame-by-frame latency re-
quirements of VJ make HMMs unsuitable for vowel
classification since HMMs estimate the likelihood
of a model based on the entire utterance. An alter-
native is to utilize causal “HMM-filtering”, which
computes likelihoods at every frame based on all
frames seen so far. We have empirically found,
however, that slightly non-causal and quite local-
ized estimates of the vowel category probability
is sufficient to achieve user satisfaction. Specifi-
cally, we obtain probability estimates of the form
p(Vt|Xt−τ,...,Xt+τ), where V is a vowel class,
and Xt−τ,...,Xt+τ are feature frames within a
length 2τ + 1 window of features centered at time
t. After several empirical trials, we decided on
neural networks for vowel classification because of
the availability of efficient discriminative training al-
gorithms and their computational simplicity. Specif-
ically we use a simple 2-layer multi-layer percep-
tron (Bishop, 1995) whose input layer consists of
26 ∗ 7 = 182 nodes, where 26 is the dimension of
Xt, the MFCC feature vector, and 2τ +1 = 7 is the
997
number of consecutive frames, and that has 50 hid-
den nodes (the numbers 7 and 50 were determined
empirically). The output layer has 4 output nodes
representing 4 vowel probabilities. During training,
the network is optimized to minimize the Kullback-
Leibler (K-L) divergence between the output and the
true label distribution, thus achieving the aforemen-
tioned probabilistic interpretation.
The VJ engine needs not only to detect that the
user is specifying a vowel (for continuous control)
but also a consonant-vowel-consonant (CVC) pat-
tern (for discrete control) quickly and with a low
probability of confusion (a VJ system also uses C
and CV patterns for discrete commands). Requir-
ing an initial consonant will phonetically distinguish
these sounds from the pure vowel segments used
for continuous control — the VJ system constantly
monitors for changes that indicate the beginning of
one of the discrete control commands. The vowel
within the CV and CVC patterns, moreover, can help
prevent background noise from being mis-classified
as a discrete sound. Lastly, each such pattern cur-
rently requires an ending silence, so that the next
command (a new discrete sound or continuous con-
trol vowel) can be accurately initiated. In all cases, a
simple threshold-based rejection mechanism is used
to reduce false positives.
To recognize the discrete control signals, HMMs
are employed since, as in standard speech recogni-
tion, time warping is necessary to normalize for dif-
ferent signal durations corresponding to the same
class. Specifically, we embed phone HMMs into
“word” (C, CV, or CVC) HMMs. In this way, it
is possible to train phone models using a training
set that covers all possible phones, and then con-
struct an application-specific discrete command vo-
cabulary without retraining by recombining existing
phone HMMs into new word HMMs. Therefore,
each VJ-driven application can have its own appro-
priate discrete sound set.
3.3 Motion Control: Direction and Velocity
The VJ motion control module receives several pat-
tern recognition parameters and processes them to
produce output more appropriate for determining 2-
D movement in the VJ-WIMP application.
Initial experiments suggested that using pitch to
affect cursor velocity (Igarashi and Hughes, 2001)
would be heavily constrained by an individual’s vo-
cal range. Giving priority to a more universal user-
independent VJ system, we instead focused on rela-
tive energy. Our observation that users often became
quiet when trying to move small amounts confirmed
energy as a natural choice. Drastically different in-
trinsic average energy levels for each vowel, how-
ever, meant that comparing all sounds to a global av-
erage energy would create a large vowel-dependent
bias. To overcome this, we distribute the energy per
frame among the different vowels, in proportion to
the probabilities output by the neural network, and
track the average energy for each vowel indepen-
dently. By splitting the power in this way, there is
no effect when probabilities are close to 1, and we
smooth out changes during vowel transitions when
probabilities are more evenly distributed.
There are many possible options for determining
velocity (a vector capturing both direction and speed
magnitude) and “acceleration” (a function determin-
ing how the control-to-display ratio changes based
on input parameters), and the different schemes have
a large impact on user satisfaction. Unlike a standard
mouse cursor, where the mapping is from 2-D hand
movement to a 2-D screen, the VJ system maps from
vocal-tract articulatory movement to a 2-D screen,
and the transformation is not as straightforward. All
values are for the current time frame t unless indi-
cated otherwise. First, a raw direction value is cal-
culated for each axis j ∈ {x, y} as
dj =
summationdisplay
i
pi · 〈vi,ej〉 (1)
in which pi = p(Vt = i|Xt−τ,...,t+τ) is the proba-
bility for vowel i at time t, vi is a unit vector in the
direction of vowel i, ej is the unit-length positive di-
rectional basis vector along the j axis, and 〈v,e〉 is
the projection of vector v onto unit vector e. To de-
termine movement speed, we first calculate a scalar
for each axis j as
sj =
summationdisplay
i
max
bracketleftbigg
0,gi
parenleftbigg
pi · f(Eµ
i
)
parenrightbiggbracketrightbigg
· |〈vi,ej〉|
where E is the energy in the current frame, µi is the
average energy for vowel i, and f(·) and gi(·) are
functions used for energy normalization and percep-
tual scaling (such as logs and/or cube-roots). This
therefore allocates frame energy to direction based
on the vowel probabilities. Lastly, we calculate the
velocity for axis j at the current frame as
Vj = β · sαj · exp(γsj). (2)
998
where β represents the overall system sensitivity and
the other values (α and γ) are warping constants, al-
lowing the user to control the shape of the accelera-
tion curve. Typically only one of α and γ is nonzero.
Setting both to zero results in constant-speed move-
ment along each axis, while α = 1 and γ = 0
gives a linear mapping that will scale motion with
energy but have no acceleration. The current user-
independent system uses β = 0.6, γ = 1.0 and sets
α = 0. Lastly, the final velocity along axis j is Vjdj.
Future publications will report on systematic evalu-
ations of different f(·) and gi(·) functions.
3.4 Motion Control: User Adaptation
Since vowel quality is used for continuous control,
inaccuracies can arise due to speaker variability ow-
ing to different speech loudness levels, vocal tract
lengths, etc. Moreover, a vowel class articulated by
one user might partially overlap in acoustic space
with a different vowel class from another user. This
imposes limitations on a purely user-independent
vowel classifier. Differences in speaker loudness
alone could cause significant unpredictability. To
mitigate these problems, we have designed an adap-
tation procedure where each user is asked to pro-
nounce four pre-defined vowel sounds, each last-
ing 2-3 seconds, at the beginning of a VJ ses-
sion. We have investigated several novel adaptation
strategies utilizing both neural networks and support
vector machines (SVM). The fundamental idea be-
hind them both is that an initial speaker-independent
transformation of the space is learned using train-
ing data, and is represented by the first layer of a
neural network. Adaptation data then is used to
transform various parameters of the classifier (e.g.,
all or sub-portions of the neural network, or the para-
meters of the SVM). Further details of some of these
novel adaptation strategies appear in (X.Li et al.,
2005), and the remainder will appear in forthcom-
ing publications. Also, the average energy values of
each vowel for each user are recorded and used to
normalize the speed control rate mentioned above.
Preliminary evaluations on the data so far collected
show very good results, with adaptation reducing the
vowel classification error rate by 18% for the 4-class
case, and 35% for the 8-class case. Moreover, infor-
mal studies have shown that users greatly prefer the
VJ system after adaptation than before.
4 Applications and Videos
Our overall intent is for VJ to interface with a va-
riety of applications, and our primary application
so far has been to drive a standard WIMP interface
with VJ controls, what we call the VJ-WIMP ap-
plication. The current VJ version allows left but-
ton clicks (press and release, using the consonant
[k]) as well as left button toggles (using consonant
[ch]) to allow dragging. Since WIMP interfaces
are so general, this allows us to indirectly control
a plethora of different applications. Video demon-
strations are available at the URL:http://ssli.
ee.washington.edu/vj.
One of our key VJ applications is vocal web
browsing. The video (dated 6/2005) shows exam-
ples of two web browsing tasks, one as an exam-
ple of navigating the New York Times web site, the
other using Google Maps to select and zoom in on a
target area. Section 5 describes a preliminary evalu-
ation on these tasks. We have also started using the
VJ engine to control video games (third video ex-
ample), have interfaced VJ with the Dasher system
(Ward et al., 2000) (we call it the “Vocal Dasher”),
and have also used VJ for figure drawing.
Several additional direct VJ-applications have
also been developed. Specifically, we have directly
interfaced the VJ system into a simple blocks world
environment, where more precise object movement
is possible than via the mouse driver. Specifically,
this environment can draw arbitrary trajectories, and
can precisely measure user fidelity when moving an
object along a trajectory. Fidelity depends both on
positional accuracy and task duration. One use of
this environment shows the spatial direction corre-
sponding to vocal effort (useful for training, forth
video example). Another shows a simple robotic
arm being controlled by VJ. We plan to use this
environment to perform formal and precise user-
performance studies in future work.
5 Preliminary User Study
We conducted a preliminary user study1 to evaluate
the feasibility of VJ and to obtain feedback regard-
ing specific difficulties in using the VJ-WIMP sys-
tem. While this study is not accurate in that: 1) it
does not yet involve the intended target population
1The user study presented here used an earlier version of VJ
than the current improved one described in the preceding pages.
999
of individuals with motor impairments, and: 2) the
users had only a small amount of time to practice and
become adept at using VJ, the study is still indica-
tive of the VJ approach’s overall viability as a novel
voice-based human-computer interface method. The
study quantitatively compares VJ performance with
a standard desktop mouse, and provides qualitative
measurement of the user’s perception of the system.
5.1 Experiment Setup
We recruited seven participants ranging from age 22
to 26, none of whom had any motor impairment.
Of the seven participants, two were female and five
were male. All of them were graduate students in
Computer Science, although none of them had pre-
viously heard of or used VJ. Four of the participants
were native English speakers; the other three had an
Asian language as their mother tongue.
We used a Dell Inspiron 9100 laptop with a 3.2
GHz Intel Pentium IV processor running the Fedora
Core 2 operating system, with a 1280 x 800 24-bit
color display. The laptop was equipped with an ex-
ternal Microsoft IntelliMouse connected through the
USB port which was used for all of the tasks in-
volving the mouse. A head-mounted Amanda NC-
61 microphone was used as the audio input device,
while the audio feedback from the laptop was output
through the laptop speakers. The Firefox browser
was used for all of the tasks, with the browser screen
maximized such that the only portion of the screen
which was not displaying the contents of the web
page was the top navigation toolbar which was 30
pixels high.
5.2 Quantitative and Qualitative Evaluation
At the beginning of the quantitative evaluation, each
participant was given a brief description of the VJ
operations and was shown a demonstration of the
system by a practiced experimenter. The participant
was then guided through an adaptation process dur-
ing which she/he was asked to pronounce the four
directional vowels (Section 3.4). After adaptation,
the participant was given several minutes to practice
using a simple target clicking application. The quan-
titative portion of our evaluation followed a within-
participant design. We exposed each participant to
two experimental conditions which we refer to as
input modalities: the mouse and the VJ. Each par-
ticipant completed two tasks on each modality, with
one trial per task.
The first task was a link navigation task (Link),
in which the participants were asked to start from a
specific web page and follow a particular set of links
to reach a destination. Before the trial, the experi-
menter demonstrated the specified sequence of links
to the participant by using the mouse and clicking at
the appropriate links. The participant was also pro-
vided with a sheet of paper for their reference that
listed the sequence of links that would lead them to
the target. The web site we used was a Computer
Science Department student guide and the task in-
volved following six links with the space between
each successive link including both horizontal and
vertical components.
The second task was map navigation (Map), in
which the participant was asked to navigate an on-
line map application from a starting view (showing
the entire USA) to get to a view showing a partic-
ular campus. The size of the map was 400x400
pixels, and the set of available navigation controls
surrounding the map included ten discrete zoom
level buttons, eight directional panning arrows, and
a click inside the map causing the map to be centered
and zoomed in by one level. Before the trial, a prac-
ticed experimenter demonstrated how to locate the
campus map starting from the USA view to ensure
they were familiar with the geography.
For each task, the participants performed one trial
using the mouse, and one trial using a 4-class VJ.
The trials were presented to the participants in a
counterbalanced order. We recorded the completion
time for each trial, as well as the number of false
positives (system interprets a click when the user
did not make a click sound), missed recognitions
(the user makes a click sound but the system fails to
recognize it as a click), and user errors (whenever
the user clicks on an incorrect link). The recorded
trial times include the time used by all of the above
errors including recovery time.
After the completion of the quantitative evalu-
ation, the participants were given a questionnaire
which consisted of 14 questions related to the partic-
ipants’ perception of their experience using VJ such
as the degree of satisfaction, frustration, and embar-
rassment. The answers were encoded on a 7-point
Likert scale. We also included a space where the
participants could write in any comments, and an in-
1000
0
10
20
30
40
50
60
70
80
90
100
Link Map
Task type
T
a
s
k
 
c
o
m
p
l
e
t
i
o
n
 
t
i
m
e
 
(
s
e
c
o
n
d
s
)
Mouse
Vocal Joystick
Figure 4: Task complement times
0
2
4
6
8
10
12
14
16
18
20
M
,
 
K
o
r
e
a
M
,
 
N
o
r
t
h
e
a
s
t
M
,
 
M
id
w
e
s
t
M
,
 
N
o
r
t
h
e
a
s
t
F
,
 
M
i
d
-
A
t
la
n
t
i
c
F
,
 
C
h
in
a
M
,
 
C
h
i
n
a
Participant (Gender, Origin)
N
u
m
b
e
r
 
o
f
 
m
i
s
s
e
d
 
r
e
c
o
g
n
i
t
i
o
n
s
Link
Map
Figure 5: Missed recognitions by participant
formal post-experiment interview was performed to
solicit further feedback.
5.3 Results
Figure 4 shows the task completion times for Link
and Map tasks, Figure 5 shows the breakdown of
click errors by individual participants, Figure 6
shows the average number of false positive and
missed recognition errors for each of the tasks.
There was no instance of user error in any trial. Fig-
ure 7 shows the median of the responses to each of
the fourteen questionnaire questions (error bars in
each plot show ± standard error). In our measure-
ment of the task completion times, we considered
the VJ’s recognition error rate as a fixed factor, and
thus did not subtract the time spent during those er-
rors from the task completion time.
There were several other interesting observations
that were made throughout the study. We noticed
that the participants who had the least trouble with
missed recognitions for the clicking sound were ei-
0
1
2
3
4
5
6
7
8
9
10
Link Map
Task type
N
u
m
b
e
r
 o
f
 e
rro
rs
Fa ls e pos itiv e
Missed Recognition
Figure 6: Average number of click errors per task
1.0
2.0
3.0
4.0
5.0
6.0
7.0
E
a
s
y
 
to
 l
e
a
r
n
E
a
s
y
 
to
 u
s
e
D
if
f
ic
u
lt
 t
o
 c
o
n
tr
o
l
F
ru
s
tr
a
ti
n
g
F
u
n
T
ir
in
g
E
m
b
a
rr
a
s
s
in
g
In
tu
it
iv
e
E
r
r
o
r
 p
ro
n
e
S
e
lf
-
c
o
n
s
c
io
u
s
S
e
lf
-
c
o
n
s
c
io
u
s
n
e
s
s
 d
e
c
re
a
s
e
d
V
o
w
e
l 
s
o
u
n
d
s
 d
is
ti
n
g
u
is
h
a
b
le
M
a
p
 
h
a
rd
e
r 
t
h
a
n
 s
e
a
rc
h
M
o
ti
o
n
 m
a
t
c
h
e
d
 i
n
te
n
ti
o
n
Strongly
agree
Strongly
disagree
Figure 7: Questionnaire results
ther female or with an Asian language background,
as shown in Figure 5. Our hypothesis regarding the
better performance by female participants is that the
original click sound was trained on one of our fe-
male researcher’s voice. We plan also in future work
to determine how the characteristics of different na-
tive language speakers influence VJ, and ultimately
to correct for any bias.
All but one user explicitly expressed their confu-
sion in distinguishing between the [ae] and [aa] vow-
els. Four of the seven participants independently
stated that their performance would probably have
been better if they had been able to practice longer,
and did not attribute their perceived suboptimal per-
formance to the quality of the VJ’s recognition sys-
tem. Several participants reported that they felt their
vocal cords were strained due to having to produce a
loud sound in order to get the cursor to move at the
desired speed. We suspect this is due either to ana-
log gain problems or to their adapted voice being too
loud, and therefore the system calibrating the nor-
mal speed to correspond to the loud voice. We have
since removed this problem by adjusting our adapta-
1001
tion strategy to express preference for a quiet voice.
In summary, the results from our study suggest
that users without any prior experience were able
to perform basic mouse based tasks using the Vocal
Joystick system with relative slowdown of four to
nine times compared to a conventional mouse. We
anticipate that future planned improvements in the
algorithms underlying the VJ engine (to improve ac-
curacy, user-independence, adaptation, and speed)
will further increase the VJ system’s viability, and
combined with practice could improve VJ enough so
that it becomes a reasonable alternative compared to
a standard mouse’s performance.
6 Related Work
Related voice-based interface studies include
(Igarashi and Hughes, 2001; Olwal and Feiner,
2005). Igarashi & Hughes presented a system where
non-verbal voice features control a mouse system —
their system requires a command-like discrete sound
to determine direction before initiating a movement
command, where pitch is used to control veloc-
ity. We have empirically found an energy-based
mapping for velocity (as used in our VJ system)
both more reliable (no pitch-tracking errors) and
intuitive. Olwal & Feiner’s system moves the mouse
only after recognizing entire words. de Mauro’s
“voice mouse” http://www.dii.unisi.it/
∼maggini/research/voice mouse.html
focuses on continuous cursor movements similar
to the VJ scenario; however, the voice mouse
only starts moving after the vocalization has been
completed leading to long latencies, and it is not
easily portable to other applications. Lastly, the
commercial dictation program Dragon by ScanSoft
includes MouseGridTM(Dra, 2004) which allows
discrete vocal commands to recursively 9-partition
the screen, thus achieving log-command access to a
particular screen point. A VJ system, by contrast,
uses continuous aspects of the voice, has change
latency (about 60ms) not much greater than reaction
time, and allows the user to make instantaneous
directional change using one’s voice (e.g., a user
can draw a ”U” shape in one breath).
7 Conclusions
We have presented new voice-based assistive tech-
nology for continuous control tasks and have
demonstrated an initial system implementation of
this concept. An initial user study using a group
of individuals from the non-target population con-
firmed the feasibility of this technology. We plan
next to further improve our system by evaluating a
number of novel pattern classification techniques to
increase accuracy and user-independence, and to in-
troduce additional vocal characteristics (possibilities
include vibrato, degree of nasality, rate of change
of any of the above as an independent parameter)
to increase the available simultaneous degrees of
freedom controllable via the voice. Moreover, we
plan to develop algorithms to decouple unintended
user correlations of these parameters, and to further
advance both our adaptation and acceleration algo-
rithms.
References
C. Bishop. 1995. Neural Networks for Pattern Recogni-
tion. Clarendon Press, Oxford.
2004. Dragon naturally speaking, MousegridTM, Scan-
Soft Inc.
D.Talkin. 1995. A robust algorithm for pitch track-
ing (RAPT). In W.B.Kleign and K.K.Paliwal, editors,
Speech Coding and Synthesis, pp. 495–515, Amster-
dam. Elsevier Science.
X. Huang, A. Acero, and H.-W. Hon. 2001. Spoken Lan-
guage Processing: A Guide to Theory, Algorithm, and
System Development. Prentice Hall.
T. Igarashi and J. F. Hughes. 2001. Voice as sound: Us-
ing non-verbal voice input for interactive control. In
ACM UIST 2001, November.
J.Malkin, X.Li, and J.Bilmes. 2005. A graphical model
for formant tracking. In Proc. IEEE Intl. Conf. on
Acoustics, Speech, and Signal Processing.
A. Olwal and S. Feiner. 2005. Interaction techniques us-
ing prosodic feature of speech and audio localization.
In Proceedings of the 10th International Conference
on Intelligent User Interfaces, pp. 284–286.
D. Ward, A. F. Blackwell, and D. C. MacKay. 2000.
Dasher - a data entry interface using continuous ges-
tures and language models. In ACM UIST 2000.
X.Li, J.Malkin, and J.Bilmes. 2004. A graphical model
approach to pitch tracking. In Proc. Int. Conf. on Spo-
ken Language Processing.
X.Li, J.Bilmes, and J.Malkin. 2005. Maximum mar-
gin learning and adaptation of MLP classifers. In 9th
European Conference on Speech Communication and
Technology (Eurospeech’05), Lisbon, Portugal, Sep-
tember.
1002
