WIT: A Toolkit for Building Robust and Real-Time Spoken Dialogue 
Systems 
Mikio Nakano* Noboru Miyazaki, Norihito Yasuda, Akira Sugiyama, 
Jun-ichi Hirasawa, Kohji Dohsaka, Kiyoaki Aikawa 
NTT Corporation 
3-1 Morinosato-Wakamiya 
Atsugi, Kanagawa 243-0198, Japan 
E-mail: nakano@atom.brl.ntt.co.jp 
Abstract 
This paper describes WI'I; a toolkit 
for building spoken dialogue systems. 
WIT features an incremental under- 
standing mechanism that enables ro- 
bust utterance understanding and real- 
time responses. WIT's ability to com- 
pile domain-dependent system specifi- 
cations into internal knowledge sources 
makes building spoken dialogue sys- 
tems much easier than :it is from 
scratch. 
1 Introduction 
The recent great advances in speech and  
technologies have made it possible to build fully 
implemented spoken dialogue systems (Aust et 
al., 1995; Allen et al., 1996; Zue et al., 2000; 
Walker et al., 2000). One of the next research 
goals is to make these systems task-portable, that 
is, to simplify the process of porting to another 
task domain. 
To this end, several toolkits for building spo- 
ken dialogue systems have been developed (Bar- 
nett and Singh, 1997; Sasajima et al., 1999). 
One is the CSLU Toolkit (Sutton et al., 1998), 
which enables rapid prototyping of a spoken di- 
alogue system that incorporates a finite-state dia- 
logue model. It decreases the amount of the ef- 
fort required in building a spoken dialogue sys- 
tem in a user-defined task domain. However, it 
limits system functions; it is not easy to employ 
the advanced  processing techniques de- 
veloped in the realm of computational linguis- 
tics. Another is GALAXY-II (Seneffet al., 1998), 
*Mikio Nakano is currently a visiting scientist at MIT 
Laboratory for Computer Science. 
which enables modules in a dialogue system to 
communicate with each other. It consists of the 
hub and several servers, such as the speech recog- 
nition server and the natural  server, and 
the hub communicates with these servers. Al- 
though it requires more specifications than finite- 
state-model-based toolkits, it places less limita- 
tions on system functions. 
Our objective is to build robust and real-time 
spoken dialogue systems in different task do- 
mains. By robust we mean utterance understand- 
ing is robust enough to capture not only utter- 
ances including grammatical errors or self-repairs 
but also utterances that are not clearly segmented 
into sentences by pauses. Real time means the 
system can respond to the user in real time. The 
reason we focus on these features is that they are 
crucial to the usability of spoken dialogue sys- 
tems as well as to the accuracy of understand- 
ing and appropriateness of the content of the sys- 
tem utterance. Robust understanding allows the 
user to speak to the system in an unrestricted 
way. Responding in real time is important be- 
cause if a system response is delayed, the user 
might think that his/her utterance was not recog- 
nized by the system and make another utterance, 
making the dialogue disorderly. Systems having 
these features should have several modules that 
work in parallel, and each module needs some 
domain-dependent knowledge sources. Creat- 
ing and maintaining these knowledge sources re- 
quire much effort, thus a toolkit would be help- 
ful. Previous toolkits, however, do not allow us to 
achieve these features, or do not provide mecha- 
nisms that achieve these features without requir- 
ing excessive efforts by the developers. 
This paper presents WIT 1, which is a toolkit 
IWIT is an acronym of Workable spoken dialogue lnter- 
150 
for building spoken dialogue systems that inte- 
grate speech recognition,  understanding 
and generation, and speech output. WIT features 
an incremental understanding method (Nakano et 
al., 1999b) that makes it possible to build a robust 
and real-time system. In addition, WIT compiles 
domain-dependent system specifications into in- 
ternal knowledge sources so that building systems 
is easier. Although WIT requires more domain- 
dependent specifications than finite-state-model- 
based toolkits, WIT-based systems are capable 
of taking full advantage of  processing 
technology. WIT has been implemented and used 
to build several spoken dialogue systems. 
In what follows, we overview WIT, explain its 
architecture, domain-dependent system specifica- 
tions, and implementation, and then discuss its 
advantages and problems. 
2 Overview 
A WIT-based spoken dialogue system has four 
main modules: the speech recognition module, 
the  understanding module, the lan- 
guage generation module, and the speech out- 
put module. These modules exploit domain- 
dependent knowledge sources, which are auto- 
matically generated from the domain-dependent 
system specifications. The relationship among 
the modules, knowledge sources, and specifica- 
tions are depicted in Figure 1. 
WIT can also display and move a human-face- 
like animated agent, which is controlled by the 
speech output module, although this paper does 
not go into details because it focuses only on spo- 
ken dialogue. We also omit the GUI facilities pro- 
vided by WIT. 
3 Architecture of WIT-Based Spoken 
Dialogue Systems 
Here we explain how the modules in WIT work 
by exploiting domain-dependent knowledge and 
how they interact with each other. 
3.1 Speech Recognition 
The speech recognition module is a phoneme- 
HMM-based speaker-independent continuous 
speech recognizer that incrementally outputs 
face Toolldt. 
word hypotheses. As the recogn/fion engine, 
either VoiceRex, developed by NTI" (Noda et 
al., 1998), or HTK from Entropic Research can 
be used. Acoustic models for HTK is trained 
with the continuous speech database of the 
Acoustical Society of Japan (Kobayashi et al., 
1992). This recognizer incrementally outputs 
word hypotheses as soon as they are found in the 
best-scored path in the forward search (Hirasawa 
et al., 1998) using the ISTAR (Incremental 
Structure Transmitter And Receiver) protocol, 
which conveys word graph information as well as 
word hypotheses. This incremental output allows 
the  understanding module to process 
recognition results before the speech interval 
ends, and thus real-time responses are possible. 
This module continuously runs and outputs 
recognition results when it detects a speech 
interval. This enables the  generation 
module to react immediately to user interruptions 
while the system is speaking. 
The  model for speech recognition 
is a network (regular) grammar, and it allows 
each speech interval to be an arbitrary number 
of phrases. A phrase is a sequence of words, 
which is to be defined in a domain-dependent 
way. Sentences can be decomposed into a cou- 
ple of phrases. The reason we use a repeti- 
tion of phrases instead of a sentence grammar 
for the  model is that the speech recog- 
nition module of a robust spoken dialogue sys- 
tem sometimes has to recognize spontaneously 
spoken utterances, which include self-repairs and 
repetition. In Japanese, bunsetsu is appropriate 
for defining phrases. A bunsetsu consists of one 
content word and a number (possibly zero) of 
function words. In the meeting room reservation 
system we have developed, examples of defined 
phrases are bunsetsu to specify the room to be re- 
served and the time of the reservation and bun- 
setsu to express affirmation and negation. 
When the speech recognition module finds a 
phrase boundary, it sends the category of the 
phrase to the  understanding module, 
and this information is used in the parsing pro- 
cess. 
It is possible to hold multiple  mod- 
els and use any one of them when recogniz- 
ing a speech interval. The  models are 
151 
Semantic \[ I ~e I 
/ specifications \[ r 1 
I R~ae I /L..___..-- -'-'-~'- / Ph~e l /de~;,i~ions I /" "-. I de~i*~._l 
I Feature I L._.___..~.. -~-'-'-'~ , ( "~ "'. 
L____i----',.,'-... ~ ~ "-.. 
• Surface- I de.~ons \[:- ,, l ;~::--°Y\l Language II Language I ~Generafio~n_/ . I 
, ~, "., "~"----~__Z_--.--" M . -. i i . ) i ....... l,,_J generaraon I ~\ ,, ~ ~ unaerstanding I I generation IO t procedures I 
TM / .... ~ .... I 
\ I I I .... J 
I definitions I '\ \ I word I strings I hypothesea + 
I.-- I___~'~setof-"-L__~I Speech i I~,~ L iT-st of--/L/i.,istof I ~" t I ~angu.~ge I- -I ~tio. I I ou~u, r'--'\] pre-r~o.dedr'--\] pre-r~o~ed I 
I d~f~i~23~_l t.models .....~1 I module I I ~oam~ I~Peech m~._J , I L_ j 
user utterance system utterance 
domain-dependent: specification knowledge source module 
Figure 1: Architecture of WIT 
switched according to the requests from the lan- 
guage understanding module. In this way, the 
speech recognition success rate is increased by 
using the context of the dialogue. 
Although the current version of WIT does not 
exploit probabilistic  models, such mod- 
els can be incorporated without changing the ba- 
sic WIT architecture. 
3.2 Language Understanding 
The  understanding :module receives 
word hypotheses from the speech recognition 
module and incrementally understands the se- 
quence of the word hypotheses to update the di- 
alogue state, in which the resnlt of understand- 
ing and discourse information are represented 
by a frame (i.e., attribute-value pairs). The un- 
derstanding module utilizes ISSS (Incremental 
Significant-utterance Sequence Search) (Nakano 
et al., 1999b), which is an integrated parsing and 
discourse processing method. ISSS enables the 
incremental understanding of user utterances that 
are not segmented into sentences prior to pars- 
ing by incrementally finding the most plausible 
sequence of sentences (or significant utterances 
in the ISSS terms) out of the possible sentence 
sequences for the input word sequence. ISSS 
also makes it possible for the  generation 
module to respond in real time because it can out- 
put a partial result of understanding at any point 
in time. 
The domain-dependent knowledge used in this 
module consists of a unification-based lexicon 
and phrase structure rules. Disjunctive feature 
descriptions are also possible; WIT incorporates 
an efficient method for handling disjunctions 
(Nakano, 1991). When a phrase boundary is de- 
tected, the feature structure for a phrase is com- 
puted using some built-in rules from the feature 
structure rules for the words in the phrase. The 
phrase structure rules specify what kind of phrase 
sequences can be considered as sentences, and 
they also enable computing the semantic repre- 
sentation for found sentences. Two kinds of sen- 
tenees can be considered; domain-related ones 
that express the user's intention about the reser- 
152 
vafion and dialogue-related ones that express the 
user's attitude with respect to the progress of the 
dialogue, such as confirmation and denial. Con- 
sidering the meeting room reservation system, ex- 
amples of domain-related sentences are "I need to 
book Room 2 on Wednesday", "I need to book 
Room 2", and "Room 2" and dialogue-related 
ones are "yes", "no", and "Okay". 
The semantic representation for a sentence is 
a command for updatingthe dialogue state. The 
dialogue state is represented by a list of attribute- 
value pairs. For example, attributes used in the 
meeting room reservation system include task- 
related attributes, such as the date and time of 
the reservation, as well as attributes that represent 
discourse-related information, such as confirma- 
tion and grounding. 
3.3 Language Generation 
How the  generation module works 
varies depending on whether the user or system 
has the initiative of turn taking in the dialogue 2. 
Precisely speaking, the participant having the ini- 
tiative is the one the system assumes has it in the 
dialogue. 
The domain-dependent knowledge used by the 
 generation module is generation proce- 
dures, which consist of a set of dialogue-phase 
definitions. For each dialogue phase, an initial 
function, an action function, a time-out function, 
and a  model are assigned. In addition, 
phase definitions designate whether the user or 
the system has the initiative. In the phases in 
which the system has the initiative, only the ini- 
tial function and the  model are assigned. 
The meeting room reservation system, for exam- 
ple, has three phases: the phase in which the 
user tells the system his/her request, the phase in 
which the system confirms it, and the phase in 
which the system tells the user the result of the 
database access. In the first two phases, the user 
holds the initiative, and in the last phase, the sys- 
tern holds the initiative. 
Functions defined here decide what string 
should be spoken and send that string to the 
speech output module based on the current di- 
alogue state. They can also shift the dialogue 
2The notion of the initiative in this paper is different from 
that of the dialogue initiative of Chu-Carroll (2000). 
phase and change the holder of the initiative as 
well as change the dialogue state. When the dia- 
logue phase shifts, the  model foi" speech 
recognition is changed to get better speech recog- 
nition performance. Typically, the  gen- 
eration module is responsible for database access. 
The  generation module works as fol- 
lows. It first checks which dialogue participant 
has the initiative. If the initiative is held by the 
user, it waits until the user's speech interval ends 
or a duration of silence after the end of a system 
utterance is detected. The action function in the 
dialogue phase at that point in time is executed in 
the former case; the time-out function is executed 
in the latter case. Then it goes back to the initial 
stage. If the system holds the initiative, the mod- 
ule executes the initial function of the phase. In 
typical question-answer systems, the user has the 
initiative when asking questions and the system 
has it when answering. 
Since the  generation module works in 
parallel with the  understanding module, 
utterance generation is possible even while the 
system is listening to user utterances and that ut- 
terance understanding is possible even while it is 
speaking (Nakano et al., 1999a). Thus the system 
can respond immediately after user pauses when 
the user has the initiative. When the system holds 
the initiative, it can immediately react to an in- 
terruption by the user because user utterances are 
understood in an incremental way (Dohsaka and 
Shimazu, 1997). 
The time-out function is effective in moving 
the dialogue forward when the dialogue gets 
stuck for some reason. For example, the system 
may be able to repeat the same question with an- 
other expression and may also be able to ask the 
user a more specific question. 
3.4 Speech Output 
The speech output module produces speech ac- 
cording to the requests from the  gener- 
ation module by using the correspondence table 
between strings and pre-recorded speech data. It 
also notifies the  generation module that 
speech output has finished so that the  
generation module can take into account the tim- 
ing of the end of system utterance. The meeting 
room reservation system uses speech files of short 
153 
phrases. 
4 Building Spoken Dialo~te Systems 
with WIT 
4.1 Domain-Dependent System 
Specifications 
Spoken dialogue systems can be built with WIT 
by preparing several domain-dependent specifica- 
tions. Below we explain the specifications. 
Feature Definitions: Feature definitions spec- 
ify the set of features used in the grammar for lan- 
guage understanding. They also specify whether 
each feature is a head feature or a foot feature 
(Pollard and Sag, 1994). This information is used 
when constructing feature structures for phrases 
in a built-in process. 
The following is an example of a feature defini- 
tion. Here we use examples from the specification 
of the meeting room reservation system. 
(case head) 
It means that the case feature is used and it is a 
head feature 3. 
Lexieal Descriptions: Lexical descriptions 
specify both pronunciations and grammatical 
features for words. Below is an example lexical 
item for the word 1-gatsu (January). 
(l-gatsu ichigatsu month nil i) 
The first three elements are the identifier, the pro- 
nunciation, and the grammatical category of the 
word. The remaining two elements are the case 
and semantic feature values. 
Phrase Definitions: Phrase definitions specify 
what kind of word sequence can be recognized 
as a phrase. Each definition is a pair compris- 
ing a phrase category name and a network of 
word categories. In the example below, month- 
phrase is the phrase category name and the re- 
maining part is the network of word categories. 
opt means an option and or means a disjunc- 
tion. For instance, a word sequence that con- 
sists of a word in the month category, such as 1- 
gatsu (January), and a word in the adraoninal- 
particle category, such as no (of), forms a 
phrase in the month-phrase category. 
3In this section, we use examples of different description 
from the actual ones for simplicity. Actual specifications are 
written in part in Japanese. 
(month-phrase 
(month 
(opt 
(or 
expression-following-subject 
(admoninal-particle 
(opt 
sentence-final-particle)))))) 
Network Definitions: Network definitions 
specify what kind of phrases can be included in 
each  model. Each definition is a pair 
comprising a network name and a set of phrase 
category names. 
Semantic-Frame Specifications: The result of 
understanding and dialogue history can be stored 
in the dialogue state, which is represented by a 
flat frame structure, i.e., a set of attribute-value 
pairs. Semantic-frame specifications define the 
attributes used in the frame. The meeting room 
reservation system uses task-related attributes. 
Two are start and end, which represent the 
user's intention about the start and end times of 
the reservation for some meeting room. It also 
has attributes that represent discourse informa- 
tion. One is confirmed, whose value indicates 
whether if the system has already made an utter- 
ance to confirm the content of the task-related at- 
tributes. 
Rule Definitions: Each rule has one of the fol- 
lowing two forms. 
((rule name) 
(child feature structure) 
•.. (child feature structure) 
=> (mother feature structu_e) 
(priority increase) ) 
((role name) 
(child feature structure) 
•.. (child feature structure) 
=> (flame operation command) 
(priority increase) ) 
These roles are similar to DCG (Pereira and War- 
ren, 1980) rules; they can include logical vari- 
ables and these variables can be bound when 
these rules are applied. It is possible to add to the 
rules constraints that stipulate relationships that 
must hold among variables (Nakano, 199 I), but 
we do not explain these constraints in detail in this 
154 
paper. The priorities are used for disambiguat- 
ing interpretation in the incremental understand- 
ing method (Nakano et al., 1999b). 
When the command on the right-hand side of 
the arrow is a frame operation command, phrases 
to which this rule can be applied can be consid- 
ered a sentence, and the sentence's semantic rep- 
resentation is the command for updating the dia- 
logue state. The command is one of the follow- 
ing: 
• A command to set the value of an attribute 
of the frame, 
• A command to increase the priority, 
Conditional commands (If-then-else type 
command, the condition being whether the 
value of an attribute of the flame is or is not 
equal to a specified value, or a conjunction 
or disjunction of the above condition), or 
• A list of commands to be sequentially exe- 
cuted. 
Thanks to conditional commands, it is possible 
to represent the semantics of sentences context- 
dependently. 
The following rule is an example. 
(start-end-times-command 
(time-phrase :from *start) 
(time-phrase (:or :to nil) *end) 
=> (command (set :start *start) 
(set :end *end))) 
The name of this rule is start-end-times- 
command. The second and third elements 
are child feature structures. In these elements, 
time-phrase is a phrase category, : from and 
( : or : to nil ) are case feature values, and 
*start and *end are semantic feature val- 
ues. Here :or means a disjunction, and sym- 
bols starting with an asterisk are variables. The 
right-hand side of the arrow is a command to up- 
date the frame. The second element of the com- 
mand, (set :start *start), changes the 
:start atttribute value of the frame to the in- 
stance of *start, which should be bound when 
applying this rule to the child feature structures. 
Phase Definitions: Each phase definition con- 
sists of a phase name, a network name, an ini- 
tiative holder specification, an initial function, an 
action function, a maximum silence duration, and 
a time-out function. The network name is the 
identifier of the  model for the speech 
recognition. The maximum silence duration spec- 
ifies how long the generation module should wait 
until the time-out function is invoked. 
Below is an example of a phase definition. 
The first element request is the name of this 
phase, "frar_request" is the name of the 
network, and move-to-reques t-phase and 
request-phase-action are the names of 
the initial and action functions. In this phase, 
the maximum silence duration is ten seconds and 
the name of the time-out function is request- 
phas e- t imeou t. 
(request "fmr_request" 
move- to-reques t -phase 
request-phase-action 
10.0 
request-phase- t imeout ) 
For the definitions of these functions, WIT pro- 
vides functions for accessing the dialogue state, 
sending a request to speak to the speech out- 
put module, generating strings to be spoken us- 
ing surface generation templates, shifting the di- 
alogue phase, taking and releasing the initiative, 
and so on. Functions are defined in terms of the 
Common Lisp program. 
Surface-generation Templates: Surface- 
generation templates are used by the surface 
generation library function, which converts 
a list-structured semantic representation to a 
sequence of strings. Each string can be spoken, 
i.e., it is in the list of pre-recorded speech files. 
For example, let us consider the conversion 
of the semantic representation (date (date- 
expression 3 15) ) to strings using the fol- 
lowing template. 
( (date 
(date-expression *month *day)) 
( (*month gatsu) (*day nichi) ) ) 
The surface generation library function matches 
the input semantic representation with the first el- 
ement of the template and checks if a sequences 
155 
of strings appear in the speech file list. It re- 
turns (' '3gagsul5nichi'') (March 15th) 
if the string "3gatsul5nichi" is in the list of 
pre-recorded speech files, and otherwise, returns 
( ' ' 3gatsu .... 15nichi' ' ) when these 
strings are in the list. 
List of Pre-recorded Speech Files: The list of 
pre-recorded speech files should show the corre- 
spondence between strings and speech files to be 
played by the speech output module. 
4.2 Compiling System Specifications 
From the specifications explained above, domain- 
dependent knowledge sources are created as indi- 
cated by the dashed arrows in Figure 1. When cre- 
ating the knowledge sources, WIT checks for sev- 
eral kinds of consistency. For example, the set of 
word categories appearing in the lexicon and the 
set of word categories appearing in phrase deft- 
nifions are compared. This makes it easy to find 
errors in the domain specifications. 
5 Implementation 
WIT has been implemented in Common Lisp and 
C on UNIX, and we have built several experi- 
mental and demonstration dialogue systems using 
it, including a meeting room reservation system 
(Nakano et al., 1999b), a video-recording pro- 
gramming system, a schedule management sys- 
tem (Nakano et al., 1999a), and a weather in- 
formation system (Dohsaka et al., 2000). The 
meeting room reservation system has vocabulary 
of about 140 words, around 40 phrase structure 
rules, nine attributes in the semantic frame, and 
around 100 speech files. A sample dialogue be- 
tween this system and a naive user is shown 
in Figure 2. This system employs HTK as the 
speech recognition engine. The weather informa- 
tion system can answer the user's questions about 
weather forecasts in Japan. The vocabulary size 
is around 500, and the number of phrase structure 
rules is 31. The number of attributes in the se- 
mantic flame is 11, and the number of the files of 
the pre-recorded speech is about 13,000. 
6 Discussion 
As explained above, the architecture of WIT al- 
lows us to develop a system that can use utter- 
ances that are not clearly segmented into sen- 
tences by pauses and respond in real time. Below 
we discuss other advantages and remaining prob- 
lems. 
6.1 Descriptive Power 
Whereas previous finite-state-model-based tool- 
kits place many severe restrictions on domain de- 
scriptions, WIT has enough descriptive power to 
build a variety of dialogue systems. Although the 
dialogue state is represented by a simple attribute- 
value matrix, since there is no limitation on the 
number of attributes, it can hold more compli- 
cated information. For example, it is possible to 
represent a discourse stack whose depth is lim- 
ited. Recording some dialogue history is also 
possible. Since the  understanding mod- 
ule utilizes unification, a wide variety of lin- 
guistic phenomena can be covered. For exam- 
ple, speech repairs, particle omission, and fillers 
can be dealt with in the framework of unifica- 
tion grammar (Nakano et al., 1994; Nakano and 
Shimazu, 1999). The  generation mod- 
ule features Common Lisp functions, so there is 
no limitation on the description. Some of the 
systems we have developed feature a generation 
method based on hierarchical planning (Dohsaka 
and Shirnazu, 1997). It is also possible to build a 
simple finite-state-model-based dialogue system 
using WIT. States can be represented by dialogue 
phases in WIT. 
6.2 Consistency 
In an agglutinative  such as Japanese, 
there is no established definition of words, so dia- 
logue system developers must define words. This 
sometimes causes a problem in that the defini- 
tion of word, that is, the word boundaries, in the 
speech recognition module are different from that 
in the  understanding module. In WIT, 
however, since the common lexicon is used in 
both the speech recognition module and  
understanding module, the consistency between 
them is maintained. 
6-3 Avoiding Information Loss 
In ordinary spoken  systems, the speech 
recognition module sends just a word hypoth- 
esis to the  processing module, which 
156 
speaker start end utterance 
time (s) time (s) 
system: 614.53 615.93 
user: 616.38 618.29 
system: 619.97 620.13 
user: 622.65 624.08 
system: 625.68 625.91 
user: 626.65 627.78 
system: 629.25 629.55 
user: 629.91 631.67 
system: 633.29 633.57 
user: 634.95 636.00 
system: 637.50 645.43 
user: 645.74 646.04 
system: 647.05 648.20 
Figure 2: 
donoy6na goy6ken desh6 ka (how may I help you?) 
kaigishitsu o yoyaku shitai ndesu ga (I'd like to make a reserva- 
tion for a meeting room) 
hai (uh-huh) 
san-gatsujfini-nichi (on March 12th) 
hal (uh-huh) 
jayo-ji kara (from 14:00) 
hai (uh-huh) 
jashichi-ji sanjup-pun made (to 17:30) 
hai (uh-huh) 
dai-kaigishitsu (the large meeting room) 
san-gatsu jani-nichi, j~yo-ji kara, jashichi-ji sanjup-pun made, 
dai-kaigishitsu toyfi koto de yoroshf deshrka (on March 12th, 
from 14:00 to 17:30, the large meeting room, is that right?) " 
hai (yes) 
kashikomarimashitd (all right) 
An example dialogue of an example system 
must disambiguate word meaning and find phrase 
boundaries by parsing. In contrast, the speech 
recognition module in WIT sends not only words 
but also word categories, phrase boundaries, and 
phrase categories. This leads to less expensive 
and better  understanding. 
6.4 Problems and Limitations 
Several problems remain with WIT. One of the 
most significant is that the system developer must 
write  generation functions. If the gen- 
eration functions employ sophisticated dialogue 
strategies, the system can perform complicated 
dialogues that are not just question answering. 
WIT, however, does not provide task-independent 
facilities that make it easier to employ such dia- 
logue strategies. 
There have been several efforts aimed at de- 
veloping a domain-independent method for gen- 
erating responses from a frame representation of 
user requests (Bobrow et al., 1977; Chu-CarroU, 
1999). Incorporating such techniques would deo 
crease the system developer workload. However, 
there has been no work on domain-independent 
response generation for robust spoken dialogue 
systems that can deal with utterances that might 
include pauses in the middle of a sentence, which 
WIT handles well. Therefore incorporating those 
techniques remains as a future work. 
Another limitation is that WIT cannot deal with 
multiple speech recognition candidates such as 
those in an N-best list. Extending WIT to deal 
with multiple recognition results would improve 
the performance of the whole system. The ISSS 
preference mechanism is expected to play a role 
in choosing the best recognition result. 
7 Conclusion 
This paper described WIT, a toolkit for build- 
ing spoken dialogue systems. Although it re- 
quires more system specifications than previous 
finite-state-model-based toolkits, it enables one 
to easily construct real-time, robust spoken dia- 
logue systems that incorporates advanced compu- 
tational linguistics technologies. 
Acknowledgements 
The authors thank Drs. Ken'ichiro Ishii, Nori- 
hiro Hagita, and Takeshi Kawabata for their sup- 
port of this research. Thanks also go to Tetsuya 
Kubota, Ryoko Kima, and the members of the 
Dialogue Understanding Research Group. We 
used the speech recognition engine VoiceRex de- 
veloped by NTT Cyber Space Laboratories and 
thank those who helped us use it. Comments by 
157 
the anonymous reviewers were of' great help. 

References 
James F. Allen, Bradford W. Miller, Eric K. Ringger, 
and Teresa Sikorski. 1996. A robust system for nat- 
ural spoken dialogue. In Proceedings of the 34th 
Annual Meeting of the Association for Computa- 
tional Linguistics (A CL-96), pages 62-70. 
Harald Aust, Martin Oerder, Frank Seide, and Volker 
Steinbiss. 1995. The Philips automatic train 
timetable information system. Speech Communi- 
cation, 17:249--262. 
James Barnett and Mona Singh. 1997. Designing 
a portable spoken  system. In Elisabeth 
Maier, Marion Mast, and Susann LuperFoy, editors, 
Dialogue Processing in Spoken Language Systems, 
pages 156--170. Springer-Vedag. 
Daniel G. Bobrow, Ronald M. Kaplan, Martin Kay, 
Dona!d A. Norman, Henry Thompson, and Terry 
Winograd. 1977. GUS, a frame driven dialog sys- 
tem. Arnficial Intelligence, 8:155-173. 
Jennifer Chu-Carroll. 1999. Fo:rrn-based reason- 
ing for mixed-initiative dialogue management in 
information-query systems. In Proceedings of the 
Sixth European Conference on Speech Communica- 
tion and Technology (Eurospeech-99) , pages 1519- 
1522. 
Junnifer Chu-Carroll. 2000. MIMIC: An adaptive 
mixed initiative spoken dialogue system for infor- 
mation queries. In Proceedings of the 6th Con- 
f~rence on Applied Natural Language Processing 
(ANLP-O0), pages 97-104. 
Kohji Dohsaka and Akira Shimazu. 1997. System ar- 
chitecture for spoken utterance production in col- 
laborative dialogue. In Working Notes of IJCAI 
1997 Workshop on Collaboration, Cooperation and 
Conflict in Dialogue Systems. 
Kohji Dohsaka, Norihito Yasuda, Noboru Miyazaki, 
Mikio Nakano, and Kiyoaki AJkawa. 2000. An ef- 
ficient dialogue control method under system's lim- 
ited knowledge. In Proceedings of the Sixth Inter- 
national Conference on Spoken Language Process- 
ing (ICSLP-O0). 
Jun-ichi Hirasawa, Noboru Miyazaki, Mikio Nakano, 
and Takeshi Kawabata. 1998. Implementation 
of coordinative nodding behavior on spoken dia- 
logue systems. In Proceedings of the Fgth Interna- 
tional Conference on Spoken Language Processing 
(1CSLP-98), pages 2347-2350. 
Tetsunod Kobayashi, Shuichi Itahashi, Satoru 
Hayamizu, and Toshiyuki Takezawa. 1992. Asj 
continuous speech corpus for research. The journal 
of th e Acoustical Society o f Japan, 48(12): 888-893. 
Mikio Nakano and Akira Shimazu. 1999. Pars- 
ing utterances including self-repairs. In Yorick 
Wilks, editor, Machine Conversations, pages 99- 
112. Kluwer Academic Publishers. 
Mikio Nakano, Aldra Shimazu, and Kiyoshi Kogure. 
1994. A grammar and a parser for spontaneous 
speech. In Proceedings of the 15th Interna- 
tional Conference on Computational Linguistics 
(COLING-94), pages 1014-1020. 
Mildo Nakano, Kohji Dohsaka, Noboru Miyazald, 
Inn ichi Hirasawa, Masafiami Tamoto, Masahito 
Kawarnon, Akira Sugiyama, and Takeshi Kawa- 
bata. 1999a. Handling rich turn-taking in spoken 
dialogue systems. In Proceedings of the Sixth Eu- 
ropean Conference on Speech Communication and 
Technology (Eurospeech-99), pages 1167-1170. 
Mikio Nakano, Noboru Miyazaki, Jun-ichi Hirasawa, 
Kohji Dohsaka, and Takeshi Kawabata. 1999b. 
Understanding unsegmented user utterances in real- 
time spoken dialogue systems. In Proceedings of 
the 37th Annual Meeting of the Association for 
Computational Linguistics (ACL-99), pages 200-- 
207. 
Mikio Nakano. 1991. Constraint projection: An ef- 
ficient treatment of disjunctive feature descriptions. 
In Proceedings of the 29th Annual Meeting of the 
Association for Computational Linguistics (ACL- 
90, pages 307-314. 
Yoshiaki Noda, Yoshikazu Yamaguchi, Tomokazu 
Yamada, Akihiro Imamura, Satoshi Takahashi, 
Tomoko Matsui, and Kiyoaki Aikawa. 1998. The 
development of speech recognition engine REX. In 
Proceedings of the 1998 1EICE General Confer- 
ence D-14-9, page 220. (in Japanese). 
Fernando C. N. Pereira and David H. D. Warren. 
1980. Definite clause grammars for  
analysis--a survey of the formalism and a compar- 
ison with augmented transition networks. Artificial 
Intelligence, 13:231-278. 
Carl J. Pollard and Ivan A. Sag. 1994. Head-Driven 
Phrase Structure Grammar. CSLI, Stanford. 
Munehiko Sasajima, Yakehide Yano, and Yasuyuki 
Kono. 1999. EUROPA: A genetic framework for 
developing spoken dialogue systems. In Proceed- 
ings of the Sixth European Conference on Speech 
Communication and Technology (Eurospeech-99), 
pages 1163--1166. 
Stephanie Seneff, Ed Hurley, Raymond Lau, Chris- 
fine Pao, Philipp Sehmid, and Victor Zue. 1998. 
GALAXY-H: A reference architecture for conver- 
sational system development. In Proceedings of 
the Fifth International Con l~rence on Spoken Lan- 
guage Processing (ICSLP-98). 
Stephen Sutton, Ronaid A. Cole, Jacques de Villiers, 
Johan SchMkwyk, Pieter Vermeulen, Michael W. 
Macon, Yonghong Yah, Edward Kaiser, Brian Run- 
die, K.haldoun Shobaki, Paul Hosom, Alex Kain, 
Johan Wouters, Dominic W. Massaro, and Michael 
Cohen. 1998. Universal speech tools: The 
CSLU toolkit. In Proceedings of the Fifth Interna- 
tional Conference on Spoken Language Processing 
(1CSLP-98), pages 3221-3224. 
Marilyn Walker, Irene Langkilde, Jerry Wright, Allen 
Gorin, and Diane Litman. 2000. Learning to pre- 
dict problematic situations in a spoken dialogue 
system: Experiments with how may I help you? In 
Proceedings of the First Meeting of the North Amer- 
ican Chapter of the Association for Computational 
Linguistics (NAA CL-O0), pages 210--217. 
Victor Zue, Stephanie Seneff, James Glass, Joseph Po- 
lifroni, Christine Pao, Timothy J. Hazen, and Lee 
He~erington. 2000. Jupiter: A telephone-based 
conversational interface for weather information. 
1EEE Transactions on Speech and Audio Process- 
ing, 8(1):85-96. 
