Collection of Spontaneous Speech for the ATIS Domain and 
Comparative Analyses of Data Collected at MIT and TI 1 
Joseph Polifroni, Stephanie Seneff, and Victor W. Zue 
Spoken Language Systems Group 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 02139 
ABSTRACT 
As part of our development of a spoken language system in 
the ATIS domain, we have begun a smail-scale effort in collecting 
spontaneous speech data. Our procedure differs from the one used 
at 'i~xas Instruments (TI) in many respects, the most important 
being the reliance on an existing system, rather than a wizard, to 
participate in data collection. Over the past few months, we have 
collected over 3,600 spontaneously generated sentences from 100 
subjects. This paper documents our data collection process, and 
makes some comparative anaiyses of our data with those collected 
at TI. The advantages as well as disadvantages of this method of 
data collection will be discussed. 
INTRODUCTION 
AWLS, or Air Travel Information Service, is the desig- 
nated common task of the DARPA Spoken Language Sys- 
tems (SLS) Program \[8\]. As part of our development of a 
spoken language system in this domain, we have recently be- 
gun a small-scale effort in collecting spontaneous speech data. 
This effort is motivated partly by our desire to contribute to 
the data collection efforts already underway elsewhere \[4,2,1\], 
so that more data can be available to the community sooner 
for system development, training, and evaluation. In addi- 
tion, we were interested in exploring various alternatives of 
the data collection procedure itself. It is our belief that we as 
a community do not fully understand how goal-directed spon- 
taneous speech should best be collected. This is not surpris- 
ing, since we have little experience in this area. Nevertheless, 
data collection is an impori~ant area of research for the SLS 
Program, since the type of data that we collect will directly 
affect the capabilities of systems that we develop, and the 
evaluations that we can perform. Therefore, we thought it 
would be appropriate to experiment with different aspects of 
this process. There is evidence that even very small changes 
in the procedure, such as the instructions to the subject, can 
drastically alter the nature of the data collected \[l\]. 
The paper is organized as follows. We will first discuss 
1This research was supported by DARPA under Contract N00014- 
89-.11-1332, monitored through the Office of Naval Research. 
some methodological considerations that led to the particular 
collection procedure that we adopted. We will then briefly 
describe the procedure itself. This will be followed by some 
comparative analyses of a subset of the data that we have 
collected with those collected at Texas Instruments (TI). Im- 
plications of our findings will be discussed. 
DATA COLLECTION 
As is the case with other efforts \[4~2,1\], our data are col- 
lected under simulation. Nevertheless, we wanted the simula- 
tion to reflect as much as possible the system that we are de- 
veloping. In this section, we will briefly describe some deslgn 
issues and document the actual collection process. Further 
details can be found elsewhere \[7\]. 
Methodological Considerations 
While many years may pass before we are able to build 
systems with capabilities approaching those of humans, we 
believe strongly that it should soon be possible to develop 
functioning systems with limited capabilities. The successful 
development of such systems will partly depend on our ability 
to train subjects to stay within the restricted domain of the 
system. Therefore, we should try to collect data intention- 
ally restricting the user in ways that closely match system 
capability. In this section we will describe some aspects of 
our data collection paradigm that support this viewpoint. 
Wizard vs. System By far the most important difference 
between the data collection procedures at TI and MIT is the 
way system simulation is conducted during data collection. 
TI made use of a "wizard" paradigm, in which a highly skilled 
experimenter interprets what was spoken, converts it into a 
form that enables database access, and produces an answer 
for the subject \[4,2\]. Based on our previous positive expe- 
rience with collecting spontaneous speech for a different do- 
main \[10\], we decided to explore an alternative paradigm from 
the one used at TI, in which we make use of the system under 
development to do most of the work. That is, prior to the 
beginning of data collection, the natural language component 
is developed to the point where it has reasonable coverage o-r 
360 
Subject: Show flights from Philadelphia to Denver serving lunch or dinner on February second and also show their fares. 
ATIS Response: 
.......................................................................................................................... 
(RAW DISPLAY) (PROCESSED DISPLAY) 
These are the flights from Philadelphia to Denver serving 
lunch and dinner on Friday February 2. 
AIRLINE FLIGHT AIRLINE NUMBER FROM TO DEPARTURE ARRIVAL STOPS MEALS 
CODE NUMBER SERVED 
DL 1083 DELTA 1083 PHL DEN 12:30 P.M. 4:15 P.M. 1 LUNCH SNACK 
UA 355 UNITED 355 PHL DEN 5:53 P.M. 7:55 P.M. 0 DINNER 
CO 1631 CONT 1631 PHL DEN 6:00 P.M. 8:37 P.M. 0 DINNER 
.......................................................................................................................... 
(RAW DISPLAY) (PROCESSED DISPLAY) 
FROM TO DEPARTURE ARRIVAL STOPS MEAL 
AIRPORT AIRPORT TIME TIME CODE 
PHL DEN 1230 1615 1 LS 
PHL DEN 1753 1955 0 D 
PBL DEN 1800 2037 0 D 
AIRLINE FLIGHT RESTRICT ONE WAY END TRIP FARE 
CODE NUMBER CODE COST COST CLASS 
CO 1631 470 940 Y 
CO 1631 706 1412 Y 
DL 1083 420 840 Y 
DL 1083 630 1260 F 
UA 355 470 940 Y 
UA 355 706 1412 F 
These are the fares for the flights from Philadelphia to Denver 
serving lunch and dinner on Friday February 2. 
AIRLINE NUMBER RESTRICTION ONE WAY ROUND TRIP FARE CLASS 
CONT 1631 NONE $470,00 $940.00 Y: COACH CLASS 
CONT 1631 NONE $706.00 $1412.00 F: FIRST CLASS 
DELTA 1083 NONE $420,00 $840.00 Y: COACH CLASS 
DELTA 1083 NOHE $630.00 $1260.00 F: FIRST CLASS 
UNITED 355 NONE $470.00 $940.00 Y: COACH CLASS 
UNITED 355 NONE $706.00 $1412.00 F: FIRST CLASS 
Figure 1: Comparison of the displays as returned from the OAG database (left panels) and those presented to the subject (right panels) 
for a query. 
the possible queries. In addition, the system must be able to 
automatically translate the text into a query to the database, 
and return the information to the subject. Once such a sys- 
tem is available, data collection is accomplished by having 
the experimenter, a fast and accurate typist, type verbatim 
to the system what was spoken, after removing spontaneous 
speech disfluencies. The actual interpretation and response 
generation is accomplished by the system without further hu- 
man intervention. If the sentence cannot be understood by 
the system, an error message is produced to help the subject 
make appropriate modifications. 
Another feature of our paradigm is that the underlying 
system can be improved incrementally using the data col- 
lected thus far. The resulting expansion in system capabil- 
ities permit us to accommodate more complex sentences as 
well as those that previously failed. 
Displays One of the considerations that led to the se- 
lection of ATm as the common task is the realization that, 
since most people have planned air travel at one time or an- 
other, there will be no shortage of subjects familiar with the 
task. Since the average traveller is not likely to be knowledge- 
able of the format and display of the Official Airline Guide 
(OAG), we have translated many of the cryptic symbols and 
abbreviations that OAG uses into easily recognizable words. 
We believe that this change has the positive effect of helping 
the subject focus on the travel planning aspect, and not be 
confused by the cryptic displays that are intended for more 
experienced users. In fact, we try to keep the displayed infor- 
mation at a minimum in order to encourage verbal problem 
solving. In general, we only display the airline, flight number, 
origination and destination cities, departure and arrival time, 
and the number of stops. Additional columns of information 
are included only when specifically requested. 
Figure 1 illustrates the difference between the raw dis- 
plays returned from the OAG database and the ones that we 
present to the subjects by applying some post-processing to 
the raw display. The query is "Show flights from Philadel- 
phia to Denver serving lunch or dinner on February second 
and also show their fares." Note that the airlines, meal codes, 
and fare codes in the processed displays (shown in the right- 
hand panels) have all been translated into words as much as 
possible while keeping the displays manageable on a screen, e 
The military-time displays for departure and arrival times 
have also been converted to more familiar forms to facilitate 
interpretation. Furthermore, under the TI data collection 
scheme the answer is assembled as one large table. However, 
we break it up into two answers, one for the flights and one 
for the fares. 
System Feedback Our system provides explicit feedback 
to the subject in the form of text and synthetic speech, para- 
phrasing its understanding of the sentence. This feature is 
illustrated in Figure 1 in the right-hand panels, immediately 
above the display tables. By providing confirmation to the 
subject of what was understood, the system greatly reduces 
the confusion and frustration that may arise later on in the 
dialogue caused by an earlier error in the system's responses. 
In addition, the generation of a verbal response implicitly en- 
courages the notion of human/machine interactive dialogue. 
Interactive Dialogue We believe that, for the ATIS sys- 
tem to be truly useful, the user must be able to carry out an 
2Some of the headings in the raw display have been reformatted so 
that, the table will stay within the width of this page. 
361 
interactive dialogue with the system, in the same way that a 
traveller would with a travel agent. Our data collection pro- 
cedure therefore encourages natural dialogue by incorporat- 
ing some primitive discourse capabilities, allowing subjects to 
make indirect as well as direct anaphoric references, and frag- 
mentary responses where appropriate. In some sessions, we 
even use a version of the system that plays an active role in 
guiding the subject through flight reservations. Details of the 
interactive dialogue capabilities of our system are described 
in a companion paper \[9\]. 
Data Collection Process 
The data are collected in an office environment where the 
ambient noise is approximately 60 dB SPL, measured on the 
C scale. The subject sits in front of a display monitor, with 
a window slaved to a Lisp process running on the experi- 
menter's Sun workstation located in a nearby room. The 
experimenter's typing is hidden from the subject to avoid 
unnecessary distractions. A push-to-talk mechanism is used 
to collect the data. The subject is instructed to hold down 
a mouse button while talking, and release the button when 
done. The resulting speech is captured both by a Sennheiser 
HMD-224 noise-cancelling microphone and a Crown PZM 
desk-top microphone, digitized simultaneously. 
Prior to the session, the subject is given a one-page de- 
scription of the task, a one-page summary of the system's 
knowledge base, and three sets of scenarios \[7\]. The first set 
contains simple tasks such as finding the earliest flight from 
one city to another that serves a meal, and is intended as a 
"warm-up" exercise. The second set involves more complex 
tasks, and includes the official ATIS scenarios. Finally, the 
subject is asked to make up a scenario and attempt to solve 
it. The subjects are instructed to choose a pre-determined 
number of scenarios from each category. In addition, they 
are asked to clearly delineate each scenario with the com- 
mands "begin scenario x" and "end scenario x," where x is a 
number, so that we can keep track of discourse utilization. A 
typical session lasts about 40 minutes, including initial task 
familiarization and final questionnaire. 
For several initial data collection sessions, the first two 
authors took turns serving as the experimenter. Once we 
began daily sessions, however, it was possible to hire a part- 
time helper to serve as the scheduler, experimenter, and tran- 
scriber. The experimenter can hear everything that the sub- 
ject says and can communicate with the subject via a two-way 
microphone/speaker hook-up. However, the experimenter 
rarely communicates with the subject once the experiment 
is under way. The digitized speech is played back to the ex- 
perimenter, allowing him/her to confirm that the recording 
was successful. The voice input during the session, minus 
disfluencies, is typed verbatim to ATIS by the experimenter, 
and saved a.utomatically in a computer log. The system re- 
sponse is generated automatically from this text, and is also 
recorded into the log. The system's response typically takes 
less than 10 seconds after the text has been entered. At a 
later time, the experimenter listens again to each digitized 
sentence and inserts false starts and non-speech events into 
the orthography to form a detailed orthographic transcrip- 
tion, following the conventions described in \[6\]. 
There are basically three ways that the system can fail, 
each of which provides a distinct error message. If the sen- 
tence contains unknown words, then the system identifies to 
the subject the words that it doesn't know. If it knows all the 
words, but the sentence fails to parse, then it identifies the 
point in the sentence where the parse failed. Finally, it may 
parse the sentence but fail to produce a response due to, for 
instance, an incorrect database query. In that case, it simply 
says, "I ran into an error trying to evaluate this sentence." 
Our long-term goal is to make error messages sufficiently in- 
formative that the subject knows how to best rephrase the 
query. By examining how subjects react to the various kinds 
of error messages, we hope to improve the overall usability of 
the system. Figure 2 illustrates the data collection process 
with a simple dialogue between a subject and ATIS. 
SCENARIO: "Find the earliest (or latest) flight from one 
city to another that serves a meal of your choice." 
Subject: BEGIN SCENARIO TWO 
Subject: SHOW ME THE FLIGHTS FROM BOSTON TO ATLANTA 
Atis: These are the flights from Boston to Atlanta. 
AIRLINE NUMBER FROM TO DEPARTURE ARRIVAL STOPS 
DELTA 225 BOS ATL 6:30 A.M 9:13 A.M. 0 
EASTERN 147 BOS ATL 7:00 A.M. 9:46 A.M. 0 
DELTA 219 BOS ATL 8:15 A.M. 11:00 A.M. 0 
EASTERN 643 BOS ATL 8:32 A.M. 11:26 A.M. 0 
DELTA 547 BOS ATL 9:55 A.M. 12:35 P.M. 0 
EASTERN 145 EOS ATL 10:35 A.M. 1:27 P.M. 0 
DELTA 708 BOS ATL 11:56 A.M. 2:35 P.M. 0 
EASTERN 143 BOS ATL 12:40 P.M. 3:25 P.M. 0 
EASTERN 646 BOS ATL 2:21 P.M. 5:09 P.M. 0 
DELTA 975 EOS ATL 3:20 P.M. 6:04 P.M. 0 
EASTERN 141 BOS ATL 4:05 P.M. 7:00 P.M. 0 
EASTERN 149 BOS ATL 8:58 P.M. 8:49 P.M. 0 
DELTA 323 BOS ATL 6:50 P.M. 9:28 P.M. 0 
DELTA 54 EOS ATL 8:24 P.M. 10:88 P.M. 0 
Subject: PLEASE SHOW ME THE THREE EARLIEST FLIGHTS 
Atis: I'm sorry I Eot stuck just before the words 
"earliest fliEhts". 
Subject: WHICH ONE IS THE EARLIEST FLIGHT 
Atis: The earliest flight from Boston to Atlanta? 
AIRLINE ~UMBER FROM TO DEPARTURE ARRIVAL STOPS 
DELTA 225 BOS ATL 6:30 A.M 9:13 A.M. 0 
Subject: DOES THIS FLIGHT SERVE BREAKFAST 
Atis: Yes, the earliest flight from Boston to Atlanta does 
serve breakfast. 
AIRLINE NUMBER FROM TO DEPARTURE ARRIVAL STOPS MEALS 
DELTA 225 BOS ATL 6:30 A.M. 9:13 AiM. 0 BREAKFAST 
Subject: END SCENARIO TWO 
Figure 2: An example log for one scenario created by a subject 
Subjects are recruited from the general vicinity of MIT. 
No restrictions in age or sex are imposed, nor do we insist 
362 
that they be native speakers. For their efforts, each subject 
is given a gift certificate at a popular Chinese restaurant or 
an ice-cream parlor. Presently, we are collecting data from 
two to three subjects per day. 
COMPARATIVE ANALYSES 
To facilitate system development, training, and testing, 
we arbitrarily partitioned part of the collected data into train- 
ing, development-test, and test sets, as summarized in Ta- 
ble 1. All the comparative analyses reported in this section 
are based on our designated training set and the TI training 
set, the latter defined as the total amount of training data 
released by TI prior to June 1990. 
Data Set \[# Speakers 
\]Training 
i Development-Test 
Test 
41 
10 371 
10 324 
Sentences 
1582' 
Table 1: Designation of various data sets collected at MIT. 
General Characteristics Table 2 compares some general 
statistics of the data in the TI and MIT training set. On 
the average, the wizard paradigm used at TI can collect 25 
sentences over approximately 40 minutes, for a yield of 39 
sentences per hour \[2\]. In contrast, we were able to collect an 
average of about 39 sentences in approximately 45 minutes, 
for a yield of 53 sentences per hour. Our higher yield is 
presumably due to the fact that the system can respond much 
faster than a wizard; the process of translating the sentences 
into an NLParse command \[2\] by hand can sometimes be 
quite time-consuming: Note that the yields in both cases do 
not include the generation of the ancillary files, which is an 
essential task performed after data collection. 
Variables TI Data MIT Data 
-# Speakers 31 41 
Sentences 774 1582 
Ave. # Sentences/Speaker 25.0 38.6 
Ave..# Words/Sentence 10.65 9.14 
\]% of Table Clarification Sentences 25 1 
i Ave. ~ Words/Second 1.18 2.04 
Table 2: General statistics of the training data collected at TI 
and MIT. 
The average number of words per sentence for the MIT 
data is 15% fewer than that for the TI data. The shorter 
sentences in the MIT data can be due to several reasons. 
The system's inability to deal with longer sentences and the 
feedback that it provides may coerce the subject into making 
shorter sentences. The limited display may discourage the 
construction of lengthy and sometimes contorted sentences 
that attempt to solve the scenarios expeditiously. The inter- 
active nature of problem solving may encourage the user to 
take a "divide-and-conquer" attitude and ask simpler ques- 
tions. Closer examination of the data reveals that the stan- 
dard deviation on sentence length is very different between 
the two data sets (o'TI = 5.53 and aMiT = 3.68). We suspect 
that this is primarily due to the preponderance of short sym- 
bol clarification sentences such as "What does EA mean?" 
in the TI data, along with occasional very long sentences. 
Table 2 shows that 25% of the TI sentences deal with table 
clarification compared to only 1% of our sentences. In fact, 8 
of our 16 table clarification sentences concern airline code ab- 
breviations. They were collected from earlier sessions when 
the display was still somewhat cryptic. Once we made some 
extremely simple changes in the display, such sentences no 
longer appeared. 
The speaking rate of the MIT sentences was more than 
70% higher that that of the TI sentences. We believe that 
the speaking rate of the TI sentences (70 words/minute) is 
unnaturally low. This may be due to the insertion of many 
pauses, or the fact that the subjects simply spoke tentatively, 
due to their unfamiliarity with the task. Acoustic analysis is 
clearly needed before we can know for certain. 
System Growth Rate Figure 3 compares the size of the 
lexicon, i.e., the number of unique words, as a function of the 
number of training sentences collected at TI and MIT. The 
Figure shows that the vocabulary size grows at a much slower 
rate (about 20 words per 100 training sentences) for the MIT 
AWlS data than the TI data (about 50 words per 100 training 
sentences). Also included on the Figure for reference is a 
plot of the growth rate for our VOYAGER. corpus, which was 
collected using the same paradigm as we have used for ATIS. 
A previous comparison of the TI data and the MIT VOYAGER. 
data \[5\] led to the conclusion that the VOYAGER. domain was 
intrinsically more restricted. Since the MIT ATIS data are 
more similar to the VOYAGER. data, it may be the case that a 
more critical factor was the data collection paradigm. Thus, 
one may argue that our data collection procedure is better 
able to encourage the subjects to stay within the domain. A 
slow growth rate may also be an indication that the training 
data is more representative of the unseen test data. 
As further evidence that our training data is represen- 
tative of the data that the system is likely to see, Table 3 
compares the system's performance on the MIT training and 
development-test sets. The similarities in performance be- 
tween the two data sets is striking, suggesting that the system 
is able to generalize well from training data. Since the sys- 
tem can deal with over 70% of the sentences, we feel that the 
subject is not likely to be overly frustrated by the system's 
inability to deal with the remaining sentences. This also re- 
flects the apparent ability of subjects to adjust their speech 
so as to stay generally within the domain of the system. 
Disfluencies Table 4 compares the occurrence of spon- 
taneous speech disfluencies in the two data sets. We define 
363 
0 o 
i 
Iu 
.=_ 
i z 
80O 
600 
4O0 
20C JJJF" I .... MKATIS Data I 
P~ I ....... MIT VOYAGER Data I 
0 0 200 400 600 800 1000 1200 1400 1600 
Number of Training Sentences 
Figure 3: The size of the lexicon as a function of the number of 
training sentences for the TI and MIT ATIS training sets, as well 
as the MIT VOYAGER training set. 
% of Sentences Training Set Development-Test Set 
with New Words 9.3 8.6 
with NL failure 16.9 
Parsed 73.8 
19.7 
71.7 
Table 3: System performance on the training and develop- 
ment-test sets. Evaluation on the training set was conducted 
after the system had been trained on these sentences, whereas 
the development-test set represents unseen data. 
% of Sentences TI Data \[MIT Data 
with filled pauses 8.1 I 1.3 
with lexical false starts i 6.0 \] 2.8 
with linguistic false starts 5.9 1.0 
Table 4: Analyses of disfluencies in the training data collected 
at TI and MIT. 
lexical false starts as the appearance of a partial word and 
linguistic false starts as the appearance of one or more extra- 
neous whole words. Again, our analyses show quite a differ- 
ence between the two data sets along all dimensions. A total 
of 73 filled pauses appear in 63 (or 8.1%) of the TI sentences, 
whereas only 25 appear in 21 (or 1.3%) of the MIT sentences. 
Similarly, it is twice as likely to find a sentence with a lexical 
false start in the TI data as is in the MIT data, and almost 
six times more likely for a linguistic false start. 
DISCUSSION 
By far the most important feature of our data collection 
process is that the system under development is used in place 
of a wizard. We believe that this "system-assisted" paradigm 
offers several advantages. Since the system used to collect 
the data is under continual development, we can periodically 
replace it with an improved version, where the improvement 
will be guided by the data already collected. If, for example, 
we observe that subjects frequently use a certain linguistic 
construct, then we can modify the system to provide that 
capability. Alternatively, we may decide that the capability 
is outside the domain of expertise of the system (for example, 
booking flights). We can then try to keep the subject ':in 
bounds" by experimenting with different subject instructions. 
Furthermore, this type of data collection provides relatively 
realistic sample data of human-machine interaction. This 
distinguishes it from wizard data, where the human wizard 
will answer any query that the back-end can answer. We 
believe that by providing the subject with a more realistic 
situation, where there are a significant number of queries that 
the system cannot handle, we can gather better data about 
the possible training effects of the system on the subject, 
the subject's ability to adapt to the system, and the system 
capabilities to provide useful diagnostics on its limitations. 
By combining data collection and system development into 
closely coupled cycles, we can potentially ensure that the 
type of data that we collect is appropriate for the system 
that we want to develop, thus increasing the efficiency of 
system development. 
Our data collection procedure is also cost effective. Since 
the experimenter plays a very passive role during data collec- 
tion, we eliminate the need for a highly skilled, and presum- 
ably expensive, person to interpret what was said and coerce 
the back-end to generate the necessary responses. This has 
the dual effect of freeing the researchers to concentrate on sys. 
tem development, and reducing the cost of data collection. 
We estimate that the unburdened cost for data collection, in- 
cluding the subject, the experimenter, and the generation of 
the correct transcription% is about $0.85 per sentence. Sub- 
sequent categorization of the sentences, and the generation of 
the reference answers will add another $2.30 to each sentence, 
although this may not be needed for all the data collected. 
The disadvantage of this method of data collection is that 
the baseline system is constantly evolving. This has several 
effects. First, data collected in an earlier session may not 
be comparable to data collected in a later session. This is 
not important if one merely wishes to collect as much spon- 
taneous speech within a given domain as possible. However, 
it can create a consistency problem if the data are used for 
training at a later stage. For example, in an earlier session, 
the system may provide an error message stating that it does 
not understand a particular word, whereas in a later session 
it may be able to handle the identical query with no prob- 
lem. The system response can also change from a diagnostic 
message to a request for further information or clarification. 
The result is that a dialogue collected at one stage in system 
development may not be coherent at a later stage in system 
development. This is a problem for development of dialogue 
handling. However, it may be possible to handle this by eval- 
364 
uating the system after "resetting the context" based on the 
actual system response, as suggested in \[3\]. Despite these 
difficulties, we feel that the ease of data collection with this 
methodology by far outweighs any disadvantages. 
Comparative analyses of the data collected at TI and MIT 
reveal significant differences in many dimensions. Given the 
many ways in which the two procedures differ, it is not al- 
ways easy to attribute the discrepancies to one single fac- 
tor. One of the most striking differences between the TI 
and MIT data is the fraction of sentences dealing with ta- 
ble clarification. One may argue that these sentences are 
unnecessary by-products of the cryptic display format, and 
they contribute very little to the problem of providing grace- 
ful human/machine interface for travel planning. By a very 
simple change in the display format, we were able to reduce 
this type of question by 25 foldl Similarly, a change in data 
collection procedure can reduce the number of spontaneous 
speech phenomena several fold. It is therefore conceivable 
that we can minimize the occurrence of spontaneous speech 
phenomena in veal systems. These effects again underscore 
the importance of collecting the type of data that is as closely 
matched as feasible to the capabilities of the system that we 
are developing. 
While we have used our own natural language system, 
TINA, for data collection, it is important to note that the 
choice was primarily motivated by convenience and availabil- 
ity. Clearly, any functioning natural language system could 
be freely substituted. In fact, a richer pool of data would 
probably arise if data were collected independently at several 
sites, each of which used their own system for the back-end 
responses. 
At this writing, we have collected 3,690 sentences from 
102 subjects. Orthographic transcriptions for all of the sen- 
tences are available, as are the categorizations and reference 
answers for the development-test and test sets. 
ACKNOWLEDGEMENT 
We would like to acknowledge the help of Claudia Sears, 
who served as experimenter and transcriber for much of the 
data, Victoria Palay, who helped in the coordination of the 
data collection, and Bridget Bly of SRI, whom we hired as a 
consultant in order to generate some of the ancillary files for 
the development-test and test sets. 
REFERENCES 
\[1\] Bly, B., P. Price, S. Tepper, E. Jackson, and V. Abrash, 
"Designing the Human Machine Interface in the ATIS Do- 
main," Proc. Third Darpa Speech and Language Workshop, 
pp. 136-140, Hidden Valley, PA, June 1990. 
\[2\] Hemphill, C. T., J. J. Godfrey, and G. R.. Doddington, "The 
ATIS Spoken Language System Pilot Corpus," Proc. Third 
Darpa Speech and Language Workshop, pp. 96-101, Hidden 
Valley, PA , June 1990. 
\[3\] 
\[4\] 
\[5\] 
\[6\] 
\[7\] 
\[s\] 
\[9\] 
\[10\] 
Hirschman, L., D. A. Dahl, D. P. McKay, L. M. Norton, 
L., and M. C. Linebarger, "Beyond Class A: A Proposal 
for Automatic Evaluation of Discourse," Proc. Third Darpa 
Speech and Language Workshop, pp. 109-113,Hidden Valley, 
PA, June 1990. 
Kowtko, J. C. and P. J. Price, "Data Collection and Analysis 
in the Air Travel Planning Domain," Proc. Second Darpa 
Speech and Language Workshop, pp. 119-125, Harwichport, 
MA, October 1989. 
Norton, L. M., D. A. Dahl, D. P. McKay, L. Hirschman, M. 
C. Linebarger, D. Magerman, and C. N. Ball, "Management 
and Evaluation of Interactive Dialog in the Air Travel Do- 
main," Proc. Third DaTa Speech and Language Workshop, 
pp. 141-146, Hidden Valley, PA , June 1990. 
Polifroni, J. and M. Soclof, "Conventions for Transcrib- 
ing Spontaneous Speech Events in the VOYAGER. Corpus," 
DARPA SLS Note 6, Spoken Language Systems Group, MIT 
Laboratory for Computer Science, Cambridge, MA, Febru- 
ary, 1990. 
Polifroni, J., S. Seneff, V. W. Zue, and L. Hirschman, , "Awls 
Data Collection at MIT," DAR.PA SLS Note 8, Spoken Lan- 
guage Systems Group, MIT Laboratory for Computer Sci- 
ence, Cambridge, MA, November, 1990. 
Price P., "Evaluation of Spoken Language Systems: The 
ATIS Domain," Proc. Third Darpa Speech and Language 
Workshop, pp. 91-95, Hidden Valley, PA , June 1990. 
Seneff, S., L. Hirschman, and V. W. Zue, "Interactive Prob- 
lem Solving and Dialogue in the ATIS Domain," These Pro- 
ceedings. 
Zue, V., N. Daly, J. Glass, H. Leung, M. Phillips, J. Po- 
lJfroni, S. Seneff, and M. Soclof, "The Collection and Pre- 
lim_inary Analysis of a Spontaneous Speech Database," Proc. 
Second Darpa Speech and Language Workshop, pp. 126-134, 
Harwichport, MA, October 1989. 
365 
