CORPUS COLLECTION FOR ATIS 
Jared Ber~tein 
SRI International 
Menlo Park, CA 94025 
PROJECT GOALS 
The project goal is to collect and deliver a corpus of speech 
data that supports DARPA SL~ system development. As of Feb- 
ruary 1991, SRI has set up a hardware and software environment 
for the collection of spoken interactions with a simulated Air 
Travel Information System (ATIS), established a data collection 
procedure, collected and dis~buted prototype data, and evalu- 
ated the prototype data with feedback from the SIS system 
developers. Having implemented revisions in the environment 
and procedures, SKI has begun collecting and distributing a cor- 
pus of data for ATIS SLS development. 
RECENT RESULTS 
Completed a plan for the interface to the relational database, 
the collection of the prototype and production data, and the 
subject environment. 
Collected 10 prototype subject sessions, prepered speech 
and auxiliary files, and shipped data to NIST for disUibution 
to interested SLS developers. Interacted with NIST and with 
SIS sites to refine certain aspects of the data collection 
environment and procedures. 
Modified and augmented the tools used in data collection 
and file preparation; e.g., automated parts of the transcrip- 
tion task and the derivation of additional auxiliary files, and 
augmented wizard tools to accelerate database responses. 
Provided yield and cost estimates for revised transcription 
protocols and for extended categorization of utterances. 
Shipped 35 subject sessions to N\[ST. Recorded and Wan- 
scribed sessions, generated auxiliary files, prepared session 
logs and categorized utterances; checked and prepared 
material for shipment to NIST. 
Shipped 32 more subject sessions to NIST. Categorized, 
prepared auxiliary flies for, checked, and shipped 32 ses- 
sions previously recorded and transcribed in summer 1990 
under SKI's ATIS .SL.S contract. 
PLANS FOR THE COMING YEAR 
• Resume and accelerate data collection in the ATIS domain. 
• Document systems and procedures in preparation for export 
of the wizard data collection system. 
• Work with NIST and the DARPA community to define and 
implement new speech corpus collections. 
423 
