Corpus Development Activities at the Center for Spoken 
Language Understanding 
Ron Cole, Mike Noel, Daniel C. Burnett, Mark Fanty, 
Terri Lander, Beatrice Oshika, Stephen Sutton 
Center for Spoken Language Understanding 
Oregon Graduate Institute of Science and Technology 
Portland, Oregon 97291 
ABSTRACT 
This paper describes eight telephone-speech corpora at vari- 
ous stages of development at the Center for Spoken Language 
Understanding. For each corpus, we describe data collection 
procedures, methods of soliciting callers, protocol used to col- 
lect the data, transcriptions that accompany the speech data, 
and the expected release date. The corpora are available at 
no charge to academic institutions. 
1. INTRODUCTION 
The Center for Spoken Language Understanding (CSLU) 
collects and transcribes telephone-speech data to enable re- 
search activities at CSLU and elsewhere. Corpus develop- 
ment a~ctivities are performed by four full-time staff, aided by 
graduate students and paxt-time employees. In 1994, we an- 
ticipate collecting and transcribing speech from 10,000 callers 
in twenty languages. Corpus development activities are sup- 
ported by industrial memberships and research grants. 
Corpus development activities at CSLU include: (a) collect- 
ing telephone speech data in different languages; (b) tran- 
scribing speech at word and phonetic levels; (c) developing 
and documenting transcription conventions for each level; (d) 
measuring the level of agreement among transcribers; (e) de- 
veloping interactive speech tools for labeling; (f) distributing 
the speech corpora to academic institutions free of charge; 
and (g) placing speech tools and labeling conventions in the 
public domain for use by others. 
In this section, we present some general information about 
our corpus development activities. In the following sections, 
we will describe individual corpora. 
Data Collection. telephone-speech data are collected over 
analog and digital telephone lines. Prior to November, 1993, 
speech data were collected over analog lines using several 
Gradient Technology Desldabs. Since November, 1993, the 
majority of our data has been collected using a 24 channel 
T1 line connected to three LINKON FC3000 Communica- 
tion Boards. We are also using an Apple GeoPort Telecom 
Adapter connected to a Macintosh Quadra A\]V to collect 
analog speech data for one of the corpora to be described. 
aligned phonetic level. Non-time-aligned word level tran- 
scription involves producing an orthographic representation 
of the utterance, including indications of extra-speech events 
such as breathes or lip smacks, without time markings. Time- 
aligned word level transcription provides the same ortho- 
graphic transcription augmented with time alignment mark- 
ings. Time-aligned phonetic transcription involves aligning 
phonetic symbols to the acoustic signal. 
A precise description of the conventions used for all levels 
of labeling, including a complete list of all phonetic labels 
for each language, is presented in the CSLU conventions 
document\[l\] 
Transcription Reliability. We are conducting experiments 
to determine the level of agreement among labelers. In these 
experiments, CSLU staff and professional phoneticians axe 
using Worldbet \[5\] to transcribe the same intervals of speech. 
Initial results for English indicate overall agreement of ap- 
proximately 80% across all labels, ranging from approxi- 
mately 70% for vowels to greater than 90% for stops and 
nasals. 
Speech Tools. The OGI Speech Tools support data ma- 
nipulation, analysis and display\[2\]. All corpus development 
activities are performed using these tools. They were devel- 
oped at CSLU, then made portable and documented for dis- 
tribution with support from NSF. The tools have been made 
available to the reseaxch community through anonymous ftp. 
2. CORPORA 
The first three corpora described in this section are consid- 
ered to be complete and are now available from CSLU. They 
were collected over an analog telephone line using a Gradient 
Technology Desldab connected via the SCSI port to a work- 
station. The data were digitized at 8000 samples per second 
with a 14 bit resolution. All data are stored in the NIST 
way file format, some with MIT shortpack compression. The 
remaining corpora are under development and estimated re- 
lease dates are provided for each. 
2.1. Spelled and Spoken Names Corpus 
Transcription. Each call is processed by one or more listen- 
ers. Calls are verified to determine that the caller followed 
instructions and in some cases, transcribed at some level. 
Transcription of corpora occurs at three different levels: non- 
time-aligned word level, time-aligned word level, and time 
The Spelled and Spoken Names Corpus \[3\] contains utter- 
ances from 3667 calls. Callers were solicited through com- 
puter newsgroups and a public relations campaign initiated 
by OGI. The majority of callers were from the Pacific North- 
west. The proportion of male to female callers is 1.15 : 1. 
31 
The goal was to collect samples of spoken English letters and 
spoken words to support a research project funded by U S 
WEST. Callers received the following prompts: 
• What city are you calling from? 
• What is your last name? 
• Please spell your last name. 
• Please spell your last name with short pauses between 
letters. 
• Does your last name contain the letter A as in apple? 
• What is your first name? 
• Please spell your first name, with short pauses between 
letters. 
• What city and state did you grow up in? 
• We will now ask you to say the alphabet. We need you 
to pause briefly between letters, like this: A B C D E 
F G. You may hang up when you are finished. Please 
begin speaking now. 
• Would you like to receive more information about the 
results of this project? 
• If you would like more information about this project, 
#ease leave your name and address at the tone. 
Documentation of the Spelled and Spoken Name Corpus in- 
cludes a speaker-by-speaker log file containing orthographic 
transcriptions of each utterance. Each utterance was tran- 
scribed by two separate listeners. The log also contains the 
global judgments of gender, age, connection quality, accent, 
and intelligibility. In addition, occurrences of extraneous 
speech, environmental noise, excessive breath, or line noise 
are indicated in the log file for each utterance. 
A subset of the data was transcribed at the time-Migned pho- 
netic level. The utterances were labeled by hand then labels 
and time-aLignments with the speech spectrogram were ver- 
ified by an expert spectrogram reader. The subsets of pho- 
netically labeled utterances available to date are as follows: 
• What is your native language? 
• What language do you speak most of the time? 
• Please recite the seven days of the week. 
• Please say the numbers zero through ten. 
• Tell us something that you like about your hometown. 
• Tell us about the climate in your hometown. 
• Describe the room that you are calling from. 
• Describe your most recent meal. 
In addition, unconstrained speech was obtained by asking 
callers to speak for 1 minute on any topic of their choice. 
Each utterance was listened to by a native speaker of the lan- 
guage to verify that the caller responded appropriately. The 
native speaker also made judgments concerning the caller's 
gender, the caller's age, and the line quality. 
The enhanced corpus is augmented with: (a) 200 Hindi calls; 
(b) speech files that were collected during the original col- 
lection but were not included in the original distribution; 
and (c) time-aligned phonetic transcriptions of over five hours 
of speech (up to 50 sec per call) in six languages--English, 
Japanese, German, Spanish, Hindi, and Mandarin. For the 
broad phonetic transcription, we have adopted the World- 
bet labeling scheme, a set of orthographic symbols for multi- 
language transcription that correspond to IPA symbols \[5\]. 
The rationale for using Worldbet and the inventory of sym- 
bols for each language is provided in \[1\]. 
2.3. Stories Corpus 
Collection for the OGI Multi-Language Corpus produced ad- 
ditional calls from English speakers not included in the Multi- 
Language Corpus. The Stories Corpus consists of up to 50 sec 
of spontaneous speech (hereafter "stories") from 692 English 
calls. All 692 calls have been transcribed at the non-time- 
aligned word level, 300 at the time-aligned word level, and 
200 at the time-aligned phonetic level. 
Type Number 
alphabet 
hometown 
callfrom 
say first name 
say last name 
spell last name 
with pause 
100 
1359 
693 
100 
101 
300 
2.4. Twenty-one Language Corpus 
CSLU plans to collect and verify calls from at least 200 fluent 
native speakers in 21 languages--Eastern Arabic, Cantonese, 
Czech, Farsi, French, German, Hindi, Hungarian, Japanese, 
Korean, Malay, Mandarin, ItaLian, Polish, Portuguese, Rus- 
sian, Spanish, Swedish, Swahili, Tamil, and Vietnamese. Ver- 
ification and global judgments will be performed by native 
speakers. 
2.2. Enhanced OGI Multi-Language Cor- 
pus 
The OGI Multi-Language Telephone-Speech Corpus \[4\] con- 
sists of telephone~speech from 10 languages: English, Farsi, 
French, German, Japanese, Korean, Mandarin, Spanish, 
Tamil and Vietnamese. The initial corpus included 900 
calls--90 calls for each language. 
Callers were solicited through computer newsgroups. Each 
caller was asked to respond to the following prompts: 
The following is the English version of the protocol for the 
twenty-one language corpus. The protocol will be presented 
to the caller in their language. 
Thank you for calling the Oregon Graduate In- 
stitute language database. We are currently 
recording speech in (language). We are studying 
the different languages of the world. To do this, 
we need to record samples of speech from fluent 
speakers of (language). Please respond to the fol- 
lowing questions and instructions in (language) 
32 
only. This will take about 7 minutes. Please 
wait for the beep before speaking. 
• What is your native language? 
• What language do you speak most of the time? 
• What language do you speak at home? 
• What other languages do you speak and understand? 
• How old are you? 
• What is your date of birth? 
• Are you male or female? 
• How long have you been in the United States? 
• What city and state did you spend most of your child- 
hood? 
• What is your zipcode? 
• What area code are you calling from? 
• What day is today? 
• What time is it? 
• Say a familiar telephone number. 
• How would you ask someone if they speak (language)? 
• Give us the greeting you usually use when answering the 
phone. 
• For each of the following descriptions, we will 
record the first ten seconds of your answer. Be- 
gin speaking at the beep. A second beep will 
indicate when we have finished recording your 
answer to each question. 
• Describe the route you take to work or to the store. 
• Tell us something that you like about your hometown. 
• Tell us about the climate in your hometown. 
• Describe the room you are calling from. 
• Describe your most recent meal. 
• We now want you to talk for a longer period of 
time. We do not care what you say as long as 
you keep talking. You can tell us anything about 
yourself, your hobbies and interests, the city that 
you live in, and the sports that you like. Or you 
can make up a story, tell a fairy-tale or recite 
a poem. You will have 1 minute to speak. We 
will now give you 10 seconds to think about what 
to say. Please do not read anything, we would 
prefer you make something up. 
• Please begin talking at the beep. You will hear a second 
beep when you have 10 seconds left. 
• For the last question, we would like you to tell us some- 
thing about yourself in English. If you do not speak 
English, you may push any button on your phone, or 
simply wait for 20 seconds. At the beep, please tell us 
something about yourself in English. 
• If you are calling from a touch tone phone, please push 
the number 2 button. 
• Would you like to receive a gift certificate for McDonalds 
or for TCBY frozen yogurt? 
• Thank you for your participation. If you would like a 
gift certificate please leave your name, address, and gift 
certificate selection. Your name and address will be kept 
confidential. 
To date, the prompts for several of the languages have been 
recorded by native speakers. We expect to begin collection 
for five languages in March 1994 and then will add five more 
languages every two weeks until the collection is finished. The 
expected completion data is yet to be determined. 
2.5. English Census Corpus 
In conjunction with the U.S. Bureau of the Census, CSLU 
is collecting data to develop a prototype automated census 
system. Callers were solicited by the Census Bureau; a mem- 
orandum was sent to regional offices asking Census Bureau 
employees, their family members and family friends to call an 
800 number on a voluntary basis to provide speech data for 
the study. A different 800 number was provided for each city. 
The cities are Dallas, Chicago, Boston, Charlotte, Atlanta, 
Philadelphia, Denver, Kansas City, Detroit, and Seattle. 
Two protocols were used that differed in the wording of some 
of the prompts. Each protocol was recorded by both male and 
female speakers. In addition, male and female synthesized 
voices were used. Incoming calls were assigned to the eight 
conditions (prompt X gender X source) in rotation. 
An interesting feature of the data collection was the use of 
automatic recognition to control the protocol. Recognition of 
"yes," "no," "other," and "American Indian" was performed 
at certain decision points to determine subsequent prompts. 
This is illustrated in the following protocol: 
• Thank you for calling the OGI census project. 
We appreciate your help. The goal of this study 
is to determine the feasibility of using a comput- 
erized questionnaire for the Year 2000 Census. 
This research is sponsored by the United States 
Census Bureau. The answers you give to the fol- 
lowing questions will be kept confidential. After- 
wards we will ask you some questions to help us 
evaluate this questionnaire. It will take approx- 
imately four minutes to complete. Please wait 
for the tone before answering each question. 
• Please say your first name. 
• Please spell your first name. 
• Please say your last name. 
• Please spell your last name. 
• Please say your middle initial. If you have no middle 
initial, say "none". 
• What is your sex, female or male? 
• We will now ask about your marital status. Have you 
ever been married? Please say yes or no. 
• (if yes, then) Which one of the following options best 
describes your current marital status: now married, wid- 
owed, divorced, or separated? 
33 
• We will now ask about your date of birth. What month 
were you born? 
• What day of the month? 
• What year? 
• We will now ask about your origin. Are you of Spanish 
or Hispanic origin? Please say yes or no. Code 
• (i\] yes then) Are you of Mexican, Mexican-American or AA1 
Chicano origin? Please say yes or no. 
• (if no then) Are you of Puerto Pdcan origin? AA2 
• (if no then) Are you of Cuban origin? 
AA3 • (if no then) Please say what other Spanish or Hispanic 
group is your origin. 
• Please spell that. QA 
• We will now ask about your race. Are you: White, Black 
or Negro, American Indian, Eskimo, Aleut, or other? IA1 
• (if American Indian, then) What is the name of your 
tribe? IA2 
• Please spell that. 
• (if other, then) Okay. Are you: Chinese, Japanese, 
Asian Indian, Korean, Vietnamese, or other? RC 
• (if other, then)Okay. Are you: Filipino, Hawaiian, 
Samoan, Guamanian, or other? 
• (i\] other, then) Please say the name of your race. 
• Please spell that. 
• Is that the name of an Asian or Pacific Islander race? 
• Do you have a telephone at home? Please say yes or no. IN 
• (if\] yes, then) Please say your home telephone number, 
area code first. 
• Finally, we'd llke some additional information to help us 
with our study. What is your native language? 
• In what city and state did you spend most of your child- 
hood? DK 
• Are you a Census Bureau employee? 
• This concludes the questionnaire portion. We RF 
will now ask you some questions to help us eval- 
uate this questionnaire. 0 
• Would you be willing to provide census information us- 
ing a questionnaire of this type over the telephone? 
• In this questionnaire, we asked about your name, sex, 
marital status, date of birth, origin, race and telephone 
number. Please tell us about any questions you found 
unclear or poorly worded. 
• What, if anything, did you like about this questionnaire? 
• What, if anything, do you suggest we do to improve this 
questionnaire? 
• We would like to hear any further comments you may 
have. You may begin speaking at the tone. When you're 
through, if you would like a gift certificate to either 
Baskin Robbins, TCBY Yogurt, B. Dalton Books, Mc- 
Donald's, or Blockbuster Video, please say which one 
and leave your marling address. Thank you for your 
help. 
34 
Each call will be transcribed at the time-aligned word level, 
including indications of filled pauses and other non-speech 
events. Each utterance will also be assigned a behavior code 
which characterizes the usability of the response. The behav- 
ior codes are described in the following table. 
Full Name Meaning 
Adequate Answer h Answer is concise and 
concise responsive. 
Adequate Answer 2: Answer is usable but 
usable not concise 
Adequate Answer 3: Answer is responsive 
responsive but not usable 
Qualified Answer An adequate answer in 
which respondent ex- 
presses uncertainty. 
Inadequate Answer 1: Answer does not seem 
unresponsive to be responsive 
Inadequate Answer 2: Respondent says noth- 
says nothing ing at all (may have 
hung up, or may be 
lurking). 
Request for Clarifica- 
tion 
Interruption 
A request for clarifica- 
tion as to the meaning 
of a concept of survey 
question. Not used for 
respondent asking for 
a repeat due to back- 
ground noise, etc. 
Respondent interrupts 
the speaking of the 
question. This code 
implies a second code 
to account for the con- 
tent of the interrup- 
tion. 
Don't Know "I don't know" or any 
other equivalent for- 
mulation. 
Refusal Respondent refuses to 
answer. 
Other respondent be- 
havior 
Respondent behavior 
not captured in codes 
listed above. Also in- 
dude request for rep- 
etition based on not 
hearing the question. 
We are in the process of transcribing the calls that have been 
collected. We expect that the transcriptions will be com- 
pleted and the corpus ready for distribution by September 
1st, 1994. 
2.6. Cellular Words, Numbers and Alpha- 
bet Corpus 
This corpus will consist of up to 600 calls made from cellular 
phones. Each caller answers nine questions, says words that 
might be used in voice messaging applications, says a familiar 
phone number, and recite• the letter• of the English alphabet. 
Callers are being provided by a private company who helped 
fund the data collection. 
The corpus is being collected using the Gradient Technology 
Desklab over an analog line. Non-time-aligned word level 
transcriptions are being produced. 
The protocol for the corpus is: 
• Are you calling from a cellular phone? 
• If you happen to know if you are calling from an analog 
or digital phone, please say which one. 
• Are you using a speaker phone? 
* What is your native language? 
. Where were you born? 
• Where did you spend your childhood? 
• What is the month day and year of your birth? 
* Please say your name. 
• Please say the name of the company or organization you 
are with. 
• We will now say a set of words, and would like 
you to repeat each word after you hear it. The 
words that you speak are intended to be com- 
mands to a voice processing system. When you 
say each command, try to imagine that you are 
telling the system what to do. 
• The caller was prompted for the following words one at 
a time. Each word was presented in the carrier phrase 
~'Say .... now". 
Cancel, Change Greeting, Continue, Copy, Erase, Help, 
Listen, No, Operator, Pause, Replay, Rerecord, Reply, 
Resume, Review, Save, Send copy, Yes, Add, Dial, Call, 
Edit, Callback, Change, Delete, Phonebook, Beginning, 
Choices, End, Directory assistance, Customer support, 
Next, Repeat, Replay message, Return call, Skip, Tu- 
torial, Customer care, Verify, Scan, Messages, Message, 
List, Rewind, Fax, Voice, Print. 
• Please say a familiar phone number, one digit at a time. 
• We would now like you to recite the English alphabet 
with a brief pause between letters, like this: A B C D E. 
Please hang up when you are finished. Thanks again. 
Currently, approximately 300 calls have been collected and 
transcribed. We estimate that the corpus will be ready for 
distribution May 1994. 
2.7. Words, Numbers and Phrases Corpus 
With support from Apple Computer, CSLU is collecting 
both analog and digital speech data for utterances related 
to voice messaging and voice control of computer apphca- 
tions. Callers are being provided both by Apple Computer 
and by CSLU through newspaper advertisements. 
The protocol consists of two questions to help determine the 
caller's language background, followed by instructions to re- 
peat 35 words or phrases given in the prompt. To increase 
35 
the usefulness of the corpus, several sub-vocabularies, includ- 
ing first names, last names, digits, numbers and days of the 
week were inserted into the prompts. For example, the phrase 
"phone (first name)" is expanded to 50 different phrases using 
50 common first names. 
There are about 350 different phrases that will be recorded 
from different speakers. 
The goal is to collect 1000 speakers using an Apple Macintosh 
Quadra A/V and 2000 speakers on the digital T1 system 
using the LINKON setup. 
The protocol is as follows: 
• Thank you for calling the Center for Spoken 
Language Understanding speech data base. We 
appreciate your willingness to participate in 
our study. This research is directly related to 
developing better human computer interaction 
through the use of voice control. During this 
call we will be asking you to answer questions 
and repeat phrases. After each prompt please 
wait for the beep before responding. 
, First we would like to ask a couple of questions to help 
us characterize your speaking patterns. What is your 
native language? 
• In what city or state did you spend most of your child- 
hood? 
• For the rest of this call we will say a phrase and 
ask you to repeat it. For example, we would 
say "read this text" and you would respond by 
saying "read this text". Please say the phrase as 
if you were giving a command to a computer. 
• play previous message again 
• cancel my ten AM appointment 
• make a meeting for today 
• what is my street address 
• quit 
• forward this message to my wife 
• set-up a call with (firstname) and (firstname) 
• conference call (lastname) and (lastname) 
• who is at work 
• stop 
• what is the area code for this state 
• add my son to the phone book 
• remove number (digit} from the directory 
• hello, what are my messages 
• skip the next name 
• help 
• good-bye 
• please send a car from the city 
• dial (number) 
• delete my email tomorrow 
• cancel 
• read this text 
• correct my balance 
• call my daughter at eleven pm on (day) 
• erase all information 
• no 
• record extended phonebook 
• get my office 
• transfer all calls to home at twelve oclock 
• use voice 
• record urgent message 
• yes 
• find the operator 
• call (firstname) 
* dial (lastname) 
. phone (firstname) 
• call (number) 
• phone (number) 
• Thank you for your participation. If you would like to 
receive a gift certificate for either McDonalds, TCBY yo- 
gurt, B Dalton Books, Blockbuster, or Baskin Robbins 
please leave your name, address, and selection. You may 
hang up when you are done. Thank you. 
The data collection is just beginning. We expect this corpus 
will be available September 1994. 
2.8. OPERA Corpus 
CSLU is collaborating with the International Computer Sci- 
ence Institute (ICSI) at Berkeley to develop speech corpora 
for Open Performance Evaluation of Recognition Algorithms 
(OPERA). These corpora win be distributed with designated 
training and test sets to all researchers who wish to compare 
recognition performance on a common task. Performance 
evaluation and summary of results win also be provided. 
3. AVAILABILITY 
CSLU is dedicated to promoting progress in the field of com- 
puter speech recognition. To this end, corpora are made 
avalhble at no charge to academic institutions. These data 
are available once they are completed. Portions of the En- 
hanced Multi-Language Corpus have been placed in the pub- 
lic domain. 
For information on obtaining any of these corpora, the con- 
ventions document, or the speech tools, contact Mike Noel at 
noel@cse.ogi.edu. 
4. ACKNOWLEDGMENTS 
We are indebted to the organizations that helped fund the 
projects: U.S. Bureau of the Census, ONR, NSF, Linguistic 
Data Consortium, U S West, Digital Equipment Corporation, 
LINKON Corporation, and Apple Computer 
Much of the corpus development would have been impossible 
without the dedicated efforts of the labeling and transcribing 
staff. Many thanks are due to Terri Durham, Vince Weath- 
erhill, Amie Wilson, Victoria Noel, Alexandra Guerra, Troy 
Bailey, Johan Schalkwyk, and many others. 
References 
1. Terri Lander, S. T. Metzler, The CSL U Labeling Guide, 
CSLU, Oregon, February, 1994. 
2. CSLU. OGI speech tools user's manual, Technical re- 
port, Center for Spoken Language Understanding, Ore- 
gon Graduate Institute, 1993. 
3. R. A. Cole, K. Roginski, and M. Fanty, A Telephone 
Speech Database o\] Spelled and Spoken Names, Pro- 
ceedings of the International Conference on Spoken 
Language Processing, Banff, Alberta, Canada, October 
1992, pp 891-893. 
4. Y. K. Muthusamy, R. A. Cole, and B. T. Oshika, The 
OGI multi-language telephone speech corpus, Proceed- 
ings of the International Conference on Spoken Lan- 
guage Proceedings, Banff, Alberta, Canada, October, 
1992, pp 895-898. 
5. James L. Hieronymus, Ascii phonetic symbols for the 
world's languages: Worldbet, Journal of the Interna- 
tional Phonetic Association, 1993. 
The first OPERA corpus, now under development, consists 
of numbers taken from three of the corpora described earlier: 
the Spelled and Spoken Words Corpus, the Cellular Words, 
Numbers and Alphabet Corpus, and the English Census Cor- 
pus. We estimate the final corpus will consist of about 10,000 
different numbers. 
Thus far, we have created numbers files from utterances in 
the Spelled and Spoken Names Corpus in which the caller 
provided their street address and zipcode. Speech intervals 
containing numbers found in street addresses, street names 
(e.g., "fifth") and zip codes were located manually, and new 
files were created containing just the numbers. From approx- 
imately 1300 different speakers, 2167 files have been created. 
Each file has been transcribed at the non-time-aligned word 
level and at the time-aligned phonetic level. 
36 
