DARPA ATIS Test Results 
June 1990 
D. S. Pallett, W. M. Fisher, J. G. Fiscus, and J. S. Garofolo 
Room A 216 Technology Building 
National Institute of Standards and Technology (NIST) 
Gaithersburg, MD 20899 
Introduction 
The first Spoken Language System tests to be 
conducted in the DARPA Air Travel 
Information System (ATIS) domain took place 
during the period June 15 - 20, 1989. This 
paper presents a brief description of the test 
protocol, comparator software used for scoring 
results at NIST, test material selection process, 
and preliminary tabulation of the scored 
results for seven SLS systems from five sites: 
BBN, CMU, MIT/LCS, SRI and Unisys. One 
system, designated cmu-spi(r) in this paper, 
made use of digitized speech as input (.wav 
files), and generated CAS-format answers. 
Other systems made use of SNOR 
transcriptions (.snr files) as input. 
Test Protocol 
The test protocol for these tests was modelled 
after precedents established over the past 
several years in the DARPA Resource 
Management speech recognition benchmark 
tests. On June 11, 1990, participating sites 
were notified of the availability of SNOR- 
format transcriptions for a designated set of 
93 "Class A" test utterances. Data were 
available using an FTP protocol that had been 
used earlier for access to system training 
material. Copies of the speech waveform files 
were distributed to three sites using Exabyte 
tapes. Responses were provided (in most 
cases) to NIST on June 15. In return, NIST 
provided a key to an encrypted version of the 
complete ATIS session data from which the 
test material was selected, and sites were free 
to access that data in scoring their system 
results locally. A preliminary summary of the 
test results was distributed by NIST to 
participants on June 18th. 
Availability of the ATIS Data 
Access to all of the ATIS data released by 
NIST (except for the speech waveform data) 
has been available via anonymous FTP. The 
speech waveform data has been made 
available to three sites to date on 8ram 
Exabyte tape: AT&T, CMU and SRI. 
Production of the entire Pilot ATIS Corpus is 
planned for release by NIST on CD-ROM 
media after completion of the corpus 
collection effort at TI. 
Comparator Software 
Both REF (reference) and HYP (hypothesized) 
answers were to be written in a CAS 
(Common Answer Specification) format that 
was a slight adaptation of the CAS originally 
developed by BBN, and which had been 
agreed on by the CAS/Comparator Task 
Group. Two programs were available to aid 
in the evaluation by automatically comparing 
matching REF and HYP answers: one in LISP, 
contributed by BBN, and one in C, developed 
at NIST. Final responsibility for decisions on 
114 
whether or not an answer is correct rested 
with human judges at NIST. 
When the comparator programs ran, there 
were a few disagreements in two areas: (1) 
some answers were scored "correct" by the 
more forgiving NIST code even though the 
REF and HYP answers disagreed in the use of 
quotation marks delimiting an answer; and (2) 
some HYP answers consisting of tables of 
numeric codes were incorrectly counted as 
matching the REF answers by the NIST code. 
The first area is a trivial matter of formats and 
how forgiving a program should be. Our 
judgement in these cases was that the content 
of the answer was correct. 
The second area raises some interesting logical 
questions. In a typical case, the required 
column in the table was a code that looked 
like a number, such as flight_code; because 
integers and floating point numbers were to 
be treated as the same type, the tolerance for 
floating point comparison was used in deciding 
equality. Because the key fields of the extra 
erroneous rows were "close enough" to the 
correct ones, they were ignored. An ad hoc 
code change was subsequently made in the 
NIST code so that the tolerance was used only 
in equality tests when at least one of the 
numbers was floating point. But we think 
that in principle there is nothing wrong about 
using a tolerance in comparing two integers; 
the real wrongness is treating a pointer (or 
name) as a number. One principled way to 
clear this up would be to consistently use 
enclosing quotation marks to indicate tokens 
that are not to be treated numerically. 
We had to increase considerably the space 
allotted to input buffers in the NIST C 
software, since one answer that was submitted 
took more than 175 K bytes. 
As a result of seeing some particular answers, 
one more change was made in the NIST 
Comparator code to make it more forgiving: 
leading and trailing whitespace in a string is 
now ignored. This made several answers from 
one of the sites count as correct, in agreement 
with our judgement that they had the right 
content. 
Several examples came up in the test answers 
to illustrate the trouble with looking for 
matches of only values, without constraining 
the values to be of the same variable, in 
conjunction with allowing extra values in a 
tuple. For instance, query bdO0cls, "WHAT IS 
CLASS Y", has the REF answer (CY")), and 
one of the HYP answers supplied is: 
(('Y' '~' "COACH" "NO" "YES" "NO" "NO" 
"NONE" "1234567")) 
Our CAS specification counts this is a correct 
match, although it is indeterminate which of 
the tuple's two "Y" fields was matched. A 
HYP answer with '~' value in any field would 
count as equally correct. In tabular answers 
intended for human consumption, this problem 
is solved by supplying column headings. It 
would be easy to incorporate a similar system 
into our computerized scoring methods. 
Test Material Selection Process 
Time was available to do only cursory study 
of material in sorting it into training and test 
bins. A vague, intuitive sense of "plain 
vanilla" vs. "weird" was used. Several sessions 
were ruled out as test material because they 
were unusual: one had an extremely low 
frequency of "Class A" queries; in another, 
almost all the queries were just NP's, without 
verbs. In the sessions accepted as test 
material, all "Class A" queries were used. 
It was strongly suggested that we partition the 
results into "new word" and "old word" sets, 
the "new word" set being those queries 
containing words not in the training material. 
This motivated us to think some about the 
"new word" problem. Probably the principle 
being implicitly addressed here is that test 
queries are unfair ff they are not answerable 
by the logical generalization of the 
i15 
conjunction of training material and the initial 
state of the language model. (In a spelling 
bee, it is probably unfair to expect a 
contestant to come up with the "k" in "knight" 
if that word -- or a related word -- had never 
been seen in training.) This is the other side 
of the usual constraint between testing and 
training: that they be statistically independent, 
for a valid test. 
Violation of the "fair generalization" constraint 
between training set and test set does not 
make a test "invalid", or necessarily biased, but 
only inefficient and "unfair" if only the bottom 
line is paid attention to. 
Since some words are understandable even 
though one has not previously heard them, 
and polysemous words are not necessarily 
understandable after limited exposure, "words" 
are not the right unit to look at in deciding if 
a test case is implied by the training set. The 
real constraint is that all sound-to-meaning 
mappings that are required in order to answer 
the test question be learnable from the 
training material (assuming an initial "tabula 
rasa" language model). This points to idioms: 
sound-meaning pairs that are not predictable 
by general rule from a knowledge of the 
sounds and meanings of their constituent 
parts. Knowing the meaning of "time" and 
"table" does not make "time table" 
understandable. And morphemes (roughly 
non-complex words) qualify as idioms, since 
they have no sound-meaning constituent parts. 
New syntactic constructions mediating 
between sound and meaning would also make 
the sound-to-meaning mapping of a query 
unhandleable. 
Here are the qualitatively new elements that 
we found in the test material (including non- 
Class A) utterances : 
1. Morphemes: EARLY, EQUAL, ITS, 
LOCKHEED, NIGHT, \[STAYING\] OVER,, 
and PART. 
2. Words: \[U\] A'S, LEAVES, 
MEANINGS, MORNINGS, 
NINETEENTH, PRICES, SEATINGS, 
SERVICING, \[TO\] SERVICE, SPECIALS, 
STAYING \[OVER\], and THREE'S. 
3. Multi-word Idioms: TIME TABLE 
(only in Class X) 
With one exception, none of the five Class A 
queries with new morphemes were answered 
correctly. Query bp00kls, with "NIGHT' in it, 
was successfully answered by only the MIT 
system. Perhaps because "NIGHT' is in the 
knowledge database, it should have been 
counted as an "old" morpheme. 
Several of the ten queries with new complex 
words in the test set were answered correctly; 
primarily ones with new words that are 
regular morphological variants of other words 
that are in the training set (or the assumed 
pre-existing language model), e.g. "meanings", 
"times", or "nineteenth". 
A table showing the "new phenomena" subset 
of results is provided at the end of this paper 
(Table 2). 
It seems to us that a promising research topic 
would be further study of such training-test 
"fair generalization" or "learnability" 
constraints, with an eye to automating 
detection of their violation in the design of 
better tests. 
Preliminary Results 
Results were reported to NIST for a total of 
seven systems by June 19th: two systems from 
BBN, two from CMU and one each from MIT, 
SRI and Unisys. The system designated as 
cmu-spi Cspi" = > speech input) was the only 
one for which the input consisted of the 
speech waveform. For the other systems the 
input consisted of the SNOR transcriptions. 
Subsequently, reformatted results for three 
systems were accepted: "cmu-r", "cmu-spir", 
and "mit-r". 
116 
The C/S-format input provided for an answer 
of the form NO ANSWER to indicate that the 
system failed to provide an answer for any of 
several reasons (e.g., failure to recognize the 
words, failure to parse, failure to produce a 
valid database query, etc). Some sites made 
considerable use of this option, others (e.g., 
MIT) initially did not, partially due to 
miscommunication about the validity of this 
option. 
Some trivial fixes in the format used for 
submissions of results from some of the sites 
were made. One site initially omitted an 
answer for one of the queries, throwing 
subsequent REF-HYP alignment off; we 
inserted for them a NO,ANSWER response. 
Since there was miscommunication about the 
use of the "NOjkNSWER" response, we also 
changed one system's stock response meaning 
"system can't handle it" to "NO_ANSWER" for 
them, and allowed another site to submit 
revised results with "NOANSWER" in place of 
some of their responses. In the table of 
results, the revised systems are "cmu-r", "cmu- 
spir", and "mit-r". 
Responding to several complaints from sites 
about specific items in the test reference 
material, we corrected one reference answer 
(bd0071s) and changed the classification of 
three queries (bm0011s, bp0081s, and 
bw00sls) from Class A to Class X (in effect 
deleting these from the test set, reducing the 
test set size to 90 valid Class A queries). The 
classification disputes all centered on 
ambiguity, one of the hardest calls to make. 
If similar limitations on what is evaluable are 
made for the next round, we would like to 
have both an explicit principle for deciding 
when ambiguity is present and a procedure for 
adjudicating disputes agreed on early. The 
detailed results are given in Table l a for Class 
A queries with only lexical items that appear 
at least once in the training data, and in Table 
2a for Class A queries with "new" morphemes, 
words, or idioms. Table 3 presents a complete 
summary of the results for the entire 90 
sentence-utterance test set. 
Since the Class A test queries are not context- 
dependent, the ordering of these queries is not 
significant. As an aid in analysis, for the 
results presented in Tables la and 2a, queries 
have been (roughly) rank ordered in order of 
increasing apparent difficulty. Note that 
queries toward the top of both parts of the 
table resulted in more "T" answers than "F" or 
NA", while queries toward the bottom of the 
table resulted in more "F" and "NA" answers. 
Not surprisingly, there appears to be a general 
trend toward increasing apparent difficulty 
with increased length of the utterance 
(number of words). 
Table 3 shows that the number of correct 
answers from the various systems ranged from 
25 to 58. Note also that for the system for 
which speech waveform data was used as 
input, (cmu-spir), 35 of the queries were 
answered correctly. Comparing results from 
similar systems for the two subsets of the data 
(Tables lb and 2b), note that the ratios of the 
numbers of correctly recognized queries in the 
two subsets vary from 1.9 to 4.6, with better 
performance on the subset for which all 
lexical items occurred at least once in the 
training data, of course. 
Comparisons such as these are complicated, 
however, by the fact that different systems 
returned NO ANSWER for from 0 to 60 of the 
queries. Perhaps a more appropriate 
denominator to be used in computing the 
percentage of correct responses would have 
been the number of responses for which an 
answer was provided. 
Summary 
This paper has presented results of the first 
Spoken Language System tests conducted in 
the DARPA Air Travel Information System 
domain. 
117 
.S 
u~ 
I:I 
D 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
~ ~ ~ ~ ~ ~ ~ ~ ~ .~ ~ ~ .~ .~ .~ .~ ~ .~ .~ .~ .~ .~ .~ .~ .~ ~ ~ ~ ~ .~ ~ .~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ,~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Z Z Z ~ Z Z Z Z Z 
I I I I I t I I I I I I I I I I I I I I I I I I I 
~ .~.~ .~ .~ .~ .~ .~ .~ .~ .~ -~ .~ .~ .~ .~ .~ .~.~o 
m p~ ~p p p p p ~ ~NpNp pNp~ , 
o 
~ooo ~ ~ 
~.u m 
o ~ g ~ ~ ~ t~ tt ~ t ~ ~ o ~ ~ o o o~ ~ o 
~,~,~,~,~,~, , ,~,~, , ~, ,~o~o~o~o~o~o~ 
N ~ X ~IN ~ N ~ N 
,--I t ,-~ I ,--1 I ,-I I ,--~ I ,-~ I ,--I I ,--~ I ,-~ I ,--I I ,--~ I ,-I 1 ,-~ I ,--I I ,--I ,--I I ~--~ I 
.q o .q o ,.q o ,,q 0 .,q o .q o .Q o .Q o o,.q 
118 
~ ~ ~ ~ ~ .~ .~ ~ .~ ~ ~ ~ .~ ~ ~ .~ .~ .~ ~ .~ ~ .~ .~ ~ ~ ~ .~ .~ ~ .~ .~ ............. ....... 
Z Z Z ~ Z ~ Z Z ~ ~ ~ ~ Z Z Z Z Z Z Z Z ~ ~ Z Z Z Z ~ Z Z Z 
-C -~ .~ -~ -~ -~ -: -: -~ -~ -~ -C = -~ -~ -~ -: -~ -~ -C -~ -~ -~ -~ -~ -~ -~ .~ -~ -~ -~ 
~ ~ ~ ~ ~ ~ ~ e ~ ~ ~ ~ ~ ~ z ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ z 
k k k k k k ~n~k k k ~ k k k~k~k k k k ~ k~qk k k k *,4 ",4 Z ",~ -,4 -,-4 -~40-~4 ~-~ ",'4 ~ -,~ -,~ "'~ "'~ O "'-I O-~4 -,~ -~ ~-,-4 -,'4 -,-4 -,'4 -,~ O-~4 -~ ~ -,~ .,.4 .~ -,"4 ",4 -,~ -,"4 I 
mO~0Z ~ ~ mom ~O inca mZ ~ ~ m m ~z m~ ~0 m ~ ~o ,n m ~ ~o~ ~ m ~n 
~ ~ ~ ~~ ~~X:~~ ~ ~ ~ ~ ~.-, ~ ~ 8 ~ ~Z ~ Z ~Z ~ 
===== = = =~=~= =z= =~=o= =z=;=~= = = = =~= ===u= Z Z OE EZE OE~ ~ ~nEOE~E< E rnE EZE 
O H ~ ~ ~ O ~ ~ ~ Z Z 0 ~ 0 ~ ~ ~ ~ ,.a ~ ~ ~ ~ 0 
o o 
I ~ I ~ I ~I~ I I ~I~ I ~.~ I ~;~ I I ~.~ I ~ I ~ I ,~ I ~ I ~.~ I ~ I ;-~ I ~ I I~ I ~ I ~ I~I~ I I ~ =~.~I li-~ I~ I I I I ~ l 
~gz~,~g~oz~o~o~o~o~z~,~o~,~~ ~=z ~ z~~~o~ 
mMmXmX~MmMmM~MmXmX~MmX~MmX~MmM~XmXmX~X~M~M~MmM~M~M~H~MmXmMmM~X 
I ~--I I ~--I I ,-I I ,--I I ~--I I ,-I I ,-..I I ,-,I I ,-"I I ,--I t ~--I I ,--I I ,-4 1 f-~ I ~--I I ,-I I ,--I I ~--I I ,"i I ,-'I I ~-'I I ,-I I ,-"~ I ,--I I ,-.~ I m-I I ,-I l ,-~ I ,-I 
.~ 0 ~ 0 .~i 0 ,,~ 0 ~ 0 ,,~ 0 ,.~ 0 .Q 0 ~ 0 .,~I 0 .Q 0 ,,Q 0 .Q 0 ~ 0 ,.Ct 0 ,,0 0 ,.~ 0 .el 0 ~ 0 .Q 0 ,.~ 0 ~ 0 .QI 0.0 0 ,.el 0 ,.0 0 ,.~I 0 .~i 0 ,,~I 0 ,.CI 0 .~ 0 
ll9 
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o o~ o o o o o o o o 
Z 
~q Z ~2 m 
Z~ Z~ Z Z Z ~ Z Z ~ Z Z ZZ ~Z~Z~Z Z Z Z Z 
~ ~ 8 ~ ~ ~ ~ ~ o 
4J - ~ 
• - - ~- O~ ....... ~ - ~- u~- - -~ ~..~-~ -~ - ~= ~ ~~ ~ ~ ~ ~ ~o~ ~z~8~ 
~ ~ ~ z~ ~z ~ ~z~z~ z 
"~ ~E .-.I ~ -~ -~ ~E~ -~ ~-~ ~ -~ -~ ~ -~ -~ 0 -~ -~ ~ -~ ~ -~ 0 "~ ~ -~ Z -~ ~-,.-t ~ -,-I ~ -~ I 
~ ~ ~ ~ ¢q ~ ~,-~; ~ Z ~ 0 ~ ~ ~ ~ ~ ~..~. ~ ,..~ ~ t~ ~ ~..~ ~ ~ ~ Z ~ 
E.~ ~ ~ Z 0 ~ ~ ~ u~ 
0 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 0 ~ ~ O 
I ~ I 0 I I ~-,~1 ~ OI ~1 I I~1 ~1~-~ 
0 ,~ ,~ ~ 0 ~ Z ~ ~ 
~ ~ ~ ~ ~ ~ 0 
,~,~o ,~o ~,~o~ o ~z~o~,~o,~o,~o~o~=.~ ~ z ~. ~ ~ Z~Z ~Z ~Z~Z Z~O~Z Z~ 
I ~1~ I ,tT.~ I I :~ I I1~ I I ~1~ I I I ~,~ I ~,-.a I I I 1 I :~ I I ~, I ICI I 
z~z> zo~ z~z~zo~z~zozzz~z~z~z~zoz~zoz~z 
,.~ o Z.~ ~ ~ ,,q o ,q o,.~ O.Q 0 ,c~ o ,.~ O,,Q O,.Q ~,.t~ o,,~ o ~.Q o ,O O,.Q o ,.Q,-~ ,,Q o ~ o,Q o ,,~ ,J:~ o ~ ,,~,1 ~ ~ ,.q o ,Q o ,q o .Q o ,Q o .q o .Q o ,,Q o ,q o .Q o ~ ,,Q o ,,Q 0 ,~ o ,.O o .Q o ~ 0 ,.O o ,Q 
u 
N *' i 
I I 1 I I I ~ ~ ~ ~ ~ ~ ~,~-,d-,d 1 
m i 
,9, ~.,~ 
120 
8 ] 
! 
D 
Z 
m 
.~ ~ .~ ~ ~ ~ .~ ~ .~ ~ ~ .~ ~ .~ 
~ ~ Z Z Z Z~X Z X Z Z Z Z 
t~ 
m 
Z 
",-~ ~ -,-~ ~ -,-I ~ -~ ",'~ ~ -,"~ -,-~ .'4 ",~ ~ ",'~ ",'4 -~ -,-4 .-i ~Z ~X: 0.~ ~ ~:L~ ~.~ ~.~Z ~ ~.~ ~.~ ~X: ~ ~ ~.~ 
Z~ ~ ~,~ ~ ~0 ~ I Z I !:Z:~ I I~:~ I ~.~ 0 I ~ I Z I Z I I ~ I ~ ~,2 ~o ~,-.~ ~ 0 ~ wZ w 
' ' ' 
88~,8~8~8o8o 8~8~8~8 
z z~ 
~0~ ~ 
i ' ~ ~,~ ~ ~ ~ ~o ~ ~ ~ ~ ~o ~ ~ ~ 
~ #,'~ ~ ~E-~ ~ ~ = ~ ~,,:~ ~ Z ~ ¢,,,~ = I .q ,.~ ~1~ ,.~ ~ ,.q ,q t.9 ..q ,e,3 ,.q .q ~ ~'-~ ,,q I:) .q .q I 
#=~=~=~=~=~ I 
8~ ~Nz~z~z~zz~z~z 
o ,.~ ~ ..~ o m o ~ o,.~ ~ o m o ,~ o m ~ m 
i 
~ ~ i ~~ i! 
~ ~ ~ ~0~ 
II III I 
~~'~'~ ~'~1 
OOO 
Z 
0 ,--I e-I U'I ('q ~I~ O~ ~"-. ,-.I ,-.I 
m-- -- ~.l-J -i~ 
/ I I I I I ~ ~ ~ ~ ~,,U~-~-~ 0-,'~ I 
.~8~-~-~,~ °~,~= 
000 
-~ .-4 -,..~ 
121 
