1993 BENCHMARK TESTS 
FOR THE ARPA SPOKEN LANGUAGE PROGRAM 
David S. Pallett, Jonathan G. Fiscus, William M. Fisher, 
John S. Garofolo, Bruce A. Lund, and Mark A. Przybocki 
National Institute of Standards and Technology (NIST) 
Room A216, Building 225 (Technology) 
Gaithersburg, MD 20899 
ABSTRACT 
This paper reports results obtained in benchmark tests 
conducted within the ARPA Spoken Language program in 
November and December of 1993. In addition to ARPA 
contractors, participants included a number of %olunteers", 
including foreign participants from Canada, France, 
Germany, and the United Kingdom. The body of the paper 
is limited to an outline of the structure of the tests and 
presents highlights and discussion of selected results. 
Detailed tabulations of reported "official" results, and 
additional explanatory text appears in the Appendix. 
1. INTRODUCTION 
Benchmark tests were implemented within the ARPA 
Human Language Technology research program during the 
period November 1993 - January 1994. As in tests conducted 
last year, the large-vocabulary continuous speech recognition 
technology tests made use of Wall Street Journal-based 
Continuous Speech Recognition (WSJ-CSR) corpus material 
which was collected at SRI International (SRI) under 
contract to the Linguistic Data Consortium (LDC). Spoken 
language understanding technology tests made use of ARPA 
Air Travel Information System (ATIS) material collected at 
several sites, processed at NIST, annotated at SRI, and 
provided to participating members of the LDC. 
2. WSJ-CSR TESTS 
2.1. New Conditions 
All sites participating in the WSJ-CSR tests were required to 
submit results for (at least) one of two "Hub" tests. The Hub 
tests were intended to measure basic speaker-independent 
performance on either a 64K-word (Hub 1) or 5K-word (Hub 
2) read-speech test set, and included required use of either 
a "standard" 20K trigram (Hub 1) or 5K bigram (Hub 2) 
grammar, and also required use of standard training sets. 
These requirements were intended to facilitate meaningful 
cross-site comparisons. 
The "Spoke" tests were intended to support a number of 
different ehaUenges. 
Spokes 1, 3 and 4 supported problems in various types of 
adaptation: incremental supervised language model 
adaptation (Spoke 1), rapid enrollment speaker adaptation 
for "recognition outliers" (i.e., non-native speakers) (Spoke 
3), incremental speaker adaptation (Spoke 4). \[There were 
no participants in what had been planned as Spoke 2.\] 
Spokes 5 through 8 supported problems in noise and channel 
compensation: unsupervised channel compensation (Spoke 
5), "known microphone" adaptation for two different 
microphones (Spoke 6), unsupervised channel compensation 
for 2 different environments (Spoke 7), and use of a noise 
compensation algorithm with a known alternate microphone 
for data collected in environments when there is competing 
"calibrated" noise (radio talk shows or music) (Spoke 8). 
Spoke 9 included spontaneous "dictation-style" speech. 
Additional details are found in Kubala, et al. \[1\], on behalf 
of members of the ARPA Continuous speech recognition 
Corpus Coordinating Committee (CCCC). 
2.2. WSJ-CSR Summary Highlights 
The design of the "Hub and Spoke" test paradigm, was such 
that opportunities abounded for informative contrasts (e.g., 
the use of bigram vs. trigram grammars, the 
enablement/disablement of supervised vs. unsupervised 
adaptation strategies, ete). 
There were nine participating sites in the Hub I tests and 
five sites participating in the Hub 2 tests, and some sites 
reported results for more than one system or research team. 
The lowest word error rate in the Hub 1 baseline condition 
was achieved by the French CNRS-LIMSI group \[2,3\]. 
Application of statistical significance tests indicated that the 
performance differences between this system and a system 
49 
developed by Cambridge University Engineering Department 
using the "HMM Toolkit" approach \[4-6\], were not 
significant. The Cambridge University HMM Toolkit 
approach also yielded excellent results for the smaller- 
vocabulary Hub 2 tests. The lowest word error rate for an 
ARPA contractor on the Hub 1 test data, for the C1 
condition permitting valid cross-site comparisons, was 
reported by the group at CMU \[7-9\]. The CMU results were 
not significantly different from the corresponding results for 
the Cambridge University HMM Toolkit system. The lowest 
word e:rror rate for an ARPA contractor for the (less 
constrained) P0 condition was reported by the group at BBN. 
R is difficult to summarize results of the spoke tests, except 
to note that there were results reported for 8 different "spoke 
conditions", with from 1 to 3 participants and systems 
typically involved in each spoke. Details are presented in the 
Appendix. 
2.3. WSJ-CSR Discussion 
In NIST's analyses of the results, displays of the range of 
reported word error rates for each speaker across all systems 
are sometimes informative. These displays tend to draw 
attention to particularly problematic speakers or systems. 
Figure 1 shows data for the 10 speakers and 11 systems 
participating in the required Hub 1 C1 test. The speakers 
have been ordered from low error rate at the top of the 
figure to high error rate at the bottom. The length of the 
plotted line indicates the range in word error rate reported 
over all systems, and the one.standard-deviation points about 
the mean are indicated with a "+" symbol. 
Note that three speakers (40h, 40j, and 40t) have unusually 
high error rates relative to the other seven in this test set. 
In previous tests involving the Resource Management 
Corpus, it was noted that high error rates seemed to be 
correlated, at least indirectly, with unusually fast or slow rate 
of speech. To see if this was the case for the present test 
data, NIST obtained estimates of the average speaking rate 
(words/minute) for each of the test speakers. These estimates 
were based solely on the total number of words uttered and 
the total duration of the waveform files, and more 
sophisticated measures would be desirable. Figure 2 shows 
a plot of the word error rate vs. speaking rate for the 10 
speakers and 11 systems in the Hub 1 C1 test. 
This figure, like Figure 1, indicates that speakers 40h, 40j and 
40f not only have unusually high error rates relative to the 
other speakers in this test set, but it also indicates that for 
these speakers, the speaking rate is markedly higher than for 
the other seven. Whereas the speaking rate for the seven 
speakers ranges from approximately 115 to 145 words/minute, 
for the three speakers with high error rate, the speaking rate 
ranges from 165-175 words/minute. 
There are at least two factors that may contribute to higher 
error rates at these fast speaking rates: within-word and 
across-word eoarticulatory effects (e.g., phone deletions) 
associated with fast (possibly better described as "careless" or 
"easuar') speech, and possible under-representation of these 
effects in the training material. 
Chase, et al. \[9\], at CMU, noted that for the 4 speakers in 
Spoke 7 (40g, 40h, 40i, and 40j), two (40g and 40i) could be 
subjectively characterized as "careful speaker\[s\]", but that 40h 
was characterized as a "pretty fast speaker, \[with\] very low 
gain", and 40j as a "very, very fast speaker". These "fast 
speakers" appear in a number of the test sets. 
NIST's analyses of the distributions of rate of speech for two 
sets of training material for the Hub 1 test (each consisting 
of approximately 30,000 utterances: "short-term" and "long- 
term" speakers) indicate that the distributions are rather 
broad, with the short-term speakers' distribution peaking at 
130 words/minute, with a standard deviation of 30 
words/minute, and the long-term speakers' distribution 
peaking at 145 words/minute, with an associated standard 
deviation of 30 words/minute. Note that speaking rates for 
the 3 "fast-talking" speakers fall just outside the "plus one 
standard deviation region" range relative to the peak of the 
distribution for the "short-term speaker" training set, and just 
inside the corresponding region relative to the "long-term" 
training set. 
Because a number of the measured performance differences 
between systems were small, and the results of the paired- 
comparison significance tests validated the relevant null 
hypotheses, it has been observed that, in general, the use of 
larger test sets, especially for the Hub tests, would have been 
more informative, especially with regard to the results of 
significance tests requiring larger speaker populations (i.e., 
the Sign and Wileoxon Signed-Rank tests). With larger 
populations of test speakers, it would be less likely to have 
such disproportionately large representation of"fast speakers" 
in the test sets. 
Two spokes made use of microphones other than the 
"standard" Sennheiser close-talking microphone. (See, for 
example, the discussion in the Appendix of this paper for 
Spokes 5 and 6.) Too other spokes dealt with the issue of 
performance degradations that were presumably due to 
degradations in the signal-to-noise ratio. (See, for example, 
the discussion for Spokes 7 and 8.) 
For the test data of Spokes 5-7, subsequent to the 
completion of the tests, NIST performed signal-to-noise ratio 
(SNR) analyses, using three different bandwidth (signal pre- 
processing) conditions: broadband, A-weighted, and 300 Hz- 
3000 kHz passband "telephone bandwidth". The filtered 
SNR's are generally higher than the broadband values. 
Figure 3 shows the results of these SNR analyses. 
Figure 3 (a) indicates, the SNRs measured for the data of 
Spoke 5, which includes 10 "unknown" microphones in 
50 
addition to the simultaneously collected reference Sennheiser 
dose-talking microphone data for each data subset, collected 
in the normal data collection environment. SRrs "normal 
offices for recording" speech data have A-weighted sound 
level values in the 46.-48 dB range, There were 2 "tieelip" or 
lapel microphones, 5 stand-mounted microphones, a surface- 
effect microphone, a speakerphone, and a cordless telephone 
in this set of 10 test microphones. 
Note that the SNR values for the Sennheiser microphone are 
typically about 45 dB for the both the broadband and A- 
weighted conditions, indicating that there is little low- 
frequency energy in the spectrum of the noise in the 
Sennheiser microphone data. Sennheiser microphone data 
typically yield values of 50 dB for the telephone-bandwidth 
condition. For the alternate microphones, the broadband 
SNR's range from about 23 dB (for the Audio-Teehnica 
stand-mounted microphone) to 45 dB (for the GE cordless 
telephone). With filtration the SNR's are higher, as 
expected. Note that nearly all of the microphones provide at 
least a 30 dB telephone-bandwidth SNR, and that the AT 
Pro 7a lapel-mounted microphone provides approximately 40 
dB. 
Figure 3 (b) indicates the measured SNR's for the data of 
Spoke 6, which includes 2 "known" alternate microphones in 
addition to the reference Sennheiser dose-talking 
microphone, collected in the normal data collection 
environment. For the Sennheiser dose-talking microphone, 
the broadband SNR's are, as for Spoke 5, 45-.46 dB. There 
is a substantial difference between the broadband and A- 
weighted SNRs for the Audio-Teehniea stand-mounted 
microphone, corresponding to low frequency noise picked up 
by this microphone, and for the telephone-bandwidth 
condition the SNR is approximately 35 dB. With the 
telephone handset, SNRs are 38 to 40 dB, depending on 
bandwidth. 
The test set data for Spoke 7, shown in Figure 3 (c), involved 
use of two different microphones (an Audio-Teehniea stand- 
mounted microphone and a telephone handset in addition to 
the usual "reference" Sennheiser dose-talking microphone), 
in two different noise environments, with background A- 
weighted noise levels of 58-68 dB. 
In the quieter of the two "noisy" environments, a computer 
laboratory with a reported A-weighted sound level in the 58- 
59 dB range, the broadband SNR was approximately 34-36 
dB for the Sennheiser microphone, and 35 dB for the 
telephone handset data, but only 17 dB for the Audio- 
Techniea microphone. Spectral analyses of the Audio- 
Teehniea background noise data demonstrate the presence of 
significant low frequency energy as well as the presence of 
harmonic components with an approximately 70 Hz 
fundamental. These components may have originated in some 
rotating machinery (e.g., a cooling fan or disc drive). 
In the noisier environment, a room containing machinery 
with conveyor belts for sorting packages, with a reported A- 
weighted sound level in the 62-68 dB range, the broadband 
SNR ratio for the Sennheiser data degraded to 27-29 dB (a 
decrease of approximately 7 dB), and 27 dB for the 
telephone handset data, and the Audio-Techniea to 16 dB (a 
decrease of only 1 dB). With A-weighting, in the quieter 
environment, the SNR for the Sennheiser improved very 
slightly (less than 1 dB, relative to the broad band values), 
and for the Audio-Techniea it was 25 dB, 8 dB higher than 
the broad band value. 
In the noisier environment, the A-weighted S/N ratio for the 
Sennheiser data was approximately 29 dB, and the Audio- 
Techniea 20 dB. 
For the telephone handset data, both the telephone- 
bandwidth-filtered and the A-weighted SNRs were higher 
than, but typically within one or two dBs, of the unweighted 
values, as might be expected. 
In summary, for the quieter of the two environments used in 
collecting the data of Spoke 7, none of the data subsets in 
Spoke 7 had an average filtered SNR worse than about 25 
dB, and in the noisier environment, the worst average filtered 
SNR for any data subset was approximately 20 dB. These 
SNR values would not ordinarily be regarded as indicative of 
severe noise-degradation. 
Spoke 8 involved data collected in the presence of competing 
noise -- music and talk radio broadcasts. For the case of 
competing music, the broadband SNR for the reference 
Sennheiser microphone ranged from 44 DB for the so-called 
"20 dB" condition, to 36 dB for the "10 dB" condition, and 29 
dB for the "0 dB" condition. For the Audio-Technica 
microphone, corresponding measured valueswere 25, 17, and 
11 dB. NISTs measurements of SNR for the data containing 
competing speech were inconclusive because of the difficulty 
of distinguishing between the spoken test material and the 
competing talk radio. 
3. ATIS TESTS 
3.1. New Conditions 
Recent ATIS tests were similar in many respects to previous 
ATIS tests -- the primary difference consisting of expansion 
of the size of the relational air-travel-information database to 
46 cities, and use of a body of newly collected and annotated 
data using this relational database \[I0\]. As in prior years, 
tests included spontaneous speech recognition (SPREC) 
tests, natural language understanding (NL) tests and spoken 
language understanding (SIS) tests. For the first time, data 
collected at NIST was included in the test and training data. 
The NIST data was collected using systems provided to NIST 
by BBN and SRL 
In previous years, results for NL and SLS tests were 
presented and discussed in terms of a "weighted error" 
51 
percentage, which was computed as twice the percentage of 
incorrect answers plus the percentage of "No Answer" 
responses. The decision to weight 'kvrong answers" twice as 
heavily as "no answer" responses was reconsidered within the 
past year by the ARPA Program Manager, and this year only 
unweighted NL and SLS errors are reported (i.e., incorrect 
answers count the same as "No Answer n responses). For 
most system developers, this change of policy has appeared 
to result in changed strategies for system responses, so that 
in this year's reported results, little use was made of the "No 
Answer" response. 
3.2. Summary Highlights 
For the recent ATIS tests, results were reported for systems 
at seven sites. Lowest error rates were reported by the group 
at CMU \[11\]. The magnitude of the differences between 
systems is frequently small, and the significance of these 
small differences is not known. 
As in previous years, error rates for "volunteers n are generally 
higher than for ARPA eontraetors, possibly reflecting a lesser 
level-of-effort. 
Additional details about the test paradigm, and comments on 
some aspeets by individual partieipants, are found in another 
paper in this Proceedings, by Dahl, et al., on behalf of 
members of the ARPA Multi-site ATIS Data COllection 
Working (MADCOW) Group \[10\]. Details about the 
technical approaches used by the partieipants, and their own 
analyses and comments, are to be found in references \[11,23- 
28\]. 
3.3. ATIS Discussion 
This year, 46% of the utterances were classified as Class A 
and 34% in Class D, so that 80% of the test utterances were 
"answerable" (i.e., Class A or D). Last year's test set had 
about the same percentage of Class A queries (43%), but 
somewhat fewer classified as Class D (i.e., 25%), so that last 
year only 67% were answerable. One possible reason for this 
change (other than the test-set-to-test-set fluctuations) may 
be that the Principles of Interpretation document is 
continually being extended to cover phenomena that would 
have otherwise resulted in eategorization of some queries as 
"unanswerable", and therefore Class X. 
For text input (NL test), for last year's test material, the 
lowest unweighted NL error rate was 6.5% for the Class 
A+D subset, 6.5% for Class A, and 6.4% for Class D, in 
contrast with this year's corresponding figures of 9.3%, 6.0% 
and 13.8%. Note that this year's test set apparently had 
"more diffieult" Class D queries, and that there was a larger 
fraction of the queries that were classified as Class D than 
last year (34% vs. 25%). 
For speech input (SLS test), and for last year's unweighted 
test material, the unweighted SLS error rate was 11.0% for 
the Class A+D subset, 10.2% for Class A, and 12.5% for 
Class D, in contrast with this year's corresponding figures of 
13.2%, 8.9% and 17.5%. 
Note that while the lowest error rate for Class A queries is 
smaller this year (i.e., 8.9% vs. 10.2%), this year's best Class 
D error rate was substantially higher than last year's. It may 
be the ease that this is related to the extended coverage 
provided by the current Principles of Interpretation 
document, so that queries that in previous years would have 
been classified as unanswerable, are now judged to be 
answerable, although context-dependent. 
4. ACKNOWLEDGEMENTS 
The "Hub and Spokes" Test paradigm could not have been 
developed, specified, or implemented without the tireless and 
effective efforts of Francis Kubala, as Chair of the ARPA 
continuous speech recognition Corpus Collection 
Coordinating Committee (CCCC). The tests would also not 
have been possible without the dedicated efforts of Denise 
Danielson and her colleagues at SRI in collecting an 
exceptionally large and varied amount of CSR data for CCSR 
system training and test purposes. In the ATIS community, 
Debbie Dahl served as Chair of the MADCOW group, and 
it is to her credit that new data was collected at several sites 
with the 46 eity relational database and that participating 
sites reached agreement on the details of the current tests. 
Kate Hunicke-Smith and her colleagues at SRI International 
were again responsible for annotation of ATIS data and for 
assisting NIST in the adjudication process following 
preliminary scoring. It is a pleasure to acknowledge Kate's 
thoughtful and cheerful interactions with our group at NIST. 
As in previous years, the cooperation of many participants in 
the ARPA data and test infrastructure -- typically several 
individuals at each site -- is gratefully acknowledged. 
52 
REFERENCES 
\[1\] Kubala, F., et al., "The Hub and Spoke Paradigm for CSR 
Evaluation", in Proceedings of the Human Language 
Technology Workshop, March 1994 (Weinstein, C.J., ed.). 
\[2\] Gauvain, J.L, Lamel, LF., Adda, G. and Adda-Decker, 
M., "The LIMSI Continuous Speech Dictation System: 
Evaluation on the ARPA Wall Street Journal Task", in 
Proceedings of ICASSP'94. 
\[3\] Oauvain. J.L, Lamel. LF. Adda, G. and Adda-Decker, 
M., "The LIMSI Continuous Speech Dictation System" in 
Proceedings of the Human Language Technology Workshop, 
March 1994 (Weinstein, C.J., ed.). 
\[4\] Woodland, P.C., Odell, J.J., Valtehev, V. and Young, S.J., 
"Large Vocabulary Continuous Speech Recognition Using 
HTK", in Proceedings of ICASSP'94. 
\[5\] Odell, J.J., Woodland, P.C., and Young, SJ., "Tree-based 
State Tying for High Accuracy Acoustic Modelling", in 
Proceedings of the Human Language Technology Workshop, 
March 1994, (Weinstein, C.J., ed.). 
\[6\] Odell, JJ., Valtchev, V., Woodland, P.C., and Young, 
SJ., "A One Pass Decoder Design for Large Vocabulary 
Recognition," in Proceedings of the Human Language 
Technology Workshop, March 1994 (Weinstein, CJ., ed.). 
\[7\] Hwang, M., Thayer, F_,. and Huang, X., "Semi-continuous 
HMMs with Phone-Dependent VQ Codebooks for 
Continuous Speech Recognition" in Proceedings of 
ICASSP'94. 
\[8\] Hwang, M., et al., "Improving Speech Recognition 
Performance via Phone-Dependent VQ codebooks and 
Adaptive Language Models in SPHINX-II" in Proceedings of 
ICASSP'94. 
\[9\] Hwang, M., Thayer, E., Mosur, R. and Chase, L, "Phone- 
Dependent Codebooks and Multiple Speaker Clusters in 
SPHINX-II", Oral Presentation at the Spoken Language 
Technology Workshop, March 6-8, 1994, Princeton, NJ. 
\[10\] Dahl, D., et al., ~Expanding the Scope of the ATIS Task: 
The ATIS-3 Corpus ~, in Proceedings of the Human Language 
Technology Workshop, March 1994 (Weinstein, C..I., ed.). 
\[11\] (a) Ward, W. and Issar, S., "Recent Improvements in the 
CMU Spoken Language Understanding System M, in 
Proceedings of the Human Language Technology Workshop, 
March 1994 (Weinstein, C.J., ed.), and (b) Issar, s., and ward, 
W., "Flexible Parsing: CMU's Approach to Spoken Language 
Understanding", Oral Presentation at the Spoken Language 
Technology Workshop, March 6-8, 1994, Princeton, NJ. 
\[12\] Garofolo, J., Robinson, T. and Fiscus, J., "Ihe 
Development of F'fle Formats for Very Large Speech 
Corpora: SPHERE and Shorten", in Proceedings of 
ICASSP'94. 
\[13\] (a) Zavaliagkos, G., et al., ~BBN Hub System and 
Results', (b) Lapre, C., et al., "Speaker Adaptation for Non- 
Native Speakers', (e) Anastasakos, A. et al., "FEnvironmental 
Robustness: Adaptation to Known Alternate Microphones ~, 
and (d) Nguyen, Let al., "Spoke 9: Spontaneous WSJ 
Dictation ~, Oral Presentations at the Spoken Language 
Technology Workshop, March 6-8, 1994, Princeton, NJ. 
\[14\] Ostendorf, M. et al., "Stochastic segment Modelling for 
Continuous Speech Recognition: Wall Street Journal 
Benchmark Report ~, Oral Presentation at the Spoken 
Language Technology Workshop, March 6-8,1994, Princeton, 
NJ. 
\[15\] (a) Seattone, F., et al., *Dragon's Large Vocabulary 
Speech Recoguition System", (b) Odoff, J., et al., "Spoke $4: 
Speaker Adaptation ~, and (e) Orloft, J., et al., "Spoke $6: 
Microphone Adaptation', Oral Presentation at the Spoken 
Language Technology Workshop, March 6-8,1994, Princeton, 
NJ. 
\[16\] Morgan, N., et al., "Scaling a Hybrid HMM/MLP System 
for Large Vocabulary CSR", Oral Presentation at the Spoken 
Language Technology Workshop, March 6-8,1994, Princeton, 
NJ. 
\[17\] (a) Paul, D.B., "The Lincoln Large Vocabulary Stack- 
Decoder Based HMM CSR", in Proceedings of the Human 
Language Technology Workshop, March 1994 (Weinstein, 
CJ., ed.), and (b) Paul, D.B., "The Lincoln Large Vocabulary 
Stack-Decoder Based HMM CSR: Spoke $4 Incremental 
Speaker Adaptation n, Oral Presentation at the Spoken 
Language Technology Workshop, March 6-8,1994, Princeton, 
NJ. 
\[18\] (a) Robinson, T., Hochberg, M. and Renals, S., "IPA: 
Improved Phone Modelling with Recurrent Neural 
Networks", in Proceedings of ICASSP'94, (b) Hochberg, M., 
Robinson, T., and Renals, S. "ABBOT: The CUED Hybrid 
Conneetionist-HMM Large-Vocabulary Recognition System", 
Oral Presentation at the Spoken Language Technology 
Workshop, March 6-8, 1994, Princeton, NJ. 
\[19\] Aubert, X., et al., "The Philips Large Vocabulary CSR 
System", Oral Presentation at the Spoken Language 
Technology Workshop, March 6.-8, 1994, Princeton, NJ. 
\[20\] Aubert, X., Dugast, C., Ney, H. and Steinbiss, V., "Large 
Vocabulary Continuous Speech Recognition of Wall Street 
Journal Data n, in Proceedings of ICASSP'94. 
53 
\[21\] (a) Rosenfeld, R., "A Hybrid Approach to Adaptive 
Statistica.l Language Modelling", in Proceedings of the 
Human Language Technology Workshop, March 1994 
(Weinstein, C..I., ed.), and (b) Chase, L, Mosur, R., and 
Rosenfeld, R., "Language Model Adaptation in the CSR 
Evaluaticm", oral Presentation at the Spoken Language 
Technology Workshop, March 6-8, 1994, Princeton, NJ. 
\[22\] (a) IJu, EH., Moreno, P.J., Stem, R.M., and Aeero, A., 
"Signal Processing for Robust Speech Recognition", in 
Proeeedi:ngs of the Human Language Technology Workshop, 
March 1994 (Weinstein, C.J., ed.) and (b) Stem, R.M., Liu, 
F,H., arid Moreno, P., "Robust Speech Recognition: 
Research at CMU", Oral Presentation at the Spoken 
Language Technology Workshop, March 6-8,1994, Princeton, 
NJ. 
\[23\] Boechieri, E., "The ATT ATIS System: March 94 
Report", Oral Presentation at the Spoken Language 
Technology Workshop, March 6-8, 1994, Princeton, NJ. 
\[24\] (a) Stallard, D., et al., "Recent Work in Spoken 
Language Understanding in the BBN SLS Project", and (b) 
Miller, S., et al., "Statistical Language Processing Using 
Hidden Understanding Models", Oral Presentations at the 
Spoken Language Technology Workshop, March 6-8, 1994, 
Princeton, NJ. 
\[25\] Normandin, Y., "CRIM's December 1983 ATIS System", 
Oral Presentation at the Spoken Language Technology 
Workshop, March 6-8, 1994, Princeton, NJ. 
\[26\] "The MIT ATIS System: March 1994 Progress Report", 
Oral Presentation at the Spoken language Technology 
Workshop, March 6-8, 1994, Princeton, NJ. 
\[27\] Moore, R. and Cohen, M. et al., "SRI's Recent Progress 
on the ATIS Task", Oral Presentation at the Spoken 
Language Technology Workshop, March 6-8,1994, Princeton, 
NJ. 
\[28\] Dahl, D., Linebarger, M., Nguyen, N. and Norton, L, 
"Unisys Acth,Sties in Spoken Language Understanding", Oral 
Presentation at the Spoken Language Technology Workshop, 
March 6-8, 
\[29\] Digilakis, V., et al., "SRI November 1993 CSR Hub 
evaluation", Oral Presentation at the Spoken Language 
Technology Workshop, March 6.-8, 1994, Princeton, NJ. 
\[30\] Weintraub, M., et. al., "SRI November 1993 CSR Spoke 
Evaluation', Oral Presentation at the Spoken Language 
Technology Workshop, March 6-8, 1994, Princeton, NJ. 
NOTICE 
Throughout this paper, a number of references are provided 
in order to refer readers to relevant papers and oral 
presentations by researchers at the indMdual sites 
participating in the tests. In some of these papers, results are 
cited that differ by small amounts from those tabulated in 
this paper. In some cases the authors cite unofficial or 
preliminary, "pre-adjudieation" results. In other eases, the 
authors cite other unofficial test results conducted after the 
"official" test period dosed. 
The views expressed in this paper are those of the author(s). 
The results presented are for local, system-developer- 
implemented tests. NIST's role in the tests is one of 
selecting and distributing the test materials, implementing 
scoring software, and uniformly tabulating the results of the 
tests. The views of the author(s) and these results are not to 
be construed or represented as endorsements of any systems 
or official findings on the part of NIST, ARPA or the U.S. 
Government. 
54 
APPENDIX: 
"BENCHMARK TEST RESULTS" 
A.1. WSJ-CSR November 1993 Test Material 
The 1993 WSJ-CSR tests make use of newly-collected 
training material, a new compressed waveform file format, 
new test paradigms, and new test sets. 
The new training material for the WSJ-CSR task includes a 
substantial amount of data (31 CD-ROMs containing training 
and developmental test data) collected at SRI International 
under contract to the Linguistic Data Consortium (LDC). 
In a collaborative effort involving NIST, Tony Robinson at 
Cambridge University's Engineering Department, and the 
LDC, the newly collected waveform data was processed with 
an "embedded" version (i.e., the file's SPHERE-format 
header is uncompressed, but the bulk of the file is 
compressed) of a lossless waveform compression algorithm 
("shorten") using the NIST SPHERE file header convention, 
to reduce the storage requirements for this data by a factor 
of approximately 50% \[12\]. The CSR test material was 
released in November. 
A.2. WSJ-CSR Test Scoring and Adjudication 
The CSR tests were conducted in November and December. 
Test and scoring protocols were similar to last year. 
However, new to the CSR benchmark tests this year was the 
addition of an official adjudication period. Following a 
preliminary scoring of recognition results, sites participating 
in the tests were permitted to submit requests for 
adjudication to NIST. Adjudication requests in the CSR 
domain contained requests for transcription modifications 
due to transcription errors, alternative transcriptions, etc. 
A total of 22 bug reports were received from 6 sites. The 
bug reports contained requests for changes to 199 (151 
unique) utterance transcriptions in all WSJ-CSR test sets. 
The NIST adjudicators carefully evaluated each request and 
ultimately revised transcriptions of 83 utterances (55% of the 
ones in question.) 
Of the transcriptions that were revised, most were the result 
of judgements by the adjudicators that the transcriptions 
contained words which could have multiple orthographic 
representations (e.g., compound words, variant orthographic 
representations, etc.) or which were lexically ambiguous. In 
many of these eases, both the original transcription and an 
alternative transcription were permitted. This was 
implemented by mapping alternate word forms to a single 
form in both the transcriptions and the recognized strings. 
The remaining revisions were the correction of simple 
transcription errors. 
A.3. WSJ-CSR Test Participants 
United States participants in the WSJ-CSR tests included: 
BBN Systems and Technologies (BBN) \[13\], Boston 
University (BU) \[14\], Carnegie Mellon University (CMU) \[7- 
9\], Dragon Systems \[15\], the International Computer Science 
Institute (ICSI) at Berkeley \[16\], Massachusetts Institute of 
Teehnology's Lincoln Laboratory (MIT/LL) \[17\], and SRI 
International (SRI) \[29,30\]. 
Foreign participants included two British groups at 
Cambridge University's Engineering Department, one 
pursuing eonnectionist approaches (CU-CON) \[18\], and 
another, developers of the HMM Toolkit (CU-HTK) \[4-6\], 
a French group at CNRS-LIMSI (LIMSI) \[2,3\], and a 
German group at the Philips GmbH Research Laboratories 
in Aachen \[20\]. 
BU collaborated with BBN, making use of the N-best outputs 
of a BBN system, using an N-best reseoring formalism, a 
stochastic segment modelling approach, and the use of both 
BU and BBN knowledge sources. 
A.4. WSJ-CSR Benchmark Test Results 
AA.1. Hub 1: 64K Baseline. The intention of the two "Hub" 
tests was "to improve basic \[speaker independent\] 
performance on clean \[read speech\] data". For Hub 1, test 
data consisted of 200 utterances -- 20 from each of 10 
speakers, using the primary (Sennheiser series HMD 410) 
microphone as used in prior tests. 
All sites were required to provide results for a static (i.e., 
non-adaptive) Speaker-Independent (SI) baseline system that 
would permit cross-site comparisons, which would use the 
standard 20K word trigram "open vocabulary" grammar and 
use standardized training sets. 
The results of that baseline system are tabulated in the 
column labelled "Contrast CI" in Table 1. 
Results for (optional) use of the same system training, but 
with the 20K bigram grammar, are shown in the column 
labelled "Contrast C2". These 'eontrastive' results were 
intended for comparison with results for optional 'primary' 
systems. The priraary systems could use "any grammar or 
acoustic training", and these results are shown in the column 
labelled "P0". 
In most cases, data from each site shows on a single line. 
The three BU "C1" systems each represent different N-best 
rescofing formalisms using the BU stochastic segment model 
recognition system in combination with the BBN Byblos 
system, using different knowledge sources to re-rank the N- 
best hypotheses. The two different CMU systems are 
different in many ways, so that comparisons are non-trivial. 
55 
For the baseline "el" systems, word error rates ranged from 
19.0% to 11.7%, with the lowest error rate reported for the 
LIMSI System. 
In this table, and others of this sort in this paper, the results 
of contrastive comparisons are shown in the boxes labelled 
"COMPARISONS AND SIGNIFICANCE TESTS". The 
results of use of the NIST statistical significance tests that 
have been used in previous tests are also shown. 
To illustrate interpretation of some of the tabulated results, 
note that BBN and MIT/LL achieved reductions in error rate 
of 13.9% and 9.8%, respectively, for their P0 systems when 
compared to the C1 baseline systems. In most cases, these 
reductions were shown to be significant. Refer to \[13\] and 
\[17\] for discussion of factors contributing to these reduction 
error rate. 
When contrasting use of trigram and bigram grammars, a 
number of sites achieved reductions in error rate of from 
approximately 12% to 23% for the ease of use of the trigram 
grammar. 
Table 2 shows a matrix tabulation of the results of cross-site 
and, in some eases, within-site, paired comparison statistical 
significance tests for the baseline H1-C1 systems. 
A.4.2. Hub 2: 5K Baseline. Because run times for full 20K 
systems were in some cases regarded as prohibitive, a second 
baseline Hub test, requiring only a 5K lexicon, was permitted. 
For Hub 2, the required static SI baseline C1 system made 
use of a standard 5K bigram dosed vocabulary grammar and 
either of two smaller training sets, consisting of 
approximately 7200 sentence utterances. 
As for Hub 1, the Hub 2 test data consisted of 200 utterances 
-- 20 from each of 10 speakers, using the primary 
microphone. 
Not surprisingly, error rates for the 5K systems were lower 
than for the 20K systems. 
Table 3 shows that for the baseline C1 systems, error rates 
ranged from 17.7% to 8.7%, with the lowest error rate 
reported by the Cambridge University's HTK research group 
\[4-6\]. For the P0 systems, for which "any grammar or acoustic 
training" were permissible, lower error rates were to be 
expected, and were achieved, typically with reductions in 
error rate of from 25% to almost 50%. In this case, also, 
one of the HTK configurations achieved the lowest word 
error rate: 4.9%. 
Table 4 shows a matrix tabulation of the results of cross-site 
and, in some eases, within-site, paired comparison statistical 
significance tests for the baseline H2-C1 systems. 
A.4.3. Spoke 1: Language Model Adaptation. The stated 
goal for this language model adaptation spoke was "to 
evaluate an incremental supervised language model (LM) 
adaptation algorithm on a problem of sublanguage 
adaptation". The sole participant was Rosenfeld et al. at 
CMU \[21\]. Test data consisted of read speech data from 
four speakers, each reading 1 to 5 articles consisting of 
approximately 20-25 sentence utterances, with the Sennheiser 
microphone. NIST's scoring was done on four successive 5- 
sentence utterance blocks throughout the articles (i.e., 
utterances 1-5, 6-10, 11-15, and 16+). Use of the statistical 
significance tests was not thought to be appropriate since 
these tests assume independence of errors across sentences, 
and this assumption is probably not valid when using an 
adaptive language model. 
Table 5 presents the results for Spoke 1. The column labelled 
P0 shows results with ineremental unsupervised adaptation 
enabled: word error rates vary from 16.5% on the first block 
of 5 sentences to 18.2% on the last block. In contrast, with 
language model adaptation disabled, the word error rates 
correspondingly vary from 20.5% to 21.1%. Comparisons 
between P0 and C1, ir~olving enabling/disabling of 
supervised LM adaptation, indicate reductions in word error 
rate of between 9.8% to 19.4%, with lesser reductions for the 
P0:C2 comparisons involving unsupervised LM adaptation. 
A.4A. Spoke 3: SI Recognition Outliers. 'Hae stated goal 
for this spoke was "to evaluate a rapid enrollment speaker 
adaptation algorithm on difficult speakers (e.g., non-native 
speakers of American English)". The sole participant was 
BBN \[13\]. Test data consisted of read speech from ten 
speakers, each reading 40 sentence utterances, with the 
Sennheiser microphone. For each speaker, the 40 "rapid 
enrollment" utterances were available for use with the "rapid 
enrollment" speaker adaptation. 
Table 6 presents the results for Spoke 3. The column labelled 
P0 shows results with rapid enrollment adaptation enabled: 
word error rate for the 400 utterance test set is 14.5%. In 
contrast, with adaptation disabled, the word error rate is 
32.0%. Alternatively, the P0:C1 contrast indicates a reduction 
in error rate 54.7%, which was shown to be significant using 
all of the significance tests applied by NIST. 
A.4.5. Spoke 4: Incremental SpoakerAdaptatlon. The stated 
goal for this spoke was "to evaluate an incremental speaker 
adaptation algorithm". Two sites participated: Dragon \[15\] 
and MIT/LL \[17\]. In this spoke, there were only four test 
speakers, with 100 sentence utterances for each. NISTs 
scoring was done on four successive 25-sentence utterance 
blocks (i.e., utterances 1-25, 26-50, 51-75, and 76+). 
Table 7 presents the results for Spoke 4. 
For the Dragon results, word error rates for the P0 condition 
(with incremental unsupervised adaptation enabled) range 
from 15.5% to 14.3%. For MIT/LL, the corresponding 
variation is 10.9% to 11.1%. There is evidence of significant 
reductions in error of the order of 20% to 30% for the P0:C1 
contrasts for the Dragon results (e.g., note the reduction of 
from 19.4% to 15.5% for the first block of 25 utterances). 
56 
For the corresponding MIT/LL results, the magnitudes of the 
reductions are not as large. For both sites, the incremental 
changes in error rates between the P0 and C2 eases, involving 
unsupervised/supervised adaptation, in most eases are not 
shown to be significant, and range from approximately 4% to 
16%. 
A.4.6. Spoke 5: "Microphone Independence". The stated 
goal of this spoke was to "evaluate an unsupervised channel 
compensation algorithm". The different "channels" in this 
ease were different microphones -- each of the ten speakers 
in this test set used a different (unknown) microphone. 
Similar, but not identical, microphones had been 
incorporated in training and development material. For the 
200 utterances in each portion of this test set, both the 
unknown microphone data (in "wv2" data files) and 
corresponding Sennheiser microphone data (in "wvl" files) 
were available. 
Both CMU \[22\] and SRI \[30\] participated in this spoke. 
Table 8 presents the results for Spoke 5. 
With unsupervised channel compensation enabled, the CMU 
system achieved an error rate of 15.1%, in contrast to 20.9% 
with compensation disabled -- a 27.8% reduction in word 
error rate. SRI achieved a comparable reduction of 24.2%, 
and with slightly lower error rates. With compensation 
enabled, the CMU system achieved 9.7% word error for the 
corresponding Sennheiser data, while the SRI system 
achieved 6.6% word error. Enabling/disabling the channel 
compensation made essentially no difference for the case of 
the Sennheiser data subset, as might be suspected. 
A.4.7. Spoke 6: Known Alternate Microphones. The stated 
goal of this spoke was to "evaluate a known microphone 
adaptation algorithm". There were two different microphones 
-- an Audio Techniea stand-mounted microphone, and a 
telephone handset which was to be connected to the data 
collection apparatus "over external lines", in addition to the 
Sennheiser (wvl) data. Two-channel microphone adaptation 
data -- for each of the two microphones and the (reference) 
Sennheiser microphone was provided fi'om "devtest data". 
There were ten speakers for the data for each of the two 
microphones, with 20 sentence utterances per speaker. In 
NIST's analysis of the results, data are separately tabulated 
for the Audio-Teehniea (at) data, and for the telephone 
handsets (th). 
Three sites participated: BBN \[13\], Dragon \[15\], and SRI 
\[301. 
Table 9 presents the results for Spoke 6. 
For the case of the microphone adaptation disabled (C1), for 
the Audio-Technica microphone's data, word error rates 
were 6.4% for the SRI system, 10.4% for the BBN system, 
and 18.5% for the Dragon results. For telephone handset 
data, the SRI system had 19.1%, the BBN system had 29.3% 
and Dragon 65.4%. These results for the telephone handset 
data were probably somewhat worse than might have been 
expected because of inadvertent channel differences between 
development test and evaluation test sets. 
Considering the adaptation enabled/disabled P0:C1 contrast, 
BBN and Dragon achieved 9.4% and 11.7% reductions in 
word error rate for the Audio-Teehniea microphone, and 
57.4% (from 29.3% to 12.5% word error) and 11.7% for 
BBN and Dragon, respectively. On corresponding Sennheiser 
data, the BBN and SRI systems with adaptation disabled 
achieved word error rates ranging from 5.9% to 8.4%, while 
the Dragon results were 13.8% and 14.6%. 
A.4.8. Spoke 7: "Noisy Envlronments". The stated goal of 
this spoke was to "evaluate a noise compensation algorithm 
with known alternate microphones" in two different data- 
collection environments with background A-weighted sound 
level of from 55 to 68 dB. Two different microphones were 
used, the same microphones as were used for Spoke 6, (the 
Audio-Teehniea and a telephone handset). Utterances for the 
microphone/channel adaptation (Sennheiser to known 
alternate microphone) were available from development test 
data, and there were files with background noise (but no 
speech) for each microphone-noise-environment-speaker 
condition. The two noise environments ("el" and "e2") 
consisted of computer laboratory (el), and a room with 
package sortation machinery in operation ("e2"). 
The sole participant in this spoke was SRI \[30\]. 
Table 10 presents the results for Spoke 7. 
As might be expected, the word error rate was smallest for 
the lower of the two noise conditions with the alternate high- 
quality (but not close-talking) Audio-Technica microphone 
(8.5%) (for which the A-weighted S/N ratio was 
approximately 26 dB), and markedly higher for both alternate 
microphones in the higher noise environment (17.4% and 
28.8%). For corresponding data from the close-talking 
Sennheiser microphone, in the two different noise 
environments, error rates of from 6.3% to 9.1% were 
obtained. 
A.4.9. Spoke 8: "Calibrated Noise Sources". The stated goal 
of this spoke was to "evaluate a noise compensation 
algorithm with a known alternate microphone on data 
corrupted with calibrated noise sources". Data was collected 
using the Audio-Technica microphone, which was also used 
in Spokes $6 and $7, in the presence of competing noise 
(from a "boom box" radio-tape player situated nearby). The 
competing noise was either a variety of musical selections 
("mu") or talk radio ("tr"). The competing noise was 
"calibrated" in the sense that the level of the competing noise 
was intended to be set so as to be 20 or 10 dB less than the 
speech peak level, or equal to (or potentially greater than) 
the speech peak level, the "0 dB condition". Note however 
that NIST's measurements of SNR do not agree well with 
these desiderata, as discussed in Section 2.3 of this paper 
57 
except ha some qualitative sense. 
CMU \[212\] was the sole participant in this spoke. 
Table 11 presents the results for Spoke 8. 
Data were submitted for the 3 competing noise conditions, 
both microphones (Sennheiser and Audio-Teehniea), and 
with noise compensation enabled and disabled -- a total of 24 
conditions, permitting many cross-comparisons. 
With compensation disabled, there were reductions in error 
rate with use of the close-talking, noise cancelling Sennheiser 
microphone when comparing results for the two different 
microphones (C3:C1). With compensation enabled, and 
again comparing the two different microphones (C3:P0), the 
differences in error rate are reduced, but are still significant 
in most cases. 
There is evidence of significant reductions in error rate when 
considering compensation enabled/disabled (P0:C1) for both 
music and talk radio at the 10 dB and 0 dB conditions. 
Further, enabling compensation appears to be beneficial for 
much of the data obtained with the close talking Sennheiser 
microphone (see, for example the C3:C2 comparisons). 
AA.10. Spoke 9: Spontaneous WSJ Dictation. The stated 
goal of this spoke was to "improve basic performance on 
spontaneous dictation-style speech". There were 10 speakers 
(all journalists, but with varying experience in dictation), each 
dictating 20 spontaneous Wall Street Journal-like sentence 
utterances, and using the Sermheiser microphone. 
BBN \[13\] was the sole participant in this spoke. 
Table 12 presents the results for Spoke 9. 
Using the same system as used for the C1 condition in Hub 
1 (which achieved a word error rate of 14.2% on the Hub 1 
test data), a word error rate of 24.7% was achieved on the $9 
data, indicating that the spontaneous dictation $9 test set is 
substantially more challenging. BBN's $9 system achieved an 
error rate of 19.1% on the $9 data, a significant reduction in 
word error rate of 22.8% over the H1-C1 system. 
test set for number of subjects or the difficulty of scenarios 
per collection site. No "pre-filtering" of the test data was 
performed except to attempt to exclude subject-scenarios 
with mostly repetitive queries. The ATIS test material was 
released in November, 1993. 
A.6. ATIS Scoring and Adjudication 
The ATIS scoring and adjudication process took place in 
December and early January. ATIS test and scoring 
protocols were similar to those of previous benchmark tests. 
After the scored ATIS results were released in December 
1993, approximately 140 adjudication requests ("bug reports") 
were sent to NIST. NISTworked in conjunction with SRI to 
resolve the requests, about 10 of which were duplicates. 
The majority of the bug reports dealt with transcription 
issues, in some cases pointing to limitations in our 
community's procedures for transcribing ATIS-domain 
spontaneous speech. One utterance, in particular, which was 
classified as Class X (and thus did not affect the NL or SLS 
scores), but was included in the ATIS SPREC scoring, 
included low-level remarks by the experimenter, as a result 
of an inadvertent "open mike" condition. Originally, this 
block of speech was transcribed as "unintelligible", but in 
adjudication, it was fully transcribed, partially because a 
number of sites had objected to having been scored with 
significant numbers of insertion errors. ARer adjudication, 
most sites continued to do very poorly on this one utterance, 
but were now penalized for substitutions and deletions as 
well. It alone accounts for an increment of approximately 
0.3% in the Class A+D+X word error for most sites, and a 
substantially larger fraction of the Class X error rate. In 
retrospect, it is clear that this problematic utterance (and the 
entire subject-scenario) ought not to have been included in 
the test set because of the "open mike" condition. 
Besides the recurrent complaints of bad transcriptions, a 
problem involving fare IDs or flight IDs not appearing in the 
maximal reference answer fdes (the "rf2s") (which came to be 
known as "Joe's Fare Bug") was brought to our attention. 
This bug was attributed to about 21 of the test utterances 
before scoring. The bug was fred by SRI and new .rf2s were 
generated prior to rescoring. 
A.5. ATIS November 1993 Test Material 
The final, adjudicated set of test material consisted of 965 
test utterances and was collected at 5 sites -- BBN, CMU, 
MIT, NIST and SRI. As in previous years, it was selected by 
NIST staff from set-aside material previously collected within 
the MADCOW community \[10\]. The test set was selected so 
as to balance the number of utterances per data collection 
site (-200 utterances per site.) Because of differences in 
the scenarios and data collection systems used at the 
different collection sites, it was not possible to balance the 
A.7. ATIS Test Participants 
United States participants in the ATIS tests included: AT&T 
Bell Laboratories (AT&T) \[23\], BBN Systems and 
Technologies (BBN) \[24\], Carnegie Mellon University 
(CMU) \[11\], Massachusetts Institute of Technology's 
Laboratory for Computer Science (MIT/LCS) \[26\], and SRI 
International (SRI) \[27\], and Unisys (UNISYS) \[28\]. There 
was one foreign participant: (CRIM) \[25\], from Canada. 
AT&T collaborated with CMU, using an AT&T-developed 
58 
ATIS..domain speech recognition system and the CMU ATIS 
natural language system, and Unisys collaborated with BBN, 
using a set of N-best outputs for a BBN ATIS-domain speech 
recognition system as input for Unisys-developed natural 
language technology. 
A.8. ATIS Benchmark Test Results 
A.8.1. SPontaneous speech RECognition (SPREC) Tests. 
Table 13 presents the results for the SPREC tests for all 
systems and subsets of the ATIS test data, using the 
Sennheiser close-talking microphone. For the case of the 
subset of all answerable queries, Class A+D, the word error 
rates ranged from 3.3% to 9.0%. 
Table 14 presents a matrix tabulation of the ATIS SPREC 
results for the Class A+D subset. The overall word error rate 
across all tested systems for the data from the several 
collecting sites ("Overall Totals" row along the bottom of the 
Table) ranges from 3.6% for the CMU-eolleeted data to 
6.8% for the NIST-eollected data, reflecting differences in 
subject populations and other factors. 
Table 15 presents the results, in matrix form, of the 
application of 4 paired-comparison significance tests for the 
SPREC systems for the Class A+D subset. Among other 
things, note that the performance differences between the 
BBN and the CMU systems are not shown to be significant, 
and that the differences between the MIT, SRI and one of 
the Unisys systems are also not shown to be significant. Note 
also that significant differences are shown between the BBN 
results and those for the two Unisys systems, which make use 
of BBN-provided N-best results. 
A.8.2. Natural Language (NL) Understanding Tests. Table 
16 presents a tabulation of the results for the NL tests for all 
systems and all sets of "answerable" ATIS queries, Class 
A+D, Class A and Class D. 
For the set of all answerable queries, Class A+D, the 
unweighted error rate ("UW. Err.") ranges from 43.1% to 
9.3%. For Class A queries, the range is 28.6% to 6.0%, and 
for Class D, the range is 63.1% to 13.8%. In each ease (and 
as in last year's results), the lowest error rates were reported 
by the CMU system. 
As noted in Section A9 of this paper, the AT&T NL system 
was the results of a collaborative agreement with CMU, thus 
it is not surprising that the performance is nearly identical to 
that of the CMU system. 
There are, in some cases, more than one set of results 
submitted by individual sites, corresponding to different 
systems. The differences between systems were specified in 
the "Systems Descriptions" provided to NIST at the time 
results were submitted. Space limitations prohibit discussion 
of these differences in this paper. 
After preliminary scoring had been completed, Moore at SRI 
advised NIST that a bug had been found in the code that 
produced results submitted to NIST for the SRI NL and SLS 
systems, with the effect of reporting results that were 
"essentially the output of \[the SRI\] system with the robust 
processing component turned off", because a "No_~Answer" 
response over-wrote the answer produced by the robust 
processing component (a "template mateher"). With the 
permission of the ARPA Coordinating Committee, SRI later 
resubmitted results for the debugged systems, and these SRI 
results are shown as "late, debugged" results. 
Table 17 presents a matrix tabulation of the official NL 
results for the several subsets of test material. There is some 
indication of varying degrees of difficulty presented by the 
different subsets of data from the different sites, subject- 
scenarios, and subject populations: note that the unweighted 
error rates reported in the "Overall Totals" row ranges from 
28.1% to 16.0%, but also note that both these values were 
obtained with BBN systems -- one at BBN, and the other at 
NIST. These differences probably are not significant since 
the numbers of speakers in the individual test sets is small. 
A.8.3. Spoken Language System (SLS) Understanding Tests. 
Table 18 presents a tabulation of the results for the SLS tests 
for all systems and all sets of "answerable" NTIS queries, 
Class A+D, Class A and Class D. 
For the set of all answerable queries, Class A+D, the 
unweighted error rate ("UW. Err.") ranges from 46.8% to 
13.2%. For Class A queries, the range is 33.5% to 8.9%, and 
for Class D, the range is 65.2% to 17.5%. For the Class 
A+D and Class A results, the lowest error rates were 
obtained by the CMU system, but for the Class D results, the 
lowest error rates were obtained by the MIT/LCS system. 
Table 19 presents a matrix tabulation of the official SIS 
results for the several subsets of Class A+D test material 
from different sites. Note that there is some evidence of 
"local adaptation" to locally collected data (e.g., error rates 
for the CMU system are substantially lower for the CMU- 
collected data). 
Note also that some sites (typically the "volunteers") 
continued to use the "No_~swer" option more frequently 
than others, which would be a beneficial strategy in a system 
in which "wrong answers" were penalized more heavily than 
"no answer". In some eases, use of this option was more 
prevalent for data from some originating sites than others, 
perhaps reflecting differences between subject populations or 
subject-scenario subsets. 
59 
RANGE ANALYSIS ACROSS SPEAKERS FOR THE TEST : 
November 1993 Hub I, Contrast 1 
by Speaker Word Error for Speakers 
i .............................................................................................................. l I 
......... ............................................. .......................... _ .................. I 
~0 5 10 15 20 25 30 35, 40 45 50 55 60 65 70 75 80 85 90 95 i001 I 
sPKR II I I I I I I I I I I I I I I I I I I I II 
I 4oA I +-I-*-- I 
I '°z4c~ I --*- I-*---. I -* I I ,loc I 
*--I--*--- I 
4oE I +-I-+--- I 4oB I * I -*- I 
I sod I *--I--* I 
I 4OH 
-÷-- ~ ---÷ ..... 
I =oF I --+ .... I---+-- I 
J -> shows the mean 
÷ -> shows plus or minus one standard deviation 
Figure 1 Range of Word error rates for the i0 speakers of the Hub 1 C1 
test set for ii systems 
o 
o 
40 
35 
30 
25 
20 
15 
i0 
DARPA November 93 Hi-C1 Test Speakers 
0 I 
6O 
/1! 1 ,. ~ .\] 
I i .<:.~ ', / 
/ ......... x.,, 
/ 
i / 
~ ..'\[. ;~.<'.>~ 
q,, 
80 
4OB \[40A 401 40C : 40D40G ! 
! 4OE 
i i 
I00 120 140 160 
Speaking Rate (Words per Minute) 
40J 
Figure 2 Word error rates for Hub 1 C1 speakers vs. 
6O 
40H 40F 
180 
speaking rate 
200 
Fig. 3A: 
AT PRO Ta Tlecllp 
Sonn 
Shure 839VV Tlecllp (VVIroless) 
Sonn 
Sony F-VX500 UnldlrooUonol 
Senn 
RadloShack 33-3007 Unld|recUonal 
Sonn 
Lebteo AM-22 Dynarnl¢ 
Sonn 
AT ATM63 UnldlrecUonal 
Sann 
AKG D 1200E Cardlold 
Senn 
Crown Sound Grabber PZM 
Sonn 
Sony IT-DI0O Spoakerphone 
Sonn 
GIE 2-9510 Cordless Telephone 
Senn 
0 10 20 30 40 
SNR Measurements for Spoke 6 
Tel. Hand. 
Sann 
Aud. Teen. 
8ann 
SNR Measurements for Spoke 5 
~d 
~band 
Ighted 
Sndw. 
Fig. 3B : 
80 60 
0 1 0 20 30 40 80 80 
Fig. 3C : 
SNR (d B~.__._____ 
Tel Hand 
Senn -- 
En~fironme 
: :i: :;: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 
Sann ~ " ,,,: .............. : ............. :::::: .... ::::::: .... ::::::::: :::::::::. 
SNR Measurements for Spoke 7 
Legend 
Oroadbend 
I A weighted 
\[~ Tell, Bndw, 
nt2 
Tel. Hand. 
Senn 
Arid. Teeth. 
Sann 
........ ~ &~..o I Environn~ent 1 
i;~i;i~i:i;i:i:i:i:i:i:i:i;i:iii~:~i:i:i:i:i:i:i:i:i:i:::i:i:i:i:!;i~i~i~i~i~i~i~i;i!i~i!i!i!i:i:i:i;i:i;i;i~i~i~i~i~i~i!i:i?i:i:i:i:!;!:i;!:i 
::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 
0 10 20 30 40 
61 
50 60 
Nov 93 Hub andSpoke CSR Evaluation 
Hub I: 64K Read WSJ Baseline 
GOAL: Improve basic SI performance on clean data. 
DATA: 10 speakers * 20 utts = 200 utts 64K-word read WSJ 
data, Sennheiser mic. 
Primary and contrast Conditlon8 
P0 (opt} any gr~r or acoustic tralnlng, session 
boundaries and utterance order given as side 
information. 
C1 (req) Static SI test wlth standard 20K tri~ram 
open-vocab grammar add choice of either short-term 
or iong-termspeakers 
C2 (opt) Static SI test with standard 20K blgram 
open-vocab gr~r and choice of either short-tataR 
or long-term Bpeakers 
SIDE INFO: Session boundarles and utterance order are known 
for Hi-P0 only. ~ ....... .................................................................. 
I Primary P0 I Contrast C1 I C~trast c2 
System I Word Err. (%} I Word Err. (%) I Word Err. (%) ...... ; ....... ;;:; ...... ; ....... ;;:; ...... ; ................. 
bu2 14.3 
bu3 14.5 
~ul ' " I c~nu2 \[ 13.9 I 13.6 
I 
drag~.l 19.0 
llmsll 11.7 15.2 
mlt-lll I 16.8 I 18.6 I 
philips2 1 I 14.8 l 17.2 
srll I I 14.4 I 16.5 ================================================================== 
COMPARISONS AND SIGNIFICANCE TESTS 
.................................................................. 
Test % Change Significance Tests: 
Comp. W.E. MAPSSWE Sign wilcoxon McN 
bbnl P0:Cl 13.9% P0 same P0 P0 
mit-lll P0:CI 9.8% P0 P0 P0 P0 
............ ÷ .................. ÷ .................................. 
Test % Change Significance Tests: 
Comp. W.E. MAPS~E Sign Wllcoxon McN 
cu-htkl CI:C2 11.7%, Cl C1 Cl same 
limsil CI:C2 22.7% C1 C1 Cl C1 
philips2 CI:C2 14.0% Cl C1 C1 same 
srll CI:C2 13.0% C1 C1 C1 C1 • .................................................................. • 
Table 1 Hub 1 Results 
62 
i ||!! II|| I||~ :" : .... :~ "~:~ "" ,!!!!,!!!!i~!!!i~o,o,~:~!!~io,!~ llJi 
:--I I + +------+ 
, ~ ~:I tI:~ 
~: i '~:~i ~ ~i~: 
i.i~33~i~33!3!33i3333i3!!~i !!! 3~33 ,~,~,~,~ ~,~,~ ~ 
ill .... ~ .... i" "'i .... i" 
, ,,~q,q~ ,q~ ,qq~ ,~ ~.~qq~, 
:;~ !~ .... :~:~ ...... , :~ :~:~ ~:~: 
~ i~ i~: , , , 
~:~I :~ ..... :~'~': .......... 
i~ : ........ t---: -t ............. 
:l; |!t!!~t~|i||||i : ~;~ ..... ,~ ":~'~': ..... : 
~ i 
+ ..... 
!~ "'" 17"7": 
ii i l 
: t~;t!t~;t I 
v- +----+ *------ +---- ~ 
:~ ~ ~, 
i~ : 
i i~;t! : 
: i 
17 
il : 
: ~, ~ ~, !~ '~ : :~ !~ :~ i~ ~, ~ ~ ~ ~ :~ !~ 
:~ i~ ~ ~ ~ ~ :~ :~ :: 
.IJ 
t-I 
(J 
m 
4-) 
E~ 
U 
U -H 
u~4 -H 
b~ 
E~ 
63 
I I Nov 93 Hub and Spoke CSR Evaluation 
I Hub 2: 5K Read WSJ Baseline 
I 
i ' GO.AL : IDATA: I 
I I 
I 
I f 
1 SIDE INFO: 8eeelon boundaries and utterance order are known 
I .............. ~L_~_:~_~: .................................... 
Primary P0 \[ Contrast C1 J 
.... ~ ....... ; ..... ~-~;;?-~; ..... i ..... ~;;~-~;;:-~ ..... 
bu2 | 5.4 | 10.3 
cu-~onl 13.5 
I I "' 1 "" cu-htk3 12.5 Icmll 17o7 
phillpel 9.2 1 12.3 
I ph111ps2 ~ 6.4 i ================================================================= 
I COMPARISONS AND SI~IFICANCE TESTS 
................................................................. 
Test % change Significance Tests: 
I Comp. W.E. MAPSSWE Sign Wilcoxon McN 
I bul p0:CI 42.4% P0 P0 P0 P0 
I bu2 R0:CI 47.4% P0 P0 P0 P0 
bu3 P0:C1 46.6% P0 p0 P0 P0 
I cu-htk2 p0:Cl 43.4% P0 P0 P0 P0 
limsi2 P0:C1 43.7% P0 p0 P0 P0 
phillpsl P0:Ci 25.5% P0 P0 P0 P0 • ................................................................. • 
Improve basic SI performance on clean data. 
10 Qpeakere * 20 utts = 200 uttu SK-word read WSJ 
data, Sennhelsermic. 
Primary and Contrast Conditions 
(opt\] any gr~-r or acoustic training, ~esslon 
boundarles and utterax*ce order given as elde 
information. 
(req) Static SI test with standard 5K bigram 
cloeed-vocab gr-~r and choice of either 
short-term or long-term speakers from WSJ0 (7.2K 
utts). 
Table 3 Hub 2 Results 
64 
,~'~ 
oo~ 
, o 
~ ,~ 
i 
, -- +------ +---- + 
i 
~4 
~,~   
mm 
i i 
+----~+ 
°, 3 °, 
~ m 
I11111~11 
,~ 
~,~ ~ 
i,,11 
mmmm ~NN 
--+ .... + ____ + ______ +______ +______ ~ 
iiii iiii i ~ ......... 
?~'~ 
i 
.---- +------ ÷------ + 
! : 
I -- +------ +---- 
i 
,--÷------+------ 
~ ~,~,~ 
~ i 
? 
o~ o, o,:, 
..... 
IIIII 
~1111 
1 
+------ + ------ 
I I I I i 
1 
°1 QI ~ QI 
illl 
Iltl 
oo~ 
O~ 
o,.-I ,.-f 
v 
rQ 
r-I 
U~ 
-,.-i 
-,-i 
-r-I 
65 
0 ,-I ,-I 
O0 
~o ~ ~ ~ o. ~ ,, 
~'= .~ .~ ~ ~ 5 ~ ~ ~ ~" ~ 
~§ o, ~ o ~ ~ : ... 
" ®~ ~' ~ ~i ~' 
,~ ~ -+-t- 
~',~, 
I 
0 
0 -,-I 
4~ 
O 
O -,-I 
4J 
--I 
O 
H U~ 
~4 
0 
0 
0.t 
,--I 
0 
MO 0~ 
~ .g~ 
~ oo~ 
~g 
0 ~ 
.. 
0 - 
o ~ 
--÷ --+ .... 
0 h 
o ~! 
',~ ~ 
I ---~ --+ .... 
/ 
I I I + 
I ~ ~ rn rnen , ~ 
Ii ~' UUUU 
I 
H ..... 
III+ 
UUUO 
• ~o~ 
oOD 
o~ 
UOUU 
lllf 
0 
0 -,--t 
~0 
0 
~0 
0 
rd 
0 
~0 
LO 
66 
I/l 
o ~ . ~ ~ 
® ~..~ ~ ~ _ 
.~., ~.o ~ o ° ~ ~ ~ ~ 
~o ~ ~ ~, 
~oo~ ~ ~ 
0,0 ,~.~ ,.~a ~~ ~ ~ ~ 
o~. :~ ~ ~. _ ~ o 
~o~ 
• II 
--@--+----II 
o g ~ 
o 0 
N 
--÷--+-- 
--÷--÷-- 
-,-t 
--+--+--t 
1.1 t ~ol 
~N ~o oo ~o ~. 
  m ~ Cm 
m 
m oo ~ ~ 
~ -- ÷__ --÷ 
--__÷_ _÷ 
ou ~ uu 
~o ~o ~ 
~o ~ .~ ~o ~ ~10 ¢00 O~ rOD 
0 0 
-.4..4 ~O -t -,-4 DO 
~°° ~°° 
-- ÷ ------ t- 
~ oo ~ oo ~. o~ ~ ~. 
oo 
~m 
~ rn~ :~ uu 
• ~ DO 
O~ 
~ mm 
~ oo 
• ~ 
m~ 
o 
I~ O0 
\[----÷ .... 
i 
t 1 
I 
I 
t 
-'C 
~0 O0 
0 
0 
mm uu 
eO ~ 
Ill 
0 
0 
0 
0 r~ 
0 
0 
H 
0 
o 
0 -M 
o 
o 
O3 
CO 
..o 
r-i 
°° '~" ~° ~ ~ ~ ~0 
'~ ~ ~ ~ ~ ~ ~..~ 
o0o ~ ~0 ~ ~°~ ~01o 0 ~ LI ~ ~ 0 ..,-t 
~ ~ ~ ~ ~,~ ~ ~ ~o 
u 
m+_ 
0 v 
Ill 
0 ~ 
--+__ 
t --+--+ 
t 
t 
i 
i 
, ~ 
, ~ 
, ~ 
~moo~o~ 
~d~dgg~ 
~om mom 
I ÷Ill÷ 
...... ~ 
t 
0 
q-.t b.l 
~ uu~ouuue g gg&;gS;; 
~om mom 
I ÷Ill* 
0000~ 
oooooooo 
oo ~ UDU DUD 
~o~o~ 
o~ oooo~ 
0 
o 
0 
0 
rut 
0 
© 
0 
H 
0 
0 
67 E~ 
~ ~0 
Ill I~ 'el ..-i ~ C iJ ,0 .,.i 
U~ • • ~ O~a~O~ 
ill .,,-i ~.i 
• ! li Ill fill rl .l,a ill ,-I @ 
~ IJ /,4 0 il~O U iJ III iJ 
z ~,~o~ ~,~ ~ o 
Ii 
ii 
_ _ ~ i + + 
i i u 
i i Ii 
I I II 
i i II 
i I II 
i ~ I Ii 'lIII I ~1 I I II 
U i~i u 
i i ii ~J i • i m,d,~m ii 
I,~ i ~ I i-lOl(M II iJ i i ii 
I i u ~,t~, • . 
U # 0 I II I ~ I II 
I I # 
u 
n 
I I J J II ~i ~ ,£l ,G ii 
ill iii iJ o 
UU~U 
i~1 ,o P. t~. 
o 
~ ,rh-~ v't v,,I o~ UUUU 
~,,,U UUUU 
uu~ 
cq cq ~q ~ 
4J 
= 
In 
~J 
C 
E 
c o 
~4 -,~ 
C 
o} .,q 
0 
= 
,-M 
0 
in 
o 
E~ 
.IJ 
,--I 
iJ .l..i i{i o i..i !El .z:l ,,~ i..i i 
.CO "G'~ ~J ~ e 0 c 
~ ®=~ ~o Oo~ o~ 
~® ,~ c~o® o ~ 
o~ ~o o~ ~®~,~ o gg ~0 ~o o~ ~ R l~J • ~I D~ 
~o ,~ ..... ~o~ .~. ~ ~ i~ ° 
~ o~ ~ .o®~ ,~ "~ o~ o .... o ~ <~o ~®® ~ 
Oct ~ 0 ~ -~-~ 
~,.,~ ~ ~,®:>,,~:>,~'~ ,~o ~. ~ ~ ~ ~,,o 
Oil ~1 I ~a .u ,-I la +.-I ~. b 
o~o~- ~o~o ~oo~ ~ =~o 
U 
C 0 
--+--÷ --___ 
i i 
i i 
i i 
i i I-i i 
r, i ,.0 i 
o i i.~ i Ui ol 
i --+--+ 
--t 
,~9 ~>~ o 0 
~ooo 
UUUU 
oo 
UUUU 
~o ~ 
~o ~ ~ 
UUU 
~ ~... 
~u uouu 
UUU~ 
~° 
eO OUOU U 
~ ~u~ 
u~ 
~mm 
O0 
CC~ 
C o 
J~ 
0 
(b 
aj 
co 
4J 
< 
0 
UUUUUU 
~u uuuo~u I 
i ~ , 0 
i 
oo 
68 
..Q 
ml 
~0~ 
O~H 
-,C: 
i!  oo0o 
Ew ,,,"~ el 
! o.o 
~m 
t~ 
M 
~UU 
b~ 
U~UU 
~o ~ 
~7 
~m 
t~ 
uuu~uu 
~o ~ ®o ouu~u~ 
im 
i 
~uu~uu 
~0 
WU UUOUUU E,,-I 
D 
~o 
WO UUUOU~ 
i 
e 
~ ~. ° ..... ~o ~ ...... ~ ~9~... 
oo oo oo oo ~o~o 
~o~o 
o uou oouuuu 
~O UUOUUU ~O UUUUOO 
oo oo oo oo I ~o~o ~o~ol 
I I I I~ I I I I I I I~ 
~ 
...... ~ ...... ~ ...... ~ 9~9 
O~ 
~U UUOUUU ~0 UUUUUO 
I 
oo oo ~o~° oo ° 
t 
oo oo  uuo, ou 
i 
4..1 
,--4 
UI 
Ill 
= 
i11 
0 
00 
-H 
~o o ° ~ ~ 
. ~0 ..~ 
D~ 
E 
o ~ ~ 
~ ~°~° ~o 0.oo~0 ° ~ ~ 
~0 ~ ~ ~ o ~ ~o 
i 
i 
~l,lJ 
o i ~.i 
i --+--÷ 
~E 
--+--+ 
~o~o 
~11111 
~,~ 
uou 
0 
4J 
-,-I 
r--i 
u 
o 
u} 
,-I ,-I 
,.Q 
69 
......... ~ ......... ~~ ................... i: ~ ....... oo I g o ......... ~ ......... g o ........ 
to 
4o 
,-4 
4o 
o 
w 
U 
ul 
M 
ID 
r-.t 
o 
O0 ~ ~ 
~ o ~ ~ ~ ill@ O* l~ ./:I 0 
,~o 0 °° ~ ,~ 
~o o~ ~ 
D., 0 ..,d (0 
oo ~ ~ ~ .... 
~. ~ o 
~',~ o~ ~ 
i 
co i 
X 
~0 ~ 
~0~ o, 
E.., ,-I I 
Z ul 
H 
,~ • 
o 
~o ~ o 
c 21 
70 
CO 
4O 
,---4 
W 
o 
-,-I .IJ 
r~ 
.IJ 
o 
-,-I 
co 
o 
w 
.IJ 
o 
C~ 
co 
o 
04 
oq 
0) 
.D 
r~ \[-~ 
Dec93 ATIS SPREC Test Results " 
Class A÷D Subset ~ l 
Originating Site of Test Data I Overall I Foreign 
BBN l C24U l MIT I NIST -BBN ~ NIST-SRI l SRI I TotAls I Coil. Site 
l (146 Utt.) I (163 Utt.) I (132 Utt.) 1 (89 Utt.} I (77 Utt.) { (166 Utt.) ~ 773 I Totals ............. ÷ ............... + ............... ÷ ............... ÷ ............... . ............... ÷ ............... I ............... ÷ ............... 
art2 l 6.3 1.8 1.31 3.4 1.4 1.0 4.6 2.3 1.21 4.3 3.0 2-91 8.6 2.6 2.01 6.5 2.0 1.11 5.4 2.1 1.5 5.4 2.1 1.5 
I 9.4 49.3 I 5.0 25.2 8.1 47.0 I 10.2 51.7 I 13.2 49.4 i 9.6 40.4 I . 9.0 42.2 9.0 42.2 
......... ÷ ............... ÷ ............... ÷ ............... * ............... ÷ ............... ~ ............... I ............... ÷ ............... bbr3 I 0.7 0.1 0.51 1.4 0.4 0.61 2.1 0.3 1.01 3.2 1.2 1.91 3.6 0.6 0.51 2.2 0.4 0.51 2.0 0.5 O.Sl 2.3 0.6 0.9 
I 1.3 7.5 I 2.4 11.7 I 3.5 2:~.5 I 6.3 34.S I 4.9 23.4 I 3.1 17.5 I 3.3 10.0 I 3.8 20.4 
I 3.5 23.3 I 2.5 11,7 I 3.1 24.2 I 4.77 24.7 I 3.8 22.1 I 3.1 16.9 II 3.3 19.7 I 3.S 21.0 
S crlm3 I 3.0 0.4 1.31 2.1 0.7 2.1 I 3.7 0.5 2.51 4.4 1.5 3.7~ 5.2 0.8 2.4 I 4.1 0.0 2.311 3.6 0.7 2.3 I 3.6 0.7 2.3 
Y I 4.8 24.0 t 4.9 23,3 I 6.77 34.B I 9.6 46.1 I 8.5 36.4 I 7.2 32.5 II 6.6 31.3 I 6.6 31.3 
s ......................... + ............... ÷ ............................................................... I I ............... * ............... 
T mit les2 i 2.2 0.8 0.41 1.7 1.5 0.5\] 3.2 1.4 1.0 I 1.9 1.2 0.81 4.4 1.1 0.51 3.0 1.2 0.711 2.6 1.2 0.71 2.5 1.2 0.6 
E I 3.4 23.3 l 3.7 16,6 l 5.6 31.0 I 3.9 24.7 l 6.0 28.6 l 5.0 19.9 II 4.5 23.3 l 4.2 21.5 
M ......................................................................................................... l l ............................... 
S sri3 2.8 1.0 0.71 1.8 1.0 0.6 I 2.6 1.4 0.71 2.3 2.0 0.5~ 4.4 1.8 2.0j 1.9 1.3 0.51 2.5 1.4 0.7 2.6 1.4 0.8 
I 4.6 27.4 I 3.5 14.3 I 4.6 30.3 I 4.0 29.2 I 8.2 31.2 I 3.8 16.3 t 4.6 23.3 I 4.8 25.2 
......................................................................................................... II ............................... 
iri4 i 2.2 1.0 0.51 1.8 1.0 0.61 2.4 1.2 0.8 i 2.0 2.5 0.9 i 4.4 1.5 2.3 i 2.1 1.2 0.S I 2.3 1.3 0.0 i 2.4 1.4 0.9 
| 3.8 23.3 I 3.4 13.5 I 4.4 20.0 I 5.4 29.2 I 8.2 32.5 ~ 3.8 16.9 l 4.5 22.3 l 4.6 23.7 ......... . ............... ÷ ............... ÷ ............... ÷ ............... ÷ ............... ~ .............................. ÷ ............... 
uni~ys2 1.3 0.5 0.7 1.4 0.3 1.3 .2.0 0.6 1.2 3.8 1.3 1.5 4.4 1.0 1.0 2.6. 0.6 0.6 2.3 0.7 1.0 2.3 0.7 1.0 
I 2.6 14.4 I 3.0 14.7 I 3.0 25.B I 6.6 34.s I 6.4 32.s I 3.9 20.s II 4.0 21.9 I 4.0 21.9 
1.3 6.8 3.0 15.3 3.6 24.2 7.0 32.6 I 6.2 28.6 4.3 20.5 3.9 19.7 I 3.9 19.7 =========================================================== 
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 
Overall I 2.4 0.7 0.7 1.8 0.0 1.0 2.7 1.0 1.1 3.1 1.8 1.71 4.7 1.3 1.31 3.1 1.0 0.8 I 
Totals I 3.9 22.1 3.6 16.2 4.8 30.0 6.$ 34.2 I 7.3 31.6 I 4.9 22.4 
i i ;:;I i .................. sy,~ I 4.2 24.0 I 3.7 16.8 { 4.7 29.7 I ~.5 34.2 I 7.3 31.6 I 5.2 24.0 II %W.Err %Utt.Err I 
.................................................................................................................................. 
Matrix tabulation of results for the Dec93 ATIS SPREC Test Results, for the Class A÷D Subset. 
Matrix columns present results for Test Data Subsets collected at several sites, and matrix rows present results for different 
systems. 
Nt~mbers printed at the top of the matrix columns indicate the number of utterances in the Test Data (sub)set from the corresponding 
mite. 
"Overall Totals" (colttmn) present results for the entire Class A+D Subset for the system corresponding to that matrix row. 
"Foreign Coll. Site Totals" present results for "foreign slte" data (i.e., excluding \]ccelly collected data) for the Class A÷D 
Subset. 
"Overall Totals" (row) present res~ Its accumulated over all systems corresponding to the Test Data (sub)set corresponding to that 
i~atrix column. "Foreign System Totals" present results acc~Imulated over "foreign systens" (i.e., excluding results for the 
system(s) developed at the site responsible for collection of that Test Data s~ibset.) 
Table 14 ATIS SPREC Results: Class (A+D) bv Collection Site 
71 
I ~ III i.~ =='" 
°° ° ° 
o o.~ 
o~ ~ ~o o 
®~ ~o ~ 
~ o 
0 ,'el u 
i 
--+ --÷----÷ .----÷~ +.~-- 
ff~f 
I 
I 
.... ,,~i .,.= i i 
I 
I 
~,~,=,~ ~ ~=, 
IIi, I 
.,==II I I i I I 
Iiii IIii IIII Illl Iiii 
"!!!i" 
÷------÷-- 
~,~, ~ 
-- ÷----÷----÷------ ÷----÷ ..... 
~ 1222o 
JJJJ '''~' 
-- ÷------+----+ ....... 
I 
-- ÷------ + ------ + ---- 
II~, ~, IIII 
iiii 
1 
R ÷----D + ------ 
i,ii 
i 
i 
m +__--m~ .... 
i t 
4J 
U) 
H 
4J 
CO 
C~ 
J~ 
CO (U 
O~ 
-,,-I 
-,-I 
CO 
,-4 
E~ 
72 
Class A+D Class A Class D 
773 Utts. 448 Utts. 352 Utts. 
syste~ UW Err. UW Err. DW Err. 
attl 10.2 7.4 14.2 
bbnl 14.7 9.6 21.8 
bbn2 22.4 16.1 31.1 
cmul 9.3 6.0 13.8 
criml 36.4 21.7 56.6 
crim2 20.8 14.7 29.2 
mit_icsl 12.5 10. O 16.0 
sril 21.9 14.3 32.3 
sri5 ** 18.2 10.5 28.9 
unlsysl 43.1 28.6 63.1 
Table 16 ATIS NL Test Results 
Class (A+D} Set l I 
Orlgltx~tlng Site of Test Data ~ Overell I Foreign 
BEN \] C~U l MIT l NIST-SRI NIST-B~N SRI Totals Coll. Site 
l 146 l 163 l 132 l 77 l 89 l 166 \] 773 l Total, 
.... ~ ....... \[;;--;;---;-'-i;;--\[;---;T~;V-~V--;-i--~;--~;---; .... ;;---;---;-i-~;;--~;---;rl-~;J;;---;-÷-;;i--;;---; - 85 
15 0 94 6 0 I 92 8 0 I 87 13 0 90 10 8 I 9o 10 0 I 90 10 0 90 10 0 
I 15.1 I 6.1 I 8.3 I 13.0 I 10.1 I 10.2 I ~0.2 I 10.2 
......... + ............. + ............. + ............. ÷ ............. ÷ ............. + .............. I ............. + ............. bb~d I 124 21 I I 141 22 0 I 117 15 0 I 
57 20 0 I 82 7 0 138 28 0 I 659 113 1 535 92 0 
85 14 I I 87 13 0 I 89 11 0 74 26 0 92 8 0 83 17 0 85 15 0 85 15 0 
I 15.1 I 13.5 I 11.4 I 26.0 I 7.9 I 16.9 I 14.7 I 14.7 
......... + ............. + ............. + ............. ÷ ............. + ............. ÷ ............. I ............. + ............. 
bbn2 104 41 I l 127 36 0 112 20 0 54 23 0 69 20 O 134 32 0 600 172 I I 496 131 0 
17128 117822 018515 017030 01 78 22 018119 0 , 7822 0\[ 79 21 0 
28.8 22.1 15.2 29.9 l 22.5 19.3 II 22.4 l 20.9 
i~ ..... i I;;~-;~i~I;;~-I;~;-i~;;~-~T~;~i~;~;~;~i~;;-~;~--;~i~I;;~;~;- -7;I-7~---;-T-;J-;~--;- I 
88 12 0 I 94 8 o I 91 9 0 I 87 13 0 I 90 10 0 } 92 8 0 II 91 9 0 I 90 10 o 
I 12.3 I 6ll I 9.1 I 13.0 I 10.1 I 7.8 l 9.3 I lo.2 
............................................................................................. tl ........................... S crlml I 76 61 9 I 114 31 18 I 88 36 8 I 40 32 5 \] 79 10 0 I 95 57 14 l 492 227 54 I 492 227 54 
Y I 52 42 6 I 70 19 Ii I 67 27 6 I 52 42 6 I. 89 II 0 I 57 34 8 I 64 29 7 1 64 29 7 
s I 47.9 I 3o.1 I 33.3 I 48.1 I 11.2 I 42.8 I 36.4 I 36.4 
T ......... ÷ ............. ÷ ............. ÷ ........................... ÷ ............. ÷ ............. II ........................... 
E crim2 \] 112 33 1 I 133 27 3 I 109 23 0 I 57 17 3 I 76 12 I I 125 41 0 J 612 153 8 I 612 153 8 
M I 77 23 I J 82 17 2 1 83 17 0 I 74 22 4 1 85 13 i I 75 25 0 II 79 20 I I 79 20 1 
s I 23.3 I 18.4 I i7.4 f 26.0 I 14.6 I 24.7 I 20.a I 20.8 
......... + ............. + ............. + ............. + ............. + ............. + ............. '1 ............. + ............. mit_ic~l 1 111 35 0 1 150 13 0 \[ 120 12 0 1 62 15 0 1 83 6 0 1 150 16 0 1 676 97 0 l 556 85 0 
I 76 24 0 I 92 8 0 f 91 9 0 I 81 19 o I 93 7 0 I 90 ao o It 87 13 o I ~7 13 o 
l 24.0 \[ 8.0 l 9.1 \] 19.5 I 6.7 I 9.6 I 12.5 I 13.3 ............................................................................................. II ........................... 
sril I 103 17 26 I 130 17 16 I III 8 13 I 54 21 2 I 75 14 0 I 131 30 5 I 604 107 62 I 473 77 57 
I 71 12 18 I 80 10 10 I 84 6 I0 I 70 27 3 I 84 16 0 l 79 18 3 1 78 14 8 I 78 13 9 
I 29.5 t 20.2 I 15.9 I 29.9 I 15.7 I 21.1 It 21.9 I 22.1 
......... ÷ ............. + ............. ÷ ............. + ............. + ............. + ............. ~I ............. ÷ ............. 
sri5 ** I I13 30 3 I 142 17 4 I 117 14 I I 54 21 2 I 75 14 0 I 131 30 5 I 632 126 15 I 501 96 10 
I 77 21 2 I 87 10 2 I 89 ii 1 I 70 27 3 I 84 16 0 \[ 79 18 3 II 82 16 2 I 83 16 2 
t 22.6 I 12.9 i 11.4 I 29.9 I 15.7 I 2l.I It 18.2 I 17.5 
............................................................................................. II ........................... 
unisysl I 76 31 39 I 88 33 42 I 91 26 15 I 40 24 13 I 49 27 13 I 96 20 50 II 440 161 172 I 440 161 172 I I 
52 21 27 I 54 20 26 I 69 20 11 I 52 31 17 I 55 30 15 I 58 12 30 II 57 21 22 I 57 21 22 
I 47.9 I 46.0 1 31.1 I 48.1 I 44.9 I 42,2 II 43.1 I 43.1 ....................................................................................................... . ................. 
-~;;;;\[~ ..... i~;~;--~-iz;;\[~--;;-i;~;;--;~-~-;;~-~;~-~-i-;;;-~;--~-~;~--~-i~ .... 
Totals I 73 21 5 I 82 13 5 I 84 13 3 i 72 25 3 I 84 14 2 I 78 17 4 II 
I 26.6 I i8.3 I 16.2 I 28.3 I 16.0 \] 21.6 II Legend: 
................................................................................................. II ........... 
Foreign 1 843 247 78 11178 206 83 1986 165 37 I 552 193 25 I 748 128 14 11040 224 64 II i ;~ ;F ;~ i 
System I 72 21 7 I 80 14 6 I az 14 3 I 72 25 3 I a4 14 2 I 78 17 5 II I *T ~F ~NA I 
Totals J 27.8 I 19.7 l 17.0 I 28.3 I 16.0 I 21.7 II 1% Un-WelghtedErrl ....................................................................................................................... 
Matrix tabulation of results for the Dec 93 ATIS NL Test Results - Using Minimal/Maximal Scoring Criterion, for the Class (A+D) 
Subset. 
Matrlx columns present results for Test Data S~absets collected at several sites, and matrix rows present results for dlfferemt 
systems. 
Numbers printed at the top of the matrix columns indicate the number of ÷valuable utterances in the Test Data (sub)set from the 
corresponding site. 
"Overall Totals" (column) present results for the entire Class (A÷D) Subset for the system corresponding to that matrix row. 
"Foreign Coil. Site Totals" present results for "foreign site" data (i.e., excluding locally collected data) for the Class (A÷D) 
Subset. 
"Overall Totals" (row) present results accumulated over all systems corresponding to the Test Data (sub)set corresponding to that 
raatrlx column. "Foreign System Totals" present results acclLm~lated over "foreign systems" (i.e., excluding results for the 
system(s) developed at the site responsible for collection of that Test Data subset.) 
** Late and for a debugged system. 
Table 17 ATIS NL Results: Class (A+D) by Collection Site 
73 
Class A+D Class A Class D 
773 Utts 448 Utts. 352 Utts. 
system UW Err. UWErr. UW Err. 
attl 24.6 22.1 28.0 
bbnl 17.5 13.8 22.5 
cmul 13.2 8.9 19.1 
criml 43.3 28.6 63.7 
crlm2 28.2 23.7 34.5 
mlt_Icsl 14.2 11.8 17,5 
srll 24.8 16.5 36.3 
sri2 25.d 18.5 34.8 
uri5 ** 20.7 14.1 29.8 
sri6 ** 21.2 13.8 31.4 
%Inlsysl 46.8 33.5 65.2 
Table 18 ATIS SLS Test Results 
Class (A÷D) Set I 
Origi.ti~ Site of Test ~ta II Overall I Forelgn 
c~ I K*T \[ ~-sRI I Nier-mm SRI II To~l~ I Coll. Site 
\[ 146 I 163 \[ 132 I 77 I 89 I 166 II 773 I Total, 
............. ÷ ............. ÷ ............. + ............. ÷ ............. + ............. ÷ .......................... I + ............. 
attl 106 40 0 I 138 25 0 100 32 0 58 19 0 72 17 0 I 109 57 0 i 583 190 0 583 190 0 
73 27 0 85 15 0 76 24 0 75 25 0 81 19 0 66 34 0 75 25 0 75 25 0 
I 27.4 I lS.3 I =4.= I =4.7 I 19.1 I 34.3 II =4.6 I 24.6 ......... ~ ............. + ............. ÷ ............. ÷ ............. ÷ ............. ÷ .......................... ÷ ............. 
bbnl I 121 24 1 I 128 38 0 I 117 15 0 I 01 16 0 I 75 14 0 I 136 30 0 II 630 134 1 J 517 110 0 
I 03 16 1 I 79 21 0 I 89 11 0 I 79 21 0 I 84 16 o I 82 18 0 II 83 17 o I 82 lO o 
I 17.1 I =1.5 I 11.4 I =0.8 I lS.7 I 18.1 !! 17.s I 17.5 ......... ÷ ............. ÷ ............. ÷ ............. + ............. + ............. + .......................... + ............. 
~ul I 127 19 0 I 152 11 0 I 114 18 0 J 64 13 0 76 13 0 138 28 0 I 671 102 0 I 519 91 0 
87 13 0 93 7 0 86 14 0 83 17 0 J 85 15 0 83 17 0 ~1 87 13 0 I 85 15 0 
I 13.0 I 6.7 I 13.6 I 16.9 1 14.6 I 16.9 ! 13.2 i 14.9 ......... ÷ ............. ÷ ............. ÷ ............. ÷ ............. ~ ............. ÷ .......................... ÷ ............. 
crlml I 73 65 8 I 1O0 44 19 82 42 8 38 35 4 66 21 2 I 79 72 15 lJ 438 279 56 ~ 438 279 56 
50 45 s 61 27 1= I 62 32 6 I 49 45 5 74 24 2 I 48 43 9 II 57 36 7 I 57 36 7 
I 5o.0 I 30.7 I 37.9 I 50.6 I 25.8 I 52.4 I 43.3 43.3 ......... ÷ ............. ÷ ............. ÷ ............. ÷ ............. ÷ ............. ÷ .......................... ÷ ............. 
crim2 I 104 40 2 I 121 40 2 I 99 33 0 ~ 55 20 2 I 69 18 2 I 107 58 I II 555 209 9 l 555 209 9 
I 71 27 1 I 74 25 1 l 75 25 0 J 71 26 3 I 78 20 2 I 64 35 1 II 72 27 1 I 72 27 1 
s I 28.8 I 25.0 I 25.0 I 28.6 I 22.5 I 35.5 I 20.2 28.2 y ......... + ............. ~ ............. ÷ ............. ÷ ............. ÷ ............. + .......................... + ............. 
S mit_icsl l 110 36 0 \[ 148 15 0 I 116 16 O I 63 14 0 I 81 8 0 I 145 21 0 II 663 110 0 I 547 94 0 
T ~ 75 25 0 I 91 9 0 l 88 12 0 l 82 18 0 I 91 9 0 l 87 13 0 II 86 14 0 j 85 15 0 
E I 24.7 I 9.2 J 12ll I 18.2 I 9.0 I 12.7 II 1¢.2 I 14.7 M ......... ÷ ......................................... ÷ ......................................... II 
........................... 
S sril I 94 19 33 \] 140 20 3 1 99 II 22 l 46 16 15 ~ 68 20 1 I 134 28 4 l\] 581 114 78 I 447 86 74 
l 64 13 23 I 86 12 2 l 75 8 17 I 60 21 19 t 76 22 1 I 81 17 2 II 78 15 10 l 74 14 12 
I 3s.6 I 14.1 I 25.0 I 40.3 I 23.6 I 19.3 II 24.8 I 26.4 
............................................................................................. II ........................... sri2 I I00 15 31 I 121 27 15 I 103 12 17 I 53 23 1 I 67 20 2 I 133 29 4 II 577 126 70 I 444 97 66 
I 68 10 all 74 17 9 t 78 9 131 69 30 1 I 75 22 21 80 17 2 I I 75 16 9 I 73 16 11 I 31.5 I 25.8 I 22.0 j 31.2 I 24.7 t 
19.9 25.4 I 26.9 
......... ÷ ............. ÷ ............. ÷ ............. + ............. + ............. ÷ ............. Ii ............. ÷ ............. 
sri5 ** l 105 38 3 l 140 20 3 l 112 19 1 i 54, 21 2 1 68 20 1 ~ 134 28 4 613 146 14 l 479 118 I0 
I 72 26 2 l 86 12 2 1 85 14 1 1 70 27 3 I 76 22 1 1 81 17 2 79 19 2 l 79 19 2 
28.1 I 14.1 I 15.2 { 29.9 I 23.6 { 19.3 20.7 I 21.1 
......... + ............. + ............. + ............. + ............. + ............. + ............. I ............. + ............. sri6 ** I 111 32 3 I 133 27 3 ~ 112 19 I I 53 23 1 \] 67 20 2 I 133 29 4 609 150 14 ~ 476 121 1O 
I 76 22 2 I 82 17 2 I 85 14 1 I 69 30 1 I 75 22 2 I 80 17 2 79 19 2 I 78 20 2 I 
24.0 18.4 I 15.2 I 31.2 I 24.7 I 19.9 21.2 I 21.6 
......... ÷ ............. + ............. + ............. + ............. * ............. + ............. I ............. ÷ ............. unisysl I 72 36 38 l 88 31 44 I 84 33 15 I 33 29 15 I 49 29 II I 85 30 51 411 188 174 I 411 188 174 
l 49 25 26 l 54 19 27 l 64 25 II l 43 38 19 I 55 33 12 ~ 51 18 31 53 24 23 1 53 24 23 l 50.7 l 46.0 l 36.4 J 57.1 l 44.9 l 
488 46.8 l 46.8 =============================================================================================================================== 
Overall 11123 364 119 ~1409 295 89 \]1138 250 64 ~ 578 229 40 I 758 200 21 I1333 410 83 
Totals I 70 23 7 J 79 16 5 J 78 17 4 I 68 27 5 l 77 20 2 I 73 22 5 
I 3o.1 I 21.4 1 21.6 I 31.8 I 22.6 I 27.0 Leg~_~: 
System l 69 23 8 I 77 17 5 \[ 77 18 5 \[ 68 27 5 I 77 20 2 I 69 25 6 I %T %F %NA \[ 
Totals I 31.4 l 22.9 l 22.6 l 31.8 I 22.6 ~ 31.2 1% Un-WelghtedErrJ ....................................................................................................................... 
Matrix tabulation of results for the Dec 93 ATIS SLS Test Results - Using Minimal/Maximal Scoring Criterion, for the Class (A+D) 
Subset. 
Matrix columns present results for Test Data Subsets collected at several sites, and matrix rows present results for different 
systems. 
Numbers printed at the top of the matrix columns indicate the number of evaluable utterances in the Test Data (sub)set from the 
corresponding site. 
"Overall Totals" (column) present results for the entire Class (A÷D) Subset for the system corresponding to that matrix row. 
"Foreign Coll. Site Totals" present results for "foreign site" data (i.e., excluding locally collected data) for the Class (A÷D) 
Subset. 
"Overall Totals" (row) present results accuxnulated over all systems corresponding to the Test Data (sub)set corresponding to that 
matrix column. "Foreign System Totals" present results accumulated over "foreign systems" (i.e., excluding results for the 
system(s) developed at the site responsible for collection of thet Test Data subset.) 
** Late and for a debugged system. 
Table 19 ATIS SLS Results: Class (A+D) by Collection Site 
74 
