BENCHMARK TESTS FOR THE DARPA 
SPOKEN LANGUAGE PROGRAM 
David S. Pallett, Johathan G. Fiscus, 
William M. Fisher, and John S. Garofolo 
National Institute of Standards and Technology 
Room A216, Building 225 (Technology) 
Gaithersburg, MD 20899 
1. INTRODUCTION 
This paper documents benchmark tests implemented within 
the DARPA Spoken Language Program during the period 
November, 1992 - January, 1993. Tests were conducted 
using the Wall Street Journal-based Continuous Speech 
Recognition (WSJ-CSR) corpus and the Air Travel Infor- 
mation System (ATIS) corpus collected by the Multi-site 
ATIS Data COllection Working (MADCOW) Group. The 
WSJ-CSR tests consist of tests of large vocabulary (lexi- 
cons of 5,000 to more than 20,000 words) continuous 
speech recognition systems. The ATIS tests consist of tests 
of (1) ATIS-domain spontaneous speech (lexicons typically 
less than 2,000 words), (2) natural language understanding, 
and (3) spoken language understanding. These tests were 
reported on and discussed in detail at the Spoken Language 
Systems Technology Workshop held at the Massachusetts 
Institute of Technology, January 20-22, 1993. 
Tests implemented during this period also included experi- 
mental or "dry run" implementation of two new tests. In the 
WSJ-CSR domain, a "stress test" was implemented, using 
test material that was drawn from unidentified sub-corpora. 
In the ATIS domain, an experimental "end-to-end" evalua- 
tion was conducted that included examination of the sub- 
ject-session "logfile". Following precedents established 
previously, the results of these dry-run tests are not 
included as part of the "official" NIST test results and are 
not discussed at length in this paper. 
Prior benchmark tests conducted within the DARPA Spo- 
ken Language Program are described in papers by Pallett, 
et al. in the several proceedings of the DARPA Speech and 
Natural Language Workshops from 1989 to 1992. Papers in 
the Proceedings of the February 1992 Speech and Natural 
Language Workshop describe the development of the WSJ- 
CSR corpus, collection procedures and initial experience in 
building systems for this domain. Initial use of the Pilot 
Corpus for a "dry run" of benchmark test procedures prior 
to the Februai-y 1992 Speech and Natural Language Work- 
shop is reported in \[1\]. ATIS-domain tests that were 
reported at the February 1992 meeting are documented in 
\[2\]. 
System descriptions were submitted to NIST by the bench- 
mark test participants and distributed at the Spoken Lan- 
guage Systems Technology Workshop. Additional informa- 
tion describing these systems can be found references 5-23. 
Detailed information is not available (in published papers) 
for some systems. 
2. WSJ-CSR TESTS: NEW CONDITIONS 
2.1. Stress Test 
The established benchmark test protocols for speech recog- 
nition systems are such that system developers have prior 
knowledge of the nature of the test material, based on 
access to similar development test sets. Some developers 
have consistently declined to report results for material of 
particular interest to DARPA program management (e.g., 
for secondary microphone data). Concern has been 
expressed that the sensitivity or "robustness" of some 
DARPA-sponsored recognition algorithms has not been 
adequately probed or the systems "stressed". 
DARPA program management requested that NIST imple- 
ment, in early December, 1992, a "dry run" of a "stress 
test" in which the nature of the test material was unspeci- 
fied. Participating DARPA contractors were required to 
document and freeze the system configuration used to pro- 
cess the test material prior to implementing the test, and to 
provide data for a baseline test of this system using the 20K 
NVP test subset of the Nov.'92 test material, as well as for 
the stress test set. Test hypotheses were scored by NIST 
using "conditional scoring" -- partitioning and reporting 
test results for individual test subsets. 
The stress test material consisted of a set of 320 utterance 
files, chosen from three components: (1) read 20K sen- 
tences, for 4 female speakers, (2) read 5K sentences, for 4 
female speakers, and (3) spontaneously dictated news arti- 
cles, for 2 male and 2 female speakers. The read speech 
included both primary and secondary microphones, so that 
there were 5 test subsets in all, each consisting of either 60 
or 80 utterances. 
Reactions to the stress test, as well as to the test results, 
were mixed. In general, as would be expected, systems with 
trigram language models did better than those with bi- 
grams. Degradations in performance for the secondary 
microphone data were relatively smaller for some systems 
than others -- particularly for those sites that had devoted 
special effort to the issue of "noise robustness". However, 
because the individual test subsets and the number of 
speakers were small, the results of many of the paired com- 
parison significance tests were inconclusive, suggesting 
that future applications of such a test procedure must 
involve larger test subsets. 
2.2. New Significance Tests 
For several years, NIST has implemented two tests of sta- 
tistical significance for the results of benchmark tests of 
speech recognition systems: the McNemar sentence error 
test (MN) and a Matched-Pair-Sentence-Segment-Word- 
Error (MAPSSWE) test, on the word error rate found in 
sentence segments. In niore recent tests, NIST has also 
implemented two additional tests: a Signed-pair (SI) test, 
and the Wllcoxon signed rank (WI) test. These additional 
tests are relevant to the word error rates found for individ- 
ual speakers, and as such are particularly sensitive to the 
number of speakers in the test set. References to these tests 
can be found in the literature on nonpararnetric or distribu- 
tion-free statistics. 
2.3. Uncertainty of Performance Measurement 
Results 
Increasing attention is being paid, at NIST, to evaluating 
and expressing the uncertainty of measurement results. This 
attention is motivated, in part, by the realization that "in 
general, it is not possible to know in detail all of the uses to 
which a particular NIST measurement result will be 
put."\[3\] Current NIST policy is that "all NIST measure- 
ment results are to be accompanied by quantitative mea- 
surements of uncertainty". In substance, the recommended 
approach to expressing measurement uncertainty is that rec- 
ommended by the International Committee for Weights and 
Measures (CIPM). 
The CIPM-recommended approach includes: (1) determin- 
ing and reporting the "standard uncertainty" or positive 
square root of the estimated variance for each component of 
uncertainty that contributes to the uncertainty of the mea- 
surement result, (2) combining the individual standard 
uncertainties into a determination of the "combined stan- 
dard uncertainty", (3) multiplying the combined standard 
uncertainty by a factor of 2 (a "coverage factor", that for 
normally distributed data corresponds to the 95% confi- 
dence interval), and specifying this quantity as the 
"expanded uncertainty". The expanded uncertainty, along 
with the coverage factor, or else the combined standard 
uncertainty, is to be reported. 
The paired-comparison significance tests outlined in the 
previous section represent specific instantiations of tests 
that evaluate the validity of null hypotheses regarding dif- 
ferences (in measured performance) between systems. In 
many cases, however, sufficiently detailed data is not avail- 
able to implement these tests. In these cases it is important 
to refer to explicit estimates of uncertainties. 
The case of evaluating the uncertainties associated with per- 
formance measurements for spoken language technology is 
particularly complex because of the number of known com- 
plicating factors. These factors include properties of the 
speaker population (e.g., gender, dialect region, speaking 
rate, vocal effort, etc.), properties of the training and test 
sets (e.g., vocabulary size, syntactic and semantic proper- 
ties, microphone/channel, etc.) and other factors \[4\]. 
Performance measures used to date within the DARPA spo- 
ken language research community (and included in this 
paper) do not conform to the recommended approach, since 
the scoring software, in general, generates a single measure- 
merit for the ensemble of test data (e.g., one datum indicat- 
ing word or utterance error rate for the entire multi-speaker, 
multi-utterance, test subset, rather than the mean error rate 
for the ensemble of speakers). These single-measurement 
performance evaluation procedures do not yield estimates 
of the variances "for each component of uncertainty that 
contributes to the uncertainty of the measurement result" 
that are required in order to implement the CIPM-recom- 
mended practice. 
In future tests, revisions to the scoring software that would 
permit estimates of the variance across the speaker popula- 
tion (at the least) are in order. However, it would seem to be 
the case that identifying and obtaining quantitative esti- 
mates of "each component of uncertainty that contributes to 
the uncertainty of the measurement" will be difficult. 
3. WSJ-CSR NOVEMBER 1992 TEST 
MATERIAL 
The test material, as distributed, included a total of 16 iden- 
tiffed test subsets. In general, these can be sub-categorized 
five ways: speaker dependent/independent (SD/SI), 5K/ 
20K reference vocabularies, the use of verbalized/non-ver- 
balized punctuation (VP/NVP), read/spontaneous speech, 
and primary (Sennheiser, close-talking)/secondary micro- 
phone. No one participant reported results on all subsets -- 
most reported results on only one or two, corresponding to 
conditions of particular local interest and/or algorithmic 
strength. 
All of the test material was drawn from the WSJ-CSR Pilot 
Corpus that was collected at MIT/LCS, SRI International, 
and TI. The "spontaneous dictation" data was collected only 
at SRI. 
Individual test set sizes varied from 72 utterances to (more 
typically) approximately 320 utterances. The number of 
speakers in each subset varied from 3 to 12 speakers. The 
actual number of sentence utterances per speaker varied 
somewhat, because the material was selected in paragraph 
blocks. A total of 8 secondary microphones was included in 
the various test subsets, including one speakerphone, a tele- 
phone handset, 3 boundary effect microphones (Crown 
PCC-160, PZM-6FS, and Shure SM91), two lavalier micro- 
phones, and a desk-stand mounted microphone. 
4. WSJ-CSR TEST PROTOCOLS 
Test protocols were similar to prior speech recognition 
benchmark tests. Test material was shipped to the partici- 
pating sites on October 20th, results were reported on Nov. 
23rd, and NIST reported scored results via ftp to the partici- 
pants on Dec. 2nd. The stress test was conducted between 
Nov. 30th and Dec. 15th. 
A "required baseline" test was defined for all participants. It 
consisted of processing the 5K word speaker independent, 
non-verbalized punctuation test set using a (common) bi- 
gram grammar. Six sites reported 5K baseline test results. 
5. WSJ-CSR TEST SCORING 
As for the test protocols, much of the scoring was routine, 
except for one new additional factor. Since previous "offi- 
cial" CSR benchmark tests had not included spontaneous 
speech, the commtmity had not reviewed the adequacy of 
the transcription convention used for spontaneous speech, 
and several inconsistencies in the transcriptions were noted 
following release of the preliminary results. Some of these 
inconsistencies were resolved prior to releasing "official" 
results. 
6. WSJ-CSR TEST PARTICIPANTS 
Participants m these WSJ-CSR tests included the following 
DARPA contractors: BBN, CMU, Dragon Systems, MIT 
Lincoln Laboratory, and SRI International. A "volunteer" 
participant was the French CNRS LIMSI. LIMSI declined 
to participate in the "stress test". 
7. WSJ-CSR BENCHMARK TEST RESULTS 
AND DISCUSSION 
7.1. Test Results: Word and Utterance (Sen- 
tence) Error Rates 
Table 1 presents the results for the several test sets on which 
results were reported. Section I of that table includes results 
reported by Paul at MIT Lincoln Laboratory \[5\] for Longi- 
tudinal Speaker Dependent (LSD) technology. Section II 
includes results reported by BBN for Speaker Dependent 
(SD) technology. Section III includes the results of Speaker 
Independent (SI) technology, for a number of sites for (a) 
the 20K NVP test set for both baseline and non-baseline SI 
systems, (b) the 5K NVP test set for both baseline and non- 
baseline SI systems, (c) the 5K NVP test set "other micro- 
phone" test set data, and (d) the 5K VP test set (on which 
only LIMSI reported results \[6\]). Section IV of Table 1 
includes the results reported by BBN for the Spontaneous 
Dictation test set. 
For the test set on which the largest number of results were 
reported -- the 5K NVP set, using the close-talking micro- 
phone -- the lowest word error rates were reported by CMU 
\[7-9\]: 6.9% for the baseline, bigram language model, and 
5.3% using a trigram language model. The range of word 
error rates for the baseline condition for all systems tested 
was 6.9% to 15.0%, while for non-baseline conditions, the 
range was from 5.3% to 16.8%. 
For the 5K NVP test set's secondary microphone data, as 
reported by CMU \[8\] and SRI \[10,11\], word error rates 
ranged from 17.7% to 38.9%. 
For the 20K NVP test set, on which other baseline data were 
reported, the word error rates range from 15.2% to 27.8%. 
The lowest error rate, reported by CMU, can be shown to be 
significantly different for all 4 significance tests when com- 
pared with the Dragon \[13\] and MIT Lincoln systems, but 
shown to be significantly different only for the MAPSSWE 
test when compared with the BBN system \[14\]. Thus the 
performance differences between the CMU and BBN sys- 
tems for this baseline condition test are very small. 
7.2. Significance Test Results 
Table 2 presents the results, in a matrix form, of 4 paired- 
comparison significance tests for the baseline tests for the 
5K NVP test set. The convention in this form of results tab- 
ulation is that if the result of a null-hypothesis test is valid, 
the word "same" is printed in the appropriate matrix ele- 
ment. If the null hypothesis is not valid, the identifier for the 
system with the lower (and significantly different) error rate 
is printed. 
For this test set, recall that the CMU system (here identified 
as cmul-a) had a word error rate of 6.9%. By comparing the 
results for the CMU system with the other 5 systems report- 
ing baseline results, note that the significance test results all 
indicate that the null hypothesis is not valid. In other words, 
the error rates for the CMU system are significantly differ- 
ent (lower) than those for the other 5 systems for this test set 
and baseline conditions. 
In general, for this test set, with 12 speakers and 310 utter- 
ances, the Wilcoxon signed rank test (WI) is more sensitive 
than the (ordinary) sign test (SI). As noted in previous tests, 
the McNemar test (MN), operating on the sentence error 
rate, is in general less sensitive than the matched-pair-sen- 
tence segment word error rate test (MAPSSWE). 
8. ATIS TESTS: NEW CONDITIONS 
Within the community of ATIS system developers, there is 
a continuing search for evaluation methodologies to com- 
plement the current evaluation methodology. In particular 
there is a recognized need for evaluation methodologies 
that can be shown to correlate well with expected perfor- 
mance of the technology in applications. Toward-the end of 
1992, several sites participated in an experimental "end-to- 
end" evaluation to assess systems in an interactive form. 
The end-to-end evaluation included (1) objective measures 
such as timing information and time to task completion, (2) 
human-derived judgements on correctness of system 
answers and user solutions, and (3) a user satisfaction ques- 
tionnaire. The results of this "dry rtm" complementary eval- 
uation experiment are reported by Hirschman et al. in \[15\]. 
9. ATIS TEST MATERIAL 
Test material for the ATIS benchmark tests consisted of 
1002 queries, for 118 subject-scenarios, involving 37 sub- 
jects. It was selected by NIST from set-aside material 
drawn from data previously collected within the MAD- 
COW community at AT&T, BBN, CMU, MIT/LCS, and 
SRI. The selection and composition of this test material is 
described in more detail in \[15\]. 
As m previous years, queries were categorized into two cat- 
egories of "answerable" queries, Class A, which are con- 
text-independent, and Class D, which are context-depen- 
dent; and "unanswerable", or Class X queries. In the final 
adjudicated test set, there were a total of 427 Class A que- 
ries, 247 Class D queries, and 328 Class X queries. 
10. ATIS TEST PROTOCOLS 
As was the case for the speech recognition benchmark tests, 
ATIS test protocols were similar to prior ATIS benchmark 
tests. The test material was shipped to the participating sites 
on October 20th, results were reported on Nov. 16th, and 
NIST reported preliminary scored results via ftp to the par- 
ticipants on Nov. 20th. After the process of formal "adjudi- 
cation" had taken place, official results were reported on 
Dec. 20th. 
11. ATIS SCORING AND ADJUDICATION 
After the preliminary scoring results were distributed, the 
participating sites were invited to send requests for adjudi- 
cation ("bug reports") to NIST, asking for changes in the 
scoring of specific queries. A total of 146 of these bug 
reports were adjudicated by NIST and SRI jointly. Since 
many of these requests for adjudication were duplicates, the 
number of distinct problems reported was less than 100. A 
decision was made on each request for adjudication and the 
corrected reference material or procedure was used in a 
final adjudicated re-run of the evaluation. The judgment 
was in favor of the plaintiff in approximately 2/3 of the 
cases. 
A number of problems uncovered by this procedure were 
systematic, in that the same root problem affected several 
different queries. Most of these were simply human error, 
which can be made less likely in the future by working less 
hectically and making software to double-check the test 
material. 
The major problem that cannot be attributed to just human 
error is that of transcribing and scoring correctly speech that 
is difficult to hear and understand. Some of this speech was 
"sotto voce"; some was mispronounced; some was trun- 
cated; and in some cases the phonetic transcription would 
have been unproblematical but division into lexical words 
was unclear, as in some contractions and compound words. 
The short-term solution adopted was just to make our best 
judgement on orthographic transcription, considering both 
acoustics and higher-level language modeling. But a better 
long-term cure is to make and use transcriptions that can 
indicate alternatives when the word spoken is uncertain; 
proposals to this effect are being considered by relevant 
committees. 
12. ATIS TEST PARTICIPANTS 
Participants in these ATIS tests included the following 
DARPA contractors: BBN, CMU, MIT Laboratory for 
Computer Science (MIT/LCS), and SRI. There were sev- 
eral "volunteers": AT&T Bell Laboratories \[16\], who have 
participated in previous years; Paramax \[17\], not a DARPA 
contractor at the time of these tests, but who have also par- 
ticipated in prior years' tests; and two participants from 
Canada, CRIM and INRS. A total of 8 system developers 
participated in some of the tests (i.e., the NL tests). 
13. ATIS BENCHMARK TEST RESULTS 
13.1. ATIS SPontaneous speech RECognition 
Tests (SPREC) 
Table 3 presents the results for the SPREC tests for all sys- 
tems and all subsets of the data. For the interesting case of 
the subset of all answerable queries, Class A+D, the word 
error rate ranged from 4.3% to 100%. The lowest value was 
reported by BBN \[18,19\], and the value of 100% was 
reported by INRS, for an incomplete ATIS system that (in 
effect) rejected every utterance, resulting in a scored word 
deletion error of 100%. 
Table 4 presents a matrix tabulation of ATIS SPREC results 
for the set of answerable queries, Class A+D. This form of 
matrix tabulation is discussed in \[2\] for the February 1992 
test results. Considerable variability can be noted for the 
performance of some systems on "local data", and there are 
indications of varying degree of difficulty for the subsets 
collected at different sites. As m the Feb.'92 test set, partic- 
ipants noted the presence of more disfluencies in the AT&T 
data than for other originating sites. 
10 
Word error rates for the "volunteers" in these tests (AT&T, 
CRIM and INRS) are in general higher than for DARPA 
contractors, perhaps reflecting a reduced level-of-effort, rel- 
ative to "funded" efforts. 
Table 5 presents the results, in a matrix form, of 4 paired- 
comparison significance tests for the 7 SPREC systems for 
the Class A+D subset. 
For this test set, recall that the BBN system (here identified 
as bbn2a_d) had a word error rate of 4.3%. By comparing 
the results for this BBN system with the other 6 ATIS 
SPREC systems, note that the null hypothesis is not valid 
for all 4 significance tests for the comparisons with the 
AT&T, CRIM, INRS, MIT/LCS and SRI systems. In other 
words, the differences in performance are significant. How- 
ever, when comparing the BBN and CMU SPREC systems, 
the null hypothesis is valid for 3 of the 4 tests. Thus, as was 
the case for the WSJ-CSR data, the performance differ- 
ences, in this case for ATIS spontaneous speech, between 
the CMU and BBN speech recognition systems are very 
small. ', 
13.2. Natural Language Understanding Tests 
(NL) 
Table 6 presents a tabulation of the results for the NL tests 
for all systems and the "answerable" ATIS queries, Class 
A+D, as well as the subsets, Class A and Class D. 
For the set of answerable queries, Class A+D, the weighted 
error ranges from 101.5% to 12.3%. For the Class A que- 
ries, the range is from 79.9% to 12.2%. And for the Class D 
queries, the range is from 138.9% to 12.6%. In each case, 
the lowest weighted error rate was reported by the CMU 
system \[20\]. 
Note that in general performance is considerably worse for 
Class D than for Class A. However, for the CMU and MIT/ 
LCS \[21\] systems, performance for the Class D test mate- 
rial is comparable to that for Class A. These systems would 
appear to have superior procedures for handling context. 
Table 7 presents a matrix tabulation of the NL results for 
the several subsets of test material. Note, however, that 
since the differences in performance between DARPA-con- 
tractor-developed systems and those of "volunteers", in 
general, are significant, the column averages presented in 
this table are not very informative. 
Of the 3 CRIM systems, the best performing one (crim3) is 
one using neural networks to classify each query into 1 of 
10 classes based on relation names in the underlying ATIS 
relational database, with subsequent use of specific parsers 
built for each class and another parser that determines the 
constraints \[22\]. 
There are two SRI NL systems \[23\]. The SRI NL-TM sys- 
tem, here designated sril, uses template matching to gener- 
ate database queries. The other SRI system, termed the 
"Gemini+TM ATIS System" by SRI, and here designated 
sri2, is an integration of SRI's unification-based natural-lan- 
guage processing system and the Template Matcher. Differ- 
ences in performance do not appear to be pronounced. 
As in previous ATIS NL tests, it is important to note that 
appropriate tests of statistical significance have not yet been 
developed for ATIS NL tests. Small differences in weighted 
error rate are probably of no significance. However, large, 
systematic, differences are noteworthy, even if of unknown 
statistical significance. The weighted error rates for the 
CMU NL system, which are in many cases approximately 
one-half those of the next best systems, are certainly note- 
worthy. 
13.3. Spoken Language System Understanding 
(SLS) 
Table 8 presents a tabulation of the results for the SLS tests 
for all systems and the "answerable" ATIS queries, Class 
A+D, as well as the subsets, Class A and Class D. 
For the set of answerable queries, Class A+D, the weighted 
error ranges from 100% to 21.6%. For the Class A queries, 
the range is from 100% to 19.7%. And for the Class D que- 
ries, the range is from 140.1% to 23.9%. As in the case of 
the NL test results, and in each case, the lowest weighted 
error rate was reported for the CMU system. 
The INRS data signify 100% usage of the No_Answer 
option, since the INRS SPREC system provided null 
hypothesis strings, causing the NL component to return the 
No_Answer response. 
Note again that the CMU and MIT/LCS systems both han- 
dle context sensitivity well. 
Table 9 presents a matrix tabulation of the SLS results for 
the several subsets of test material. 
For the ATIS SLS with lowest overall weighted error rate 
(21.6%), the cmul system, there is an almost ten-fold range 
in error rate over the several test subsets: from 37.1%, for 
the AT&T subset, to 3.9% for the SRI subset. The CMU 
SLS weighted error rates for Class A+D are approximately 
two-thirds those of the next-best-performing systems, 
although for the Class A subset, differences m performance 
between the CMU system and the BBN and SRI systems 
are less pro-nounced. 
14. ACKNOWLEDGEMENT 
At NIST, our colleague Nancy Dahlgren contributed signif- 
icantly to the DARPA ATIS community and had a major 
role m annotating data and implementing "bug fixes" in col- 
laboration with the SRI annotation group and others. Nancy 
was severely injured in an automobile accident in Novem- 
ber, 1992, and is undergoing rehabilitation therapy for treat- 
11 
ment of head trauma. It is an understatement to say that we 
miss her very much. 
Brett Tj~en also assisted us at NIST in preparing test mate- 
rial and other ways. 
The cooperation of the many participants in the DARPA 
data and test infrastructure -- typically several individuals at 
each site'. -- is gratefully acknowledged. 
References 
1. Pallett, D.S., "DARPA February 1992 Pilot Corpus CSR 
'Dry Run' Benchmark Test Results", in Proceedings of 
Speech and Natural Language Workshop, February 1992 
(M. Marcus, ed.) ISBN 1-55860-272-0, Morgan Kaufmann 
Publishers, Inc., pp. 382-386. 
2. Pallett, D.S., et al., "DARPA February 1992 ATIS 
Bench-mark Test Results", in Proceedings of Speech and 
Natural Language Workshop, February 1992 (M. Marcus, 
ed.) ISBN 1-55860-272-0, Morgan Kaufrnann Publishers, 
Inc., pp. 15-27. 
3. Taylor, B.N. and Kuyatt, C.E., "Guidelines for Evaluat- 
ing and Expressing the Uncertainty of NIST Measurement 
Results", NIST Technical Note 1297, January 1993. 
4. Pallett, D.S. "Performance Assessment of Automatic 
Speech Recognizers", J. Res. National Bureau of Standards, 
Volume 90, #5, Sept.-Oct. 1985, pp. 371-387. 
5. Paul, D.B. and Necioglu, B.F., "The Lincoln Large- 
Vocabulary Stack-Decoder HMM CSR", Proceedings of 
ICASSP'93. 
6. Gauvain, J.L., et al., "LIMSI Nov92 Evaluation", Oral 
Presentation at the Spoken Language Systems Technology Workshop, January 
20-22, 1993, Cambridge, MA. 
7. Huang, X., et al., "The SPHINX-II Speech Recognition 
System: An Overview", Computer Speech and Language, 
in press (1993). 
8. Alleva, E, et al., "An Improved Search Algorithm for 
Continuous Speech Recognition", Proceedings of 
ICASSP'93. 
9. Hwang, M.Y., et al., "Predicting Unseen Triphones with 
Senones", Proceedings of ICASSP'93. 
10. Liu, E-H., et al., "Efficient Cepstral Normalization for 
Robust Speech Recognition", in Proceedings of the Human 
Language Technology Workshop, March 1993 (M. Bates, 
ed.) Morgan Kaufmann Publishers, Inc. 
11. Murveit, H., et al., "Large-Vocabulary Dictation using 
SRI's DECIPHER (tm) Speech Recognition System: Pro- 
gressive Search Techniques", Proceedings of ICASSP'93. 
12. Murveit, H., et al., "Progressive-search Algorithms for 
Large Vocabulary Speech Recognition", in Proceedings of 
the Human Language Technology Workshop, March 1993 
(M. Bates, ed.) Morgan Kaufmann Publishers, Inc. 
13. Roth, R., et al., "Large Vocabulary Continuous Speech 
Recognition of Wall Street Journal Data", Proceedings of 
ICASSP'93. 
14. Schwartz, R., et al., "Comparative Experiments on 
Large Vocabulary Speech Recognition", in Proceedings of 
the Human Language Technology Workshop, March 1993 
(M. Bates, ed.) Morgan Kaufmann Publishers, Inc. 
15. Hirschman, L., et al., "Multi-Site Data Collection and 
Evaluation in Spoken Language Understanding", in Pro- 
ceedings of the Human Language Technology Workshop, 
March 1993 (M. Bates, ed.) Morgan Kaufmann Publishers, 
Inc. 
16. Tzoukerrnann, E., (Untitled) Oral Presentation at the 
Spoken Language Systems Technology Workshop, January 
20-22, 1993, Cambridge, MA. 
17. Linebarger, M.C., Norton, L.M. and Dahl, D.A., "A por- 
table approach to last resort parsing and interpretation", in 
Proceedings of the Human Language Technology Work- 
shop, March 1993 (M. Bates, ed.) Morgan Kaufmann Pub- 
lishers, Inc. 
18. Bates, M., et al., "Design and Performance of HARC, 
the BBN Spoken Language Understanding System", Pro- 
ceedings of ICSLP-92, Banff, Alberta, Canada, October, 
1992. 
19. Bates, M., et al., "The BBN/HARC Spoken Language 
Understanding System", Proceedings of ICASSP'93. 
20. Ward, W. and Issar, S., "CMU ATIS Benchmark Evalu- 
ation", Oral Presentation at the Spoken Language Systems 
Technology Workshop, January 20-22, 1993, Cambridge, 
MA. 
21. Glass, et al., "The MIT ATIS System: January 1993 
Progress Report", Oral Presentation at the Spoken Lan- 
guage Systems Technology Workshop, January 20-22, 
1993, Cambridge, MA. 
22. Cardin, R., et al., "CRIM's Speech Understanding Sys- 
tem for the ATIS Task", Oral Presentation at the Spoken 
Language Systems Technology Workshop, January 20-22, 
1993, Cambridge, MA. 
23. Dowding, J., et al., "Gemini: A Natural Language Sys- 
tem for Spoken-Language Understanding", in Proceedings 
of the Human Language Technology Workshop, March 
1993 (M. Bates, ed.) Morgan Kaufmann Publishers, Inc. 
12 
I. Longltudlnal Speaker Depen=ent Tests 
a. LSD EVL 20K NVP Test Set 
Systems W.Err U.Err 
mlt l14-n 14.6 78.2 
mlt i15-~ 11.2 71.8 
IDENTIFIER 
LL NOV92 CSR LSD 20K CLOSED NVP BIGRAM 
LL NOV92 CSR LSD 20K CLOSED NVP TRIGRAM 
b. LSD EVL 20K VP Test Set 
nlt 114-i 11.6 70.7 
mlt 115-i 7.6 56.0 
LL NOV92 CSR LSD 20K CLOSED VP BIGRAM 
LL NOV92 CSR LSD 20K CLOSED VP TRIGRAM 
c. LSD EVL 5K NVP Test Set 
mlt ll4-f 8.3 62.5 
mlt llS-f 5.6 48.8 
LL NOV92 CSR LSD 5K CLOSED NVP BIGRAM 
LL NOV92 CSR LSD 5K CLOSED NVP TRIGRAM 
d. LSD EVL 5K VP Test Set 
mlt_ll4-g 6.7 68.1 
mlt_llS-g 4.5 44.4 
LL NOV92 CSR LSD 5K CLOSED VP BIGRAM 
LL NOV92 CSR LSD 5K CLOSED VP TRIGRAM 
II. Speaker Dependent Tests 
a. SD EVL 5K NVP Test Set 
Systems W.Err U.Err 
bDn2-e 8.2 54.5 
bbn3-e 6.1 44.5 
IDENTIFIER 
BBN NOV92 CSR BYBLOS SD-600 5K BIGRAM 
BBN NOV92 CSR BYBLOS SD-600 5K TRIGRAM 
III. Speaker Independent:Tests: Read Speech 
a. SI Test Set (Baseline Tests) 
W.Err U.Err IDENTIFIER 
16.7 81.1 BBN NOV92 CSR BYBLOS SI-12 20K BIGRAM BASELINE 
15.2 79.0 CMU NOV92 CSR SPHINX-iI SI-84 20K BASELINE 
25.0 86.8 DRAGON NOV92 CSR MULTIPLE SI-12 20K NVP BASELINE 
25.2 88.0 LL NOV92 CSR SI-84 20K OPEN NVP BIGRAM BASELINE 
Test Set (Non-Basellne Tests) 
14.8 75.7 BBN NOV92 CSR BYBLOS SI-12 20K TRIGRAM 
12.8 71.8 CMU NOV92 CSR SPHINX-iI SI-84 20K TRIGRAM 
24.8 87.4 DRAGON NOV92 CSR GD SI-12 20K NVP 
27.8 87.4 DRAGON NOV92 CSR GI SI-12 20K NVP 
19.4 84.1 LL NOV92 CSR SI-84 20K OPEN NVP TRIGRAM ADAPTIVE 
b. SI Test Set (Baseline Tests) 
8.7 63.6 BBN NOV92 CSR BYBLOS SI-12 5K BIGRAM BASELINE 
6.9 57.6 CMU NOV92 CSR SPHINX-iI SI-84 5K BASELINE 
14.1 78.2 DRAGON NOV92 CSR MULTIPLE SI-12 5K NVP BASELINE 
9.7 64.5 LIMSI NOV92 CSR SI-84 5K-NVP BASELINE 
15.0 78.2 LL NOV92 CSR SI-84 5K CLOSED NVP BIGRAM BASELINE 
13.0 73.9 SRI NOV92 CSR DECIPHER(TM) SI-84 BIGRAM BASELINE 
EVL 20K NVP 
Systems 
bbnl-d 
cmul-d 
dragon3-d 
mJ.t lll-d 
SI EVL 20K NVP 
bbn3-d 
cmu2-d 
dragonl-d 
dragon2-d 
mlt ll3-d 
EVL 5K ~P 
bbnl-a 
cmul-a 
dragon3-a 
llmsll-a 
m!t lll-a 
sr!~-a 
SI EVL 5K NVP 
bbn3-a 7.3 53.0 
cmu2-a 5.3 45.2 
cmu3-a 8.1 63.0 
cmu4-a 9.4 67.9 
cmu5-a 8.4 63.0 
cmu6-a 8.1 65.2 
dragcnl-a 13.6 76.7 
dragon2-a 16.8 76.4 
mlt l12-a 10.5 61.2 
mlt llS-a 9.1 56.7 
c. SI EVL 5K ~P OTHER 
cmu3-c 38.5 88.2 
cmu4-c 17.7 75.8 
cmuS-c 38.9 87.3 
cmu6-c 19.3 77.9 
srll-c 27.3 87.6 
d. SI EVL 5K VP Test Set 
llmsll-D 7.8 58.9 
Test Set (Non-Basellne Tests) 
BBN NOV92 CSR BYBLOS SI-12 5K TRIGRAM 
CMU NOV92 CSR SPHINX-II SI-84 5K TRIGRAM 
CMU NOV92 SPHINX-iIA MFCDCN W/O COMP CSR SI-84 5K NVP 
CMU NOV92 SPHINX-IIA MFCDCN W/ COMP CSR SI-84 5K NVP 
CMU NOV92 SPHINX-IIA CDCN W/O COMP CSR SI-84 5K NVP 
~4U NOV92 SPHINX-IIA CDCN W COMP CSR SI-84 5K NVP 
DRAGON NOV92 CSR GD SI-12 5K NVP 
DRAGON NOV92 CSR GI SI-12 5K NVP 
LL NOV92 CSR SI-84 5K CLOSED NVP TRIGRAM 
LL NOV92 CSR SI-84 5K CLOSED NVP TRIGRAM ADAPTIVE 
MICROPHONE Test Set 
CMU NOV92 SPHINX-iIA M~CDCN W/O COMP CSR SI-84 5K NVP 
CMU NOV92 SPHINX-iIA MFCDCN W/ COMP CSR SI-84 5K NVP 
CMU NOV92 SPHINX-iiA CDCN W/O COMP CSR SI-84 5K NVP 
CMU NOV92 SPHINX-IIA CDCN W COMP CSR SI-84 5K NVP 
SRI NOV92 CSR DECIPHER(TM) SI-84 BIGRAM BASELINE 
LIMSI NOV92 CSR SI-84 5K-VP 
IV. Speaker Incepenaent Tesl: Spontaneous Speech 
a. SI SPONTANEOUS DICTATION NVP Test Set 
Systems W.Err U.Err IDENTIFIER 
bbn 2- 3 26.5 94.1 BBN NOV92 CSR BYBLOS SI-!2 SPON BIGRAM 
bbn3-~ 24.9 93.4 BBN NOV92 CSR BYBLOS SI-!2 SPON TRIGKAM 
Table ": WSJ-CSR ~encnmark Test Results 13 
Composite Report of All Significance Tests 
For the WSJ-CSR Nov 92 SI 5K NVP Baseline (Bigram) Test 
Test Name Abbrev. 
............................................................ 
Matched Pair Sentence Segment (Word Error) Test MP 
Signed Paired Comparison (Speaker Word Accuracy) Test SI 
Wilcoxon Signed Rank (Speaker Word Accuracy) Test WI 
McNemar (Sentence Error) Test MN 
........................................................................................................ l 
I bbnl-a cmul-a I dragon3-a I limsil-a I mit lll-a I sril-a 
.............. + ............................. + .............. + .............. + .............. + .............. 
bbnl-a l MP cmul-a 1 5~ bbnl-a IMP same IMP bbnl-a IMP bbnl-a 
1 SI cmul-a I SI bbnl-a I SI same I SI bbnl-a I SI bbnl-a 
I WI cmul-a I WI bbnl-a I WI same I WI bbnl-a I WI bbnl-a 
I MN cmul-a I MN bbnl-a I MN same \[ MN bbnl-a I MN bbnl-a 
.............. + .............................. + .............. + .............. + .............. + .............. 
cmul-a I 
l 
I 
I 
+ 
dragon3-a 
limsil-a 
mit lll-a 
sril-a 
l 
l 
I 
4 
l 
l 
l ,+ ............... 
I 
I 
I -4 
IMP omul-a IMP cmul-a L MP cmul-a IMP cmul-a 
I SI cmul-a I SI cmul-a I SI cmul-a I SI cmul-a 
I WI cmul-a 1WI cmul-a I WI cmul-a i WI cmul-a 
I MN cmul-a I MN cmul-a k MN cmul-a I MN cmul-a 
............... + .............. + .............. + + 
1 IMP limsil-a I MP same I MP same 
I I SI limsil-a I SI same I SI same 
I I WI limsil-a I WI same L WI same 
I L MN limsil-a I MN same I MN same 
÷ ÷ ÷ + 
l l 
l l 
I I 
l l 
+ ÷ 
i MP limsil-a IMP limsil-a 
I SI limsil-a I SI same 
i WI limsil-a I WI limsil-a 
I MN limsil-a t MN limsil-a 
÷ + 
I I i i MP sril-a 
I I I I SI same 
1 I I I WI sril-a 
I l l I MN same 
+ ÷ + + 
l L l l 
1 1 l l 
l I l I 
\[ l l I 
Table 2: Siqnficance Test Results: Baseline Tests Using the 5K NVP Test Set 
(See text for explanation of format) 
14 
Nov92 ATIS SPREC Test Results 
Class A+D+X Subset 
W. Err Corr Sub Del 
att2-adx !1.7 90.8 6.8 2.4 
bbn2-adx 7.6 94.2 4.2 1.6 
cmu2-adx 8.3 92.9 4.2 2.9 
crim4-adx 19.3 84.1 12.1 3.8 
inrs2-adx i00.0 0.0 0.0 I00.0 
mit ics2-adx 12.6 89.8 7.3 2.9 
sri3-adx 9.1 93.2 5.4 1.4 
Class A+D Subset 
W. Err Corr Sub Del 
att 2-a d 8.4 93.6 4.6 1.8 
bbn2-a--d 4.3 96.7 2.5 0.9 
cmu2-a--d 4.7 96.0 2.8 i. 2 
crim4-a d 14.1 88.7 8.4 2.9 
inrs2-a d 100.0 0.0 0.0 100.0 
mit ics2-a d 8.1 93.3 4.5 2.2 
sri3-a d - 5.7 95.7 3.5 0.9 
Class A Subset 
W. Err Corr Sub Del 
att2-a 8.0 93.8 4.4 i. 8 
bbn2-a 4.0 96.7 2.3 I. 0 
cmu2-a 4.4 96.1 2.7 i. 2 
crim4-a 13.5 88.9 8.0 3.1 
inrs2-a i00.0 0.0 0.0 i00.0 
mit ics2-a 7.8 93.5 4.4 2.2 
sri~-a 5.2 96.0 3.2 0.9 
Ins U. Err | Utt. 
2.5 52.4 967 
1.8 35.6 967 
1.2 38.3 967 
3.4 64.1 967 
0.0 i00.0 967 
2.4 47.8 967 
2.3 43.3 967 
Description 
ATT Nov 92 SPREC Results 
BBN Nov 92 SPREC Results 
CMU Nov 92 SPREC Results 
CRIM Nov 92 SPREC Results 
INRS Late Nov 92 SPREC Results 
MIT-LCS Nov 92 SPREC Results 
SRI Nov 92 SPREC Results 
Class D Subset 
Ins U. Err | Utt. 
2.0 44.7 674 
0.9 25.2 674 
0.7 28.9 674 
2.8 56.4 674 
0.0 i00.0 674 
1.4 37.8 674 
1.4 33.8 674 
Description 
ATT Nov 92 SPREC Results Class A+D 
BBN Nov 92 SPREC Results Class A+D 
CMU Nov 92 SPREC Results Class A+D 
CRIM Nov 92 SPREC Results Class A+D 
INRS Late Nov 92 SPREC Results Class A+D 
MIT-LCS Nov 92 SPREC Results Class A+D 
SRI Nov 92 SPREC Results Class A+D 
W. Err Corr Sub Del 
att2-d 9.2 93.2 5.0 1.7 
bbn2-d 4.8 96.5 2.8 0.7 
cmu2-d 5.4 95.7 3.2 i. 1 
crim4-d 15.4 88.2 9.4 2.4 
inrs2-d i00.0 0.0 0.0 100.0 
mit ics2-d 8.9 92.9 5.0 2.1 
sri~-d 7.1 95.0 4.1 0.8 
Ins U. Err # Utt. 
1.8 45.4 427 
0.8 25.3 427 
0.5 30.7 427 
2.4 57.8 427 
0.0 i00.0 427 
1.3 38.2 427 
i.i 34.2 427 
Description 
ATT Nov 92 SPREC Results Class A 
BBN Nov 92 SPREC Results Class A 
CMU Nov 92 SPREC Results Class A 
CRIM Nov 92 SPREC Results Class A 
INRS Late Nov 92 SPREC Results Class A 
MIT-LCS Nov 92 SPREC Results Class A 
SRI Nov 92 SPREC Results Class A 
Class X Subset 
Ins U. Err f Utt. 
2.4 43.3 247 
1.3 25.1 247 
i.i 25.9 247 
3.6 53.8 247 
0.0 100.0 247 
1.8 37.2 247 
2.1 33.2 247 
Description 
ATT Nov 92 SPREC Results Class D 
BBN Nov 92 SPREC Results Class D 
CMU Nov 92 SPREC Results Class D 
CRIM Nov 92 SPREC Results Class D 
INRS Late Nov 92 SPREC Results Class D 
MIT-LCS Nov 92 SPREC Results Class D 
SRI Nov 92 SPREC Results Class D 
W. Err Corr Sub Del 
att2-x 18.5 85.1 11.3 3.6 
bbn2-x 14.5 89.2 7.8 3.0 
cmu2-x 15.6 86.6 7.0 6.5 
crim4-x 20.1 74.7 19.7 5.6 
inrs2-x I00.0 0.0 0.0 i00.0 
mit Ics2-x 21.7 82.6 12.9 4.6 
sri~-x 15.8 88.1 9.4 2.4 
Ins U. Err | Utt. 
3.5 70.3 293 
3.7 59.0 293 
2.2 59.7 293 
4.8 81.6 293 
0.0 i00.0 293 
4.2 70.6 293 
4.0 64.8 293 
Description 
ATT Nov 92 SPREC Results Class X 
BBN Nov 92 SPREC Results Class X 
CMU Nov 92 SPREC Results Class X 
CRIM Nov 92 SPREC Results Class X 
INRS Late Nov 92 SPREC Results Class X 
MIT-LCS Nov 92 SPREC Results Class X 
SRI Nov 92 SPREC Results Class X 
Table 3: ATIS SPREC Benchmark Test Results 
15 
NOV92 ATIS 5PP.EC Test Results 
.................................................................................................................. 
1 Class A%D Subset 11 \[ 
J Orlglnatlng SltS Of Test Data l J Overall I Forelqn 
I ATT I BBN I CCMU \] MIT I SRI I I Totals I Coil. Site 
I (89 Utt.) i (124 Utt.l I (142 Utt.) I (167 Utt.) I (152 Utt.) \]i 674 I Totals 
art2 l 8.7 5.4 3.01 7.7 1.9 2.II 1.8 2.2 3.01 3.9 i.I 0.91 118 0.8 1.211 4.6 1.8 2.01 3.9 1.5 1.8 
} 15.1 74.2 } 11.7 58.1 { 7.0 44.4 J 5.9 54.1 1 3.8 20.3 li 8.4 44.7 I 7.1 40.2 
........ ÷ ............. -~ .............. ÷ ............. + ............. + ............. 1 I .......... -e .............. 
hhn2 ~ 4.7 1.7 1.9J 4.2 1.4 0.71 1.5 0.8 1.2 l 1.8 0.3 0.6J 0.5 0.4 0.4it 2.5 0.9 0.91 2.0 0.7 l.O 
i 8.4 50.6 ~ 6.3 34.'7 I 3.5 22.5 1 2.8 21.6 \] 1.3 9.2 jJ 4.3 25.2 ~ 3.~ 2~.I 
....... * ............. -t- ............. + ............. • ............. + ............. I I ........... ~ .............. 
S Cmlu2 1 5.8 2.6 1.31 4.1 1.4 O.Ti 1.4 1.4 8.91 1.6 0.6 0.31 2.0 0.4 0.6~I 2.8 1.2 0.71 3.2 i.i 0.6 
Y l 9.7 57.3 I 6.I 39.5 I 3.7 21.8 1 2.5 19.2 1 3.8 21.1 I I 4.7 28.9 I 5.0 30.8 
S ........ + ............. -+ ............. . ............. * ............ * ............. I I ........... -* ............. 
T crlm4 J 14.1 4.4 5.5t 12.9 5.1 3.11 4.7 1.5 2.11 6.8 2.2 1.51 4.9 1.4 2.5~J 8.4 2.8 2.81 8.4 2.8 2.8 
E I 24.0 86.5 i 21.1 74.2 I 8.4 38.7 ~ 10.5 55.1 J 8.8 42.1 L~ 14.1 56.4 ~ 14.1 56.4 
............................................... 
S lnrs2 1 0.0 100.0 O.01 0.0 1O0.O 0.0i 0.0 100.0 0.01 0.0 100.0 O.01 0.0 100.0 0.8~1 O.0 100.0 0.01 0.O 100.0 0.0 
i 100.0 100.0 100.0 100,0 I 100.O 100.0 I 100.0 100.0 I 100.0 1O0.O II i00.0 100.0 I 100.0 100.0 
rolE_Its2 I 8.9 3.5 3.51 6.8 2.8 !.81 4.4 2.3 1.ST 1.7 1.3 0.31 2.3 1.2 0.8J~ 4.5 2.2 1.41 5.4 2.4 1.8 
I 15.9 57.3 11.4 54.0 I 8.2 45.0 I 3.3 19.2 I 4.2 28.9 I I 8.1 37.8 J 9.7 44.0 
........ + ......................... ~ + ............ * ............ I I .......... ~ .............. 
srl3 ~ 4.9 1.5 3.4 i 5.8 1.4 1.2J 3.2 1.0 1-?l 2.3 0.3 0.81 1.4 0.3 O.711 3.5 0.8 1.4~ 3.8 1.0 1.6 
J 9.8 61.8 8.4 50.0 I 5.9 33.8 i 3.4 25.1 ~ 2.4 13.8 l J 5.7 33.8 i 6.5 39.7 
=~I~==~===~===~===~=~=====~=~====~=~=~=====~=======~=======~=~===~=~=======~===~=~======"~=====~========= 
Overall I 6.7 16.7 2.71 5.9 16.3 1.4T 2.4 15.6 1.51 2.6 15.1 0.61 1.8 14.9 0.9~i 
Totals I 26.1 89.7 23.6 58.6 ~ 19.5 43.5 I 18.3 38.2 ~ 17.6 34.8 I I 
+ ............ ' * ............. * ........... II ................. 
Forezgm 1 6.4 18.9 2.61 6.2 18.8 1.51 2.6 18.0 1.61 2.7 17.4 0.71 1.9 17.4 0.9~I ~ %SUb ~Del %Ins I 
SysteJn I 27.9 68.9 26.4 82.6 I 22.2 47.1 I 20.8 42.5 \[ 20.2 38.3 II I tW.Err %Utt.Err I 
Matrix tabulation of results for the Nov92 ATIS SPREC Test Resultst for the Class A+D Subset. 
Matrix columns present results £~r Test Data Subsets collected at several sliest and matrlx rows present results ~or dl~ferent 
systems. 
Numbers prlnted at the top of the matrlx columns Indlcate the number of utterances in the Test Data (sub)set from the corresponding 
site. 
• Overall Totals" (column) present results for the entire Class A+D Subset for the system corresponding to that matrix row. 
"Forelgn Coll. Slte Totals. present results for "~orelgm s~te" data (l.e.~ excluding locally collected data) for the Class A~D 
Subset. 
• Overall Totals" (row\] present results accumulated over all systems correspondl~g to the Test Data (sub)set corresponding to that 
matrix column. "Forelgn System Totals" present results accumulated over "~orelgn systems, (l.e.~ excluding results for the 
system(s) developed at the slte responsible ~or collection o~ that Test Data subset.) 
Table 4: ATIS SPREC Results: Class (A,D) by Collection Slte 
Compos2te Report: ot All SSgn2£1cance Tests 
For the Nov92 ATIS SPREC Class A+D Test Results Test 
Test Name Abbrev. .............................................. 
Matcbecl Pa.tr Sentence Segment (WOES Error) Test MP 
Sl~nned Palr~ Comparison (8pea~er Word Accuracy) Test SI 
Wllcoxon Signed Ran3¢ (Speaker Word Accuracy) Test WI 
McNemar (Sentence Error) Test MN 
......................................................................................................................... 
tl art ad I hbn2 a d I cmu2 a d crlm4 a d l .... 2 ad mlt los2 a d ~ sr13 ad 
att2-a_d 1 ~ ~P bbr~2-a_d I Ml ~ cmu2-a d MP att2~a cl J MP att2-a d ~'~ same I MP sr13-a d 
I I SI bbn2-a d I $I cmu2-a d SI att2~a d I 5I att2-a d SI same I SI sr13-a ~ 
i I W~ bbn2-a_d I WI cmu2-a_d WI a~t2-a d I WI att2-a_d WI same J WI sr:3-s_d 
} ~ bbn2-a_d I ~ Crnu2-a d MN att2-a d I HN atr2-a d HN mlt ics2-a d J MN sr13-a d 
.............. ÷ .............. ÷ .............. ~ ........... -- .............. :--÷ ............ D ......... : ..... :--+ ............ ~__ 
hbn2-a d I I I ~ sane MP bbn2-a_d I ~ bhn2-a d MP hbn2-a d I ~ hbn2-a_d 
I I i SI same SI bbn2-a d I SI hbn2-a d $I bbn2-a d I SI hbn2-a_d 
i i I Wl same WI bbn2-a d I W~ bbn2-a_d WI bDn2-a d I WI bbn2-a d 
\] I I ~ hbn2-a d MN bbrl2-a_d I ~ bbn2-a d MN hbn2-a d I MN bbn2-a d ................ , ............... + ............. ~ ........... : .................. . ........... : ............... :--~ .... _'_ ....... :__ 
c~au2-a ~ 1 J I ~I cmlu2-a d I MP cmlu2-a d HP cmu2-a d J HP cmu2-a d 
\] I 1 Cmu2-a d I 8I cmu2-a_d SI cmu2-ad I SI same 
I ~ I WI Cmu2-a--d I W~ cmnu2-a d WI cmu2-a d t WZ same 
i I I ~ Cmu2-s_d I MM cmu2-a--d ~ clnu2-a--d ~ HN c~lu2-a d 
crlm4-a d ~ ~ I I ~ crlm4-a d MP mlt Ics2-a d L MP sr13-a d 
I I I I SI crlm4-a d $I mlt ics2-a d I S~ srl3-a ~ 
I I ~ I WI crlm4-a d WI mlt ics2-a_d I W~ sr13-a d 
I I I ~ crlm4-a d HN mlt lcs2-a d ~ MN sr13-a d ............... , ............... , ............... . ................................ , ............. : .......... : ..... :_-~. ............ _--__ 
~nrs2-a_d i I I I MP mlt Ics2-a d \] ~P sr13-a d 
I I i ~ Sl mlt Ics2-a d I SI sr13-a_d 
\[ I I I WI mlt ics2-a d ~ W~ sr!3-a d 
I I ! I ~ mlt~lcs2-a_--d ~ HN sr~.3-a-d ............... ÷ ................. ~ ............... . ................................ ~ ............................... + ............. _-_ 
mlt _I cs2-a_d 1 I I I I Mp Sr13-a d 
I i I 1 I Sl srl ~-a~d 
I J I I I WI Srl3-a d 
I I I I I HN sr~.3-a d 
srl3~a._~ I I I I ~ I 
I I ~ I I I 
i I I I I I ........................................................................................................................................ , 
TaJole 5: ~zgnf:cance Test ~esults: iTISi~REC 3ystems 
Class A+D Class A Class D 
674 Utt. 427 Utt. 247 Utt. 
system W. Err(%) W. Err(%) W. Err(%) 
a~t! 42.4 34.7 55.9 
bbnl 22.0 15.7 32.8 
cmu! 12.3 12.2 12.6 
crlml 71.2 40.5 124.3 
crlm2 69.4 50.1 102.8 
crlm3 49.7 31.1 81.8 
inrsl 101.5 79.9 138.9 
mlt !csl 18.4 18.3 18.6 
paramax 55.6 44.0 75.7 
srl! 27.6 22.2 36.8 
srl2 23.6 14.8 38.9 
Table 6: ATIS NL Test Results 
Descrlptlon 
ATTi Nov 92 ATIS NL Results 
BBNi Nov 92 ATIS NL Results 
CMUl Nov 92 ATIS NL Results 
CRIMi CHANEL Nov 92 ATIS NL Results 
CRIM2 CHANEL CD Nov 92 ATIS NL Results 
CRIM3 NEURON Nov 92 ATIS NL Results 
INRS Late Nov 92 ATIS NL Results 
MIT LCSi Nov 92 ATIS NL Results 
PARAMAX Nov 92 ATIS NL Results 
SRII TM Nov 92 ATIS NL Results 
SRI2 GEMINI+TM Nov 92 ATIS NL Results 
attl 
bbnl 
I Class (A+D) Set II I 
\] Orlglnatlng Slte o£ Test Data II Overall I Forelgn 
I ATT I BBN I CMU I MIT I SRI II Totals I Coll. Slte 
\[ 89 I 124 I 142 I 167 I 152 II 674 I Totals 
+ + + + + II 
\[ 71 14 4 I 79 29 16 1 93 45 4 i 137 25 5 i 135 14 3 il 515 127 32 I 444 113 28 
I 80 16 , 4 I 64 23 13 i 65 32 3 l 82 15 3 1 89 9 2 Ll 76 19 5 1 76 19 5 
I 36.0 I 59.7 I 66.2 I 32.9 I 20.4 I\] 42.4 I 43.4 
+ ............. *. + + -4 11 + 
I 76 3 i0 I 95 15 14 I 116 15 II I 150 5 12 I 136 9 7 II 573 47 54 I 478 32 40 
I 85 3 ii i 77 12 Ii I 82 II 8 L 90 3 7 I 89 6 5 li 85 7 8 I 87 6 7 
I 18.0 I 35.5 I 28.9 ~ 13.2 I 16.4 II 22.0 I 18.9 
+ + + + + \[I + 
cmul 1 84 5 0 1 I00 20 4 1 138 4 0 1 158 8 1 1 150 2 0 II 630 39 5 i 492 35 5 
I 94 6 0 1 81 16 3 1 97 3 0 1 95 5 1 ~ 99 1 0 II 93 6 1 1 92 7 1 
i 11.2 I 35.5 I 5.6 I 10.2 I 2.6 II 12.3 I 14.1 
+ +, + ~ 4 II + 
crlml I 36 17 36 I 67 24 33 I 65 41 36 I 77 28 62 I 91 32 29 li 336 142 196 I 336 142 196 
i 40 19 40 I 54 19 27 i 46 29 25 I 46 17 37 I 60 21 19 II 50 21 29 i 50 21 29 
I 78.7 I 65.3 1 83.1 I 70.7 I 61.2 II 71.2 { 71.2 
......... + ............ ++ ÷ ............. + 4 \[I + ............. 
crlm2 I 43 27 19 I 67 39 18 i 69 54 19 I 95 23 49 I 106 31 15 il 380 174 120 I 380 174 120 
1 48 30 21 I 54 31 15 i 49 38 13 I 57 14 29 I 70 20 I0 il 56 26 18 I 56 26 18 
S I 82.0 I 77.4 ~ 89.4 I 56.9 I 50.7 II 69.4 I 69.4 
Y ......... - ............. ~+ + ............ ~ + II + ............. 
S crlm3 I 63 21 5 i 88 32 4 1 101 39 2 i 119 40 8 i 126 26 0 II 497 158 19 1 497 158 19 
T I 71 24 6 I 71 26 3 1 71 27 1 1 71 24 5 1 83 17 0 In 74 23 3 I 74 23 3 
E I 52.8 I 54.8 I 56.3 I 52.7 i 34.2 II 49.7 I 49.7 
M ...................... + ............. + ............. ~ + ............. In ............. + ............. 
S inrsl I 38 47 4 t 51 65 8 1 56 83 3 \] 74 79 14 1 98 53 1 II 317 327 30 I 317 327 30 
I 43 53 4 i 41 52 6 i 39 58 2 i 44 47 8 1 64 35 1 II 47 49 4 I 47 49 4 
I ii0.i I iii 3 I 119.0 I 103.0 \[ 70.4 II 101.5 I 101.5 
......... * ............. ÷ ............. ÷ ............. + + ............ II ............. + ............. 
mlt icsl i 78 7 4 i 93 21 i0 I 132 8 2 i 154 9 4 I 143 5 4 il 600 50 24 L 446 41 20 
i 88 8 4 I 75 17 8 I 93 6 1 I 92 5 2 I 94 3 3 I I 89 7 4 I 88 8 4 
I 20.2 I 41 9 I 12.7 I 13.2 I 9.2 II 18.4 I 20.1 
......... ~ ............. + ............. + ............. + ............ + ............. II ............. + ............. 
paramax ! 33 I0 46 1 59 17 48 I 65 37 40 1 II0 Ii 46 1 121 14 17 II 388 89 197 1 388 89 197 
I 37 II 52 ~ 48 14 39 t 46 26 28 1 66 7 28 I 80 9 ii II 58 13 29 I" 58 13 29 
I 74.2 I 66 1 I 80.3 I 40.7 I 29.6 II 55.6 I 55.6 
....................... * ............. + ............. + ............ + ............. II ............. + ............. 
srll I 69 12 8 ! 91 19 14 i 109 17 16 1 144 7 16 i 137 7 8 II 550 62 62 l 413 55 54 
I 78 13 9 1 73 15 Ii I 77 12 ii I 86 4 i0 1 90 5 5 il 82 9 9 L 79 ii I0 
I 36.0 \[ 41 9 I 35.2 i 18.0 I 14.5 II 27.6 I 31.4 
......... + ............. + ............. + ............. + ............. + ............. II ............. + ............. 
srl2 i 74 Ii 4 i 93 16 15 i 108 19 15 \] 150 5 12 I 146 5 1 II 571 56 47 i 425 51 46 
I 83 12 4 ! 75 13 12 i 76 13 Ii I 90 3 7 I 96 3 1 II 85 8 7 l 81 i0 9 
i 29.2 I 37 9 I 37.3 I 13.2 i 7.2 ~1 23.6 \] 28.4 
================================================================================================================= 
Overall ! 665 174 140 i 883 297 184 Ii052 362 148 I1368 240 229 I1389 198 85 II 
Totals ! 68 18 14 I 65 22 13 I 67 23 9 I 74 13 12 I 83 12 5 el 
! 49.8 i 57 0 I 55.8 1 38.6 I 28.8 II Legend: 
........................... + ............. + ............. + ............ + ............. \[i .................... 
Yorelgn I 594 160 136 I 788 282 170 I 914 358 148 I1214 231 225 11106 186 76 II I #T #F #NA I 
System ~ 67 18 15 I 64 23 14 I 64 25 i0 1 73 14 13 t 81 14 6 II { %T ~F ~NA I 
Totals ~ 51.2 I 59 2 I 60.8 I 41.1 I 32.7 II ~ % Welghted Error 
......................................................................................................... 
Tab!e 7: ATIS NL ~esu!ts: ~ss ~A+D) by Collectlon Slte 
Class A+D Class A Class D 
674 Utt. 427 Utt. 247 Utt. 
system W. Err(%) w. Err(%) W. Err(%) 
attl 82.8 49.6 140.1 
bbnl 30.6 23.7 42.5 
cmul 21.2 19.7 23.9 
crZml 82.3 56.9 126.3 
crlm2 82.9 66.3 111.7 
Crlm3 75.2 57.1 106.5 
inrsl i00.0 I00.0 I00.0 
mlt Icsl 29.7 30.4 28.3 
srl~ 37.4 31.9 47.0 
srl2 33.2 26.5 44.9 
Table 8: ATIS SLS Test Results 
Descr!ptlon 
ATTI Nov 92 ATIS SLS Results 
BBNi Nov 92 ATIS SLS Results 
CMUi Nov 92 ATIS SLS Results 
CRIMI CHANEL Nov 92 ATIS SLS Results 
CRIM2 CHANEL CD Nov 92 ATIS SLS Results 
CRIM3 NEURON Nov 92 ATIS SLS Results 
INRSI LATE Nov 92 ATIS SLS Results 
MIT LCSI Nov 92 ATIS SLS Results 
SRI\[ TM Nov 92 ATIS SLS Results 
SRI2 GEMINI+TM Nov 92 ATIS SLS Results 
attl 
I Class (A+D) Set 
i Orlglnatlng Slte of Test Data 
I ATT I BBN I CMU I MIT I SRI 
J 89 \] 124 I 142 I 167 ~ 152 
~ ~ + + 
I 35 41 13 i 62 42 20 I 61 76 5 I 98 56 13 I 110 35 
J 39 46, 15 ~ 50 34 16 I 43 54 4 i 59 34 8 ~ 72 23 
I06.~ J 83.9 J 110.6 I 74.9 I 50.7 
II I 
il Overall I Forelgn 
I~ Totals I Coll. Site 
II 674 I Totals 
7 II 366 250 58 I 331 209 45 
5 II 54 37 9 I 57 36 8 
II 82.8 I 79.1 
I\] + 
bbnl J 60 14 15 i 88 17 19 I 112 22 
I 67 16 17 S 71 14 15 i 79 15 
I 48.3 I 42.7 I 36.6 
8 I 147 14 6 i 139 II 
6 1 88 8 4 1 91 7 
I 20.4 i 15.8 
+ 
2 II 546 78 50 I 458 61 31 
1 II 81 12 7 i 83 11 6 
II 30.6 I 27.8 
-Jl + 
cmul I 72 16 I J 92 27 5 i 129 13 0 } 157 9 1 i 149 3 0 II 599 68 7 I 470 55 7 
l 81 18 1 I 74 22 4 I 91 9 0 I 94 5 1 ~ 98 2 0 II 89 10 1 I 88 10 1 
I 37.1 I 47.6 I 18.3 I 11.4 I 3.9 21.2 i 22.0 
~ ~ ~ + .......................... + 
crlml i 27 12 50 I 45 34 45 l 59 44 39 i 67 33 67 I 83 39 30 281 162 231 t 281 162 231 
l 30 13 56 ~ 36 27 36 i 42 31 27 I 40 20 40 I 55 26 20 42 24 34 I 42 24 34 
I 83.1 I 91.1 I 89.4 i 79.6 l 71.1 82.3 i 82.3 
~- ~ ~ +- + ............. 
S crlm2 I 36 18 35 i 43 43 38 I 66 54 22 I 74 31 62 I 89 47 16 308 193 173 I 308 193 173 
Y ~ 40 20 39 i 35 35 31 i 46 38 15 I 44 19 37 ~ 59 31 ii 46 29 26 I 46 29 26 
S I 79.8 I 100.0 I 91.5 i 74.3 i 72.4 82.9 ( 82.9 
T ~ ~ ~ ~ + ........................... + ............. 
E crlm3 I 46 39 4 I 55 62 7 \[ 88 49 5 I 99 47 21 \[ II0 34 8 398 231 45 \[ 398 231 45 
M \[ 52 44 4 \[ 44 50 6 i 62 35 4 1 59 28 13 \[ 72 22 5 59 34 7 1 59 34 7 
S \[ 92.1 t 105.6 I 72.5 i 68.9 I 50.0 75.2 \[ 75.2 
............. + ~ ............ + ............ + ............................ + ............ 
inrsl J 0 0 89 i 0 0 124 I 0 0 142 I 0 0 167 I 0 0 152 0 0 674 I 0 0 674 
I 0 0 i00 1 0 0 100 i 0 0 i00 1 0 0 I00 I 0 0 i00 0 0 I00 { 0 0 i00 
l i00.0 1 i00.0 I i00.0 I I00.0 I i00.0 i00.0 I I00.0 
~ ............ ~ + -+ .......................... 4 
mlt_icsl I 57 12 20 1 79 28 17 i 120 12 I0 I 149 ii 7 I 140 8 4 545 71 58 L 396 60 51 
i 64 13 22 ~ 64 23 14 I 85 8 7 i 89 7 4 i 92 5 3 81 11 9 I 78 12 10 
I 49.4 I 58.9 I 23.9 I 17.4 I 13.2 29.7 I 33.7 
÷ ~ ~ + ........................... + ............. 
srll { 60 16 13 i 75 27 22 I i01 23 18 i 141 9 17 i 132 12 8 509 87 78 i 377 75 70 
I 67 18 15 1 60 22 18 1 71 16 13 1 84 5 I0 I 87 8 5 76 13 12'i 72 14 13 
1 50.6 i 61.3 J 45.1 I 21.0 I 21.1 37.4 i 42.1 
......... + ............ ~ ~ ~ ........... + ............. ,, ............. + ............ 
srl2 I 65 13 ii 1 75 26 23 i i01 25 16 1 149 6 12 i 139 9 4 529 79 66 i 390 70 62 
1 73 15 12 i 60 21 19 i 71 18 ii I 89 4 7 i 91 6 3 78 12 I0 1 75 13 12 
i 41.6 1 60.5 I 46.5 i 14.4 i 14.5 33.2 i 38.7 
============================================================================================================== 
Overall I 458 181 251 I 614 306 320 I 837 318 265 Ii081 216 373 11091 198 231 
Totals i 51 20 28 i 50 25 26 i 59 22 19 I 65 13 22 i 72 13 15 
I 68.9 I 75.2 I 63.5 1 48.2 I 41.2 Legend: 
............. 4 ~ + ............ + ............ + ................................. 
Forelgn I 423 140 238 I 526 289 301 i 708 305 265 I 932 205 366 i 820 177 219 i #T #F #NA I 
System ~ 53 17 30 1 47 26 27 1 55 24 21 1 62 14 24 I 67 15 18 1 ~T %F %NA i 
Totals I 64.7 I 78.8 ~ 68.5 I 51.6 ~ 47.1 I ~ Welghted Error I 
Table 9: ATIS SLS Results: Class (A+D) by Collection Slte 
18 
