File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1004_metho.xml

Size: 31,002 bytes

Last Modified: 2025-10-06 14:13:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1004">
  <Title>DARPA FEBRUARY 1992 ATIS BENCHMARK TEST RESULTS</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 OCTOBER 1991 &amp;quot;DRY RUN&amp;quot; TESTS
</SectionTitle>
    <Paragraph position="0"> The procedures for test set selection, testing, scoring, adjudication, and reporting for the February 1992 ATIS Benchmark Tests were developed and used for a &amp;quot;dry run&amp;quot; test in October 1991, with unpublished results. A somewhat smaller test set was used at that time, which did not include test data from AT&amp;T. The implementation of the tests was generally regarded as successful within the DARPA MADCOW Group and by the DARPA Spoken Language Program Coordinating Committee. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="15" type="metho">
    <SectionTitle>
3 NEW CONDITIONS FOR THESE
TESTS
</SectionTitle>
    <Paragraph position="0"> The structure (and scoring) of these ATIS domain tests differ in several ways from the tests reported at the June 1990 and February 1991 Workshops: * Following the February 1991 Workshop, minor revisions (e.g., to accommodate connecting flights, clarify terminology, revise headings and restructure tables, improve representation of fare structures, bug fixes, etc) were made to the relational air-travelinformation database. The MADCOW data collection effort, and systems developed with this data, made use of this revised relational database (Version 3.3).</Paragraph>
    <Paragraph position="1"> * The MADCOW data collection effort provided data from five sites (AT&amp;T, BBN, CMU, MIT/LCS, and SRI), rather than the single ATIS data collection site (TI) used for the June 1990 and February 1991 tests.</Paragraph>
    <Paragraph position="2"> * Some (but not all) of the collecting sites provided secondary (Crown PCC-160) microphone data in addition to the primary (close- talking Sennheiser) microphone. The use of the secondary microphone data was encouraged, but not required, for the February 1992 tests.</Paragraph>
    <Paragraph position="3"> * The definition of &amp;quot;Class D&amp;quot; queries was broadened to include &amp;quot;Class DI&amp;quot; queries.</Paragraph>
    <Paragraph position="4"> * The files indicating the &amp;quot;classification&amp;quot; (i.e., Class A, D or X) for each query were not provided along with the test queries (as they had been in previous tests), so'that each site had no extra information regarding the context-dependency or answerability of each query.</Paragraph>
    <Paragraph position="5"> * Similarly, &amp;quot;unanswerable&amp;quot; (Class X) queries were not identified when the test material was released.</Paragraph>
    <Paragraph position="6"> If system developers provided answers for these queries, they were not scored.</Paragraph>
    <Paragraph position="7"> * No utterances were to be treated differently on the grounds of the presence of disfluencies such as false starts or restarts. In the February 1991 tests, these utterances were regarded as &amp;quot;Optional&amp;quot;.</Paragraph>
    <Paragraph position="8"> * Concern had been expressed at the February 1991 meeting that some sites might have chosen to &amp;quot;overgenerate&amp;quot; (by providing verbose) NL and SLS answers rather than provide more succinct answers. It was argued that &amp;quot;correct&amp;quot; answers should have at  least the information in the &amp;quot;.ref&amp;quot; files previously used in scoring answers, but no more than in some specified maximal answer. Bob Moore and Eric Jackson, at SRI, proposed and implemented an algorithmic procedure for deriving maximal reference answers (&amp;quot;.rf2&amp;quot;) from the NLParse-generated SQL files used to generate the .ref files. Bill Fisher at NIST subsequently modified the NIST comparator (used in scoring the NL and SLS results) to implement the new &amp;quot;minimum~maximum&amp;quot; scoring procedure. The Principles of Interpretation document was modified to accommodate these changes.</Paragraph>
    <Paragraph position="9"> * Special reports were to be prepared by NIST to partition the tabulations of results according to the originating sites for the test data.</Paragraph>
    <Paragraph position="10"> * Following completion of each phase of scoring the results, NIST was to prepare and make available to all participants both detailed and summary reports via anonymous ftp.</Paragraph>
    <Paragraph position="11"> * Because there had been a recommendation to report results for all answerable queries in complete subject-scenarios (i.e., the material collected during one subject's working of one scenario), test material was to be provided to the testing sites in complete subject-scenarios. Emphasis was to be placed on analysis of the subset of &amp;quot;answerable&amp;quot; queries (i.e., Class A+D), rather than on the individual classes A and/or D. Further, the weighted error percentage (defined as twice the percentage of incorrect or &amp;quot;false&amp;quot; answers plus the percentage of &amp;quot;No_Answer&amp;quot; responses) was identified as preferable to the single-number &amp;quot;Score&amp;quot; reported at the February 1991 meeting (Score (%) = 100 (%)- \[Weighted Error (%)1)</Paragraph>
  </Section>
  <Section position="5" start_page="15" end_page="15" type="metho">
    <SectionTitle>
4 TEST MATERIAL SELECTION
AND DISTRIBUTION
</SectionTitle>
    <Paragraph position="0"> With the approval of the MADCOW Group, NIST had reserved approximately 20% of the pooled MADCOW data for test purposes. NIST screened this data for the occurrence of truncated utterances, rejected the subject-scenarios that included these phenomena, and determined that there was a sufficient quantity of reserved potential test material to permit release of a test set consisting of approximately 200 utterances from each of the five MADCOW sites contributing data. NIST did not monitor the audio quality of the .wav files nor review the accuracy of the transcriptions, since no criteria for acceptability based on these have been defined, although in retrospect this might have simplified the adjudication process.</Paragraph>
    <Paragraph position="1"> The test material, subsequent to deletion of some material during the adjudication process, consisted of 970 non-null (and 1 null) utterances in all classes. The number of distinct scenarios used by all subjects was 42, with a total of 37 subjects (&amp;quot;speakers&amp;quot;) completing 122 subject-scenarios. There were 17 male subjects, and 20 were female. Seven of the 122 subject-scenarios used the &amp;quot;Common-l&amp;quot; scenario; however, the test material selected from BBN and CMU did not include any instances of this scenario. The average number of queries per subject-scenario was 8. The MIT subject-scenarios had an average number of 4.6 queries, and SRI and CMU each had an average number of 12.1 queries per subjectscenario. There were 508 lexemes represented in the test material. The average number of words per utterance was about 11.</Paragraph>
    <Paragraph position="2"> After NIST selected the test material, it was produced on CD-ROM. The test disc (NIST Speech Disc T3-1.1) was distributed to the testing sites on Jan. 6, 1992.</Paragraph>
    <Paragraph position="3"> Concurrent with preparation of the CD-ROMs, NIST staff and the &amp;quot;Annotation Group&amp;quot; at SRI initiated preparation of the annotation files required to implement scoring.</Paragraph>
  </Section>
  <Section position="6" start_page="15" end_page="15" type="metho">
    <SectionTitle>
5 TEST PROCEDURE
</SectionTitle>
    <Paragraph position="0"> Following completion of locally administered single-passper-system tests, participating sites submitted results for (at least) three ATIS tests: the SPeech RECognition (SPREC), Natural Language (NL) and Spoken Language System (SLS) tests.</Paragraph>
    <Paragraph position="1"> The format for data submission via e-mail was specified by NIST and all &amp;quot;official&amp;quot; results were received at NIST by 6:00 AM on Jan. 20, 1992. As in previous ATIS tests, answer hypotheses were to be in the form of lexical SNOR (.lsn) files for the SPREC results and in Common Answer Specification (CAS) format files for the NL and SLS results. Each submission was to be accompanied by a text file for each system providing a system description following a suggested format.</Paragraph>
  </Section>
  <Section position="7" start_page="15" end_page="16" type="metho">
    <SectionTitle>
6 TEST SCORING, ADJUDICATION
AND REPORTING PROCEDURE
</SectionTitle>
    <Paragraph position="0"> Upon receipt of the test results, NIST implemented preliminary scoring with a reference answer set including .cat, .ref and .rff2 files developed at NIST and SRI for the NL and SLS tests, and the &amp;quot;lexical SNOR&amp;quot; (.lsn) files derived from the detailed (.sro) transcriptions provided by the collecting sites for the SPREC tests. On Jan. 24, 1992, upon completion of the preliminary scoring and  preparation of the required reports, NIST released the preliminary results by anonymous ftp.</Paragraph>
    <Paragraph position="1"> A detailed and formal procedure was established at NIST at the MADCOW group's request for handling requests for adjudication.</Paragraph>
    <Paragraph position="2"> The participating sites filed a total of 122 requests for adjudication, which were treated by NIST and the SRI Annotation group in a manner similar to that followed for the training data's bug reports. Some of these requests involved more than one utterance, or reported on more than one &amp;quot;bug&amp;quot; in an utterance, so that the number of unique utterances potentially affected by the requests for adjudication was 193, or approximately 19% of the test material.</Paragraph>
    <Paragraph position="3"> Of these utterances, the adjudicators determined that 99 (51%) actually required one or more changes. &amp;quot;No Action&amp;quot; decisions were made for the remaining 49%. NIST was advised by Francis Kubala at BBN during the adjudication period that some of the reference transcriptions used for scoring the SPREC test appeared to be inaccurate. NIST subsequently reviewed all of the transcriptions noted by Kubala and corrected them as deemed appropriate.</Paragraph>
    <Paragraph position="4"> In addition to the 99 utterances noted as part of the formal requests for adjudication requiring changes to the annotations, 26 test utterances were identified by the adjudicators as requiring changes.</Paragraph>
    <Paragraph position="5"> The final total of 125 utterances (12.9% of the entire test set) for which annotation changes were made includes the following breakdown (by category): * 42 with software problems related to annotations or scoring (e.g., NLParse, batching, or Comparator bugs), * 36 for which annotation errors had been made, * 27 involved problems with the transcriptions developed at the originating sites, and * 20 involved differences of opinion in applying the Principles of Interpretation or the use of context in interpreting the query.</Paragraph>
    <Paragraph position="6"> Following completion of the adjudication process, NIST released a set of &amp;quot;Official&amp;quot; ATIS Benchmark Test results to the community on Feb. 5, 1992.</Paragraph>
    <Paragraph position="7"> NIST was subsequently advised by Paramax that corrections to the reference answer set that were to have been made during the adjudication process did not appear to have been made. NIST and SRI determined that this had in fact been the case, and a total of 19 .rf2 files were corrected. The entire set of NL and SLS results were then re-scored, and a &amp;quot;Revised Official&amp;quot; set of results was made available to the community. Analysis of the differences between these two sets of &amp;quot;official&amp;quot; results shows that only 5 of Paramax's NL and 4 of their SLS answers were scored differently.</Paragraph>
    <Paragraph position="8"> Paramax also noted, following release of the &amp;quot;Revised Official&amp;quot; results, that 20 of their NL as well as another 20 of their SLS answers were scored as &amp;quot;False&amp;quot; because of known limitations in the NIST official scoring software. NIST had determined that the degree to which Paramax's answers were affected by this known limitation was approximately ten times more severe than for any other site, and declined to alter the scoring software to accomodate Paramax's unusual responses. NIST encouraged Paramax to develop and document &amp;quot;unofficial&amp;quot; results \[4\] with slightly modified scoring software.</Paragraph>
    <Paragraph position="9"> A &amp;quot;handout&amp;quot; was prepared for, and distributed at, the February 1992 Speech and Natural Language Workshop containing the System Descriptions provided by the participants and NIST's summaries of Benchmark Test results. null</Paragraph>
  </Section>
  <Section position="8" start_page="16" end_page="18" type="metho">
    <SectionTitle>
7 BENCHMARK TEST
AND DISCUSSION
RESULTS
7.1 ATIS SPeech RECognition
(SPREC) Test Results:
</SectionTitle>
    <Paragraph position="0"> Table 1 presents a tabulation of the February 1992 ATIS spontaneous Speech RECognition (SPREC) test results.</Paragraph>
    <Paragraph position="1"> Results are presented for a number of defined subsets of the utterances, with the utterance classes defined in the annotation process. The set Class ATDWX is the set of all utterances in all classes, consisting of 971 utterances.</Paragraph>
    <Paragraph position="2"> The set Class A+D includes all answerable utterances, 687 in all. Individual scores for the component subsets Class A, Class D, and Class X are also included. The utterances in Classes D and X tend to have a greater degree of disfluency than those in Class A. This factor may be reflected in the corresponding error rates, since the lowest subset error rates are to be found for Class A utterances, and the highest for Class X.</Paragraph>
    <Paragraph position="3"> In the set of answerable queries, Class A+D, the word error ranges from 6.2% to 13.8%, and the &amp;quot;Utterance  error rate&amp;quot; (corresponding approximately to &amp;quot;sentence error rate&amp;quot;, but acknowledging the fact that some utterances consist of more than one sentence) range from 34.6% to 60.1%.</Paragraph>
    <Paragraph position="4"> The lowest word error rate, in any of the subsets, 5.8%, is noted for the BBN system described in \[5\] for the subset of Class A utterances.</Paragraph>
    <Paragraph position="5"> Table 2 presents a matrix tabulation of ATIS SPREC results for the set of answerable queries, Class A+D. This matrix form of tabulation of results was developed at the MADCOW group's request to shed light on potential variabilities in the data for test set components from differing originating sites. The five columns of the matrix block correspond to the five originating sites for the MADCOW test data. In this case, the six rows of the matrix block correspond to the six sets of SPREC test results sent to NIST. The &amp;quot;Overall Totals&amp;quot; column at the right of the central block presents results corresponding to those cited for the Class A+D subset in Table 1. Note, for example, that the previously cited lowest Class A+D subset word error of 6.2% (for the BBN system) is shown in the second row entry of this column.</Paragraph>
    <Paragraph position="6"> The &amp;quot;Overall Totals&amp;quot; row presents results accumulated over all systems for which results were reported to NIST.</Paragraph>
    <Paragraph position="7"> Note that the Overall (subset) Total Word Error (&amp;quot;W.</Paragraph>
    <Paragraph position="8"> Error&amp;quot;) ranges from a low of 5.9%, for the data originating at MIT/LCS, to 14.6% for the AT&amp;T data subset. These data suggest that the MIT data subset is less challenging for ATIS SPREC systems than the data from other sites, but the reasons for this are not immediately evident.</Paragraph>
    <Paragraph position="9"> Analysis of the transcriptions suggests that the AT&amp;T data subset has a higher incidence of disfluencies than other subsets, partially explaining why it is more challenging than the other data subsets.</Paragraph>
    <Paragraph position="10"> For the &amp;quot;Class A+D&amp;quot; data, the lowest subset word error for any SPREC system is 3.2%, again for the BBN SPREC system and for the MIT data subset. Analysis of a similar matrix for the Class A data (not shown) indicates that the lowest subset word error (again for the MIT data subset) is 2.6% for the BBN system, with a corresponding utterance error of 20.7%.</Paragraph>
    <Paragraph position="11">  Three ATIS MADCOW sites provided data for both the Sennheiser close-talking microphone and the secondary (Crown PCC-160) microphone: CMU, MIT/LCS, and SRI. Two sites agreed to use the Crown microphone data with SPREC systems, using &amp;quot;robust&amp;quot; recognition algorithms: CMU and SRI. In some cases, results for other algorithms for comparable subsets of the data are available, and these have been excised from larger sets of data provided to NIST by CMU and SRI for the purposes of comparisons.</Paragraph>
    <Paragraph position="12"> Table 3 presents a matrix tabulation of the SPREC data for the Class A+D data from CMU, MIT/LCS and SRI for 5 systems (i.e., 3 from CMU and 2 from SRI). The &amp;quot;cmu4&amp;quot; system is the CMU Sphinx II system \[6\] processing the close-talking microphone data, the &amp;quot;cmu6&amp;quot; system is the CMU codeword-dependentcepstral-normalization (CDCN) system \[7\] processing the close-talking data, and the &amp;quot;cmu3&amp;quot; system is the CMU CDCN system processing the Crown microphone data. The &amp;quot;sri3&amp;quot; system (processing the close-talking microphone data) and &amp;quot;sri4&amp;quot; system (processing the Crown microphone data) are versions of the Sl:tI Decipher system incorporating the &amp;quot;I:tASTA&amp;quot; procedure for high-pass filtering of a log-spectral representation of speech \[8\]. For the close-talking microphone data subset, the lowest word error rate (7.0%) is for the sri4 system, which may be compared to the cmu4 system (10.4%) and the cmu6 system (13.7%). According to the system description provided by CMU, the two CMU systems differ in the amount of training material, among other factors.</Paragraph>
    <Paragraph position="13"> For the secondary microphone data subset, the word error rate for the cmu3 system is 17.8%, and for the sri4 system is 30.4%.</Paragraph>
    <Paragraph position="14"> There are indications of substantial variabilities due to originating site for the secondary microphone data, with both the SRI and CMU data secondary microphone data subsets giving rise to higher error rates than for the MIT data subsets.</Paragraph>
    <Paragraph position="15">  As in previous benchmark tests, two statistical significance tests are routinely implemented at NIST in analysis of speech recognition performance assessment tests. The utterance (sentence) error test is an application of McNemar's test, first suggested for use in this community by Gillick \[9\]. Another test consists of a MAtched- null nificance test, originally devised for use with the Resource Management corpora.</Paragraph>
    <Paragraph position="16"> Analysis of the tabulation of the word error test results  for the answerable query subset (Class A+D) shown in Table 4a indicates that for the BBN system \[5\], the word error rates are significantly different from (lower than) those for the other systems included in these tests. The sentence error McNemar test (Table 4b) indicates a similar result, but in this case, the sentence error rate for the Paramax SPREC system \[4\] does not differ significantly from the BBN system.</Paragraph>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
7.2 Natural Language (NL) Tests
</SectionTitle>
      <Paragraph position="0"> Table 5 presents a tabulation of the February 1992 ATIS Natural Language (NL) understanding tests results. Results are presented for the set of all &amp;quot;answerable utterances&amp;quot;, Class A+D, and for the individual Class A and Class D subsets. As was the case for the SPREC results, in general the error rates are higher for Class D than for Class A utterances.</Paragraph>
      <Paragraph position="1"> For the set of answerable queries, Class A+D, the weighted error ranges from 30.1% to 75.4%. Note that five of the systems have weighted error percentages between 30.1% and 33.9%.</Paragraph>
      <Paragraph position="2"> Table 6 presents a matrix tabulation for the NL test results for the set of answerable queries, Class A+D. There were a total of 687 queries in this set. The numbers tabulated for this set in Table 5 appear in the &amp;quot;Overall Totals&amp;quot; column, along with corresponding percentages. The &amp;quot;Overall Totals&amp;quot; row indicates the variability due to the test subsets' originating site.</Paragraph>
      <Paragraph position="3"> Of the 5 data subsets, the lower weighted error percentages in the &amp;quot;Overall Totals&amp;quot; row are to be found for the CMU and MIT data, with the SPd, AT&amp;T, and BBN data giving rise to higher weighted error percentages.</Paragraph>
      <Paragraph position="4"> Since the AT&amp;T data was collected using a significantly different collection paradigm- with the subject interfacing with the ATIS system simulation only over a phone line, rather than viewing a screen display of travel information \[10\] - the fact that the AT&amp;T data subset is more difficult than three other sites is perhaps not surprising.</Paragraph>
      <Paragraph position="5"> However, the BBN ATIS data collection effort also differed somewhat from that at other MADCOW sites in that - although information was presented using a screen display - the BBN scenarios &amp;quot;included not only trip planning scenarios, but also problem solving involving more general kinds of database access... This was done to try to elicit a richer range of language usage.\[3\]&amp;quot; This factor (&amp;quot;richer language usage&amp;quot;) may provide a partial explanation for the high NL error rates noted for the BBN data subset.</Paragraph>
      <Paragraph position="6"> For the CMU and MIT \[11\] systems, there appears to be some indication that the error percentages for &amp;quot;locallycollected&amp;quot; data are lower than for &amp;quot;foreign&amp;quot; data, perhaps because of greater familiarity with the local data-collection scenarios and environment, or use of a variant of the system under test when collecting the MADCOW data from which the test set was selected.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
7.3 Spoken Language Systems (SLS)
Tests
</SectionTitle>
      <Paragraph position="0"> Table 7 presents a tabulation of the February 1992 Spoken Language System understanding test results. As was the case for Table 5 (for the corresponding NL results), results are shown for several classes of the data, but emphasis in this material is placed on the answerable utterances, comprising Class A+D.</Paragraph>
      <Paragraph position="1"> For the Class A+D set, the seven SLS systems have weighted error ranging from 43.7% to 90.2%. Note that four systems (from three sites: BBN, MIT and SRI \[12\]) have weighted error percentages between 43.7% and 52.8%.</Paragraph>
      <Paragraph position="2"> Table 8 presents a matrix tabulation for the SLS test results for Class A+D, comparable in structure to that for the NL results of Table 6.</Paragraph>
      <Paragraph position="3"> Of the 5 data subsets corresponding to different collection sites, the range in weighted error is from 49.5% (for the MIT data) to 73.1% (for the AT&amp;T data).</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="18" end_page="19" type="metho">
    <SectionTitle>
8 ACKNOWLEDGEMENT
</SectionTitle>
    <Paragraph position="0"> The authors would like to acknowledge the help provided by the entire MADCOW community throughout the ordeal of collecting, annotating, distributing and using the MADCOW corpus for test purposes. A companion paper in this Proceedings provides a detailed acknowledgement, but special credit was earned by Lynette I-Iirschman as Chair of the MADCOW Group. It is to everyone's credit that the essential data was collected, annotated, and distributed and that &amp;quot;deadlines&amp;quot; were usually honored! Special thanks are also due to the group at MIT, particularly Michael Phillips and Christie Clark Winterton, for quick turn- around in producing recordable CD-ROM discs for distribution of the MADCOW training corpus from master tapes produced by Brett Tjaden at NIST.</Paragraph>
    <Paragraph position="1"> The Annotation Group at SRI, consisting of Kate Hunicke-Smith, Harry Bratt and Beth Bryson, was invaluable to the NIST effort to implement these tests.</Paragraph>
    <Paragraph position="2">  They participated actively and cheerfully in annotation of the test material and the adjudication process, in addition to &amp;quot;training&amp;quot; one of the authors (ND) in the use of the NLParse software and annotation techniques.</Paragraph>
    <Paragraph position="3"> Francis Kubala, at BBN, called NIST's attention to some problematic transcriptions for the SPREC tests. NIST reviewed and revised these as appropriate, and in the process noted 3 truncated utterances (in one subject-scenario collected at BBN). While the revised transcriptions were used in NIST's &amp;quot;revised official&amp;quot; scoring, NIST neglected to delete this subject-scenario from the NL and SLS tests, as specified by MADCOW protocols for handling data with truncated utterances. Analysis of performance on this particular subject-scenario indicates that most sites did well, nonetheless.</Paragraph>
    <Paragraph position="4">  9 References 1. Pallett, et al., &amp;quot;DARPA ATIS Test Results June 1990&amp;quot;, in Proc. Speech and Natural Language Workshop, June 1990, (R. Stern, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="5"> ISBN 1-55860-157-0, pp. 114-121.</Paragraph>
    <Paragraph position="6"> 2. Pallett, D.S., &amp;quot;Session 2: DARPA Resource Management and ATIS Benchmark Test Poster Session&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1991, (P. Price, ed.) Morgan Kaufmann Publishers, Inc. ISBN 1-55860-207-0, pp. 49-58.</Paragraph>
    <Paragraph position="7"> 3. MADCOW, &amp;quot;Multi-Site Data Collection for a Spoken Language Corpus&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="8"> 4. Norton, L.M., Dahl, D.A., and Linebarger, M.C., &amp;quot;Recent Improvements and Benchmark Results for the Paramax ATIS System&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="9"> 5. Kubala, F. et al., &amp;quot;BBN BYBLOS and HARC February 1992 ATIS Benchmark Results&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="10"> 6. Ward, W. et al., &amp;quot;Speech Understanding in Open Tasks&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="11"> 7. Stern, R. M., et al., &amp;quot;Multiple Approaches to Robust speech Recognition&amp;quot; in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="12"> 8. Murveit, H., Butzberger, J. and Weintraub, M., &amp;quot;Reduced Channel Dependence for Speech Recognition&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="13"> 9. Gfllick, L. and Cox, S.J., ~'Some Statistical Issues in the Comparison of speech Recognition Algorithms&amp;quot;, Proceedings of ICASSP-89, Glasgow, May 1989, pp.532-535. 10. Pieraccini, R. et al., &amp;quot;Progress Report on the Chronus System: ATIS Benchmark Results&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="14"> 11. Zue, V., et al., &amp;quot;The MIT ATIS System: February 1992 Progress Report&amp;quot;, in Proc. Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
    <Paragraph position="15"> 12. Appelt, D.E. and Jackson, E., &amp;quot;SRI International February 1992 ATIS Benchmark Test Results&amp;quot;, in Proc.  Speech and Natural Language Workshop, February 1992, (M. Marcus, ed.) Morgan Kaufmann Publishers, Inc.</Paragraph>
  </Section>
  <Section position="10" start_page="19" end_page="27" type="metho">
    <SectionTitle>
10 APPENDIX: &amp;quot;OFFICIAL&amp;quot;
&amp;quot;UNOFFICIAL&amp;quot; RESULTS
</SectionTitle>
    <Paragraph position="0"> VS.</Paragraph>
    <Paragraph position="1"> Several sites expressed interest in having results for additional systems included in NIST's &amp;quot;official&amp;quot; summary, although these results typically were not available at the required time for &amp;quot;official&amp;quot; scoring. At least one site took exception to an idiosyncratic property of the &amp;quot;official&amp;quot; comparator's treatment of their system's responses to several queries, and requested permission to present &amp;quot;unofficial&amp;quot; results at the meeting. Another site noted that they had identified a &amp;quot;bug&amp;quot; in their CAS-answer- format software, and after it was fixed, they also requested permission to report unofficial results.</Paragraph>
    <Paragraph position="2"> It was subsequently decided that the results submitted to NIST by the specified deadline, and uniformly scored at NIST with the &amp;quot;official&amp;quot; comparator and the adjudicated final set of reference answers would comprise the only &amp;quot;official&amp;quot; results, and that locally scored results should be represented as &amp;quot;unofficial&amp;quot;, even if scored with the same scoring software and answer set as the &amp;quot;official&amp;quot; results.</Paragraph>
    <Paragraph position="3"> It should be noted that since the results are for locally implemented tests, and since NIST's role in the tests is principally one of selecting and distributing the test material, and implementing the scoring software and uniformly tabulating the results of the tests, the results are not to be construed or represented as endorsements of any systems or official findings on the part of NIST, DARPA or the U.S. Government.</Paragraph>
    <Paragraph position="5"/>
    <Paragraph position="7"/>
  </Section>
class="xml-element"></Paper>
Download Original XML