File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1059_abstr.xml

Size: 10,428 bytes

Last Modified: 2025-10-06 13:46:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1059">
  <Title>DARPA Resource Management Benchmark Test Results</Title>
  <Section position="2" start_page="0" end_page="299" type="abstr">
    <SectionTitle>
Tabulated Results
</SectionTitle>
    <Paragraph position="0"> Table 1 presents results of NIST scoring of the June 1990 RM2 Test Set results received by NIST as of June 21, 1990.</Paragraph>
    <Paragraph position="1"> For speaker-dependent systems, results are presented for systems from BBN and MIT/LL [4] for two conditions of training: the set of 600 sentence texts used in previous (e.g., RM1 corpus) tests, and another condition making use of an additional 1800 sentence utterances for each speaker, for a total of 2400 training utterances. For speaker-independent systems, results were reported from AT&amp;T [S], BBN [6], CMU [7], MIT/LL [4], SFU [8] and SSI [9]. Most sites made use of the 109-speaker system training condition used for previous tests and reported results on the RM2 test set. BBN's Speaker Independent and Speaker Adaptive results [6] were reported for the February 1989 Test sets, and are tabulated in Table 2. SRI also reported results for the case of having used the 12 speaker (7200 sentence utterance) training material from the speaker-dependent RM1 corpus in addition to the 109 speaker (3990 sentence utterance) speaker independent system training set, for a total of 11,190 sentence utterances for system training.</Paragraph>
    <Paragraph position="2"> Table 2 presents results of NIST scoring of other results reported by several sites on test sets other than the June 1990 (RM2) Test Set.</Paragraph>
    <Paragraph position="3"> In some cases (e.g., some of the &amp;quot;test-retest&amp;quot; cases) the results may reflect the benefits of having used these test sets for retest purposes more than one time.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Significance Test Results
</SectionTitle>
      <Paragraph position="0"> NIST has implemented some of the significance tests \[3\] contained on the RM series of CD-ROMs for some of the data sent for these tests. In general these tests serve to indicate that the differences in measured performance between many of these systems are small -- certainly for systems that are similarly trained and/or share similar algorithmic approaches to speech recognition.</Paragraph>
      <Paragraph position="1"> As a case in point, consider the sentence-level McNemar test results shown in Table 3, comparing the BBN and MIT/LL speaker dependent systems, when using the word-pair grammar. For the two systems that were trained on 2400 sentence utterances, the BBN system had 426 (out of 480) sentences correct, and the MIT/LL system had 427 correct. In comparing these systems with the McNemar test, there are subsets of 399 responses that were identically correct, and 26 identically incorrect. The two systems differed in the number of unique errors by only one sentence (i.e., 27 vs. 28). The significance test obviously results in a &amp;quot;same&amp;quot; judgement. A similar comparison shows that the two systems trained on 600 sentence utterances yield a &amp;quot;same&amp;quot; judgement. However, comparisons involving differently-trained systems do result in significant performance differences -- both within site, and across sites. Table 4 shows the results of implementation of the sentence-level McNemar test for speaker-independent systems trained on the 109 speaker/3990 sentence utterance training set, using the word-pair grammar, for the RM2 test set.</Paragraph>
      <Paragraph position="2"> For the no-grammar case for the speaker-independent systems, the sentence-level McNemar test indicates that the performance differences between these systems are not significant. However, when implementing the word-level matched-pair sentence-segment word error (MAPSSWE) test, the CMU system has significantly better performance than other systems in this category.</Paragraph>
      <Paragraph position="3"> Note that the data for the SRI system trained on 11,190 sentence utterances are not included in these comparisons, since the comparisons are limited to systems trained on 3990 sentence utterances.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="299" type="sub_section">
      <SectionTitle>
Other Analyses
</SectionTitle>
      <Paragraph position="0"> Since release of the &amp;quot;standard scoring software&amp;quot; used for the results reported at this meeting, NIST has developed additional scoring software tools. One of these tools performs an analysis of the results reported for each lexical item.</Paragraph>
      <Paragraph position="1"> By focussing on individual lexical items (&amp;quot;words&amp;quot;) we can investigate lexical coverage as well as performance for individual words for each individual test (such as the June 1990 test). In this RM2 test set there were occurrences of 226 mono-syllabic words and 503 polysyllabic words -- larger coverage of the lexicon than in previous test sets. The most frequently appearing word was &amp;quot;THE&amp;quot;, with 297 occurrences.</Paragraph>
      <Paragraph position="2"> In the case of the system we refer to as &amp;quot;BBN (2400 train)&amp;quot; with the word pair grammar, in the case of the word &amp;quot;THE&amp;quot; -- 97.6% of the occurrences of this word were correctly recognized, with 0.0% substitution errors, 2.4% deletions, and 0.7% &amp;quot;resultant  insertions&amp;quot;, for a total of 3.0% word error for this lexical item. What we term &amp;quot;resultant insertions&amp;quot; correspond to cases for which an insertion error of this lexical item occurred, but for which the cause is not known.</Paragraph>
      <Paragraph position="3"> The conventional scoring software provides data on a &amp;quot;weighted&amp;quot; frequency-of-occurrence basis. All errors are counted equally, and the more frequently occurring words -- such as the &amp;quot;function&amp;quot; words -- typically contribute more to the overall system performance measures.</Paragraph>
      <Paragraph position="4"> However, when comparing results from one test set to another it is sometimes desirable to look at measures that are not weighted by frequency of occurrence. Our recently developed scoring software permits us to do this, and, by looking at results for the subset of words that have appeared on all tests to date, some measures of progress over the past several years are provided, without the complications introduced by variable coverage and different frequencies-of-occurrence of lexical items in different tests. Further discussion of this is to appear in an SLS Note in preparation at NIST.</Paragraph>
      <Paragraph position="5"> By further partitioning the results of such an analysis into those for mono- and poly-syllabic word subsets, some insights can be gained into the state-of-the art as evidenced by the present tests.</Paragraph>
      <Paragraph position="6"> For the speaker-dependent systems trained on 2400 sentence utterances using the word-pair grammar, the unweighted total word error for mono-syllabic word subset is between 1.6% and 2.2% (with the MIT/LL system having a slightly (but not significantly) larger number of &amp;quot;resultant insertions&amp;quot;. For the corresponding case of poly-syllabic words, the unweighted total word error is 0.2% for each system.</Paragraph>
      <Paragraph position="7"> For the CMU speaker independent system, using the word-pair grammar, the unweighted total word error for mono-syllabic words is 5.6%, and for poly-syllabic words, 1.7%.</Paragraph>
      <Paragraph position="8"> By comparing the CMU speaker-independent system results to the best-trained speaker-dependent systems, one can observe that the error rates for mono-syllabic words are typically 3 to 4 times greater than for the speaker-dependent systems, and for poly-syllabic words, approximately 8 times larger. When making similar comparisons, using results for other speaker-independent systems and the best-trained speaker-dependent systems, the mono-syllabic word error rates are typically 4 to 6 times greater, and for poly-syllabic words, 12 times larger.</Paragraph>
      <Paragraph position="9"> It is clear from such comparisons that the well-trained speaker-dependent systems have achieved substantially greater success in modelling the poly-syllabic words than the speaker-independent systems.</Paragraph>
      <Paragraph position="10"> Comparisons With Other RM Test Sets Several sites have noted that the four speakers of the RM2 Corpus are significantly different from the speakers of the RM1 corpus. One speaker in particular appears to be a &amp;quot;goat&amp;quot;, and there may be two &amp;quot;sheep&amp;quot; -- to varying degrees for both speaker-dependent and speaker-independent systems. An ANOVA test should be implemented to address the significance of this effect.</Paragraph>
      <Paragraph position="11"> It has been noted that there appears to be a &amp;quot;within-session effect&amp;quot; -- with later sentence utterances being more difficult to recognize than earlier.</Paragraph>
      <Paragraph position="12"> It has been argued that overall performance is worse for this test set than for other recent test sets in the RM corpora, but this conclusion does not appear to be supported for all systems. Some sites have noted that performance for this test set is worse than for the RM2 Development Test Set, but the significance of this effect is unknown. Data for the current AT&amp;T system are available for both the Feb 89 and Oct 89 Speaker Independent Test Sets, and indicate total word errors of 5.2% and 4.7%, respectively (see Table 2) vs. 5.7% for the June 1990 RM2 test set (see Table 1), suggesting that the RM2 test set is more difficult. A similar comparison involving the current CMU data for the Feb 89 and Oct 89 Speaker Independent Test Sets indicates word error rates of 4.6% and 4.8%, respectively vs. 4.3% for the June 1990 test set, suggesting that for the current CMU system there is (probably insignificantly) better performance on the June 1990 test set.</Paragraph>
      <Paragraph position="13"> The significance of these differences is not known, but appears to vary from system to system.</Paragraph>
      <Paragraph position="14"> Summary This paper has presented NIST's tabulation and preliminary analysis of results reported for</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML