File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/90/h90-1040_evalu.xml

Size: 5,776 bytes

Last Modified: 2025-10-06 14:00:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1040">
  <Title>Continuous Speech Recognition from a Phonetic Transcription</Title>
  <Section position="6" start_page="194" end_page="196" type="evalu">
    <SectionTitle>
7. Experimental Results
</SectionTitle>
    <Paragraph position="0"> All the tests described below, except for one informal listening test, were conducted on standard DARPA data that has been filtered and downsarnpled to a 4 kHz bandwidth. The training set consists of 3,267 sentences spoken by 109 different speakers. This comprises about 4 hrs. of speech. Two test sets each consist of 300 sentences spoken by 10 speakers.</Paragraph>
    <Paragraph position="1"> The tNrd test set comprises 54 sentences spoken by one of us (SEL) recorded using equipment similar to that used for the DARPA data. All four data sets are completely independent.</Paragraph>
    <Paragraph position="2"> The acoustic/phonetic model was trained as follows. The training data was segmented in temas of the 47 phonetic symbols by means of the segmental k-means algorithm \[25\]. All frames so assigned to each phonetic unit were collected and sample statistics for the spectral means and covariances, IXij and Uij and the durational means and variances, mij and ~ij, were computed for 1 _&lt; i,j _&lt; 47. If fewer than 500 samples were available for a particular value of i, then the samples for all values of i and fixed j were pooled and only a single statistic was computed and used for all values of i. The durational means and variances were then converted to parameters appropriate to the gamma distribution vii and rlij according to miy -- vij/~ij and C/~ij = Vij/'l~i~.</Paragraph>
    <Paragraph position="3"> The transition matrix was computed from the lexicon. All adjacent pairs of words allowed by the grammar were formed and all occurrences of phonetic units and bigrams were counted.</Paragraph>
    <Paragraph position="4"> These were then converted to transition probabilities from Ar(i'J) (14) aij = .K ( i) where N(i,j) is the total number of occurrences of the bigram qi qj and .Y(i) is the total number of occurrences of the unit q i.</Paragraph>
    <Paragraph position="5"> Word recognition results are summarized in Table I. All results are for the perplexity 9 grammar.</Paragraph>
    <Paragraph position="6">  Data set trainl09 is a subset of the training data formed by taking two sentences at random from each of the training set speakers. This set was used for algorithm development. The three independent test sets were run only once. Recognition requires about 15 times real time on an 8 CE AUiant FX-80.</Paragraph>
    <Paragraph position="7"> Rather than try to measure the accuracy of the phonetic transcription directly, we tried to get an impression of its quality by listening to speech ^resynthesized from it. For this purpose we use the PRONOUNCE module of tts \[26\] with ~, d, and a pitch contour computed by the harmonic sieve method \[27\]. The average data rate for these quantities is approximately 100 bps pointing to the possible utility of the phonetic decoder as a very-low-bit-rate vocoder. Our informal test was made on six sentences recorded by one of us (SEL). An audio tape was made of the resynthesis and played for several listeners from whose responses we judged that about 75% of the 91 words were intelligible. The speech recognition system gave an 96% word accuracy on these sentences. We have also recorded, decoded and resynthesized several Harvard phonetically balanced sentences with nearly identical results. This is significant since these sentences have no vocabulary in common with the DARPA task.</Paragraph>
    <Paragraph position="8"> 8. Interpretation of the Results The results listed in Table I are approximately the same as those achieved by more conventional systems tested on the same data \[13, 14, 15, 16\] and the perplexity 60 grammar. Given the difficulty of the task and the early stage of development of this system, however, we consider these results quite respectable. Also, note that the performance on training data is not substantially different from that obtained on new test data indicating a certain robustness of our method. Moreover, almost all of the insertions and deletions are of monosyllabic articles and prepositions which do not change the meaning of the sentence.</Paragraph>
    <Paragraph position="9"> It appears that there are two straightforward ways to improve performance. First we need to improve the acoustic/phonetic model. Desirable structural changes would appear to be the incorporation of trigram phonotactics by making the underlying Markov chain second order \[28\]. This would allow us to associate the spectral distributions with three states rather than two. This should afford a better model of coarticulatory effects. Also, the spectral distributions can be made more faithful by using Gaussian mixtures rather than unimodal multi-variate densities. Fidelity can be further improved by accounting for temporal correlations among observations. Finally, we need to make a global improvement in the model by optimizing it. We have repeatedly tried reestimation techniques but, thus far, they have actually degraded performance. We speculate that applying constraints to the reestimation  formulae by forcing the state sequence to be fixed will ameliorate the results of optimization. Second, we can improve the lexical access technique by rationalizing the insertion, deletion, substitution metric. One possible alternative is to replace the rhobar distance with error probabilities determined either analytically or empirically. Also, applying phonological rules to the fixed, citation form, pronunciations stored in the lexicon may eliminate some errors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML