File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-1042_intro.xml

Size: 3,974 bytes

Last Modified: 2025-10-06 14:04:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1042">
  <Title>SRI's DECIPHER System</Title>
  <Section position="2" start_page="0" end_page="238" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The hidden Markov model (HMM) formulation is a powerful statistical framework that is well-suited to the speech recognition problem. Systems based on this formulation have improved dramatically, however, as developers have learned how to modify them appropriately to take into account principles from speech research and from linguistics. Concepts arising from the study of the sound system of a language, i.e., phonology, are language-specific; they are done once for English and little additional labor is required when, for example, the vocabulary or the domain changes. Care should be taken, however, in modeling detailed linguistic structure since the practice can lead to models with many additional parameters to be estimated; unless this problem is addressed directly, performance gains will be compromised.</Paragraph>
    <Paragraph position="1"> SRI is not the first group to incorporate speech knowledge and concepts from linguistics in an HMM formulation for speech recognition. In fact, many improvements in I-IMM-based systems are implicitly related to principles from speech and linguistics, even though this was not their original motivation. We survey some of these modifications below.</Paragraph>
    <Paragraph position="2"> Phonetic Units. Not all recognizers are based on phonetic units. A number of HMM-based speech recognition systems have been based on word-level models \[1, 12, 9\]. The use of phonetic units allows for larger vocabularies by sharing training across sub-word units that repeat more frequently than words do. Phonetic-based units are now common to many HMM-based recognizers (e.g., \[7, 8, 10\]).</Paragraph>
    <Paragraph position="3"> Triphones. Triphones are phonetic models conditioned on the immediately surrounding phonetic units. BBN \[3\] was able to show significant performance gain (roughly halving the error rate compared to a similar system without the triphone models), provided the context-dependent models were averaged (&amp;quot;smoothed&amp;quot;) with the context-independent ones in order to deal with the large number of poorly trained triphone models. Triphones are used extensively now (e.g., at BBN, CMU, Lincoln Laboratories, SPd). Triphones can model major coarticulatory effects described in the speech research literature, as well as phonological variation conditioned on the immediately surrounding phones. In general, more detailed models will perform better than less detailed models, provided there is sufficient data to estimate the parameters. For 60 phones, there are 60 cubed triphones, which represents a significant increase in parameters to estimate. Triphones would not have shown a performance gain had they been introduced without a mechanism to take this into account, i.e., in this case, smoothing with more general models.</Paragraph>
    <Paragraph position="4"> Difference Parameters. The use of additional, independently trainable parameters means that more details can be included in the model without a dramatic increase in the amount of training material. In particular, recognition performance has been significantly increased \[8, 10\] through the use of codebooks that represent the difference between the current value of a parameter and its value several frames previously.</Paragraph>
    <Paragraph position="5"> Spectral and energy difference parameters are used in addition to their current values. This captures important dynamic patterns exhibited in speech as well as the standard static information.</Paragraph>
    <Paragraph position="6">  In this paper we describe SRI's recent work in incorporating linguistic concepts in the DECIPHER system: improved phonological modeling and modeling cross-word coarticulation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML