File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/79/p79-1005_abstr.xml

Size: 5,285 bytes

Last Modified: 2025-10-06 13:45:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P79-1005">
  <Title>TOWARD A COMPUTATIONAL THEORY OF SPEECH PERCEPTION</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
TOWARD A COMPUTATIONAL THEORY OF SPEECH PERCEPTION
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="17" type="abstr">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> In recent years,a great deal of evidence has been collected which gives substantially increased insight into the nature of human speech perception. It is the author's belief that such data can be effectively used to infer much of the structure of a practical speech recognition system. This paper details a new view of the role of structural constraints within the several structural domains (e.g. articulation, phonetics, phonology, syntax, semantics) that must be utilized to infer the desired percept.</Paragraph>
    <Paragraph position="1"> Each of the structural domains mentioned above has a substantial &amp;quot;internal theory&amp;quot; describing the constraints within that domain, but there are also many interactions between structural domains which must be considered.</Paragraph>
    <Paragraph position="2"> Thus words llke &amp;quot;incline&amp;quot; and &amp;quot;survey&amp;quot; shift stress with syntactic role, and there is a pragmatic bias for the ambiguous sentence &amp;quot;John called the boy who has smashed his car up.&amp;quot; to be interpreted under a strategy that reflects a tendency for local completion of syntactic structures. It is clear, then, that while analysis within a structural domain (e.g. syntactic parsing) can be performed up to a point,lnteraction with other domains and integration of constraint strengths across these domains is needed for correct perception. The various constraints have differing and changing strengths at different points in an utterance, so that no fixed metric can be used to determine their contribution to the well-formedness of the utterance.</Paragraph>
    <Paragraph position="3"> At the segmental level, many diverse cues for segmental features have been found. As many as 16 cues mark the voicing distinction, for example. We may think of each of these cues as also representing a constraint, and the strength of the constraint varies with the context. For example, stop closure duration must be interpreted in the context of the local rate of speech, and a given value of closure duration can signify either a voiced or an unvoiced stop depending on the surrounding vowel durations. Thus several cues must be integrated to obtain the perceived segmental feature, and the weights assigned to each cue vary with the local context.</Paragraph>
    <Paragraph position="4"> From the preceding examples, it is seen that in order to model human speech perception, it is necessary to dynamically integrate a wide variety of constraints. The evidence argues strongly for an active focussed search, whereby the perceptual mechanism knows, as the utterance unfolds, where the strongest constraint strengths are, and uses this reliable information, while ignoring &amp;quot;cues&amp;quot; that are unreliable or non-determining in the immediate context. For example, shadowing experiments have shown that listeners (performing the shadowing task) can restore disrupted words to their original form by using semantic and syntactic context, thus demonstrating the integration process. Furthermore, techniques are now available for analytically finding that informatlon in an input stimulus which can maximally discrimlnate between two candidate prototypes, so that the perceptual control structure can focus only on such information co make a choice between the candidates.</Paragraph>
    <Paragraph position="5"> In this paper, we develop a theory for speech recognition which contains the required dynamic integration capability coupled with the ability to focus on a restricted set of cues which has been contextually selected.</Paragraph>
    <Paragraph position="6"> The model of speech recognition which we have developed requires, of course, an initial low-level analysis of the speech waveform to get started. We argue from the recent psychollnguistic literature that stressed syllables provide the required entry points. Stressed syllable peaks can be readily located, and use of the phonotactics of segmental distribution within syllables, together with the relatively clear articulation of syllable-initial consonants, allows us to formulate a robust procedure for determining initial segmental &amp;quot;islands&amp;quot;, around which further analysis can proceed. In fact, there is evidence to indicate that the human lexicon is organized and accessed via these stressed syllables. The restriction of the original analysis to these stressed syllables can be regarded as another form of focussed search, which in turn leads to additional searches dictated by the relative constraint strengths of the various domains contributing to the percept. We argue that these views are not only consonant with the current knowledge of human speech perceptlon, but form the proper basis for the design of hlgh-performance Speech recognition systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML