File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2077_metho.xml

Size: 24,628 bytes

Last Modified: 2025-10-06 14:12:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2077">
  <Title>III. RESEARCH OPPORTUNITIES SCIENTIFIC OBJECTIVES</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ULTIMATE GOAL
</SectionTitle>
    <Paragraph position="0"> Spoken language is the most natural and common form of human-human communication, whether face to face, over the telephone, or through various communication media such as radio and television. In contrast, human-machine interaction is currently achieved largely through keyboard strokes, pointing, or other mechanical means, using highly stylized languages. Communication, whether human-human or human-machine, suffers greatly when the two communicating agents do not &amp;quot;speak&amp;quot; the same language. The ultimate goal of work on spoken language systems is to overcome this language barrier by building systems that provide the necessary interpretive function between various languages, thus establishing spoken language as a versatile and natural communication medium between humans and machines and among humans speaking different languages.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
GRAND CHALLENGES
</SectionTitle>
    <Paragraph position="0"> Spoken language systems differ widely in their capabilities and requirements.</Paragraph>
    <Paragraph position="1"> Three grand challenges for spoken language systems include:</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="463" type="metho">
    <SectionTitle>
* INTERACTIVE PROBLEM SOLVING -- interactive command, control, and
</SectionTitle>
    <Paragraph position="0"> information retrieval using voice input/output -- The system would require full integration of speech recognition and natural language understanding for input, and may require natural language generation and speech synthesis for output. Example applications include database query (e.g., airline reservations, library search, yellow pages with voice input), command and control, resource management (such as battle management workstation or logistics support), computer-assisted instruction, and aids for the handicapped.</Paragraph>
  </Section>
  <Section position="5" start_page="463" end_page="463" type="metho">
    <SectionTitle>
* AUTOMATIC DICTATION (transcription) -- The challenge lies in the system's
</SectionTitle>
    <Paragraph position="0"> ability to transcribe arbitrary spoken input with virtually unlimited vocabulary and types of sentence construction.</Paragraph>
  </Section>
  <Section position="6" start_page="463" end_page="463" type="metho">
    <SectionTitle>
* AUTOMATIC TRANSLATION -- multi-language voice input/output with
</SectionTitle>
    <Paragraph position="0"> automatic translation -- Example applications include automatic interpreter for multi-language speeches and meetings, translating telephone, and NATO field communications.</Paragraph>
    <Paragraph position="1"> While these challenges constitute long-term goals for spoken language systems, there are challenging but achievable shorter-term goals that would have significant economic impact. One near-term challenge would be to develop robust, operational voiceoperated data entry and query systems with limited language understanding capabilities in actual applications.</Paragraph>
    <Paragraph position="2"> The grand challenges listed above require advances in speech processing (recognition and synthesis), natural language processing, and automatic translation. This report on spoken language systems has been written largely from a speech recognition point of view; the modeling of natural language, however, has such a great impact on the performance of a speech recognition system that language modeling becomes an important area for speech recognition research as well. The reader is referred to another document which covers the natural language processing areas in more detail. Also, natural speech synthesis is an important area that requires treatment beyond that allotted to it in this report.</Paragraph>
  </Section>
  <Section position="7" start_page="463" end_page="465" type="metho">
    <SectionTitle>
MISSING SCIENCE
</SectionTitle>
    <Paragraph position="0"> The areas of missing science fall in three general categories: * Complete modeling of the speech signal and its variabilities to facilitate efficient information extraction for recognition and synthesis. These variabilities include phonetic and other linguistic effects, inter- and intra-speaker variabilities (including health condition and emotional state), and environmental acoustic variabilities.</Paragraph>
    <Paragraph position="1"> * Automatic acquisition and modeling of linguistic phenomena, including domain-dependent and domain-independent knowledge (lexicon, syntax, semantics, discourse, pragmatics, task structure), especially the modeling of actual spoken language.</Paragraph>
    <Paragraph position="2"> * Developing human factors methods for the design of user-friendly spoken language systems, including the use of clarification dialogues and the efficient training of users.</Paragraph>
    <Paragraph position="3"> Statistical methods capable of modeling signal variability in parameter space as well as time, such as hidden Markov models, have put the speech recognition problem on a solid theoretical basis and have resulted in significant advances in continuous speech recognition in the last decade. The performance of such systems, however, is still far from adequate for the ultimate goals stated above and far inferior to human performance. One can improve the performance of current systems somewhat through better signal processing and feature extraction, and through extensions of the existing theoretical paradigms.</Paragraph>
    <Paragraph position="4"> However, significant improvement in performance will require a more comprehensive modeling of the speech signal and its variabilities, including possibly the development of new theoretical paradigms. A prerequisite to developing improved speech models is acquiring the knowledge of how to extract the needed information from the speech signal and how to build appropriate recognition structures that can take advantage of this information. To be useful, this knowledge must be developed in the context of building advanced speech recognition systems. An important aspect to operational speech  recognition systems will be their robustness to speaker and environmental variabilities. Of special interest to certain military applications, for example, is robustness to high levels of noise and stress. Methods that would adapt automatically and quickly to changes in speaker or in environment characteristics will need to be developed.</Paragraph>
    <Paragraph position="5"> More comprehensive models of the speech signal will also benefit the automatic synthesis of speech from text. Current commercial synthesis devices may be adequate for some applications, but their synthetic speech quality limits their wide use. The ability of a machine to produce natural speech quality will be an important output modality for an advanced interactive human-machine interface. Speech output with natural quality will require significant research into improved speech signal models, including proper modeling of prosody, many aspects of which depend on the linguistic constructs of the text to be synthesized.</Paragraph>
    <Paragraph position="6"> For humans, the speech understanding decisions depend on the acoustic content of the speech signal and the listener's expectation of what might be said. It is the purpose of the language model in the human to sharpen that expectation and to effect the understanding of the message. Similarly, in automatic speech understanding, it is the purpose of the language model to constrain the possible sequences of words, leading to improved recognition performance, and to interpret what was said if understanding of the message is desired. Even though much work remains to be done to solve the speech recognition problem as such, a major barrier to full realization of an advanced spoken language system arguably rests with the development of a mature natural language understanding technology. The speech recognition problem has benefited from the fact that the problem is well defined, where the input is the speech signal and the output is a set of words.</Paragraph>
    <Paragraph position="7"> Given the input and desired output, automatic methods have been developed for modeling various speech phenomena. These automatic methods have been crucial in advancing the state of the art in speech recognition. Furthermore, the performance of a speech recognition system can be evaluated by simply measuring the word error rate, for example, thus allowing for systematic and objective evaluations that can be used to improve system design. In contrast, the natural language understanding problem has not been as well defined. While the input here is taken to be a set of words, the output (i.e., the meaning of the utterance) is not well defined for all cases, nor is it well modeled computationally. One result has been the lack of rigorous evaluation of the performance of natural language systems. Partial theories for modeling meaning exist, such that limited language understanding systems have been built which may be useful in certain applications. The building of such systems has required the enormously labor-intensive process of developing grammars and semantic rules that map an input sentence into its meaning representation. The extent to which existing theories for modeling language are complete, however, has not been rigorously tested. While new linguistic theories may be needed to model a larger range of linguistic phenomena, there is a dire need to develop automatic or semiautomatic methods for the modeling of linguistic phenomena.</Paragraph>
    <Paragraph position="8"> It is important to note that the language modeling problem is significant for speech recognition whether complete understanding of the input speech is required, as in the problem solving application, or merely transcription of what has been said, as in the dictation application. Statistical language models, for example, have been quite successful for the dictation application, without requiring understanding on the part of the machine. The translation problem, however, is directly affected by progress in modeling of syntax, semantics and discourse, especially in interactive applications over the telephone.</Paragraph>
    <Paragraph position="9"> One of the major obstacles to the fielding of spoken language systems is often the lack of an ergonomically sound design. It is therefore important to develop a human factors technology that is appropriate for the design of user-friendly spoken language  systems. In support of possibly deficient language models, or genuine ambiguity on the part of the user, it would be important to develop graceful and effective methods for machine generation of cooperative clarification dialogues with the user to resolve possible errors or ambiguities. It would also be useful to develop methods for training users of spoken language systems to learn the limitations of such systems in a relatively short period of time. The learning by users of the capabilities and limitations of complex systems is, of course, a genetic problem in other fields as well. For some applications, spoken language input and output will need to be integrated with other input/output modalities such as typing, graphics, and pointing (by mouse or touch screen).</Paragraph>
  </Section>
  <Section position="8" start_page="465" end_page="466" type="metho">
    <SectionTitle>
BARRIERS TO PROGRESS
</SectionTitle>
    <Paragraph position="0"> The most important barriers to progress are the areas of missing science mentioned above. Our fundamental lack of understanding of spoken language must be overcome through concentrated and substantial research efforts. In addition, there are barriers to progress in the form of computing, data, human resources, and support:  * Lack of very fast computing and large on-line storage for research.</Paragraph>
    <Paragraph position="1"> * Unavailability of adequate and relevant speech and language databases and corpora.</Paragraph>
    <Paragraph position="2"> * No comprehensive educational and training programs for scientists and engineers in speech and natural language.</Paragraph>
    <Paragraph position="3"> * Lack of long-term research programs of sufficient size to build and test complete  experimental systems.</Paragraph>
    <Paragraph position="4"> Many of the advances in speech recognition in the last decade have benefited directly from the availability of faster computing. Significant additional advances can still be achieved simply by increasing computational power which would allow experimentation with more compute-intensive ideas. It is estimated that future spoken language systems may require computing speeds of 100 gigaflops or more. However, the ready availability of machines for research which compute at rates of 100-1000 megaflops would advance the state of the art considerably. These should be general-purpose machines that are easily programmable in a higher-level language for research purposes. In addition, on-line storage capabilities of at least 100 gigabytes would be needed to store data for regularly performed experiments. (See the Appendix for a detailed analysis of computing and storage needs.) It is rather difficult to model phenomena that one is not able to observe adequately. Another deficiency in resources, in fact, has been the dearth of speech and natural language corpora that manifest the various speech and linguistic variabilities. Efforts have begun to define some of the needed corpora. It is estimated that hundreds of hours of speech and billions of words may be required to represent all the natural language phenomena of interest for the different applications. Additionally, labor-intensive special linguistic labeling of large portions of the collected data will be needed for training and test purposes, especially if semi-automatic modeling of spoken linguistic phenomena is to be accomplished. (See the Appendix for a detailed analysis of the need for large coropora of speech and text.) The third barrier to progress is another resource problem: the lack of properly trained scientists and engineers in spoken language research. First and foremost, a solid background is needed in a number of abstract and applied mathematical disciplines,  including probability, statistics, linear systems, theory of computation and algorithms (including formal grammars, parsing, and search algorithms), logic, pattern recognition, and information theory. This background should be acquired by both speech scientists and computational linguists. In addition, speech scientists need to be trained in signal processing and speech communication, including phonetics, speech production and perception, speech analysis/synthesis, and speech recognition. Computational linguists need to be trained in phonology, morphology, syntax, semantics, and discourse, with special emphasis on data-driven, empirically-based linguistics. Of benefit to all also would be courses in psycholinguistics, cognitive science, and human factors (ergonomics). Currently, programs do not exist which offer the areas mentioned above in a coherent fashion. It is not typical for training in computational linguistics, for example, to include courses in probability and statistics. Nor are speech scientists usually properly trained in computer science. Many prominent schools do not even offer certain basic courses, such as pattern recognition.</Paragraph>
    <Paragraph position="5"> Finally, it is obvious that insufficient funding would be a serious barrier to progress. The funding would have to be sufficient to support several research groups with critical mass on a long-term basis, i.e., groups with sufficient size to be able to build and test complete systems. Support for smaller groups that concentrate on research issues in specific areas is also necessary; however, such groups will need to integrate their work into the larger systems. Support will also be required for the purchase of adequate computing facilities and for the specification and collection of massive corpora for the purpose of system training and testing.</Paragraph>
  </Section>
  <Section position="9" start_page="466" end_page="467" type="metho">
    <SectionTitle>
POTENTIAL BREAKTHROUGHS
</SectionTitle>
    <Paragraph position="0"> Within the next decade, a major potential breakthrough is that large-vocabulary continuous speech recognition systems will have improved sufficiently to allow them to be integrated in some everyday applications. To make this happen, the following component technical breakthroughs are needed and are likely:  for interactive applications.</Paragraph>
    <Paragraph position="1"> In speech synthesis, a potential breakthrough is synthesis of speech from text with more natural quality than currently possible, including more natural-sounding intonation and prosody. However, work in this area would have to be increased significantly for advances to take place.</Paragraph>
    <Paragraph position="2"> For breakthroughs in the natural language and machine translation areas, the reader is referred to another document on the topic.</Paragraph>
  </Section>
  <Section position="10" start_page="467" end_page="467" type="metho">
    <SectionTitle>
II. BACKGROUND
ASSESSMENT OF THE FIELD
</SectionTitle>
    <Paragraph position="0"> We will limit our discussion here to the assessment of speech recognition systems.</Paragraph>
    <Paragraph position="1"> A separate document will assess more thoroughly the natural language and translation areas. Speech synthesis systems from arbitrary text currently have a rather stylized synthetic quality, with rather unnatural intonation patterns.</Paragraph>
    <Paragraph position="2"> The performance of continuous speech recognition systems is typically measured in terms of total word error rate, including insertions and deletions. For small vocabularies of less than 20 words, usually the digits plus some control words, speaker-independent performance (i.e., no special training needed for each speaker) has been measured at less than 1% word error rate. Systems with medium-size vocabularies of 100-200, a constrained grammar with perplexity (average branching factor) of on the order of 10, have achieved a word error rate of less than 1% in speaker-dependent mode. Both types of systems are available commercially.</Paragraph>
    <Paragraph position="3"> Large-vocabulary systems of 1000 words, with grammars of perplexity 60, show continuous recognition performance of 5-10% word error rate. Systems with larger vocabularies typically are not operated in continuous mode, but rather the words are spoken in isolation, which tends to decrease the error rate. Very large-vocabulary systems of 20,000 words, spoken in isolation in speaker-dependent mode, perform at a 5% word error rate with a perplexity of 200. Large vocabulary systems are, for the most part, laboratory systems (some commercial isolated-word systems exist).</Paragraph>
    <Paragraph position="4"> The results mentioned above have been obtained largely in relatively controlled environments. Performance typically degrades under hostile acoustic conditions, especially in noisy military platforms; however, good performance has been obtained for restricted tasks over dialed-up telephone lines and moderate amounts of background noise.</Paragraph>
    <Paragraph position="5"> Most of the grammars employed with speech recognition systems are quite constrained in their ability to model natural language. For interactive applications, where understanding of the input is necessary, the grammars utilized are typically small and treelike, with well-defined semantics. The integration of speech recognition with existing natural language understanding has started only recently.</Paragraph>
  </Section>
  <Section position="11" start_page="467" end_page="468" type="metho">
    <SectionTitle>
RELATIONSHIP TO OTHER FIELDS
</SectionTitle>
    <Paragraph position="0"> The design of spoken language systems depends on the integration of several of the following technologies, depending on the application: speech recognition and synthesis, natural language understanding and generation, automatic translation, and human factors engineering. It also depends on advances in computer architecture and accompanying software. Progress in all these areas is necessary for the reliable fielding of advanced systems.</Paragraph>
    <Paragraph position="1"> An area that is intimately related to speech recognition is that of machine learning and self-organizing systems. In fact, the recent advances in speech recognition rest almost exclusively on the development of computational models (e.g., hidden Markov models and others) for which automatic training (i.e., learning) methods have existed for some time and are often taken for granted. These learning algorithms estimate the values of the model parameters directly from data, typically in a few iterations, and are able to generalize the models to unseen data. Because the performance of speech recognition systems can be  measured rigorously in terms of word error rate, the speech recognition problem, therefore, could serve as a convenient testbed for the comparative testing of other learning algorithms, such as those associated with artificial neural networks.</Paragraph>
    <Paragraph position="2"> Another area that is closely related to speech recognition is that of speaker recognition, with its two branches: speaker verification (of a claimed identity) and speaker identification (of an unknown speaker) from a given speech utterance. The state of the art in these two areas already exceeds human performance doing the same task. It appears that humans are far better at recognizing what is being said, irrespective of who is talking, than at recognizing who is talking. As in speech recognition, the most successful approaches to speaker recognition have been those that characterize statistically the short-term spectral characteristics of a speaker. We expect that by working on recognizing speakers from their voices, we should be able to learn more about how to adapt a speech recognition system to the voice of a particular speaker.</Paragraph>
    <Paragraph position="3"> Speaker verification is typically used for secure access to information media (such as the telephone) or to physical locations. The performance of a speaker verification system is often measured by the average of the rate of false rejection of correct talkers (customers) and the rate of false acceptance of incorrect talkers (imposters). The state of the art is less than 1% average error rate; this number is relatively independent of the number of talkers that the system can handle. Relative to speech recognition, speaker verification technology is considered quite mature and commercial products already exist. Partly because of human factors issues, these products do not appear to be in widespread use as yet.</Paragraph>
    <Paragraph position="4"> In contradistncfion with speaker verification where the system can prompt the user to say a particular utterance, in speaker identification the identity of the speaker must be determined independent of what the speaker utters, which is inherently a more difficult task. Two possible applications of this technology are the automatic identification of speakers when transcribing the proceedings of a meeting or conference (human transcribers are good at transcription but not at identifying the speakers), or for identifying speakers for intelligence purposes. The performance of speaker identification systems depends on a large number of factors which include the number of speakers in the set to be identified, the class of communication channels being used, the total amount of speech and the number of conversations for each speaker used in training the speaker models, and the amount of data used in the identification process. If, for example, we wish to identify among 20 speakers with speaker models developed from a total of 60 s of speech from different conversations and 20 s are used for identification, we expect to achieve at least 90% correct identification. With as little as 10 s for training and 2 s for identification and with communication taking place over highly variable radio channels, nearly 70% correct identification has been achieved.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML