File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1002_metho.xml

Size: 20,737 bytes

Last Modified: 2025-10-06 14:15:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1002">
  <Title>AUTOMATIC SPEECH RECOGNITION AND ITS APPLICATION TO INFORMATION EXTRACTION</Title>
  <Section position="4" start_page="11" end_page="75" type="metho">
    <SectionTitle>
2. BROADCAST NEWS DICTATION AND
INFORMATION EXTRACTION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
2.1 DARPA Broadcast News Dictation Project
</SectionTitle>
      <Paragraph position="0"> With the introduction of the broadcast news test bed to the DARPA project in 1995, the research effort took a profound step forward. Many of the deficiencies of the WSJ domain were resolved in the broadcast news domain \[3\]. Most importantly, the fact that broadcast news is a real-</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
2.2 Japanese Broadcast News
Dictation System
</SectionTitle>
      <Paragraph position="0"> We have been developing a large-vocabulary continuous-speech recognition (LVCSR) system for Japanese broadcast-news speech transcription \[4\]\[5\]. This is a part of a joint research with the NHK broadcast company whose goal is the closed-captioning of TV programs. The broadcast-news manuscripts that were used for constructing the language models were taken from the period between July 1992 * and May 1996, and comprised roughly 500k sentences and 22M words. To calculate word n-gram language models, we segmented the broadcast-news manuscripts into words by using a morphological analyzer since Japanese sentences are written without spaces between words. A word-frequency list was derived for the news manuscripts, and the 20k most frequently used words were selected as vocabulary words.</Paragraph>
      <Paragraph position="1"> This 20k vocabulary covers about 98% of the words in the broadcast-news manuscripts. We calculated bigrams and trigrams and estimated unseen n-grams using Katz's back-off smoothing method.</Paragraph>
      <Paragraph position="2"> Japanese text is written by a mixture of three kinds of characters: Chinese characters (Kanji)  and two kinds of Japanese characters (Hira-gana and Kata-kana). Most Kanji have multiple readings, and correct readings can only be decided according to context. Conventional language models usually assign equal probability to all possible readings of each word. This causes recognition errors because the assigned probability is sometimes very different from the true probability. We therefore constructed a language model that depends on the readings of words in order to take into account the frequency and context-dependency of the readings.</Paragraph>
      <Paragraph position="3"> Broadcast news speech includes filled pauses at the beginning and in the middle of sentences, which cause recognition errors in our language models that use news manuscripts written prior to broadcasting. To cope with this problem, we introduced filled-pause modeling into the language model.</Paragraph>
      <Paragraph position="4"> Table 1 - Experimental results of Japanese broadcast news dictation with various language models (word error rate \[%\])  News speech data, from TV broadcasts in July 1996, were divided into two parts, a clean part and a noisy part, and were separately evaluated. The clean part consisted of utterances with no background noise, and the noisy part consisted of utterances with background noise. The noisy part included spontaneous speech such as reports by correspondents. We extracted 50 male utterances and 50 female utterances for each part, yielding four evaluation sets; male-clean (m/c), male-noisy (m/n), female-clean (f/c), femalenoisy (fin). Each set included utterances by five or six speakers. All utterances were manually segmented into sentences. Table 1 shows the experimental results for the baseline language model (LM 1) and the new language models. LM2 is the reading-dependent language model, and LM3 is a modification of LM2 by filled-pause modeling. For clean speech, LM2 reduced the word error rate by 4.7 % relative to LM1, and LM3 model reduced the word error rate by 10.9 % relative to LM2 on average.</Paragraph>
    </Section>
    <Section position="3" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
2.3 Information Extraction in the DARPA
Project
</SectionTitle>
      <Paragraph position="0"> News is filled with events, people, and organizations and all manner of relations among them. The great richness of material and the naturally evolving content in broadcast news has leveraged its value into areas of research well beyond speech recognition. In the DARPA project, the Spoken Document Retrieval (SDR) of TREC and the Topic Detection and Tracking (TDT) program are supported by the same materials and systems that have been developed in the broadcast news dictation arena \[3\]. BBN'sRough'n'Reddy system extracts structural features of broadcast news. CMU's Informedia \[6\], MITRE's Broadcast Navigator, and SRI's Maestro have all exploited the multi-media features of news producing a wide range of capabilities for browsing news archives interactively. These systems integrate various diverse speech and language technologies including speech recognition, speaker change detection, speaker identification, name extaction, topic classification and information retrieval.</Paragraph>
    </Section>
    <Section position="4" start_page="12" end_page="75" type="sub_section">
      <SectionTitle>
2.4 Information Extraction from Japanese
Broadcast News
</SectionTitle>
      <Paragraph position="0"> Summarizing transcribed news speech is useful for retrieving or indexing broadcast news. We investigated a method for extracting topic words from nouns in the speech recognition results on the basis of a significance measure \[4\]\[5\]. The extracted topic words were compared with &amp;quot;true&amp;quot; topic words, which were given by three human subjects. The results are shown in Figure 2.</Paragraph>
      <Paragraph position="1">  When the top five topic words were chosen (recall=13%), 87% of them were correct on average.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="75" end_page="75" type="metho">
    <SectionTitle>
3. HUMAN-COMPUTER DIALOGUE
SYSTEMS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
3.1 Typical Systems in US and Europe
Recently a number of sites have been working
</SectionTitle>
      <Paragraph position="0"> on human-computer dialogue systems. The followings are typical examples.</Paragraph>
      <Paragraph position="1">  focuses its speech research on a content-addressable multimedia information retrieval system, under a multi-lingual environment, where queries and multimedia documents may appear in multiple languages \[7\]. The system is called &amp;quot;View4You&amp;quot; and their research is conducted  in cooperation with the Informedia project at CMU \[6\]. In the View4You system, German and Servocroatian public newscasts are recorded daily. The newscasts are automatically segmented and an index is created for each of the segments by means of automatic speech recognition. The user can query the system in natural language by keyboard or through a speech utterance. The system returns a list of segments which is sorted by relevance with respect to the user query. By selecting a segment, the user can watch the corresponding part of the news show on his/her computer screen. The system overview is shown in Fig. 3.  (b) The SCAN- speech content based audio navigator at AT&amp;T Labs SCAN (Speech Content based Audio Navigator) is a spoken document retrieval system developed at AT&amp;T Labs integrating speaker-independent, large-vocabulary speech recognition with  information-retrieval to support query-based retrieval of information from speech archives \[8\]. Initial development focused on the application of SCAN to the broadcast news domain. An overview of the system architecture is provided in Fig. 4. The system consists of three components: (1) a speaker-independent large-vocabulary speech recognition engine which  segments the speech archive and generates transcripts, (2) an information-retrieval engine which indexes the transcriptions and formulates hypotheses regarding document relevance to user-submitted queries and (3) a graphical-userinterface which supports search and local contextual navigation based on the machine-generated transcripts and graphical representations of query-keyword distribution in the retrieved speech transcripts. The speech recognition component of SCAN includes an intonational phrase boundary detection module and a classification module, These subcomponents preprocess the speech data before passing the speech to the recognizer itself.  technology at MIT for several years. Recently, they have initiated a significant redesign of the GALAXY architecture to make it easier for researchers to develop their own applications, using either exclusively their own servers or intermixing them with servers developed by others.</Paragraph>
      <Paragraph position="2"> This redesign was done in part due to the fact that GALAXY has been designed as the first reference architecture for the new DARPA Communicator program. The resulting configuration of the GALAXY-II architecture is shown in Fig. 5. The boxes in this figure represent various human language technology servers as well as information and domain servers. The label in italics next to each box identifies the corresponding MIT system component. Interactions between servers are mediated by the hub and managed in the hub script. A particular dialogue session is initiated by a user either through interaction with a graphical interface at a Web site, through direct telephone dialup, or through a desktop agent.  prototype telephone information services for train travel information in several European countries \[ 10\]. In collaboration with the Vecsys company and with the SNCF (the French Railways), LIMSI has developed a prototype telephone service providing timetables, simulated fares and reservations, and information on reductions and services for the main French intercity connections. A prototype French/English service for the high speed trains between Paris and London is also under development. The system is based on the spoken language systems developed for the RailTel project \[11\] and the ESPRIT Mask project \[12\]. Compared to the RailTel system, the main advances in ARISE are in dialogue management, confidence measures, inclusion of optional spell mode for ci, ty/station names, and barge-in capability to allow more natural interaction between the user and the machine.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
3.2 Designing a Multimodal Dialogue System
for Information Retrieval
</SectionTitle>
      <Paragraph position="0"> We have recently investigated a paradigm for designing multimodal dialogue systems \[ 13\]. An example task of the system was to retrieve particular information about different shops in the Tokyo Metropolitan area, such as their names, addresses and phone numbers. The system accepted speech and screen touching as input, and presented retrieved information on a screen display or by synthesized speech as shown in Fig.</Paragraph>
      <Paragraph position="1"> 6. The speech recognition part was modeled by the FSN (finite state network) consisting of keywords and fillers, both of which were implemented by the DAWG (directed acyclic word-graph) structure. The number ofkeywords was 306, consisting of district names and business names. The fillers accepted roughly 100,000 non-keywords/phrases occuring in spontaneous speech. A variety of dialogue strategies were designed and evaluated based on an objective cost function having a set of actions and states as parameters. Expected dialogue cost The speech recognizer uses n-gram backoff language models estimated on the transcriptions of spoken queries. Since the amount of language model training data is small, some grammatical classes, such as cities, days and months, are used to provide more robust estimates of the n- null empirically determined threshold, the hypothesized word is marked as uncertain. The uncertain words are ignored by the understanding component or used by the dialogue manager to start clarification subdialogues.</Paragraph>
      <Paragraph position="2"> was calculated for each strategy, and the best strategy was selected according to the keyword recognition accuracy.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="75" end_page="75" type="metho">
    <SectionTitle>
4. ROBUST SPEECH
RECOGNITION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
4.1 Automatic
</SectionTitle>
      <Paragraph position="0"> variation in speech \[14\]. ~.</Paragraph>
      <Paragraph position="1"> It is crucial to establish methods that are robust against voice variation due to individuality, the physical and psychological condition of the speaker, telephone sets, microphones, network characteristics, additive background noise, speaking styles, and so on.</Paragraph>
      <Paragraph position="2"> Figure 8 shows main methods for making speech recognition systems robust against voice variation. It is also important for the systems to impose few restrictions on tasks and vocabulary. To solve these problems, it is essential to develop automatic adaptation techniques.</Paragraph>
      <Paragraph position="3"> Extraction and normalization of.</Paragraph>
      <Paragraph position="4"> (adaptation to) voice individuality is one of the most important issues \[ 14\]. A small percentage of people occasionally cause systems to produce exceptionally low recognition rates* This is an example of the &amp;quot;sheep and goats&amp;quot; phenomenon. Speaker adaptation (normalization) methods can usually be classified into supervised (text-dependent) and  instantaneous/incremental adaptation is ideal, since the system works as if it were a speaker-independent system, and it performs increasingly better as it is used. However, since we have to adapt many phonemes using a limited size of utterances including only a limited number of phonemes, it is crucial to use reasonable modeling of speaker-to-speaker variablity or constraints. Modeling of the mechanism of speech production is expected to provide a useful modeling of speaker-to-speaker variability.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
4.2 On-line speaker adaptation in broadcast
</SectionTitle>
      <Paragraph position="0"> news dictation Since, in broadcast news, each speaker utters several sentences in succession, the recognition error rate can be reduced by adapting acoustic models incrementally within a segment that contains only one speaker. We applied on-line, unsupervised, instantaneous and incremental speaker adaptation combined with automatic detection of speaker changes \[4\]. The MLLR \[ 15\] -MAP \[ 16\] and VFS (vector-field smoothing) \[17\] methods were instantaneously and incrementally carried out for each utterance. The adaptation process is as follows. For the first input utterance, the speaker-independC/nt model is used for both recognition and adaptation, and the first speaker-adapted model is created. For the second input utterance, the likelihood value of the utterance given the speaker-independent model and that given the speaker-adapted model are calculated and compared. If the former value is larger, the utterance is considered to be the beginning of a new speaker, and another speaker-adapted model is created. Otherwise, the existing speaker-adapted model is incrementally adapted.</Paragraph>
      <Paragraph position="1"> For the succeeding input utterances, speaker changes are detected in the same way by comparing the acoustic likelihood values of each utterance obtained from the speaker-independent model and some speaker-adapted models. If the speaker-independent model yields a larger likelihood than any of the speaker-adapted models, a speaker change is detected and a new speaker-adapted model is constructed.</Paragraph>
      <Paragraph position="2"> Experimental results show that the adaptation reduced the word error rate by 11.8 % relative to the speaker-independent models.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="75" end_page="75" type="metho">
    <SectionTitle>
5. PRESPECTIVES OF LANGUAGE
MODELING
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
5.1 Language modeling for spontaneous
</SectionTitle>
      <Paragraph position="0"> speech recognition One of the most important issues for speech recognition is how to create language models (rules) for spontaneous speech. When recognizing spontaneous speech in dialogues, it is necessary to deal with variations that are not encountered when recognizing speech that is read from texts. These variations include extraneous words, out-of-vocabulary words, ungrammatical sentences, disfluency, partial words, repairs, hesitations, and repetitions. It is crucial to develop robust and flexible parsing algorithms that match the characteristics of spontaneous speech. A paradigm shift from the present transcription-based approach to a detection-based approach will be important to solve such problems \[2\]. How to extract contextual information, predict users' responses, and focus on key words are very important issues.</Paragraph>
      <Paragraph position="1"> Stochastic language modeling, such as bigrams and trigrams, has been a very powerful tool, so it would be very effective to extend its utility by incorporating semantic knowledge. It would also be useful to integrate unification grammars and context-free grammars for efficient word prediction. Style shifting is also an important problem in spontaneous speech recognition. In typical laboratory experiments, speakers are reading lists of words rather than trying to accomplish a real task. Users actually trying to accomplish a task, however, use a different linguistic style. Adaptation of linguistic models according to tasks, topics and speaking styles is a very important issue, since collecting a large linguistic database for every new task is difficult and costly.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
5.2 Message-Driven Speech Recognition
</SectionTitle>
      <Paragraph position="0"> State-of-the-art automatic speech recognition systems employ the criterion of maximizing P(/4,qX), where W is a word sequence, and X is an acoustic observation sequence. This criterion is reasonable for dictating read speech. However, the ultimate goal of automatic speech recognition is to extract the underlying messages of the speaker from the speech signals. Hence we need to model the process of speech generation and recognition as shown in Fig. 9 \[ 18\], where M is the message (content) that a speaker intended to convey.</Paragraph>
      <Paragraph position="1"> models in the same way as in usual recognition processes. We assume that P(M) has a uniform probability for all M. Therefore, we only need to consider further the term P(~M). We assume that P(~M) can be expressed as follows.</Paragraph>
      <Paragraph position="2">  According to this model, the speech recognition process is represented as the maximization of the following a posteriori probability \[4\]\[5\], (4) where ~, 0&lt;-/1.&lt;1, is a weighting factor. P(W), the first term of the right hand side, represents a part of P(~M) that is independent of Mand can be given by a general statistical language model. P'(WIM), the second term of the right hand side, represents the part ofP(WIA D that depends on M. We consider that M is represented by a co-occurrence of words based on the distributional hypothesis by Harris \[ 19\]. Since this approach formulates P'(WIM) without explicitly representing M, it can use information about the speaker's message M without being affected by the quantization problem of topic classes. This new formulation of speech recognition was applied to the Japanese broadcast news dictation, and it was found that word error rates for the clean set were slightly reduced by this method.</Paragraph>
      <Paragraph position="3"> maxP(MIX) = max\]~ P(MIW)P(WIX). (1) M M W Using Bayes' rule, Eq. (1) can be expressed as</Paragraph>
      <Paragraph position="5"> For simplicity, we can approximate the equation as</Paragraph>
      <Paragraph position="7"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML