File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-0117_concl.xml

Size: 6,885 bytes

Last Modified: 2025-10-06 13:57:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0117">
  <Title>A Natural Language Correction Model for Continuous Speech Recognition 1</Title>
  <Section position="7" start_page="173" end_page="174" type="concl">
    <SectionTitle>
6. OTHER APPROACHES
</SectionTitle>
    <Paragraph position="0"> Natural language processing techniques have been used before to assist automated speech recognition either in constructing better language models, or as post-processors. For example, part-of-speech information has been used to reduce the overall perplexity of the language model. More advanced linguistic methods include re-ranking N-best sentence hypotheses using syntactic and lexieal well-formedness. A good example is an ongoing effort by Gnshman and Sekine (1994, 1996) at using syntactic parser to pinpoint the correct transcription among the N-best alternatives returned by an SRS. Despite the intuitive appeal of their approach, in which transcribed sentences are re-ranked using the likelihood or degree of syntactic correctness, they have thus far been unable to obtain noticeable reduction of word error rates, partly due to, as they point out, a limited possible range of improvement: one can only improve N-best ranking if there is a better transeription'among them. Other work with NLP-motivated re-ranking of N-best alternative transcription include Rayner et al. (1994), Ostendorf et al. (1991), Norton et al. (1992), among others, and see also (Hirsehman, 1994) and (Moore, 1994) for summary overview. Some alternative possibilities of taking advantage of linguistic information in speech recognition are described by e.g., Kupiec (1992), Murveit and Moore (1990), Maltese and Mancini (1992), and Schwartz et al. (1994).</Paragraph>
    <Paragraph position="1"> The post-correction approach has been considered by Ringger and Allen (1996), for a different domain (trains scheduling dialog), and using a probabilistie modeling technique rather than correction rules. They report some interesting preliminary results, however these are not directly comparable to ours for two reasons. First, the baseline SRS system used in these experiments is much weaker (only about 58% accurate). Second, the reported improvement cover corrections of vocabulary deficiencies, not only true transcription mistakes due to weaknesses of the language model.</Paragraph>
    <Paragraph position="2"> It should also be noted here that while the above efforts usually attack the more general problem of speech understanding accuracy, in an ad-hoc speech production situation (though limited to certain domains such as broadcast news), our solution has been specifically tailored to clinical dictation, and wouldn't necessarily apply elsewhere. Some of the application characteristics that we take advantage of is relatively limited vocabulary (hence smaller training samples), a limited number of speakers (hence speaker independence less critical), and relatively low perplexity of radiology sublanguage (about 20 vs. about 250 for general English as measured for trigram models (Roukos, 1995)). The C-Box approach does not preclude N-Best techniques - in fact we consider this as a natural extension of the present method - one of the obvious limitations of the present approach is the need for parallel training texts, which may be replaced by multiple alternatives. null Overall, the C-Box approach is partly related to error-driven learning techniques as used for part-of-speech tagging (Bfill, 1992, 1995), and spelling correction (e.g., Golding &amp; Schabes, 1996).</Paragraph>
    <Paragraph position="3"> Text alignment methods have been discussed primarily in context of machine translation, (e.g., Brown et. al, 1991; Church, 1993; Chen, 1993) and we draw on these here. Rule validation is based in part on the N-gram weighting method described in (Strzalkowsld &amp; Brandow, 1996).</Paragraph>
    <Paragraph position="4">  I 7. CONCLUSIONS, LIMITATIONS, AND FUTURE DIRECTIONS While we are generally pleased with the initial progress of this work, it is still quite early to draw any definite conclusions whether the C-Box method will prove sufficiently robust and effective in practical applications. For one thing, we have thus far avoided the question of speaker dependence. As described, the C-Box method is clearly speaker dependent, that is, the correction rules need to be learned for each new speaker. It remains to be seen if the present solution is acceptable, and if a degree of speaker independence can be introduced through rule generalization. Further research and evaluations are required with context-sensitive correction rules and various sizes of the training data. At this time it is also an open question how much improvement can be achieved using this method, i.e., if there is an upper bound, and if so what that is. On the face of it, this seems to depend only on how good rules we can get. In practice, we face limits of rule learnability due to sparse data, as well as rule interference (i.e., one rule may undo another, etc.). We plan to study these issues in the near future.</Paragraph>
    <Paragraph position="5"> There are also other possibilities. Some SRS produce ranked lists of alternative transcriptions (N-Best), which ean be used to further improve the chances of making only the right corrections.</Paragraph>
    <Paragraph position="6"> Using N-Best sentence hypotheses may also alleviate the need for parallel correct transcriptions in the training sample, as multiple hypotheses can be aligned in order to postulate correction rules. Using multiple SRS in parallel may also increase the likelihood of locating and correcting spurious transcription errors. Finally, we may consider an open-box solution where the information encoded in C-Box rules is fed back into the SRS language model to improve its baseline performance. null At this time, we made no attempt to address any of the problems related to spontaneous speech, such as disfluencies and self-repairs (e.g., Oviatt, 1994; Heeman &amp; Allen, 1994). In dictation, where the speaker normally has an option of backing up and re-recording, such things are less of an issue than the word-for-word accuracy of the final transcription, since there are serious liability considerations to be reckoned with.</Paragraph>
    <Paragraph position="7"> Acknowledgements. This research is based upon work supported in part under a cooperative agreement between the National Institute of Standards and Technology Advanced Technology Program (under the I-IITECC contract, number 70NANB5Hl195) and the Healtheare Open Systems and Trials, Inc. consortium. The authors would like to thank all members of the HITECC/ IMS project at GE CR&amp;D, GEMS, SMS, SCRA, UMMC, Advanced Radiology, and CAMC for their invaluable help, particularly Glenn Fields, Skip Crane, Steve Fritz and Scott Cheney.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML