XML Viewer - h90-1076

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1076_metho.xml
Size: 5,781 bytes
Last Modified: 2025-10-06 14:12:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1076">
  <Title>An 86,000-Word Recognizer Based on Phonemic Models</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
IN RS-T414communications
3 Place du Commerce
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="391" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We have developed an algorithm for the automatic conversion of dictated English sentences to written text, with essentially no restriction on the nature of the material dictated. We require that speakers undergo a short training session so that the system can adapt to their individual speaking characteristics and that they leave brief pauses between words. We have tested our algorithm extensively on an 86,000 word vocabulary (the largest of any such system in the world) using nine speakers and obtained word recognition rates on the order of 93Uo.</Paragraph>
    <Paragraph position="1"> Introduction Most speech recognition systems, research and commercial, impose severe restrictions on the vocabulary that may be used. For a system that aims to do speech-to-text conversion, this is a serious limitation since the speaker may be unable to express himself in his own words without leaving the vocabulary. From the outset we have worked with a very large vocabulary, based on the 60,000 words in Merriam Webster's Seventh New Collegiate Dictionary. We have augmented this number by 26,000 so that at present the probability of encountering a word not in the vocabulary in a text chosen at random from a newspaper, magazine or novel is less than 2% \[25\]. (More than 80% of out-of-vocabulary words are proper names.) Our vocabulary is thus larger than that of any other English language speech-to-text system. IBM has a real-time isolated word recognizer with a vocabulary of 20,000 words \[1\] giving over 95% word recognition on an office correspondence task. The perplexity \[16\] of this task is about 200; the corresponding figure in our case is 700. There is only one speech recognition project in the world having a larger vocabulary than ours; it is being developed by IBM France \[20\] and it requires that the user speak in isolated syllable mode, a constraint which may be reasonable in French but which would be very unnatural in English.</Paragraph>
    <Paragraph position="2"> Briefly, our approach to the problem of speech recognition is to apply the principle of naaximum a posteriori probability (MAP) using a stochastic model for the speech data associated with an arbitrary string of words. The model has three components: (i) a language model which assigns prior probabilities to word strings, (ii) a phonological component which assigns phonetic transcriptions to words in the dictionary and (iii) an acoustic-phonetic model which calculates the likelihood of speech data for an arbitrary phonetic transcription.</Paragraph>
    <Section position="1" start_page="0" end_page="391" type="sub_section">
      <SectionTitle>
Language Modeling
</SectionTitle>
      <Paragraph position="0"> We have trained a trigram language model, which assigns a prior probability distribution to words in the vocabulary based on the previous two words uttered, on 60 million words of text consisting of 1 million words from the Brown Corpus \[11\], 14 million from Hansard (the record of House of Commons debates), 21 million from the Globe and Mail and 24 million from the Montreal Gazette. 1 Reliable estimation of trigram statistics for our vocabulary would require a corpus which is several orders of magnitude larger and drawn from much more heterogeneous sources but such a corpus is not available today. Nonetheless we have found that the trigram model is capable of correcting over 60% of the errors made by the acoustic component of our recognizer; in the case of words for which trigram statistics can be compiled from the training corpus, 90% of the errors are corrected.</Paragraph>
      <Paragraph position="1"> Perhaps the simplest way of increasing recognition performance would be to increase the amount of training data for the language model. Although we are fortunate to have had access to a very large amount of data, we are still a long way from having a representative sample of contemporary written English. IBM has trained their language model using 200 million words of text. It seems that at least one billion words drawn from diverse sources are needed.</Paragraph>
      <Paragraph position="2"> We have found that it is possible to compensate to some extent for the lack of training data by training  parts-of-speech trigrams rather than word trigrams \[10\]. One of our graduate students has produced a Master's thesis which uses Markov modeling and the very detailed parts-of-speech tags with which the Brown Corpus is annotated to annotate new text automatically.</Paragraph>
      <Paragraph position="3"> We have also developed a syntactic parser which is capable of identifying over 30% of the recognition errors which occur after the trigram model \[22\].</Paragraph>
    </Section>
    <Section position="2" start_page="391" end_page="391" type="sub_section">
      <SectionTitle>
The Phonological Component
</SectionTitle>
      <Paragraph position="0"> In most cases Merriam Webster's Seventh New Collegiate Dictionary indicates only one pronunciation for each word. The transcriptions do not provide for phenomena such as consonant cluster reduction or epenthetic stops. Guided by acoustic recognition 2 errors, we have devised a comprehensive collection of context-dependent production rules which we use to derive a set of possible pronunciations for each word. This work is described in \[26\].</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML