File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2021_metho.xml

Size: 14,148 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2021">
  <Title>Speech Recognition of Czech - Inclusion of Rare Words Helps</Title>
  <Section position="3" start_page="0" end_page="122" type="metho">
    <SectionTitle>
2 Language Model
</SectionTitle>
    <Paragraph position="0"> Language models used in a rst pass of current speech recognition systems are usually built in the following way. First, a text corpus is acquired.</Paragraph>
    <Paragraph position="1"> In case of broadcast news, a newspaper collection or news transcriptions are a good source. Second, most frequent words are picked out to form a dictionary. Dictionary size is typically in tens of thousand words. For English, for example, dictionaries of size  of 60k words suf ciently cover common domains.</Paragraph>
    <Paragraph position="2"> (Of course, for recognition of entries listed in the Yellow pages, such limited dictionaries are clearly inappropriate.) Third, an a2 -gram language model is estimated. In case of Katz back-off model, the conditional bigram word probability is estimated as</Paragraph>
    <Paragraph position="4"> where a22a3 represents a smoothed probability distribution, a32a34a33a34a6a37a16 stands for the back-off weight, and a25 a6a39a38a40a16 denotes the count of its argument. Back-off model can be also nicely viewed as a nite state automaton as depicted in Figure 1.</Paragraph>
    <Paragraph position="6"> represented as a nite-state automaton.</Paragraph>
    <Paragraph position="7"> To alleviate the problem of a high OOV, we suggest to gather supplementary words and add them into the model in the following way.</Paragraph>
    <Paragraph position="9"> (2) a3a5a4a7a6a37a16 refers to the regular back-off model, a59 denotes the regular dictionary from which the back-off model was estimated, a64 is the supplementary dictionary which does not overlap with a59 .</Paragraph>
    <Paragraph position="10"> Several sources can be exploited to obtain supplementary dictionaries. Morphology tools can derive words which are close to those observed in corpus. In such a case, a62a24a6a9a8a43a10a66a16 can be set as a constant function and estimated on held-out data to maximize recognition accuracy.</Paragraph>
    <Paragraph position="12"> Having prior domain knowledge, new words which are expected to appear in audio recordings might be collected and added into a64 . Consider an example of transcribing an ice-hockey tournament. Names of new players are desirably in the vocabulary. Another source of a64 are the words which fell below the selection threshold of a59 . In large corpora, there are hundreds of thousands words which are omitted from the estimated language model. We suggest to put them into a64 . As it turned out, unigram probability of these words is very low, so it is suitable to increase their score to make them competitive with other words in a59 during recognition. a62a24a6a9a8 a10 a16 is then computed as</Paragraph>
    <Paragraph position="14"> where a72a56a6a9a8a43a10a73a16 refers to the relative frequency of a8a74a10 in a given corpus, shift denotes a shifting factor which should be tuned on some held-out data.</Paragraph>
    <Paragraph position="16"> injected by a supplementary dictionary Note that the probability of a word given its history is no longer proper probability. It does not adds up to one. We decided not to normalize the model for two reasons. First, we used a decoder which searches for the best path using Viterbi criterion, so there's no need for normalization. Second, normalization would have involved recomputing all back-off model weights and could also enforce re-tuning of the language model scaling factor. To rule out any variation which the re-tuning of the scaling factor could bring, we decided not to normalize the new model.</Paragraph>
    <Paragraph position="17"> In nite-state representation, injection of a new dictionary was implemented as depicted in Figure 2. Supplementary words form a loop in the back-off state.</Paragraph>
  </Section>
  <Section position="4" start_page="122" end_page="124" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> We have evaluated our approach on two corpora, Czech Broadcast News and the Czech portion of MALACH data.</Paragraph>
    <Section position="1" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
3.1 Czech Broadcast News Data
</SectionTitle>
      <Paragraph position="0"> The Czech Broadcast News (Radov*a et al., 2004) is a collection of both radio and TV news in Czech.</Paragraph>
      <Paragraph position="1"> Weather forecast, traf c announcements and sport news were excluded from this corpus. Our training portion comprises 22 hours of speech. To tune the language model scaling factor and additional LM parameters, we set aside 100 sentences. The test set consists of 2500 sentences.</Paragraph>
      <Paragraph position="2"> We used the HTK toolkit (Young et al., 1999) to extract acoustic features from sampled signal and to estimate acoustic models. As acoustic features we used 12 Mel-Frequency Cepstral Coef cients plus energy and delta and delta-delta features. We trained a triphone acoustic model with tied mixtures of continuous density Gaussians.</Paragraph>
      <Paragraph position="3"> As a LM training corpus we exploited a collection of newspaper articles from the Lidov*e Noviny (LN) newspaper. This collection was published as a part of the Prague Dependency Treebank by LDC (Haji c et al., 2001). This corpus contains 33 million tokens. Its vocabulary contains more than 650k word forms.</Paragraph>
      <Paragraph position="4"> OOV rates are displayed in Table 1.</Paragraph>
      <Paragraph position="5"> Dict. size OOV  Dictionaries contain the most frequent words.</Paragraph>
      <Paragraph position="6"> As can be readily observed, moderate-size vocabularies don't suf ciently cover the test data transcriptions. Therefore they are one of the major sources of poor recognition performance.</Paragraph>
      <Paragraph position="7"> The baseline language model was estimated from 60k most frequent words. It was a bigram Katz back-off model with Knesser-Ney smoothing pruned by the entropy-based method (Stolcke, 1998).</Paragraph>
      <Paragraph position="8"> As the supplementary dictionary we took the rest of words from the LN corpus. To learn the impact of injection of infrequent words, we carried out two experiments.</Paragraph>
      <Paragraph position="9"> First, we built a uniform loop which was injected into the back-off model. The uniform distribution was tuned on the held-out data. Tuning of this constant is displayed in Table 2.</Paragraph>
      <Paragraph position="10">  out set. WER denotes the word error rate.</Paragraph>
      <Paragraph position="11"> Second, we took relative frequencies multiplied by a shift coef cient as the injected model scores. This shift coef cient was again tuned on held-out data as shown in Table 3.</Paragraph>
      <Paragraph position="12">  model on the held-out set.</Paragraph>
      <Paragraph position="13"> Then, we took the best parameters and used them for recognition of the test data. Recognition results are depicted in Figure 4. The injection of supplementary words helped decrease both recognition word error rate and oracle word error rate. By oracle WER is meant WER of the path, stored in the generated lattice, which best matches the utterance regardless the scores. In other words, oracle WER gives us a bound on how well can we get by tuning scores in a given lattice. Injection of shifted unigram model brought relative improvement of 13.6% in terms of WER over the 60k baseline model. Uniform injection brought also signi cant improvement despite its simplicity. Indeed, we observed more than 10% relative improvement in terms of WER. In terms of oracle WER, unigram injection brought more than 30% relative improvement.</Paragraph>
      <Paragraph position="14">  stands for the oracle error rate.</Paragraph>
      <Paragraph position="15"> It's worthwhile to mention the model size, since it could be argued that the improvement was achieved by an enormous increase of the model. We decided to measure the model size using two factors. The disk space occupied by the language model and the disk space taken up by the so-called CLG. By CLG we mean a transducer which maps triphones to words augmented with the model scores. This transducer represents the search space investigated during recognition. More details on transducers in speech recognition can be found in (Mohri et al., 2002). Table 5 summarizes the sizes of the evaluated models.</Paragraph>
      <Paragraph position="16">  space. G denotes a language model compiled as a nite-state automaton. CLG denotes transducer mapping triphones to words augmented with model scores.</Paragraph>
      <Paragraph position="17"> Injection of supplementary words increased the model size only slightly. To see the difference in the size of injected models and traditionally built ones, we constructed a model of 80k most frequent words and pruned with the same threshold as the 60k LM. Not only did this 80k model give worse recognition results, but it also proved to be bigger.</Paragraph>
    </Section>
    <Section position="2" start_page="123" end_page="124" type="sub_section">
      <SectionTitle>
3.2 MALACH Data
</SectionTitle>
      <Paragraph position="0"> The next data we tested our approach on was the Czech portion of the MALACH corpus (http://www.clsp.jhu.edu/research/malach).</Paragraph>
      <Paragraph position="1"> MALACH is a multilingual audio-visual corpus.</Paragraph>
      <Paragraph position="2"> It contains recordings of survivors of World War II talking about war events. 600 people spoke in Czech, but only 350 recordings had been digitized till end of 2003. The interviewer and the interviewee had separate microphones, and were recorded on separate stereo channels. Recordings were stored in the MPEG-1 format. Average length of a testimony is 1.9 hours.</Paragraph>
      <Paragraph position="3"> 30 minutes from each testimony were transcribed and used as training data. 10 testimonies were transcribed completely and used for testing. The acoustic model used 15-dimensional PLP cepstral features, sampled at 10 msec. Modeling was done using the HTK Toolkit.</Paragraph>
      <Paragraph position="4"> The baseline language model was estimated from transcriptions of the survivors' testimonies. We worked with the standardized version of the transcriptions. More details regarding the Czech portion of the MALACH data can be found in (Psutka et al., 2004). Transcriptions are 610k words long and the entire vocabulary comprises 41k words. We refer to this corpus as TR 41k.</Paragraph>
      <Paragraph position="5"> To obtain a supplementary vocabulary, we used Czech morphology tools (Haji c and Vidov*a-Hladk*a, 1998). Out of 41k words we generated 416k words which were the in ected forms of the observed words in the corpus. Note that we posed restrictions on the generation procedure to avoid obsolete, archaic and uncommon expressions. To do so, we ran a Czech tagger on the transcriptions and thus obtained a list of all morphological tags of observed forms. The morphological generation was then conned to this set of tags.</Paragraph>
      <Paragraph position="6"> Since there is no corpus to train unigram scores of generated words on, we set the LM score of the generated forms to a constant.</Paragraph>
      <Paragraph position="7"> The transcriptions are not the only source of text data in the MALACH project. (Psutka et al., 2004) searched the Czech National Corpus (CNC) for sentences which are similar to the transcriptions. This additional corpus contains almost 16 million words, 330k types. CNC vocabulary overlaps to a large extent with TR vocabulary. This fact is not surprising since the selection criterion was based on a lemma unigram probability. Table 6 summarizes OOV rates of several dictionaries.</Paragraph>
      <Paragraph position="8"> We estimated several language models. The base-line models are pruned bigram back-off models with Knesser-Ney smoothing. The baseline word error  note the transcriptions, the Czech National Corpus, respectively. Morph refers to the dictionary generated by the morphology tools from from TR. Numbers in the dictionary names represent the dictionary size.</Paragraph>
      <Paragraph position="9"> rate of the model built solely from transcriptions was 37.35%. We injected constant loop of morphological variants into this model. In terms of text coverage, this action reduced OOV from 5.07% to 2.74%. In terms of recognition word error rate, we observed a relative improvement of 3.5%.</Paragraph>
      <Paragraph position="10"> In the next experiment we took as the baseline LM a linear interpolation of the LM built from transcriptions and a model estimated from the CNC corpus. Into this model, we injected a unigram loop of all the available words. That is the rest of words from the CNC corpus with unigram scores and words provided by morphology which were not already in the model. Table 7 summarizes the achieved WER and oracle WER. Given the fact that the injection only slightly reduced the OOV rate, a small relative reduction of 2.3% matched our expectations.</Paragraph>
      <Paragraph position="11">  line and injected models. Uniform Morph refers to the constant uniform loop of the morphology-generated words. Inj denotes the loop of the rest of words of the CNC corpus and the morphology-generated words.</Paragraph>
      <Paragraph position="12"> To learn how the injection affected model size, we measured size of the language model automaton and the optimized triphone-to-word transducer. As in the case of the LN corpus, injection increased the model size only moderately. Sizes of the models are shown in Table 8.</Paragraph>
      <Paragraph position="13">  to a language model compiled into an automaton, CLG denotes triphone-to-word transducer. CNC and Morph refer to a LM estimated from transcriptions and the Czech National Corpus, respectively. Morph represents the loop of words generated by morphology. Inj is the loop of all words from CNC which were not included in CNC language model, moreover, Inj also contains words generated by the morphology. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML