XML Viewer - h05-1026

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1026_metho.xml
Size: 16,281 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1026">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 201-208, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Training Neural Network Language Models On Very Large Corpora [?]</Title>
  <Section position="3" start_page="203" end_page="206" type="metho">
    <SectionTitle>
3 Application to Speech Recognition
</SectionTitle>
    <Paragraph position="0"> In this paper the neural network LM is evaluated in a real-time speech recognizer for French Broadcast News. This is a very challenging task since the incorporation of the neural network LM into the speech recognizer must be very effective due to the time constraints. The speech recognizer itself runs in 0.95xRT2 and the neural network in less than 0.05xRT. The compute platform is an Intel Pentium 4 extreme (3.2GHz, 4GB RAM) running Fedora Core 2 with hyper-threading.</Paragraph>
    <Paragraph position="1"> The acoustic model uses tied-state position-dependent triphones trained on about 190 hours of Broadcast News data. The speech features consist of 39 cepstral parameters derived from a Mel frequency spectrum estimated on the 0-8kHz band (or 0-3.8kHz for telephone data) every 10ms. These cepstral coefficients are normalized on a segment cluster basis using cepstral mean removal and variance normalization. The feature vectors are linearly transformed (MLLT) to better fit the diagonal covariance Gaussians used for acoustic modeling.</Paragraph>
    <Paragraph position="2"> Decoding is performed in two passes. The first fast pass generates an initial hypothesis, followed by acoustic model adaptation (CMLLR and MLLR) and a second decode pass using the adapted models. Each pass generates a word lattice which is expanded with a 4-gram LM. The best solution is then extracted using pronunciation probabilities and consensus decoding. Both passes use very tight pruning thresholds, especially for the first pass, and fast Gaussian computation based on Gaussian short lists.</Paragraph>
    <Paragraph position="3"> For the final decoding pass, the acoustic models include 23k position-dependent triphones with 12k tied states, obtained using a divisive decision tree based clustering algorithm with a 35 base phone set.</Paragraph>
    <Paragraph position="4">  tiples of the length of the speech signal, the real time factor xRT. For a speech signal of 2h, a processing time of 0.5xRT corresponds to 1h of calculation.</Paragraph>
    <Paragraph position="5"> The system is described in more detail in (Gauvain et al., 2005).</Paragraph>
    <Paragraph position="6"> The neural network LM is used in the last pass to rescore the lattices. A short-list of length 8192 was used in order to fulfill the constraints on the processing time (the complexity of the neural network to calculate a LM probability is almost linear with the length of the short-list). This gives a coverage of about 85% when rescoring the lattices, i.e. the percentage of LM requests that are actually performed by the neural network.</Paragraph>
    <Section position="1" start_page="203" end_page="204" type="sub_section">
      <SectionTitle>
3.1 Language model training data
</SectionTitle>
      <Paragraph position="0"> The following resources have been used for lan- null First a language model was built for each corpus using modified Kneser-Ney smoothing as implemented in the SRI LM toolkit (Stolcke, 2002). The individual LMs were then interpolated and merged together. An EM procedure was used to determine the coefficients that minimize the perplexity on the development data. Table 1 summarizes the characteristics of the individual text corpora.</Paragraph>
      <Paragraph position="1">  of words, perplexity on the development corpus and interpolation coefficients) Although the detailed transcriptions of the audio data represent only a small fraction of the available data, they get an interpolation coefficient of 0.43. This shows clearly that they are the most appropriate text source for the task. The commercial transcripts,  the newspaper and WEB texts reflect less well the speaking style of broadcast news, but this is to some extent counterbalanced by the large amount of data. One could say that these texts are helpful to learn the general grammar of the language. The word list includes 65301 words and the OOV rate is 0.95% on a development set of 158k words.</Paragraph>
    </Section>
    <Section position="2" start_page="204" end_page="204" type="sub_section">
      <SectionTitle>
3.2 Training on in-domain data only
</SectionTitle>
      <Paragraph position="0"> Following the above discussion, it seems natural to first train a neural network LM on the transcriptions of the acoustic data only. The architecture of the neural network is as follows: a continuous word representation of dimension 50, one hidden layer with 500 neurons and an output layer limited to the 8192 most frequent words. This results in 3.2M parameters for the continuous representation of the words and about 4.2M parameters for the second part of the neural network that estimates the probabilities. The network is trained using standard stochastic back-propagation.3 The learning rate was set to 0.005 with an exponential decay and the regularization term is weighted with 0.00003. Note that fast training of neural networks with more than 4M parameters on 4M examples is already a challenge.</Paragraph>
      <Paragraph position="1"> The same fast algorithms as described in (Schwenk, 2004) were used. Apparent convergence is obtained after about 40 epochs though the training data, each one taking 2h40 on standard PC equipped with two Intel Xeon 2.8GHz CPUs.</Paragraph>
      <Paragraph position="2"> The neural network LM alone achieves a perplexity of 103.0 which is only a 4% relative reduction with respect to the back-off LM (107.4, see Table 1).</Paragraph>
      <Paragraph position="3"> If this neural network LM is interpolated with the back-off LM trained on the whole training set the perplexity decreases from 70.2 to 67.6. Despite this small improvements in perplexity a notable word error reduction was obtained from 14.24% to 14.02%, with the lattice rescoring taking less than 0.05xRT.</Paragraph>
      <Paragraph position="4"> In the following sections, it is shown that larger improvements can be obtained by training the neural network on more data.</Paragraph>
    </Section>
    <Section position="3" start_page="204" end_page="205" type="sub_section">
      <SectionTitle>
3.3 Adding selected data
</SectionTitle>
      <Paragraph position="0"> Training the neural network LM with stochastic back-propagation on all the available text corpora 3The weights are updated after each example.</Paragraph>
      <Paragraph position="1"> would take quite a long time. The estimated time for one training epoch with the 88M words of commercial transcriptions is 58h, and more than 12 days if all the 508M words of newspaper texts were used.</Paragraph>
      <Paragraph position="2"> This is of course not very practicable. One solution to this problem is to select a subset of the data that seems to be most useful for the task. This was done by selecting six month of the commercial transcriptions that minimize the perplexity on the development set. This gives a total of 22M words and the training time is about 14h per epoch.</Paragraph>
      <Paragraph position="3"> One can ask if the capacity of the neural network should be augmented in order to deal with the increased number of examples. Experiments with hidden layer sizes from 400 to 1000 neurons have been  training time per epoch as a function of the size of the hidden layer (fixed 6 months subset of commercial transcripts).</Paragraph>
      <Paragraph position="4"> Although there is a small decrease in perplexity and word error when increasing the dimension of the hidden layer, this is at the expense of a higher processing time. The training and recognition time are in fact almost linear to the size of the hidden layer. An alternative approach to augment the capacity of the neural network is to modify the dimension of the continuous representation of the words (in the range 50 to 150). The idea behind this is that the probability estimation may be easier in a higher dimensional space (instead of augmenting the capacity of the non-linear probability estimator itself). This is similar in spirit to the theory behind support vector machines (Vapnik, 1998).</Paragraph>
      <Paragraph position="5"> Increasing the dimension of the projection layer has several advantages as can be seen from the Figure 2. First, the perplexity and word error rates are lower than those obtained when the size of the  continuous word representation (500 hidden units, fixed 6 months subset of commercial transcripts).</Paragraph>
      <Paragraph position="6"> hidden layer is increased. Second, convergence is faster: the best result is obtained after about 15 epochs while up to 40 are needed with large hidden layers. Finally, increasing the size of the continuous word representation has only a small effect on the training and recognition complexity of the neural network4 since most of the calculation is done to propagate and learn the connections between the hidden and the output layer (see equation 6). The best result was obtained with a 120 dimensional continuous word representation. The perplexity is 67.9 after interpolation with the back-off LM and the word error rate is 13.88%.</Paragraph>
    </Section>
    <Section position="4" start_page="205" end_page="206" type="sub_section">
      <SectionTitle>
3.4 Training on all available data
</SectionTitle>
      <Paragraph position="0"> In this section an algorithm is proposed for training the neural network on arbitrary large training corpora. The basic idea is quite simple: instead of performing several epochs over the whole training data, a different small random subset is used at each epoch. This procedure has several advantages:  noise to the training procedure. This potentially increases the generalization performance.</Paragraph>
      <Paragraph position="1"> This algorithm is summarized in figure 4. The parameters of this algorithm are the size of the random subsets that are used at each epoch. We chose  dom subsets of the commercial transcriptions. (word representation of dimension 120, 500 hidden units) to always use the full corpus of transcriptions of the acoustic data since this is the most appropriate data for the task. Experiments with different random sub-sets of the commercial transcriptions and the newspaper texts have been performed (see Figure 3 and 5). In all cases the same neural network architecture was used, i.e a 120 dimensional continuous word representation and 500 hidden units. Some experiments with larger hidden units showed basically the same convergence behavior. The learning rate was again set to 0.005, but with a slower exponential decay. null First of all it can be seen from Figure 3 that the results are better when using random subsets instead of a fixed selection of 6 months, although each random subset is actually smaller (for instance a total of 12.5M examples for a subset of 10%). Best results were obtained when taking 10% of the commercial  The perplexities are given for the neural network LM alone and interpolated with the back-off LM trained on all the data. The last column corresponds to three interpolated neural network LMs. transcriptions. The perplexity is 66.7 after interpolation with the back-off LM and the word error rate is 13.81% (see summary in Table 3). Larger sub-sets of the commercial transcriptions lead to slower training, but don't give better results.</Paragraph>
      <Paragraph position="2"> Encouraged by these results, we also included the 508M words of newspaper texts in the training data.</Paragraph>
      <Paragraph position="3"> The size of the random subsets were chosen in order to use between 4 and 9M words of each corpus. Figure 5 summarizes the results. There seems to be no obvious benefit from resampling large subsets of the individual corpora. We choose to resample 10% of the commercial transcriptions and 1% of the newspaper texts.</Paragraph>
      <Paragraph position="4">  dom subsets of the commercial transcriptions and the newspaper texts.</Paragraph>
      <Paragraph position="5"> Table 3 summarizes the results of the different neural network LMs. It can be clearly seen that the perplexity of the neural network LM alone decreases significantly with the amount of training data used. The perplexity after interpolation with the back-off LM changes only by a small amount, but there is a notable improvement in word error rate. This is another experimental evidence that the perplexity of a LM is not directly related to the word error rate.</Paragraph>
      <Paragraph position="6"> The best neural network LM achieves a word error reduction of 0.5% absolute with respect to the carefully tuned back-off LM (14.24% - 13.75%).</Paragraph>
      <Paragraph position="7"> The additional processing time needed to rescore the lattices is less than 0.05xRT. This is a significant improvement, in particular for a fast real-time continuous speech recognition system. When more processing time is available a word error rate of 13.61% can be achieved by interpolating three neural networks together (in 0.14xRT).</Paragraph>
      <Paragraph position="8"> 3.5 Using a better speech recognizer The experimental results have also been validated using a second speech recognizer running in about 7xRT. This systems differs from the real-time recognizer by a larger 200k word-list, additional acoustic model adaptation passes and less pruning. Details are described in (Gauvain et al., 2005). The word error rate of the reference system using a back-off LM is 10.74%. This can be reduced to 10.51% using a neural network LM trained on the fine transcriptions only and to 10.20% when the neural network LM is trained on all data using the described resampling approach. Lattice rescoring takes about 0.2xRT.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="206" end_page="207" type="metho">
    <SectionTitle>
4 Conclusions and future work
</SectionTitle>
    <Paragraph position="0"> Neural network language models are becoming a serious alternative to the widely used back-off language models. Consistent improvements in perplexity and word error rate have been reported (Bengio et al., 2003; Schwenk and Gauvain, 2004; Schwenk and Gauvain, 2005; Emami and Jelinek, 2004). In these works, the amount of training data was how- null ever limited to a maximum of 20M words due to the high complexity of the training algorithm.</Paragraph>
    <Paragraph position="1"> In this paper new techniques have been described to train neural network language models on large amounts of text corpora (up to 600M words). The evaluation with a state-of-the-art speech recognition system for French Broadcast News showed a significant word error reduction of 0.5% absolute. The neural network LMs is incorporated into the speech recognizer by rescoring lattices. This is done in less than 0.05xRT.</Paragraph>
    <Paragraph position="2"> Several extensions of the learning algorithm itself are promising. We are in particular interested in smarter ways to select different subsets from the large corpus at each epoch (instead of a random choice). One possibility would be to use active learning, i.e. focusing on examples that are most useful to decrease the perplexity. One could also imagine to associate a probability to each training example and to use these probabilities to weight the random sampling. These probabilities would be updated after each epoch. This is similar to boosting techniques (Freund, 1995) which build sequentially classifiers that focus on examples wrongly classified by the preceding one.</Paragraph>
  </Section>
  <Section position="5" start_page="207" end_page="207" type="metho">
    <SectionTitle>
5 Acknowledgment
</SectionTitle>
    <Paragraph position="0"> The authors would like to thank Yoshua Bengio for fruitful discussions and helpful comments. The authors would like to recognize the contributions of G. Adda, M. Adda and L. Lamel for their involvement in the development of the speech recognition systems on top of which this work is based.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML