File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1026_intro.xml
Size: 10,850 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1026"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 201-208, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Training Neural Network Language Models On Very Large Corpora [?]</Title> <Section position="2" start_page="0" end_page="203" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Language models play an important role in many applications like character and speech recognition, machine translation and information retrieval. Several approaches have been developed during the last [?]This work was partially financed by the European Commission under the FP6 Integrated Project TC-STAR.</Paragraph> <Paragraph position="1"> decades like n-gram back-off word models (Katz, 1987), class models (Brown et al., 1992), structured language models (Chelba and Jelinek, 2000) or maximum entropy language models (Rosenfeld, 1996).</Paragraph> <Paragraph position="2"> To the best of our knowledge word and class n-gram back-off language models are still the dominant approach, at least in applications like large vocabulary continuous speech recognition or statistical machine translation. In many publications it has been reported that modified Kneser-Ney smoothing (Chen and Goodman, 1999) achieves the best results. All the reference back-off language models (LM) described in this paper are build with this technique, using the SRI LM toolkit (Stolcke, 2002).</Paragraph> <Paragraph position="3"> The field of natural language processing has recently seen some changes by the introduction of new statistical techniques that are motivated by successful approaches from the machine learning community, in particular continuous space LMs using neural networks (Bengio and Ducharme, 2001; Bengio et al., 2003; Schwenk and Gauvain, 2002; Schwenk and Gauvain, 2004; Emami and Jelinek, 2004), Random Forest LMs (Xu and Jelinek, 2004) and Random cluster LMs (Emami and Jelinek, 2005). Usually new approaches are first verified on small tasks using a limited amount of LM training data. For instance, experiments have been performed using the Brown corpus (1.1M words), parts of the Wallstreet journal corpus (19M words) or transcriptions of acoustic training data (up to 22M words). It is much more challenging to compare the new statistical techniques to carefully optimized back-off LM trained on large amounts of data (several hundred millions words). Training may be difficult and very time consuming and the algorithms used with several tens of millions examples may be impracticable for larger amounts. Training back-off LMs on large amounts of data is not a problem, as long as powerful machines with enough memory are available in order to calculate the word statistics. Practice has also shown that back-off LMs seem to perform very well when large amounts of training data are available and it is not clear that the above mentioned new approaches are still of benefit in this situation.</Paragraph> <Paragraph position="4"> In this paper we compare the neural network language model to n-gram model with modified Kneser-Ney smoothing using LM training corpora of up to 600M words. New algorithms are presented to effectively train the neural network on such amounts of data and the necessary capacity is analyzed. The LMs are evaluated in a real-time state-of-the-art speech recognizer for French Broadcast News. Word error reductions of up to 0.5% absolute are reported.</Paragraph> <Paragraph position="5"> 2 Architecture of the neural network LM The basic idea of the neural network LM is to project the word indices onto a continuous space and to use a probability estimator operating on this space (Bengio and Ducharme, 2001; Bengio et al., 2003). Since the resulting probability functions are smooth functions of the word representation, better generalization to unknown n-grams can be expected. A neural network can be used to simultaneously learn the projection of the words onto the continuous space and to estimate the n-gram probabilities. This is still a n-gram approach, but the LM posterior probabilities are &quot;interpolated&quot; for any possible context of length n-1 instead of backing-off to shorter contexts.</Paragraph> <Paragraph position="6"> The architecture of the neural network n-gram LM is shown in Figure 1. A standard fully-connected multi-layer perceptron is used. The inputs to the neural network are the indices of the n[?]1 previous words in the vocabulary hj = wj[?]n+1,...,wj[?]2,wj[?]1 and the outputs are the posterior probabilities of all words of the vocabulary:</Paragraph> <Paragraph position="8"> where N is the size of the vocabulary. The input uses the so-called 1-of-n coding, i.e., the i-th word of the vocabulary is coded by setting the i-th element of the vector to 1 and all the other elements to</Paragraph> <Paragraph position="10"> language model. hj denotes the context wj[?]n+1,...,wj[?]1. P is the size of one projection and H and N is the size of the hidden and output layer respectively. When shortlists are used the size of the output layer is much smaller then the size of the vocabulary.</Paragraph> <Paragraph position="11"> 0. The i-th line of the NxP dimensional projection matrix corresponds to the continuous representation of the i-th word. Let us denote ck these projections, dj the hidden layer activities, oi the outputs, pi their softmax normalization, and mjl, bj, vij and ki the hidden and output layer weights and the corresponding biases. Using these notations the neural network performs the following operations:</Paragraph> <Paragraph position="13"> The value of the output neuron pi corresponds directly to the probability P(wj = i|hj). Training is performed with the standard back-propagation algorithm minimizing the following error function:</Paragraph> <Paragraph position="15"> where ti denotes the desired output, i.e., the probability should be 1.0 for the next word in the training sentence and 0.0 for all the other ones. The first part of this equation is the cross-entropy between the output and the target probability distributions, and the second part is a regularization term that aims to prevent the neural network from overfitting the training data (weight decay). The parameter b has to be determined experimentally.</Paragraph> <Paragraph position="16"> It can be shown that the outputs of a neural network trained in this manner converge to the posterior probabilities. Therefore, the neural network directly minimizes the perplexity on the training data. Note also that the gradient is back-propagated through the projection-layer, which means that the neural network learns the projection of the words onto the continuous space that is best for the probability estimation task. The complexity to calculate one probability with this basic version of the neural network LM is quite high:</Paragraph> <Paragraph position="18"> where P is the size of one projection and H and N is the size of the hidden and output layer respectively.</Paragraph> <Paragraph position="19"> Usual values are n=4, P=50 to 200, H=400 to 1000 and N=40k to 200k. The complexity is dominated by the large size of the output layer. In this paper the improvements described in (Schwenk, 2004) have been used: 1. Lattice rescoring: speech recognition is done with a standard back-off LM and a word lattice is generated. The neural network LM is then used to rescore the lattice.</Paragraph> <Paragraph position="20"> 2. Shortlists: the neural network is only used to predict the LM probabilities of a subset of the whole vocabulary.</Paragraph> <Paragraph position="21"> 3. Regrouping: all LM probabilities needed for one lattice are collected and sorted. By these means all LM probability requests with the same context ht lead to only one forward pass through the neural network.</Paragraph> <Paragraph position="22"> 4. Block mode: several examples are propagated at once through the neural network, allowing the use of faster matrix/matrix operations.</Paragraph> <Paragraph position="23"> 5. CPU optimization: machine specific BLAS libraries are used for fast matrix and vector operations. null The idea behind shortlists is to use the neural network only to predict the s most frequent words, s lessmuch |V|, reducing by these means drastically the complexity. All words of the word list are still considered at the input of the neural network. The LM probabilities of words in the shortlist ( ^PN) are calculated by the neural network and the LM probabilities of the remaining words ( ^PB) are obtained from a standard 4-gram back-off LM:</Paragraph> <Paragraph position="25"> It can be considered that the neural network redistributes the probability mass of all the words in the shortlist. This probability mass is precalculated and stored in the data structures of the back-off LM. A back-off technique is used if the probability mass for a requested input context is not directly available.</Paragraph> <Paragraph position="26"> Normally, the output of a speech recognition system is the most likely word sequence given the acoustic signal, but it is often advantageous to preserve more information for subsequent processing steps. This is usually done by generating a lattice, a graph of possible solutions where each arc corresponds to a hypothesized word with its acoustic and language model scores. In the context of this work LIMSI's standard large vocabulary continuous speech recognition decoder is used to generate lattices using a n-gram back-off LM. These lattices are then processed by a separate tool and all the LM probabilities on the arcs are replaced by those calculated by the neural network LM. During this lattice rescoring LM probabilities with the same context ht are often requested several times on potentially different nodes in the lattice. Collecting and regrouping all these calls prevents multiple forward passes since all LM predictions for the same context are immediately available at the output.</Paragraph> <Paragraph position="27"> Further improvements can be obtained by propagating several examples at once though the network, also known as bunch mode (Bilmes et al., 1997; Schwenk, 2004). In comparison to equation 2 and 3, this results in using matrix/matrix instead of matrix/vector operations which can be aggressively optimized on current CPU architectures. The Intel Math Kernel Library was used.1 Bunch mode is also used for training the neural network. Training of a typical network with a hidden layer with 500 nodes and a shortlist of length 2000 (about 1M parameters) take less than one hour for one epoch through four million examples on a standard PC.</Paragraph> </Section> class="xml-element"></Paper>