File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1007_metho.xml

Size: 33,336 bytes

Last Modified: 2025-10-06 14:08:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1007">
  <Title>Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Linear Models, the Perceptron
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Algorithm, and Conditional Random
Fields
</SectionTitle>
      <Paragraph position="0"> This section describes a general framework, global linear models, and two parameter estimation methods within the framework, the perceptron algorithm and a method based on conditional random fields. The linear models we describe are general enough to be applicable to a diverse range of NLP and speech tasks - this section gives a general description of the approach. In the next section of the paper we describe how global linear models can be applied to speech recognition. In particular, we focus on how the decoding and parameter estimation problems can be implemented over lattices using finite-state techniques. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Global linear models
</SectionTitle>
      <Paragraph position="0"> We follow the framework outlined in Collins (2002; 2004). The task is to learn a mapping from inputs x2X to outputs y 2 Y. We assume the following components: (1) Training examples (xi;yi) for i = 1:::N.</Paragraph>
      <Paragraph position="1"> (2) A function GEN which enumerates a set of candidates GEN(x) for an input x. (3) A representation mapping each (x;y) 2 X Y to a feature vector (x;y)2Rd. (4) A parameter vector 2Rd.</Paragraph>
      <Paragraph position="2"> The components GEN; and define a mapping from an input x to an output F(x) through</Paragraph>
      <Paragraph position="4"> where (x;y) is the inner product Ps s s(x;y).</Paragraph>
      <Paragraph position="5"> The learning task is to set the parameter values using the training examples as evidence. The decoding algorithm is a method for searching for the y that maximizes Eq. 1.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Perceptron algorithm
</SectionTitle>
      <Paragraph position="0"> We now turn to methods for training the parameters of the model, given a set of training examples Inputs: Training examples (xi;yi)  (x1;y1):::(xN;yN). This section describes the perceptron algorithm, which was previously applied to language modeling in Roark et al. (2004). The next section describes an alternative method, based on conditional random fields.</Paragraph>
      <Paragraph position="1"> The perceptron algorithm is shown in figure 1. At each training example (xi;yi), the current best-scoring hypothesis zi is found, and if it differs from the reference yi , then the cost of each feature2 is increased by the count of that feature in zi and decreased by the count of that feature in yi. The features in the model are updated, and the algorithm moves to the next utterance. After each pass over the training data, performance on a held-out data set is evaluated, and the parameterization with the best performance on the held out set is what is ultimately produced by the algorithm.</Paragraph>
      <Paragraph position="2"> Following Collins (2002), we used the averaged parameters from the training algorithm in decoding held-out and test examples in our experiments. Say ti is the parameter vector after the i'th example is processed on the t'th pass through the data in the algorithm in figure 1. Then the averaged parameters AVG are defined as AVG = Pi;t ti=NT. Freund and Schapire (1999) originally proposed the averaged parameter method; it was shown to give substantial improvements in accuracy for tagging tasks in Collins (2002).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Conditional Random Fields
</SectionTitle>
      <Paragraph position="0"> Conditional Random Fields have been applied to NLP tasks such as parsing (Ratnaparkhi et al., 1994; Johnson et al., 1999), and tagging or segmentation tasks (Lafferty et al., 2001; Sha and Pereira, 2003; McCallum and Li, 2003; Pinto et al., 2003). CRFs use the parameters to define a conditional distribution over the members of GEN(x) for a given input x:</Paragraph>
      <Paragraph position="2"> normalization constant that depends on x and .</Paragraph>
      <Paragraph position="3"> Given these definitions, the log-likelihood of the training data under parameters is</Paragraph>
      <Paragraph position="5"> 2Note that here lattice weights are interpreted as costs, which changes the sign in the algorithm presented in figure 1.</Paragraph>
      <Paragraph position="6"> Following Johnson et al. (1999) and Lafferty et al.</Paragraph>
      <Paragraph position="7"> (2001), we use a zero-mean Gaussian prior on the parameters resulting in the regularized objective function:</Paragraph>
      <Paragraph position="9"> The value dictates the relative influence of the log-likelihood term vs. the prior, and is typically estimated using held-out data. The optimal parameters under this criterion are = argmax LLR( ).</Paragraph>
      <Paragraph position="10"> We use a limited memory variable metric method (Benson and Mor'e, 2002) to optimize LLR. There is a general implementation of this method in the Tao/PETSc software libraries (Balay et al., 2002; Benson et al., 2002). This technique has been shown to be very effective in a variety of NLP tasks (Malouf, 2002; Wallach, 2002). The main interface between the optimizer and the training data is a procedure which takes a parameter vector as input, and in turn returns LLR( ) as well as the gradient of LLR at . The derivative of the objective function with respect to a parameter s at parameter  will find it. The use of the Gaussian prior termjj jj2=2 2 in the objective function has been found to be useful in several NLP settings. It effectively ensures that there is a large penalty for parameter values in the model becoming too large - as such, it tends to control over-training. The choice ofLLR as an objective function can be justified as maximum a-posteriori (MAP) training within a Bayesian approach. An alternative justification comes through a connection to support vector machines and other large margin approaches. SVM-based approaches use an optimization criterion that is closely related to LLR - see Collins (2004) for more discussion.</Paragraph>
      <Paragraph position="11"> 3 Linear models for speech recognition We now describe how the formalism and algorithms in section 2 can be applied to language modeling for speech recognition.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The basic approach
</SectionTitle>
      <Paragraph position="0"> As described in the previous section, linear models require definitions ofX,Y, xi, yi, GEN, and a parameter estimation method. In the language modeling setting we takeX to be the set of all possible acoustic inputs;Y is the set of all possible strings, , for some vocabulary . Each xi is an utterance (a sequence of acoustic feature-vectors), and GEN(xi) is the set of possible transcriptions under a first pass recognizer. (GEN(xi) is a huge set, but will be represented compactly using a lattice - we will discuss this in detail shortly). We take yi to be the member of GEN(xi) with lowest error rate with respect to the reference transcription of xi.</Paragraph>
      <Paragraph position="1"> All that remains is to define the feature-vector representation, (x;y). In the general case, each component i(x;y) could be essentially any function of the acoustic input x and the candidate transcription y. The first feature we define is 0(x;y) as the log-probability of y givenxunder the lattice produced by the baseline recognizer. Thus this feature will include contributions from the acoustic model and the original language model. The remaining features are restricted to be functions over the transcription y alone and they track all n-grams up to some length (say n = 3), for example: 1(x;y) = Number of times &amp;quot;the the of&amp;quot; is seen in y At an abstract level, features of this form are introduced for all n-grams up to length 3 seen in some training data lattice, i.e., n-grams seen in any word sequence within the lattices. In practice, we consider methods that search for sparse parameter vectors , thus assigning many n-grams 0 weight. This will lead to more efficient algorithms that avoid dealing explicitly with the entire set of n-grams seen in training data.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Implementation using WFA
</SectionTitle>
      <Paragraph position="0"> We now give a brief sketch of how weighted finite-state automata (WFA) can be used to implement linear models for speech recognition. There are several papers describing the use of weighted automata and transducers for speech in detail, e.g., Mohri et al. (2002), but for clarity and completeness this section gives a brief description of the operations which we use.</Paragraph>
      <Paragraph position="1"> For our purpose, a WFA A = ( ;Q;qs;F;E; ), where is the vocabulary, Q is a (finite) set of states, qs 2 Q is a unique start state, F Q is a set of final states, E is a (finite) set of transitions, and : F !R is a function from final states to final weights. Each transition e2E is a tuple e = (l[e];p[e];n[e];w[e]), where l[e] 2 is a label (in our case, words), p[e] 2Q is the origin state of e, n[e] 2Q is the destination state of e, and w[e] 2 R is the weight of the transition. A successful path = e1:::ej is a sequence of transitions, such that p[e1] = qs, n[ej] 2 F, and for 1 &lt; k j, n[ek 1] = p[ek]. Let A be the set of successful paths in a WFA A. For any = e1:::ej, l[ ] = l[e1]:::l[ej].</Paragraph>
      <Paragraph position="2"> The weights of the WFA in our case are always in the log semiring, which means that the weight of a path =</Paragraph>
      <Paragraph position="4"> By convention, we use negative log probabilities as weights, so lower weights are better. All WFA that we will discuss in this paper are deterministic, i.e. there are no transitions, and for any two transitions e;e0 2 E, if p[e] = p[e0], then l[e] 6= l[e0]. Thus, for any string w = w1:::wj, there is at most one successful path 2 A, such that = e1:::ej and for 1 k j, l[ek] = wk, i.e. l[ ] = w. The set of strings w such that there exists a 2 A with l[ ] = w define a regular language LA .</Paragraph>
      <Paragraph position="5"> We can now define some operations that will be used in this paper.</Paragraph>
      <Paragraph position="6"> A. For a set of transitions E and 2 R, define</Paragraph>
      <Paragraph position="8"> as follows: A = ( ;Q;qs;F; E; ).</Paragraph>
      <Paragraph position="9"> A A0. The intersection of two deterministic WFAs A A0 in the log semiring is a deterministic WFA such that LA A0 = LATLA0. For any 2 A A0,</Paragraph>
      <Paragraph position="11"> BestPath(A). This operation takes a WFA A, and returns the best scoring path ^ = argmin 2 AwA[ ].</Paragraph>
      <Paragraph position="12"> MinErr(A;y). Given a WFA A, a string y, and an error-function E(y;w), this operation returns ^ = argmin 2 AE(y;l[ ]). This operation will generally be used with y as the reference transcription for a particular training example, and E(y;w) as some measure of the number of errors in w when compared to y. In this case, the MinErr operation returns the path 2 A such l[ ] has the smallest number of errors when compared to y.</Paragraph>
      <Paragraph position="13"> Norm(A). Given a WFA A, this operation yields a WFA A0 such that LA = LA0 and for every 2 A there is a 02 A0 such that l[ ] = l[ 0] and</Paragraph>
      <Paragraph position="15"> In other words the weights define a probability distribution over the paths.</Paragraph>
      <Paragraph position="16"> ExpCount(A;w). Given a WFA A and an n-gram w, we define the expected count of w in A as</Paragraph>
      <Paragraph position="18"> where C(w;l[ ]) is defined to be the number of times the n-gram w appears in a string l[ ].</Paragraph>
      <Paragraph position="19"> Given an acoustic input x, let Lx be a deterministic word-lattice produced by the baseline recognizer. The latticeLx is an acyclic WFA, representing a weighted set of possible transcriptions of x under the baseline recognizer. The weights represent the combination of acoustic and language model scores in the original recognizer.</Paragraph>
      <Paragraph position="20"> The new, discriminative language model constructed during training consists of a deterministic WFA which we will denote D, together with a single parameter 0.</Paragraph>
      <Paragraph position="21"> The parameter 0 is the weight for the log probability feature 0 given by the baseline recognizer. The WFA Dis constructed so that LD = and for all 2 D</Paragraph>
      <Paragraph position="23"> Recall that j(x;w) for j &gt; 0 is the count of the j'th n-gram in w, and j is the parameter associated with that</Paragraph>
      <Paragraph position="25"> n-gram. Then, by definition, 0L D accepts the same set of strings asL, but</Paragraph>
      <Paragraph position="27"> Thus decoding under our new model involves first producing a latticeLfrom the baseline recognizer; second, scaling L with 0 and intersecting it with the discriminative language modelD; third, finding the best scoring path in the new WFA.</Paragraph>
      <Paragraph position="28"> We now turn to training a model, or more explicitly, deriving a discriminative language model (D; 0) from a set of training examples. Given a training set (xi;ri) for i = 1:::N, where xi is an acoustic sequence, and ri is a reference transcription, we can construct latticesLi for</Paragraph>
      <Paragraph position="30"> derive target transcriptions yi = MinErr(Li;ri). The training algorithm is then a mapping from (Li;yi) for i = 1:::N to a pair (D; 0). Note that the construction of the language model requires two choices. The first concerns the choice of the set of n-gram features i for i = 1:::d implemented by D. The second concerns the choice of parameters i for i = 0:::d which assign weights to the n-gram features as well as the baseline feature 0.</Paragraph>
      <Paragraph position="31"> Before describing methods for training a discriminative language model using perceptron and CRF algorithms, we give a little more detail about the structure of D, focusing on how n-gram language models can be implemented with finite-state techniques.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Representation of n-gram language models
</SectionTitle>
      <Paragraph position="0"> An n-gram model can be efficiently represented in a deterministic WFA, through the use of failure transitions (Allauzen et al., 2003). Every string accepted by such an automaton has a single path through the automaton, and the weight of the string is the sum of the weights of the transitions in that path. In such a representation, every state in the automaton represents an n-gram history h, e.g. wi 2wi 1, and there are transitions leaving the state for every wordwi such that the featurehwi has a weight.</Paragraph>
      <Paragraph position="1"> There is also a failure transition leaving the state, labeled with some reserved symbol , which can only be traversed if the next symbol in the input does not match any transition leaving the state. This failure transition points to the backoff state h0, i.e. the n-gram history h minus its initial word. Figure 2 shows how a trigram model can be represented in such an automaton. See Allauzen et al.</Paragraph>
      <Paragraph position="2"> (2003) for more details.</Paragraph>
      <Paragraph position="3"> Note that in such a deterministic representation, the entire weight of all features associated with the word wi following history h must be assigned to the transition labeled with wi leaving the state h in the automaton. For example, if h = wi 2wi 1, then the trigram wi 2wi 1wi is a feature, as is the bigram wi 1wi and the unigram wi. In this case, the weight on the transition wi leaving state h must be the sum of the trigram, bigram and unigram feature weights. If only the trigram feature weight were assigned to the transition, neither the unigram nor the bigram feature contribution would be included in the path weight. In order to ensure that the correct weights are assigned to each string, every transition encoding an order k n-gram must carry the sum of the weights for all n-gram features of orders k. To ensure that every string in receives the correct weight, for any n-gram hw represented explicitly in the automaton, h0w must also be represented explicitly in the automaton, even if its weight is 0.</Paragraph>
    </Section>
    <Section position="8" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 The perceptron algorithm
</SectionTitle>
      <Paragraph position="0"> The perceptron algorithm is incremental, meaning that the language model D is built one training example at a time, during several passes over the training set. Initially, we buildDto accept all strings in with weight 0. For the perceptron experiments, we chose the parameter 0 to be a fixed constant, chosen by optimization on the held-out set. The loop in the algorithm in figure 1 is implemented as:</Paragraph>
      <Paragraph position="2"> If zi 6= MinErr(Li;ri), then update the feature weights as in figure 1 (modulo the sign, because of the use of costs), and modifyD so as to assign the correct weight to all strings.</Paragraph>
      <Paragraph position="3"> In addition, averaged parameters need to be stored (see section 2.2). These parameters will replace the unaveraged parameters inDonce training is completed.</Paragraph>
      <Paragraph position="4"> Note that the only n-gram features to be included in D at the end of the training process are those that occur in either a best scoring path zi or a minimum error path yi at some point during training. Thus the perceptron algorithm is in effect doing feature selection as a by-product of training. Given N training examples, and T passes over the training set,O(NT) n-grams will have non-zero weight after training. Experiments in Roark et al. (2004) suggest that the perceptron reaches optimal performance after a small number of training iterations, for example T = 1 or T = 2. Thus O(NT) can be very small compared to the full number of n-grams seen in all training lattices. In our experiments, the perceptron method chose around 1.4 million n-grams with non-zero weight. This compares to 43.65 million possible n-grams seen in the training data.</Paragraph>
      <Paragraph position="5"> This is a key contrast with conditional random fields, which optimize the parameters of a fixed feature set. Feature selection can be critical in our domain, as training and applying a discriminative language model over all n-grams seen in the training data (in either correct or incorrect transcriptions) may be computationally very demanding. One training scenario that we will consider will be using the output of the perceptron algorithm (the averaged parameters) to provide the feature set and the initial feature weights for use in the CRF algorithm. This leads to a model which is reasonably sparse, but has the benefit of CRF training, which as we will see gives gains in performance.</Paragraph>
    </Section>
    <Section position="9" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Conditional Random Fields
</SectionTitle>
      <Paragraph position="0"> The CRF methods that we use assume a fixed definition of the n-gram features i for i = 1:::d in the model.</Paragraph>
      <Paragraph position="1"> In the experimental section we will describe a number of ways of defining the feature set. The optimization methods we use begin at some initial setting for , and then search for the parameters which maximize LLR( ) as defined in Eq. 3.</Paragraph>
      <Paragraph position="2"> The optimization method requires calculation of LLR( ) and the gradient of LLR( ) for a series of values for . The first step in calculating these quantities is to take the parameter values , and to construct an acceptorDwhich accepts all strings in , such that</Paragraph>
      <Paragraph position="4"> (in the log domain) the distribution p (yjxi) over strings y2GEN(xi). The value of logp (yijxi) for any i can be computed by simply taking the path weight of such that l[ ] = yi in the new latticeL0i. Hence computation of LLR( ) in Eq. 3 is straightforward.</Paragraph>
      <Paragraph position="5"> Calculating the n-gram feature gradients for the CRF optimization is also relatively simple, onceL0i has been constructed. From the derivative in Eq. 4, for each i =</Paragraph>
      <Paragraph position="7"> must be computed. The first term is simply the number of times the j'th n-gram feature is seen in yi. The second term is the expected number of times that the j'th n-gram is seen in the acceptor L0i. If the j'th n-gram is w1:::wn, then this can be computed as ExpCount(L0i;w1:::wn). The GRM library, which was presented in Allauzen et al. (2003), has a direct implementation of the function ExpCount, which simultaneously calculates the expected value of all n-grams of order less than or equal to a given n in a latticeL.</Paragraph>
      <Paragraph position="8"> The one non-ngram feature weight that is being estimated is the weight 0 given to the baseline ASR negative log probability. Calculation of the gradient of LLR with respect to this parameter again requires calculation of the term in Eq. 8 for j = 0 and i = 1:::N. Computation of Py2GEN(xi)p (yjxi) 0(xi;y) turns out to be not as straightforward as calculating n-gram expectations. To do so, we rely upon the fact that 0(xi;y), the negative log probability of the path, decomposes to the sum of negative log probabilities of each transition in the path. We index each transition in the lattice Li, and store its negative log probability under the baseline model. We can then calculate the required gradient from L0i, by calculating the expected value in L0i of each indexed transition inLi.</Paragraph>
      <Paragraph position="9"> We found that an approximation to the gradient of 0, however, performed nearly identically to this exact gradient, while requiring substantially less computation.</Paragraph>
      <Paragraph position="10"> Let wn1 be a string of n words, labeling a path in wordlatticeL0i. For brevity, let Pi(wn1 ) = p (wn1jxi) be the conditional probability under the current model, and let Qi(wn1 ) be the probability of wn1 in the normalized base-line ASR lattice Norm(Li). Let Li be the set of strings in the language defined byLi. Then we wish to compute</Paragraph>
      <Paragraph position="12"> The approximation is to make the following Markov assumption:</Paragraph>
      <Paragraph position="14"> where Si is the set of all trigrams seen in Li. The term log Qi(zjxy) can be calculated once before training for every lattice in the training set; the ExpCount term is calculated as before using the GRM library. We have found this approximation to be effective in practice, and it was used for the trials reported below.</Paragraph>
      <Paragraph position="15"> When the gradients and conditional likelihoods are collected from all of the utterances in the training set, the contributions from the regularizer are combined to give an overall gradient and objective function value. These values are provided to the parameter estimation routine, which then returns the parameters for use in the next iteration. The accumulation of gradients for the feature set is the most time consuming part of the approach, but this is parallelizable, so that the computation can be divided among many processors.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical Results
</SectionTitle>
    <Paragraph position="0"> We present empirical results on the Rich Transcription 2002 evaluation test set (rt02), which we used as our development set, as well as on the Rich Transcription 2003 Spring evaluation CTS test set (rt03). The rt02 set consists of 6081 sentences (63804 words) and has three subsets: Switchboard 1, Switchboard 2, Switchboard Cellular. The rt03 set consists of 9050 sentences (76083 words) and has two subsets: Switchboard and Fisher.</Paragraph>
    <Paragraph position="1"> We used the same training set as that used in Roark et al. (2004). The training set consists of 276726 transcribed utterances (3047805 words), with an additional  iterations for CRF trials, contrasted with baseline recognizer performance and perceptron performance. Points are at every 20 iterations. Each point (x,y) is the WER at the iteration with the best objective function value in the interval (x-20,x]. each utterance, a weighted word-lattice was produced, representing alternative transcriptions, from the ASR system. From each word-lattice, the oracle best path was extracted, which gives the best word-error rate from among all of the hypotheses in the lattice. The oracle word-error rate for the training set lattices was 12.2%.</Paragraph>
    <Paragraph position="2"> We also performed trials with 1000-best lists for the same training set, rather than lattices. The oracle score for the 1000-best lists was 16.7%.</Paragraph>
    <Paragraph position="3"> To produce the word-lattices, each training utterance was processed by the baseline ASR system. However, these same utterances are what the acoustic and language models are built from, which leads to better performance on the training utterances than can be expected when the ASR system processes unseen utterances. To somewhat control for this, the training set was partitioned into 28 sets, and baseline Katz backoff trigram models were built for each set by including only transcripts from the other 27 sets. Since language models are generally far more prone to overtrain than standard acoustic models, this goes a long way toward making the training conditions similar to testing conditions.</Paragraph>
    <Paragraph position="4"> There are three baselines against which we are comparing. The first is the ASR baseline, with no reweighting from a discriminatively trained n-gram model. The other two baselines are with perceptron-trained n-gram model re-weighting, and were reported in Roark et al.</Paragraph>
    <Paragraph position="5"> (2004). The first of these is for a pruned-lattice trained trigram model, which showed a reduction in word error rate (WER) of 1.3%, from 39.2% to 37.9% on rt02.</Paragraph>
    <Paragraph position="6"> The second is for a 1000-best list trained trigram model, which performed only marginally worse than the latticetrained perceptron, at 38.0% on rt02.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Perceptron feature set
</SectionTitle>
      <Paragraph position="0"> We use the perceptron-trained models as the starting point for our CRF algorithm: the feature set given to the CRF algorithm is the feature set selected by the perceptron algorithm; the feature weights are initialized to those of the averaged perceptron. Figure 3 shows the performance of our three baselines versus three trials of  iterations for CRF trials, contrasted with baseline recognizer performance and perceptron performance. Points are at every 20 iterations. Each point (x,y) is the WER at the iteration with the best objective function value in the interval (x-20,x]. the CRF algorithm. In the first two trials, the training set consists of the pruned lattices, and the feature set is from the perceptron algorithm trained on pruned lattices. There were 1.4 million features in this feature set. The first trial set the regularizer constant =1, so that the algorithm was optimizing raw conditional likelihood.</Paragraph>
      <Paragraph position="1"> The second trial is with the regularizer constant = 0:5, which we found empirically to be a good parameterization on the held-out set. As can be seen from these results, regularization is critical.</Paragraph>
      <Paragraph position="2"> The third trial in this set uses the feature set from the perceptron algorithm trained on 1000-best lists, and uses CRF optimization on these on these same 1000-best lists.</Paragraph>
      <Paragraph position="3"> There were 0.9 million features in this feature set. For this trial, we also used = 0:5. As with the perceptron baselines, the n-best trial performs nearly identically with the pruned lattices, here also resulting in 37.4% WER. This may be useful for techniques that would be more expensive to extend to lattices versus n-best lists (e.g. models with unbounded dependencies).</Paragraph>
      <Paragraph position="4"> These trials demonstrate that the CRF algorithm can do a better job of estimating feature weights than the perceptron algorithm for the same feature set. As mentioned in the earlier section, feature selection is a by-product of the perceptron algorithm, but the CRF algorithm is given a set of features. The next two trials looked at selecting feature sets other than those provided by the perceptron algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Other feature sets
</SectionTitle>
      <Paragraph position="0"> In order for the feature weights to be non-zero in this approach, they must be observed in the training set. The number of unigram, bigram and trigram features with non-zero observations in the training set lattices is 43.65 million, or roughly 30 times the size of the perceptron feature set. Many of these features occur only rarely with very low conditional probabilities, and hence cannot meaningfully impact system performance. We pruned this feature set to include all unigrams and bigrams, but only those trigrams with an expected count of greater than 0.01 in the training set. That is, to be included, a  various trials, on both Switchboard 2002 test set (rt02), which was used as the dev set, and Switchboard 2003 test set (rt03). trigram must occur in a set of paths, the sum of the conditional probabilities of which must be greater than our threshold = 0:01. This threshold resulted in a feature set of roughly 12 million features, nearly 10 times the size of the perceptron feature set. For better comparability with that feature set, we set our thresholds higher, so that trigrams were pruned if their expected count fell below = 0:9, and bigrams were pruned if their expected count fell below = 0:1. We were concerned that this may leave out some of the features on the oracle paths, so we added back in all bigram and trigram features that occurred on oracle paths, giving a feature set of 1.5 million features, roughly the same size as the perceptron feature set.</Paragraph>
      <Paragraph position="1"> Figure 4 shows the results for three CRF trials versus our ASR baseline and the perceptron algorithm baseline trained on lattices. First, the result using the perceptron feature set provides us with a WER of 37.4%, as previously shown. The WER at convergence for the big feature set (12 million features) is 37.6%; the WER at convergence for the smaller feature set (1.5 million features) is 37.5%. While both of these other feature sets converge to performance close to that using the perceptron features, the number of iterations over the training data that are required to reach that level of performance are many more than for the perceptron-initialized feature set.</Paragraph>
      <Paragraph position="2"> Table 1 shows the word-error rate at the convergence iteration for the various trials, on both rt02 and rt03. All of the CRF trials are significantly better than the perceptron performance, using the Matched Pair Sentence Segment test for WER included with SCTK (NIST, 2000).</Paragraph>
      <Paragraph position="3"> On rt02, the N-best and perceptron initialized CRF trials were were significantly better than the lattice perceptron at p&lt; 0:001; the other two CRF trials were significantly better than the lattice perceptron at p &lt; 0:01. On rt03, the N-best CRF trial was significantly better than the lattice perceptron at p &lt; 0:002; the other three CRF trials were significantly better than the lattice perceptron at p&lt; 0:001.</Paragraph>
      <Paragraph position="4"> Finally, we measured the time of a single iteration over the training data on a single machine for the perceptron algorithm, the CRF algorithm using the approximation to the gradient of 0, and the CRF algorithm using an exact gradient of 0. Table 2 shows these times in hours. Because of the frequent update of the weights in the model, the perceptron algorithm is more expensive than the CRF algorithm for a single iteration. Further, the CRF algorithm is parallelizable, so that most of the work of an  Xeon 2.4Ghz processor with 4GB RAM.</Paragraph>
      <Paragraph position="5"> iteration can be shared among multiple processors. Our most common training setup for the CRF algorithm was parallelized between 20 processors, using the approximation to the gradient. In that setup, using the 1.4M feature set, one iteration of the perceptron algorithm took the same amount of real time as approximately 80 iterations of CRF.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML