File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4006_intro.xml

Size: 4,071 bytes

Last Modified: 2025-10-06 14:02:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4006">
  <Title>Language model adaptation with MAP estimation and the perceptron algorithm</Title>
  <Section position="4" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
2.1 MAP language model adaptation
</SectionTitle>
    <Paragraph position="0"> To build an adapted n-gram model, we use a count merging approach, much as presented in (Bacchiani and Roark, 2003), which is shown to be a special case of maximum a posteriori (MAP) adaptation. Let wO be the out-of-domain corpus, and wI be the in-domain sample. Let h represent an n-gram history of zero or more words. Let ck(hw) denote the raw count of an n-gram hw in wk, for k [?] {O,I}. Let ^pk(hw) denote the standard Katz backoff model estimate of hw given wk. We define the corrected count of an n-gram hw as:</Paragraph>
    <Paragraph position="2"> where |wk |denotes the size of the sample wk. Then:</Paragraph>
    <Paragraph position="4"> (2) where th is a state dependent parameter that dictates how much the out-of-domain prior counts should be relied upon. The model is then defined as:</Paragraph>
    <Paragraph position="6"> where a is the backoff weight and hprime the backoff history for history h.</Paragraph>
    <Paragraph position="7"> The principal difficulty in MAP adaptation of this sort is determining the mixing parameters th in Eq. 2. Following (Bacchiani and Roark, 2003), we chose a single mixing parameter for each model that we built, i.e. th = t for all states h in the model.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Perceptron algorithm
</SectionTitle>
      <Paragraph position="0"> Our discriminative n-gram model training approach uses the perceptron algorithm, as presented in (Roark et al., 2004), which follows the general approach presented in (Collins, 2002). For brevity, we present the algorithm, not in full generality, but for the specific case of n-gram model training.</Paragraph>
      <Paragraph position="1"> The training set consists of N weighted word lattices produced by the baseline recognizer, and a gold-standard transcription for each of the N lattices. Following (Roark et al., 2004), we use the lowest WER hypothesis in the lattice as the gold-standard, rather than the reference transcription. The perceptron model is a linear model with k feature weights, all of which are initialized to 0. The algorithm is incremental, i.e. the parameters are updated at each example utterance in the training set in turn, and the updated parameters are used for the next utterance. After each pass over the training set, the model is evaluated on a held-out set, and the best performing model on this held-out set is the model used for testing.</Paragraph>
      <Paragraph position="2"> For a given path pi in a weighted word lattice L, let w[pi] be the cost of that path as given by the baseline recognizer. Let GL be the gold-standard transcription for L. Let Ph(pi) be the K-dimensional feature vector for pi, which contains the count within the path pi of each feature. In our case, these are unigram, bigram and trigram feature counts. Let -at [?] RK be the K-dimensional feature weight vector of the perceptron model at time t. The perceptron model feature weights are updated as follows  1. For the example lattice L at time t, find ^pit such that</Paragraph>
      <Paragraph position="4"> Note that if ^pit = GL, then the features are left unchanged. null As shown in (Roark et al., 2004), the perceptron feature weight vector can be encoded in a deterministic weighted finite state automaton (FSA), so that much of the feature weight update involves basic FSA operations, making the training relatively efficient in practice. As suggested in (Collins, 2002), we use the averaged perceptron when applying the model to held-out or test data. After each pass over the training data, the averaged perceptron model is output as a weighted FSA, which can be used by intersecting with a lattice output from the base-line system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML