File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1007_intro.xml

Size: 6,251 bytes

Last Modified: 2025-10-06 14:02:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1007">
  <Title>Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A crucial component of any speech recognizer is the language model (LM), which assigns scores or probabilities to candidate output strings in a speech recognizer. The language model is used in combination with an acoustic model, to give an overall score to candidate word sequences that ranks them in order of probability or plausibility. null A dominant approach in speech recognition has been to use a &amp;quot;source-channel&amp;quot;, or &amp;quot;noisy-channel&amp;quot; model. In this approach, language modeling is effectively framed as density estimation: the language model's task is to define a distribution over the source - i.e., the possible strings in the language. Markov (n-gram) models are often used for this task, whose parameters are optimized to maximize the likelihood of a large amount of training text. Recognition performance is a direct measure of the effectiveness of a language model; an indirect measure which is frequently proposed within these approaches is the perplexity of the LM (i.e., the log probability it assigns to some held-out data set).</Paragraph>
    <Paragraph position="1"> This paper explores alternative methods for language modeling, which complement the source-channel approach through discriminatively trained models. The language models we describe do not attempt to estimate a generative model P(w) over strings. Instead, they are trained on acoustic sequences with their transcriptions, in an attempt to directly optimize error-rate. Our work builds on previous work on language modeling using the perceptron algorithm, described in Roark et al. (2004).</Paragraph>
    <Paragraph position="2"> In particular, we explore conditional random field methods, as an alternative training method to the perceptron. We describe how these models can be trained over lattices that are the output from a baseline recognizer. We also give a number of experiments comparing the two approaches. The perceptron method gave a 1.3% absolute improvement in recognition error on the Switchboard domain; the CRF methods we describe give a further gain, the final absolute improvement being 1.8%.</Paragraph>
    <Paragraph position="3"> A central issue we focus on concerns feature selection.</Paragraph>
    <Paragraph position="4"> The number of distinct n-grams in our training data is close to 45 million, and we show that CRF training converges very slowly even when trained with a subset (of size 12 million) of these features. Because of this, we explore methods for picking a small subset of the available features.1 The perceptron algorithm can be used as one method for feature selection, selecting around 1.5 million features in total. The CRF trained with this feature set, and initialized with parameters from perceptron training, converges much more quickly than other approaches, and also gives the optimal performance on the held-out set.</Paragraph>
    <Paragraph position="5"> We explore other approaches to feature selection, but find that the perceptron-based approach gives the best results in our experiments.</Paragraph>
    <Paragraph position="6"> While we focus on n-gram models, we stress that our methods are applicable to more general language modeling features - for example, syntactic features, as explored in, e.g., Khudanpur and Wu (2000). We intend to explore methods with new features in the future. Experimental results with n-gram models on 1000-best lists show a very small drop in accuracy compared to the use of lattices. This is encouraging, in that it suggests that models with more flexible features than n-gram models, which therefore cannot be efficiently used with lattices, may not be unduly harmed by their restriction to n-best lists.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Related Work
</SectionTitle>
      <Paragraph position="0"> Large vocabulary ASR has benefitted from discriminative estimation of Hidden Markov Model (HMM) parameters in the form of Maximum Mutual Information Estimation (MMIE) or Conditional Maximum Likelihood Estimation (CMLE). Woodland and Povey (2000) have shown the effectiveness of lattice-based MMIE/CMLE in challenging large scale ASR tasks such as Switchboard.</Paragraph>
      <Paragraph position="1"> In fact, state-of-the-art acoustic modeling, as seen, for example, at annual Switchboard evaluations, invariably includes some kind of discriminative training.</Paragraph>
      <Paragraph position="2"> Discriminative estimation of language models has also been proposed in recent years. Jelinek (1995) suggested an acoustic sensitive language model whose parameters  guage model with fewer features is likely to be considerably more efficient when decoding new utterances.</Paragraph>
      <Paragraph position="3"> are estimated by minimizing H(WjA), the expected uncertainty of the spoken text W, given the acoustic sequence A. Stolcke and Weintraub (1998) experimented with various discriminative approaches including MMIE with mixed results. This work was followed up with some success by Stolcke et al. (2000) where an &amp;quot;anti-LM&amp;quot;, estimated from weighted N-best hypotheses of a baseline ASR system, was used with a negative weight in combination with the baseline LM. Chen et al. (2000) presented a method based on changing the trigram counts discriminatively, together with changing the lexicon to add new words. Kuo et al. (2002) used the generalized probabilistic descent algorithm to train relatively small language models which attempt to minimize string error rate on the DARPA Communicator task. Banerjee et al.</Paragraph>
      <Paragraph position="4"> (2003) used a language model modification algorithm in the context of a reading tutor that listens. Their algorithm first uses a classifier to predict what effect each parameter has on the error rate, and then modifies the parameters to reduce the error rate based on this prediction.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML