File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1012_intro.xml

Size: 4,569 bytes

Last Modified: 2025-10-06 14:03:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1012">
  <Title>Reducing Weight Undertraining in Structured Discriminative Learning</Title>
  <Section position="2" start_page="0" end_page="89" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Discriminative methods for training probabilistic models have enjoyed wide popularity in natural language processing, such as in part-of-speech tagging (Toutanova et al., 2003), chunking (Sha and Pereira, 2003), named-entity recognition (Florian et al., 2003; Chieu and Ng, 2003), and most recently parsing (Taskar et al., 2004).</Paragraph>
    <Paragraph position="1"> A discriminative probabilistic model is trained to maximize the conditional probability p(y|x) of output labels y given input variables x, as opposed to modeling the joint probability p(y, x), as in generative models such as the Naive Bayes classifier and hidden Markov models.</Paragraph>
    <Paragraph position="2"> The popularity of discriminative models stems from the great flexibility they allow in defining features: because the distribution over input features p(x) is not modeled, it can contain rich, highly overlapping features without making the model intractable for training and inference.</Paragraph>
    <Paragraph position="3"> In NLP, for example, useful features include word bi-grams and trigrams, prefixes and suffixes, membership in domain-specific lexicons, and information from semantic databases such as WordNet. It is not uncommon to have hundreds of thousands or even millions of features.</Paragraph>
    <Paragraph position="4"> But not all features, even ones that are carefully engineered, improve performance. Adding more features to a model can hurt its accuracy on unseen testing data. One well-known reason for this is overfitting: a model with more features has more capacity to fit chance regularities in the training data. In this paper, however, we focus on another, more subtle effect: adding new features can cause existing ones to be underfit. Training of discriminative models, such as regularized logistic regression, involves complex trade-offs among weights. A few highly-indicative features can swamp the contribution of many individually weaker features, even if the weaker features, taken together, are just as indicative of the output. Such a model is less robust, for the few strong features may be noisy or missing in the test data.</Paragraph>
    <Paragraph position="5"> This effect was memorably observed by Dean Pomerleau (1995) when training neural networks to drive vehicles autonomously. Pomerleau reports one example when the system was learning to drive on a dirt road: The network had no problem learning and then driving autonomously in one direction, but when driving the other way, the network was erratic, swerving from one side of the road to the other. . . . It turned out that the network was basing most of its predictions on an easilyidentifiable ditch, which was always on the right in the training set, but was on the left when the vehicle turned around. (Pomerleau, 1995) The network had features to detect the sides of the road, and these features were active at training and test time, although weakly, because the dirt road was difficult to  detect. But the ditch was so highly indicative that the network did not learn the dependence between the road edge and the desired steering direction.</Paragraph>
    <Paragraph position="6"> A natural way of avoiding undertraining is to train separate models for groups of competing features--in the driving example, one model with the ditch features, and one with the side-of-the-road features--and then average them into a single model. This is same idea behind logarithmic opinion pools, used by Smith, Cohn, and Osborne (2005) to reduce overfitting in CRFs. In this paper, we tailor our ensemble to reduce undertraining rather than overfitting, and we introduce several new combination methods, based on whether the mixture is taken additively or geometrically, and on a per-sequence or per-transition basis. We call this general class of methods feature bagging, by analogy to the well-known bagging algorithm for ensemble learning.</Paragraph>
    <Paragraph position="7"> We test these methods on conditional random fields (CRFs) (Lafferty et al., 2001; Sutton and McCallum, 2006), which are discriminatively-trained undirected models. On two natural-language tasks, we show that feature bagging performs significantly better than training a single CRF with all available features.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML