File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3223_intro.xml

Size: 7,620 bytes

Last Modified: 2025-10-06 14:02:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3223">
  <Title>Incremental Feature Selection and lscript1 Regularization for Relaxed Maximum-Entropy Modeling</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The maximum-entropy (ME) principle, which prescribes choosing the model that maximizes the entropy out of all models that satisfy given feature constraints, can be seen as a built-in regularization mechanism that avoids overfitting the training data.</Paragraph>
    <Paragraph position="1"> However, it is only a weak regularizer that cannot avoid overfitting in situations where the number of training examples is significantly smaller than the number of features. In such situations some features will occur zero times on the training set and receive negative infinity weights, causing the assignment of zero probabilities for inputs including those features. Similar assignment of (negative) infinity weights happens to features that are pseudominimal (or pseudo-maximal) on the training set (see Johnson et al. (1999)), that is, features whose value on correct parses always is less (or greater)  than or equal to their value on all other parses. Also, if large features sets are generated automatically from conjunctions of simple feature tests, many features will be redundant. Besides overfitting, large feature sets also create the problem of increased time and space complexity.</Paragraph>
    <Paragraph position="2"> Common techniques to deal with these problems are regularization and feature selection. For ME models, the use of an lscript2 regularizer, corresponding to imposing a Gaussian prior on the parameter values, has been proposed by Johnson et al. (1999) and Chen and Rosenfeld (1999). Feature selection for ME models has commonly used simple frequency-based cut-off, or likelihood-based feature induction as introduced by Della Pietra et al. (1997). Whereas lscript2 regularization produces excellent generalization performance and effectively avoids numerical problems, parameter values almost never decrease to zero, leaving the problem of inefficient computation with the full feature set. In contrast, feature selection methods effectively decrease computational complexity by selecting a fraction of the feature set for computation; however, generalization performance suffers from the ad-hoc character of hard thresholds on feature counts or likelihood gains.</Paragraph>
    <Paragraph position="3"> Tibshirani (1996) proposed a technique based on lscript1 regularization that embeds feature selection into regularization such that both a precise assessment of the reliability of features and the decision about inclusion or deletion of features can be done in the same framework. Feature sparsity is produced by the polyhedral structure of the lscript1 norm which exhibits a gradient discontinuity at zero that tends to force a subset of parameter values to be exactly zero at the optimum. Since this discontinuity makes optimization a hard numerical problem, standard gradient-based techniques for estimation cannot be applied directly. Tibshirani (1996) presents a specialized optimization algorithm for lscript1 regularization for linear least-squares regression called the Lasso algorithm. Goodman (2003) and Kazama and Tsujii (2003) employ standard iterative scaling and conjugate gradient techniques, however, for regularization a simplified one-sided exponential prior is employed which is non-zero only for non-negative parameter values. In these approaches the full feature space is considered in estimation, so savings in computational complexity are gained only in applications of the resulting sparse models. Perkins et al. (2003) presented an approach that combines lscript1 based regularization with incremental feature selection. Their basic idea is to start with a model in which almost all weights are zero, and iteratively decide, by comparing regularized feature gradients, which weight should be adjusted away from zero in order to decrease the regularized objective function by the maximum amount. The lscript1 regularizer is thus used directly for incremental feature selection, which on the one hand makes feature selection fast, and on the other hand avoids numerical problems for zero-valued weights since only non-zero weights are included in the model. Besides the experimental evidence presented in these papers, recently a theoretical account on the superior sample complexity of lscript1 over lscript2 regularization has been presented by Ng (2004), showing logarithmic versus linear growth in the number of irrelevant features for lscript1 versus lscript2 regularized logistic regression.</Paragraph>
    <Paragraph position="4"> In this paper, we apply lscript1 regularization to log-linear models, and motivate our approach in terms of maximum entropy estimation subject to relaxed constraints. We apply the gradient-based feature selection technique of Perkins et al. (2003) to our framework, and improve its computational complexity by an n-best feature inclusion technique.</Paragraph>
    <Paragraph position="5"> This extension is tailored to linguistically motivated feature sets where the number of irrelevant features is moderate. In experiments on real-world data from maximum-entropy parsing, we show the advantage of n-best lscript1 regularization over lscript2, lscript1, lscript0 regularization and standard incremental feature selection in terms of better computational complexity and improved generalization performance.</Paragraph>
    <Paragraph position="7"> be a conditional log-linear model defined by feature functions f and log-parameters l. For data {(xj,yj)}mj=1, the objective function to be minimized in lscriptp regularization of the negative log-likelihood L(l) is</Paragraph>
    <Paragraph position="9"> The regularizer family Ohmp(l) is defined by the Minkowski lscriptp norm of the parameter vector l raised to the pth power, i.e. bardbllbardblpp = summationtextni=1|li|p. The essence of this regularizer family is to penalize overly large parameter values. If p = 2, the regularizer corresponds to a zero-mean Gaussian prior distribution on the parameters with g corresponding to the inverse variance of the Gaussian. If p = 0, the regularizer is equivalent to setting a limit on the maximum number of non-zero weights. In our experiments we replace lscript0 regularization by the related technique of frequency-based feature cutoff.</Paragraph>
    <Paragraph position="10"> lscript1 regularization is defined by the case where p = 1. Here parameters are penalized in the sum of their absolute values, which corresponds to applying a zero-mean Laplacian or double exponential prior distribution of the form</Paragraph>
    <Paragraph position="12"> with g = 1t being proportional to the inverse standard deviation [?]2t. In contrast to the Gaussian, the Laplacian prior puts more mass near zero (and in the tails), thus tightening the prior by decreasing the standard deviation t provides stronger regularization against overfitting and produces more zero-valued parameter estimates. In terms of lscript1-norm regularization, feature sparsity can be explained by the following observation: Since every non-zero parameter weight incurs a regularizer penalty of g|li|, its contribution to minimizing the negative log-likelihood has to outweigh this penalty. Thus parameters values where the gradient at l = 0 is</Paragraph>
    <Paragraph position="14"> can be kept zero without changing the optimality of the solution.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML