Incremental Feature Selection and lscript1 Regularization
for Relaxed Maximum-Entropy Modeling
Stefan Riezler and Alexander Vasserman
Palo Alto Research Center
3333 Coyote Hill Road, Palo Alto, CA 94304
Abstract
We present an approach to bounded constraint-
relaxation for entropy maximization that corre-
sponds to using a double-exponential prior or lscript1 reg-
ularizer in likelihood maximization for log-linear
models. We show that a combined incremental fea-
ture selection and regularization method can be es-
tablished for maximum entropy modeling by a nat-
ural incorporation of the regularizer into gradient-
based feature selection, following Perkins et al.
(2003). This provides an efficient alternative to stan-
dard lscript1 regularization on the full feature set, and
a mathematical justification for thresholding tech-
niques used in likelihood-based feature selection.
Also, we motivate an extension to n-best feature
selection for linguistic features sets with moderate
redundancy, and present experimental results show-
ing its advantage over lscript0, 1-best lscript1, lscript2 regularization
and over standard incremental feature selection for
the task of maximum-entropy parsing.1
1 Introduction
The maximum-entropy (ME) principle, which pre-
scribes choosing the model that maximizes the en-
tropy out of all models that satisfy given feature
constraints, can be seen as a built-in regularization
mechanism that avoids overfitting the training data.
However, it is only a weak regularizer that cannot
avoid overfitting in situations where the number of
training examples is significantly smaller than the
number of features. In such situations some fea-
tures will occur zero times on the training set and
receive negative infinity weights, causing the as-
signment of zero probabilities for inputs including
those features. Similar assignment of (negative) in-
finity weights happens to features that are pseudo-
minimal (or pseudo-maximal) on the training set
(see Johnson et al. (1999)), that is, features whose
value on correct parses always is less (or greater)
1This research has been funded in part by contract
MDA904-03-C-0404 of the Advanced Research and Develop-
ment Activity, Novel Intelligence from Massive Data program.
than or equal to their value on all other parses. Also,
if large features sets are generated automatically
from conjunctions of simple feature tests, many fea-
tures will be redundant. Besides overfitting, large
feature sets also create the problem of increased
time and space complexity.
Common techniques to deal with these problems
are regularization and feature selection. For ME
models, the use of an lscript2 regularizer, corresponding
to imposing a Gaussian prior on the parameter val-
ues, has been proposed by Johnson et al. (1999) and
Chen and Rosenfeld (1999). Feature selection for
ME models has commonly used simple frequency-
based cut-off, or likelihood-based feature induction
as introduced by Della Pietra et al. (1997). Whereas
lscript2 regularization produces excellent generalization
performance and effectively avoids numerical prob-
lems, parameter values almost never decrease to
zero, leaving the problem of inefficient computa-
tion with the full feature set. In contrast, feature se-
lection methods effectively decrease computational
complexity by selecting a fraction of the feature
set for computation; however, generalization per-
formance suffers from the ad-hoc character of hard
thresholds on feature counts or likelihood gains.
Tibshirani (1996) proposed a technique based on
lscript1 regularization that embeds feature selection into
regularization such that both a precise assessment of
the reliability of features and the decision about in-
clusion or deletion of features can be done in the
same framework. Feature sparsity is produced by
the polyhedral structure of the lscript1 norm which ex-
hibits a gradient discontinuity at zero that tends to
force a subset of parameter values to be exactly
zero at the optimum. Since this discontinuity makes
optimization a hard numerical problem, standard
gradient-based techniques for estimation cannot be
applied directly. Tibshirani (1996) presents a spe-
cialized optimization algorithm for lscript1 regularization
for linear least-squares regression called the Lasso
algorithm. Goodman (2003) and Kazama and Tsujii
(2003) employ standard iterative scaling and con-
jugate gradient techniques, however, for regulariza-
tion a simplified one-sided exponential prior is em-
ployed which is non-zero only for non-negative pa-
rameter values. In these approaches the full fea-
ture space is considered in estimation, so savings
in computational complexity are gained only in ap-
plications of the resulting sparse models. Perkins
et al. (2003) presented an approach that combines
lscript1 based regularization with incremental feature se-
lection. Their basic idea is to start with a model in
which almost all weights are zero, and iteratively
decide, by comparing regularized feature gradients,
which weight should be adjusted away from zero
in order to decrease the regularized objective func-
tion by the maximum amount. The lscript1 regularizer is
thus used directly for incremental feature selection,
which on the one hand makes feature selection fast,
and on the other hand avoids numerical problems
for zero-valued weights since only non-zero weights
are included in the model. Besides the experimental
evidence presented in these papers, recently a theo-
retical account on the superior sample complexity of
lscript1 over lscript2 regularization has been presented by Ng
(2004), showing logarithmic versus linear growth in
the number of irrelevant features for lscript1 versus lscript2
regularized logistic regression.
In this paper, we apply lscript1 regularization to log-
linear models, and motivate our approach in terms
of maximum entropy estimation subject to relaxed
constraints. We apply the gradient-based feature se-
lection technique of Perkins et al. (2003) to our
framework, and improve its computational com-
plexity by an n-best feature inclusion technique.
This extension is tailored to linguistically motivated
feature sets where the number of irrelevant features
is moderate. In experiments on real-world data from
maximum-entropy parsing, we show the advantage
of n-best lscript1 regularization over lscript2, lscript1, lscript0 regulariza-
tion and standard incremental feature selection in
terms of better computational complexity and im-
proved generalization performance.
2 lscriptp Regularizers for Log-Linear Models
Let pλ(x|y) = e
summationtextn
i=1 λifi(x,y)summationtext
x e
summationtextn
i=1 λifi(x,y)
be a conditional
log-linear model defined by feature functions f and
log-parameters λ. For data {(xj,yj)}mj=1, the objec-
tive function to be minimized in lscriptp regularization of
the negative log-likelihood L(λ) is
C(λ) = L(λ) + Ωp(λ)
= − 1m
msummationdisplay
j=1
lnpλ(xj|yj) + γbardblλbardblpp
The regularizer family Ωp(λ) is defined by the
Minkowski lscriptp norm of the parameter vector λ
raised to the pth power, i.e. bardblλbardblpp = summationtextni=1|λi|p.
The essence of this regularizer family is to penalize
overly large parameter values. If p = 2, the regu-
larizer corresponds to a zero-mean Gaussian prior
distribution on the parameters with γ corresponding
to the inverse variance of the Gaussian. If p = 0,
the regularizer is equivalent to setting a limit on the
maximum number of non-zero weights. In our ex-
periments we replace lscript0 regularization by the re-
lated technique of frequency-based feature cutoff.
lscript1 regularization is defined by the case where
p = 1. Here parameters are penalized in the sum
of their absolute values, which corresponds to ap-
plying a zero-mean Laplacian or double exponential
prior distribution of the form
p(λi) = 12τe−|λi|τ
with γ = 1τ being proportional to the inverse stan-
dard deviation √2τ. In contrast to the Gaussian, the
Laplacian prior puts more mass near zero (and in
the tails), thus tightening the prior by decreasing
the standard deviation τ provides stronger regular-
ization against overfitting and produces more zero-
valued parameter estimates. In terms of lscript1-norm
regularization, feature sparsity can be explained by
the following observation: Since every non-zero pa-
rameter weight incurs a regularizer penalty of γ|λi|,
its contribution to minimizing the negative log-
likelihood has to outweigh this penalty. Thus param-
eters values where the gradient at λ = 0 is
vextendsinglevextendsingle
vextendsinglevextendsingle∂L(λ)
∂λi
vextendsinglevextendsingle
vextendsinglevextendsingle≤ γ (1)
can be kept zero without changing the optimality of
the solution.
3 Bounded Constraint Relaxation for
Maximum Entropy Estimation
As shown by Lebanon and Lafferty (2001), in terms
of convex duality, a regularization term for the dual
problem corresponds to a “potential” on the con-
straint values in the primal problem. For a dual
problem of regularized likelihood estimation for
log-linear models, the corresponding primal prob-
lem is a maximum entropy problem subject to re-
laxed constraints. Let H(p) denote the entropy with
respect to probability function p, and g : IRn → IR
be a convex potential function, and ˜p[·] and p[·] be
expectations with respect to the empirical distribu-
tion ˜p(x,y) = 1m summationtextmj=1 δ(xj,x)δ(yj,y) and the
model distribution p(x|y)˜p(y). The primal problem
can then be stated as
Maximize H(p)−g(c) subject to
p[fi]− ˜p[fi] = ci,i = 1,... ,n
Constraint relaxation is achieved in that equality of
the feature expectations is not enforced, but a certain
amount of overshooting or undershooting is allowed
by a parameter vector c ∈ IRn whose potential is de-
termined by a convex function g(c) that is combined
with the entropy term H(p).
In the case of lscript2 regularization, the potential func-
tion for the primal problem is a quadratic penalty
of the form 12γ summationtexti c2i for γ = 1σ2
i
,i = 1,... ,n
(Lebanon and Lafferty, 2001). In order to recover
the specific form of the primal problem for our case,
we have to start from the given dual problem. Fol-
lowing Lebanon and Lafferty (2001), the dual func-
tion for regularized estimation can be expressed in
terms of the dual function Λ(pλ,λ) for the unregu-
larized case and the convex conjugate g∗(λ) of the
potential function g(c). In our case the negative of
Λ(pλ,λ) corresponds to the likelihood term L(λ),
and the negative of the convex conjugate g∗(λ) is
the lscript1 regularizer. Thus our dual problem can be
stated as
λstar = argmax
λ
Λ(pλ,λ)−g∗(λ)
= argmin
λ
L(λ) + γbardblλbardbl11
Since for convex and closed functions, the con-
jugate of the conjugate is the original function, i.e.
g∗∗ = g (Boyd and Vandenberghe, 2004), the poten-
tial function g(c) for the primal problem can be re-
covered by calculating the conjugate g∗∗ of the con-
jugate g∗(λ) = γbardblλbardbl11. In our case, we get
g∗∗(c) = g(c) =
braceleftbigg 0 bardblcbardbl
∞ ≤ γ
∞ otherwise (2)
where bardblcbardbl∞ = max{|c1|,... ,|cn|}. A proof for
this proposition is given in the Appendix. The re-
sulting potential function g(c) is the indicator func-
tion on the interval [−γ,γ]. That is, it restricts the
allowable amount of constraint relaxation to at most
±γ. From this perspective, increasing γ means to al-
low for more slack in constraint satisfaction, which
in turn allows to fit a more uniform, less overfit-
ting distribution to the data. For features that are in-
cluded in the model, the parameter values have to be
adjusted away from zero to meet the constraints
|p[fi]− ˜p[fi]| ≤ γ, i = 1,... ,n (3)
Initialization: Initialize selected features S to ∅, and
zero-weighted features Z to the full feature set,
yielding the uniform distribution pλ(0),S(0).
n-best grafting: For steps t = 1,... ,T,
(1) for all features fi in Z(t−1), calculate
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
∂L(λ(t−1),S(t−1))
∂λi
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle> γ,
(2) S(t) := S(t−1) ∪N(t) and Z(t) := Z(t−1) \
N(t) where N(t) is the set of n-best features
passing the test in (1),
(3) perform conjugate gradient optimization to
find the optimal model pλstar,S(t) where λ is
initialized at λ(t−1), and λ(t) := λstar =
argmax
λ
C(λ,S(t)).
Stopping condition: Stop if for all fi in Z(t−1):
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle
∂L(λ(t−1),S(t−1))
∂λi
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsingle≤ γ
Figure 1: n-best gradient feature testing
For features that meet the constraints without pa-
rameter adjustment, parameter values can be kept at
zero, effectively discarding the features. Note that
equality of equations 3 and 1 connects the maxi-
mum entropy problem to likelihood regularization.
4 Standardization
Note that the Ωp regularizer presented above penal-
izes the model parameters uniformly, correspond-
ing to imposing a uniform variance onto all model
parameters. This motivates a normalization of in-
put data to the same scale. A standard technique
to achieve this is to linearly rescale each feature
count to zero mean and standard deviation of one
over all training data. The same rescaling has to be
done for training and application of the model to un-
seen data. As we will see in the experimental evalua-
tion presented below, a standardization of input data
can also dramatically improve convergence behav-
ior in unregularized optimization . Furthermore, pa-
rameter values estimated from standardized feature
counts are directly interpretable to humans. Com-
bined with feature selection, interpretable parame-
ter weights are particularly useful for error analysis
of the model’s feature design.
5 Incremental n-best Feature Selection
The basic idea of the “grafting” (for “gradient fea-
ture testing”) algorithm presented by (Perkins et al.,
2003) is to assume a tendency of lscript1 regularization
to produce a large number of zero-valued param-
eters at the function’s optimum, thus to start with
all-zero weights, and incrementally add features to
the model only if adjusting their parameter weights
away from zero sufficiently decreases the optimiza-
tion criterion. This idea allows for efficient, incre-
mental feature selection, and at the same time avoids
numerical problems caused by the discontinuity of
the gradient in lscript1 regularization. Furthermore, the
regularizer is incorporated directly into a criterion
for feature selection, based on the observation made
above: It only makes sense to add a feature to the
model if the regularizer penalty is outweighed by
the reduction in negative log-likelihood. Thus fea-
tures considered for selection have to pass the fol-
lowing test:
vextendsinglevextendsingle
vextendsinglevextendsingle∂L(λ)
∂λi
vextendsinglevextendsingle
vextendsinglevextendsingle> γ
In the grafting procedure suggested by (Perkins
et al., 2003), this gradient test is applied to each fea-
ture, and at each step the feature passing the test
with maximum magnitude is added to the model.
Adding one feature at a time effectively discards
noisy and irrelevant features, however, the overhead
introduced by grafting can outweigh the gain in ef-
ficiency if there is a moderate number of noisy and
truly redundant features. In such cases, it is bene-
ficial to add a number of n > 1 features at each
step, where n is adjusted by cross-validation or on a
held-out data set. In the experiments on maximum-
entropy parsing presented below, a feature set of lin-
guistically motivated features is used that exhibits
only a moderate amount of redundancy. We will see
that for such cases, n-best feature selection consid-
erably improves computational complexity, and also
achieves slightly better generalization performance.
After adding n ≥ 1 features to the model in
a grafting step, the model is optimized with re-
spect to all parameters corresponding to currently
included features. This optimization is done by call-
ing a gradient-based general purpose optimization
routine for the regularized objective function. We
use a conjugate gradient routine for this purpose
(Minka, 2001; Malouf, 2002)2. The gradient of our
criterion with respect to a parameter λi is:
∂C(λ)
∂λi =
1
m
msummationdisplay
k=1
∂L(λ)
∂λi + γ sign(λi)
2Note that despite gradient feature testing, the parameters
for some features can be driven to zero in conjugate gradient
optimization of the lscript1-regularized objective function. Care has
to be taken to catch those features and prune them explicitly to
avoid numerical instability.
The sign of λi decides if γ is added or subtracted
from the gradient for feature fi. For a feature that
is newly added to the model and thus has weight
λi = 0, we use the feature gradient test to determine
the sign. If ∂L(λ)∂λi > γ, we know that ∂C(λ)∂λi > 0,
thus we let sign(λi) = −1 in order to decrease C.
Following the same rationale, if ∂L(λ)∂λi < −γ we
set sign(λi) = +1. An outline of an n-best grafting
algorithm is given in Fig. 1.
6 Experiments
6.1 Train and Test Data
In the experiments presented in this paper, we eval-
uate lscript2, lscript1, and lscript0 regularization on the task of
stochastic parsing with maximum-entropy models
For our experiments, we used a stochastic parsing
system for LFG that we trained on section 02-21
of the UPenn Wall Street Journal treebank (Mar-
cus et al., 1993) by discriminative estimation of a
conditional maximum-entropy model from partially
labeled data (see Riezler et al. (2002)). For esti-
mation and best-parse searching, efficient dynamic-
programming techniques over features forests are
employed (see Kaplan et al. (2004)). For the setup
of discriminative estimation from partially labeled
data, we found that a restriction of the training data
to sentences with a relatively low ambiguity rate
was possible at no loss in accuracy compared to
training from all sentences. Furthermore, data were
restricted to sentences of which a discriminative
learner can possibly take advantage, i.e. sentences
where the set of parses assigned to the labeled string
is a proper subset of the parses assigned to the un-
labeled string. Together with a restriction to exam-
ples that could be parsed by the full grammar and
did not have to use a backoff mechanism of frag-
ment parses, this resulted in a training set of 10,000
examples with at most 100 parses. Evaluation was
done on the PARC 700 dependency bank3, which
is an LFG annotation of 700 examples randomly
extracted from section 23 of the UPenn WSJ tree-
bank. To tune regularization parameters, we split the
PARC 700 into a heldout and test set of equal size.
6.2 Feature Construction
Table 1 shows the 11 feature templates that were
used in our experiments to create 60,109 features.
On the around 300,000 parses for 10,000 sentences
in our final training set, 10,986 features were active,
resulting in a matrix of active features times parses
that has 66 million non-zero entries. The scale of
this experiment is comparable to experiments where
3http://www2.parc.com/istl/groups/nltt/fsbank/
Table 1: Feature templates
name parameters activation condition
Local Templates
cs label label constituent label is present in parse
cs adj label parent label, constituent child label is
child label child of constituent parent label
cs right branch constituent has right child
cs conj nonpar depth non-parallel conjuncts within depth levels
fs attrs attrs f-structure attribute is one of attrs
fs attr value attr, value attribute attr has value value
fs attr subsets attr sum of cardinalities of subsets of attr
lex subcat pred, args sets verb pred has one of args sets as arguments
Non-Local (Top-Down) Templates
cs embedded label, size chain of size constituents
labeled label embedded into one another
cs sub label ancestor label, constituent descendant label
descendant label is descendant of ancestor label
fs aunt subattr aunts, parents, one of descendants is descendant of one of
descendants parents which is a sister of one of aunts
much larger, but sparser feature sets are employed4.
The reason why the matrix of non-zeroes is less
sparse in our case is that most of our feature tem-
plates are instantiated to linguistically motivated
cases, and only a few feature templates encode all
possible conjunctions of simple feature tests. Re-
dundant features are introduced mostly by the lat-
ter templates, whereas the former features are gen-
eralizations over possible combinations of grammar
constants. We conjecture that feature sets like this
are typical for natural language applications.
Efficient feature detection is achieved by a com-
bination of hashing and dynamic programming on
the packed representation of c- and f-structures
(Maxwell and Kaplan, 1993). Features can be de-
scribed as local and non-local, depending on the size
of the graph that has to be traversed in their compu-
tation. For each local template one of the parame-
ters is selected as a key for hashing. Non-local fea-
tures are treated as two (or more) local sub-features.
Packed structures are traversed depth-first, visiting
each node only once. Only the features keyed on
the label of the current node are considered for
matching. For each non-local feature, the contexts
of matching subfeatures are stored at the respective
nodes, propagated upward in dynamic programing
fashion, and conjoined with contexts of other sub-
features of the feature. Fully matched features are
associated with the corresponding contexts resulting
in a feature-annotated and/or-forest. This annotated
4For example, Malouf (2002) reports a matrix of non-zeroes
that has 55 million entries for a shallow parsing experiment
where 260,000 features were employed.
and/or forest is exploited for dynamic programming
computation in estimation and best parse selection.
6.3 Experimental Results
Table 2 shows the results of an evaluation of five
different systems of the test split of the PARC 700
dependency bank. The presented systems are unreg-
ularized maximum-likelihood estimation of a log-
linear model including the full feature set (mle),
standardized maximum-likelihood estimation as de-
scribed in Sect. 4 (std), lscript0 regularization using
frequency-based cutoff, lscript1 regularization using n-
best grafting, and lscript2 regularization using a Gaus-
sian prior. All lscriptp regularization runs use a standard-
ization of the feature space. Special regularization
parameters were adjusted on the heldout split, re-
sulting in a cutoff threshold of 16, and penaliza-
tion factors of 20 and 100 for lscript1 and lscript2 regular-
ization respectively, with an optimal choice of 100
features to be added in each n-best grafting step.
Performance of these systems is evaluated firstly
with respect to F-score on matching dependency re-
lations. Note that the F-score values on the PARC
700 dependency bank range between a lower bound
of 68.0% for averaging over all parses and an upper
bound of 83.6% for the parses producing the best
possible matches. Furthermore, compression of the
full feature set by feature selection, number of con-
jugate gradient iterations, and computation time (in
hours:minutes of elapsed time) are reported.5
5All experiments were run on one CPU of a dual processor
AMD Opteron 244 with 1.8GHz clock speed and 4GB of main
memory.
Table 2: F-score, compression, number of iterations,
and elapsed time for unregularized and standardized
maximum-likelihood estimation, and lscript0, lscript1, and lscript2
regularization on test split of PARC 700 dependency
bank.
mle std lscript0 lscript2 lscript1
F-score 77.9 78.1 78.1 78.9 79.3
compr. 0 0 18.4 0 82.7
cg its. 761 371 372 34 226
time 129:12 66:41 60:47 6:19 5:25
Unregularized maximum-likelihood estimation
using the full feature set exhibits severe overtraining
problems, as the relation of F-score to the number
of conjugate gradient iterations shows. Standard-
ization of input data can alleviate this problem by
improving convergence behavior to half the num-
ber of conjugate gradient iterations. lscript0 regulariza-
tion achieves its maximum on the heldout data for a
threshold of 16, which results in an estimation run
that is slightly faster than standardized estimation
using all features, due to a compression of the full
feature set by 18%. lscript2 regularization benefits from
a very tight prior (standard deviation of 0.1 corre-
sponding to penalty 100) that was chosen on the
heldout set. Despite the fact that no reduction of the
full feature set is achieved, this estimation run in-
creases the F-score to 78.9% and improves compu-
tation time by a factor of 20 compared to unregular-
ized estimation using all features. lscript1 regularization
for n-best grafting, however, even improves upon
this result by increasing the F-score to 79.3%, fur-
ther decreasing computation time to 5:25 hours, at a
compression of the full feature set of 83%.
77.5
78
78.5
79
79.5
1 10 100 1000 10000
10
100
1000
F-score NumCGIterations
FeaturesAddedAtEachStep
F-score
a51 a51
a51
a51
a51
a51
a51NumCGIterations+
+
+ +
+
+
+
Figure 2: n-best grafting with n of features added
at each step plotted against F-score on test set and
conjugate gradient iterations.
As shown in Fig. 2, for feature selection from lin-
guistically motivated feature sets with only a mod-
erate amount of truly redundant features, it is crucial
to choose the right number n of features to be added
in each grafting step. The number of conjugate gra-
dient iterations decreases rapidly in the number of
features added at each step, whereas F-score evalu-
ated on the test set does not decrease (or increases
slightly) until more than 100 features are added in
each step. 100-best grafting thus reduces estimation
time by a factor of 10 at no loss in F-score compared
to 1-best grafting. Further increasing n results in a
significant drop in F-score, while smaller n is com-
putationally expensive, and also shows slight over-
training effects.
Table 3: F-score, compression, number of itera-
tions, and elapsed time for gradient-based incre-
mental feature selection without regularization, and
with lscript2, and lscript1 regularization on test split of PARC
700 dependency bank.
mle-ifs lscript2-ifs lscript1
F-score 78.8 79.1 79.3
compr. 88.1 81.7 82.7
cg its. 310 274 226
time 6:04 6:56 5:25
In another experiment we tried to assess the rel-
ative contribution of regularization and incremental
feature selection to the lscript1-grafting technique. Re-
sults of this experiments are shown in Table 3. In
this experiment we applied incremental feature se-
lection using the gradient test described above to un-
regularized maximum-likelihood estimation (mle-
ifs) and lscript2-regularized maximum-likelihood estima-
tion (lscript2-ifs). Threshold parameters γ are adjusted
on the heldout set, in addition to and independent
of regularization parameters such as the variance
of the Gaussian prior. Results are compared to lscript1-
regularized grafting as presented above. For all runs
a number of 100 features to be added in each graft-
ing step is chosen. The best result for the mle-ifs run
is achieved at a threshold of 25, yielding an F-score
of 78.8%. This shows that incremental feature se-
lection is a powerful tool to avoid overfitting. A fur-
ther improvement in F-score to 79.1% is achieved
by combining incremental feature selection with the
lscript2 regularizer at a variance of 0.1 for the Gaussian
prior and a threshold of 15. Both runs provide ex-
cellent compression rates and convergence times.
However, they are still outperformed by the lscript1 run
that achieves a slight improvement in F-score to
79.3% and a slightly better runtime. Furthermore,
by integrating regularization naturally into thresh-
olding for feature selection, a separate thresholding
parameter is avoided in lscript1-based incremental fea-
ture selection.
A theoretical account of the savings in com-
putational complexity that can be achieved by n-
best grafting can be given as follows. Perkins et
al. (2003) assess the computational complexity for
standard gradient-based optimization with the full
feature set by ≈ cmp2τ, for a multiple c of p line
minimizations for p derivatives over m data points,
each of which has cost τ. In contrast, for grafting,
the cost is assessed by adding up the costs for fea-
ture testing and optimization for s grafting steps as
≈ (msp+13cms3)τ. For n-best grafting as proposed
in this paper, the number of steps can be decom-
posed into s = n · t for n features added at each
of t steps. This results in a cost of ≈ mtp for fea-
ture testing, and ≈ 13cmn2t3τ for optimization. If
we assume that t lessmuch n lessmuch s, this indicates consid-
erable savings compared to both 1-best grafting and
standard gradient-based optimization.
7 Discussion and Conclusion
A related approach to lscript1 regularization and
constraint-relaxation for maximum-entropy mod-
eling has been presented by Kazama and Tsujii
(2003). In this approach, constraint relaxation is
done by allowing two-sided inequality constraints
−Bi ≤ ˜p[fi]−p[fi] ≤ Ai, Ai,Bi > 0
in entropy maximization. The dual function is the
regularized likelihood function
1
m
msummationdisplay
j=1
pα−β(xj|yj)−
nsummationdisplay
i=1
αiAi −
nsummationdisplay
i=1
βiBi
where the two parameter vectors α and β replace
our parameter vector λ, and αi,βi ≥ 0. This reg-
ularizer corresponds to a simplification of double-
sided exponentials to a one-sided exponential dis-
tribution which is non-zero only for non-negative
parameters. The use of one-sided exponential pri-
ors for log-linear models has also been proposed
by Goodman (2003), however, without a motiva-
tion in a maximum entropy framework. The fact that
Kazama and Tsujii (2003) allow for lower and up-
per bounds of different size requires the parameter
space to be doubled in their approach. Furthermore,
similar to Goodman (2003), the requirement to work
with a one-sided strictly positive exponential dis-
tribution makes it necessary to double the feature
space to account for (dis)preferences in terms of
strictly positive parameter values. These are consid-
erable computational and implementational disad-
vantages of these approaches. More importantly, an
integration of lscript1 regularization into incremental fea-
ture selection was not considered.
Incremental feature selection has been proposed
firstly by Della Pietra et al. (1997) in a likelihood-
based framework. In this approach, an approximate
gain in likelihood for adding a feature to the model
is used as feature selection criterion, and thresholds
on this gain are used as stopping criterion. Maxi-
mization of approximate likelihood gains and gra-
dient feature testing both are greedy approxima-
tions to the true gain in the objective function -
grafting can be seen as applying one iteration of
Newton’s method, where the weight of the newly
added feature is initialized at 0, to calculate the ap-
proximate likelihood gain. Efficiency and accuracy
of both approaches are comparable, however, the
grafting framework provides a well-defined mathe-
matical basis for feature selection and optimization
by incorporating selection thresholds naturally as
penalty factors of the regularizer. The idea of adding
n-best features in each selection step also has been
investigated earlier in the likelihood-based frame-
work (see for example McCallum (2003)). How-
ever, the possible improvements in computational
complexity and generalization performance due to
n-best selection were not addressed explicitly. Fur-
ther improvements of efficiency of grafting are pos-
sible by applying Zhou et al.’s (2003) technique of
restricting feature selection in each step to the top-
ranked features from previous stages.
In sum, we presented an application of lscript1 regu-
larization to likelihood maximization for log-linear
models that has a simple interpretation as bounded
constraint relaxation in terms of maximum entropy
estimation. The presented n-best grafting method
does not require specialized algorithms or simplifi-
cations of the prior, but allows for an efficient, math-
ematically well-defined combination of feature se-
lection and regularization. In an experimental eval-
uation, we showed n-best grafting to outperform lscript0,
1-best lscript1, lscript2 regularization and standard incremen-
tal feature selection in terms of computational effi-
ciency and generalization performance.

References
Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
Stanley F. Chen and Ronald Rosenfeld. 1999. A gaussian prior for smoothing maximum entropy models. Technical Report CMU-CS-99-108, Carnegie Mellon University, Pittsburgh, PA.
Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393.
Joshua Goodman. 2003. Exponential priors for maximum entropy models. Unpublished Manuscript, Microsoft Research, Redmont, WA.
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic “unification-based” grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), College Park, MD.
Ronald M. Kaplan, Stefan Riezler, Tracy H. King, John T. Maxwell III, and Alexander Vasserman. 2004. Speed and accuracy in shallow and deep stochastic parsing. In Proceedings of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting (HLT/NAACL’04), Boston, MA.
Jun’ichi Kazama and Jun’ichi Tsujii. 2003. Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of EMNLP’03, Sapporo, Japan.
Guy Lebanon and John Lafferty. 2001. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing 14 (NIPS’01), Vancouver.
Robert Malouf. 2002. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of Computational Natural Language Learning (CoNLL’02), Taipei, Taiwan.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn treebank. Computational Linguistics, 19(2):313–330.
John Maxwell and Ron Kaplan. 1993. The interface between phrasal and functional constraints. Computational Linguistics, 19(4):571–589.
Andrew McCallum. 2003. Efficiently inducing features of conditional random fields. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI’03), Acapulco, Mexico.
Thomas Minka. 2001. Algorithms for maximum-likelihood logistic regression. Department of Statistics, Carnegie Mellon University.
Andrew Y. Ng. 2004. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada.
Simon Perkins, Kevin Lacker, and James Theiler. 2003. Grafting: Fast, incremetal feature selection by gradient descent in function space. Machine Learning, 3:1333–1356.
Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. Maxwell, and Mark Johnson. 2002. Parsing the Wall Street Journal using a Lexical-Functional Grammar and discriminative estimation techniques. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02), Philadelphia, PA.
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288.
Yaqian Zhou, Fuliang Weng, Lide Wu, and Hauke Schmidt. 2003. A fast algorithm for feature selection in conditional maximum entropy modeling. In Proceedings of EMNLP’03, Sapporo, Japan.
