Evaluation and Extension of Maximum Entropy Models
with Inequality Constraints
Jun’ichi Kazama†
kazama@is.s.u-tokyo.ac.jp
†Department of Computer Science
University of Tokyo
Hongo 7-3-1, Bunkyo-ku,
Tokyo 113-0033, Japan
Jun’ichi Tsujii†‡
tsujii@is.s.u-tokyo.ac.jp
‡CREST, JST
(Japan Science and Technology Corporation)
Honcho 4-1-8, Kawaguchi-shi,
Saitama 332-0012, Japan
Abstract
A maximum entropy (ME) model is usu-
ally estimated so that it conforms to equal-
ity constraints on feature expectations.
However, the equality constraint is inap-
propriate for sparse and therefore unre-
liable features. This study explores an
ME model with box-type inequality con-
straints, where the equality can be vio-
lated to reflect this unreliability. We eval-
uate the inequality ME model using text
categorization datasets. We also propose
an extension of the inequality ME model,
which results in a natural integration with
the Gaussian MAP estimation. Experi-
mental results demonstrate the advantage
of the inequality models and the proposed
extension.
1 Introduction
The maximum entropy model (Berger et al., 1996;
Pietra et al., 1997) has attained great popularity in
the NLP field due to its power, robustness, and suc-
cessful performance in various NLP tasks (Ratna-
parkhi, 1996; Nigam et al., 1999; Borthwick, 1999).
In the ME estimation, an event is decomposed
into features, which indicate the strength of certain
aspects in the event, and the most uniform model
among the models that satisfy:
E
˜p
[f
i
]=E
p
[f
i
], (1)
for each feature. E
˜p
[f
i
] represents the expectation
of feature f
i
in the training data (empirical expec-
tation), and E
p
[f
i
] is the expectation with respect
to the model being estimated. A powerful and ro-
bust estimation is possible since the features can be
as specific or general as required and does not need
to be independent of each other, and since the most
uniform model avoids overfitting the training data.
In spite of these advantages, the ME model still
suffers from a lack of data as long as it imposes the
equality constraint (1), since the empirical expecta-
tion calculated from the training data of limited size
is inevitably unreliable. A careful treatment is re-
quired especially in NLP applications since the fea-
tures are usually very sparse. In this study, text cat-
egorization is used as an example of such tasks with
sparse features.
Previous work on NLP proposed several solutions
for this unreliability such as the cut-off, which sim-
ply omits rare features, the MAP estimation with
the Gaussian prior (Chen and Rosenfeld, 2000), the
fuzzy maximum entropy model (Lau, 1994), and fat
constraints (Khudanpur, 1995; Newman, 1977).
Currently, the Gaussian MAP estimation (com-
bined with the cut-off) seems to be the most promis-
ing method from the empirical results. It succeeded
in language modeling (Chen and Rosenfeld, 2000)
and text categorization (Nigam et al., 1999). As
described later, it relaxes constraints like E
˜p
[f
i
] −
E
p
[f
i
]=
λ
i
σ
2
, where λ
i
is the model’s parameter.
This study follows this line, but explores the fol-
lowing box-type inequality constraints:
A
i
≥ E
˜p
[f
i
] − E
p
[f
i
] ≥−B
i
,A
i
,B
i
> 0. (2)
Here, the equality can be violated by the widths A
i
and B
i
. We refer to the ME model with the above
inequality constraints as the inequality ME model.
This inequality constraint falls into a type of fat con-
straints, a
i
≤ E
p
[f
i
] ≤ b
i
, as suggested by (Khudan-
pur, 1995). However, as noted in (Chen and Rosen-
feld, 2000), this type of constraint has not yet been
applied nor evaluated for NLPs.
The inequality ME model differs from the Gaus-
sian MAP estimation in that its solution becomes
sparse (i.e., many parameters become zero) as a re-
sult of optimization with inequality constraints. The
features with a zero parameter can be removed from
the model without changing its prediction behavior.
Therefore, we can consider that the inequality ME
model embeds feature selection in its estimation.
Recently, the sparseness of the solution has been rec-
ognized as an important concept in constructing ro-
bust classifiers such as SVMs (Vapnik, 1995). We
believe that the sparse solution improves the robust-
ness of the ME model as well.
We also extend the inequality ME model so that
the constraint widths can move using slack vari-
ables. If we penalize the slack variables by their 2-
norm, we obtain a natural integration of the inequal-
ity ME model and the Gaussian MAP estimation.
While it incorporates the quadratic stabilization of
the parameters as in the Gaussian MAP estimation,
the sparseness of the solution is preserved.
We evaluate the inequality ME models empiri-
cally, using two text categorization datasets. The
results show that the inequality ME models outper-
form the cut-off and the Gaussian MAP estimation.
Such high accuracies are achieved with a fairly small
number of active features, indicating that the sparse
solution can effectively enhance the performance. In
addition, the 2-norm extended model is shown to be
more robust in several situations.
2 The Maximum Entropy Model
The ME estimation of a conditional model p(y|x)
from the training examples {(x
i
,y
i
)} is formulated
as the following optimization problem.
1
maximize
p
H(p)=
summationdisplay
x
˜p(x)
summationdisplay
y
p(y|x)logp(y|x)
subject to E
˜p
[f
i
]− E
p
[f
i
]=0 1≤ i ≤ F. (3)
1
To be precise, we have also the constraints
P
y
p(y|x) −
1=0 x ∈X. Note that although we explain using a condi-
tional model throughout the paper, the discussion can be applied
easily to a joint model by considering the condition x is fixed.
The empirical expectations and model expectations
in the equality constraints are defined as follows.
E
˜p
[f
i
]=
summationtext
x
˜p(x)
summationtext
y
˜p(y|x)f
i
(x,y), (4)
E
p
[f
i
]=
summationtext
x
˜p(x)
summationtext
y
p(y|x)f
i
(x,y), (5)
˜p(x)=c(x)/L, ˜p(y|x)=c(x,y)/c(x), (6)
where c(·) indicates the number of times · occurred
in the training data, and L is the number of training
examples.
By the Lagrange method, p(y|x) is found to have
the following parametric form:
p
λ
(y|x)=
1
Z(x)
exp(
summationdisplay
i
λ
i
f
i
(x,y)), (7)
where Z(x)=
summationtext
y
exp(
summationtext
i
λ
i
f
i
(x,y)). The dual
objective function becomes:
L(λ)=
summationtext
x
˜p(x)
summationtext
y
˜p(y|x)
summationtext
i
λ
i
f
i
(x,y) (8)
−
summationtext
x
˜p(x)log
summationtext
y
exp(
summationtext
i
λ
i
f
i
(x,y)).
The ME estimation becomes the maximization of
L(λ). And it is equivalent to the maximization of the
log-likelihood: LL(λ)=log
producttext
x,y
p
λ
(y|x)
˜p(x,y)
.
This optimization can be solved using algo-
rithms such as the GIS algorithm (Darroch and Rat-
cliff, 1972) and the IIS algorithm (Pietra et al.,
1997). In addition, gradient-based algorithms can
be applied since the objective function is concave.
Malouf (2002) compares several algorithms for the
ME estimation including GIS, IIS, and the limited-
memory variable metric (LMVM) method, which is
a gradient-based method, and shows that the LMVM
method requires much less time to converge for real
NLP datasets. We also observed that the LMVM
method converges very quickly for the text catego-
rization datasets with an improvement in accuracy.
Therefore, we use the LMVM method (and its vari-
ant for the inequality models) throughout the exper-
iments. Thus, we only show the gradient when men-
tioning the training. The gradient of the objective
function (8) is computed as:
∂L(λ)
∂λ
i
= E
˜p
[f
i
]− E
p
[f
i
]. (9)
3 The Inequality ME Model
The maximum entropy model with the box-type in-
equality constraints (2) can be formulated as the fol-
lowing optimization problem:
maximize
p
summationdisplay
x
˜p(x)
summationdisplay
y
p(y|x)logp(y|x),
subject to E
˜p
[f
i
]− E
p
[f
i
]− A
i
≤ 0, (10)
E
p
[f
i
]− E
˜p
[f
i
]− B
i
≤ 0. (11)
By using the Lagrange method for optimization
problems with inequality constraints, the following
parametric form is derived.
p
α,β
(y|x)=
1
Z(x)
exp(
summationdisplay
i
(α
i
− β
i
)f
i
(x,y)),
α
i
≥ 0,β
i
≥ 0, (12)
where parameters α
i
and β
i
are the Lagrange mul-
tipliers corresponding to constraints (10) and (11).
The Karush-Kuhn-Tucker conditions state that, at
the optimal point,
α
i
(E
˜p
[f
i
]− E
p
[f
i
]− A
i
)=0,
β
i
(E
p
[f
i
]− E
˜p
[f
i
]− B
i
)=0.
These conditions mean that the equality constraint is
maximally violated when the parameter is non-zero,
and if the violation is strictly within the widths, the
parameter becomes zero. We call a feature upper
active when α
i
> 0, and lower active when β
i
> 0.
When α
i
−β
i
negationslash=0, we call that feature active.
2
Inac-
tive features can be removed from the model without
changing its behavior. Since A
i
>0 and B
i
>0,any
feature should not be upper active and lower active
at the same time.
3
The inequality constraints together with the con-
straints
summationtext
y
p(y|x)−1=0define the feasible re-
gion in the original probability space, on which the
entropy varies and can be maximized. The larger
the widths, the more the feasible region is enlarged.
Therefore, it can be implied that the possibility of a
feature becoming inactive (the global maximal point
is strictly within the feasible region with respect
to that feature’s constraints) increases if the corre-
sponding widths become large.
2
The term ’active’ may be confusing since in the ME re-
search, a feature is called active when f
i
(x,y) > 0 for an
event. However, we follow the terminology in the constrained
optimization.
3
This is only achieved with some tolerance in practice.
The solution for the inequality ME model would
become sparse if the optimization determines many
features as inactive with given widths. The relation
between the widths and the sparseness of the solu-
tion is shown in the experiment.
The dual objective function becomes:
L(α,β)=
summationtext
x
˜p(x)
summationtext
y
˜p(y|x)
summationtext
i
(α
i
− β
i
)f
i
(x,y)
−
summationtext
x
˜p(x)log
summationtext
y
exp(
summationtext
i
(α
i
− β
i
)f
i
(x,y))
−
summationtext
i
α
i
A
i
−
summationtext
i
β
i
B
i
. (13)
Thus, the estimation is formulated as:
maximize
α
i
≥0,β
i
≥0
L(α,β).
Unlike the optimization in the standard maximum
entropy estimation, we now have bound constraints
on parameters which state that parameters must be
non-negative. In addition, maximizing L(α,β) is no
longer equivalent to maximizing the log-likelihood
LL(α,β). Instead, we maximize:
LL(α,β) −
summationtext
i
α
i
A
i
−
summationtext
i
β
i
B
i
. (14)
Although we can use many optimization algorithms
to solve this dual problem since the objective func-
tion is still concave, a method that supports bounded
parameters must be used. In this study, we use the
BLMVM algorithm (Benson and Mor´e, ), a variant
of the limited-memory variable metric (LMVM) al-
gorithm, which supports bound constraints.
4
The gradient of the objective function is:
∂L(α,β)
∂α
i
= E
˜p
[f
i
] − E
p
[f
i
] − A
i
,
∂L(α,β)
∂β
i
= E
p
[f
i
] − E
˜p
[f
i
] − B
i
. (15)
4 Soft Width Extension
In this section, we present an extension of the in-
equality ME model, which we call soft width. The
soft width allows the widths to move as A
i
+ δ
i
and −B
i
− γ
i
using slack variables, but with some
penalties in the objective function. This soft width
extension is analogous to the soft margin extension
of the SVMs, and in fact, the mathematical discus-
sion is similar. If we penalize the slack variables
4
Although we consider only the gradient-based method here
as noted earlier, an extension of GIS or IIS to support bounded
parameters would also be possible.
by their 2-norm, we obtain a natural combination of
the inequality ME model and the Gaussian MAP es-
timation. We refer to this extension using 2-norm
penalty as the 2-norm inequality ME model. As the
Gaussian MAP estimation has been shown to be suc-
cessful in several tasks, it should be interesting em-
pirically, as well as theoretically, to incorporate the
Gaussian MAP estimation into the inequality model.
We first review the Gaussian MAP estimation in the
following, and then we describe our extension.
4.1 The Gaussian MAP estimation
In the Gaussian MAP ME estimation (Chen and
Rosenfeld, 2000), the objective function is:
LL(λ) −
summationtext
i
(
1
2σ
2
i
)λ
2
i
, (16)
which is derived as a consequence of maximizing
the log-likelihood of the posterior probability, using
a Gaussian distribution centered around zero with
the variance σ
2
i
as a prior on parameters. The gra-
dient becomes:
∂L(λ)
∂λ
i
= E
˜p
[f
i
]− E
p
[f
i
]−
λ
i
σ
2
i
. (17)
At the optimal point, E
˜p
[f
i
] − E
p
[f
i
] −
λ
i
σ
2
i
=0.
Therefore, the Gaussian MAP estimation can also be
considered as relaxing the equality constraints. The
significant difference between the inequality ME
model and the Gaussian MAP estimation is that the
parameters are stabilized quadratically in the Gaus-
sian MAP estimation (16), while they are stabilized
linearly in the inequality ME model (14).
4.2 2-norm penalty extension
Our 2-norm extension to the inequality ME model is
as follows.
5
maximize
p,δ,γ
H(p)− C
1
summationtext
i
δ
i
2
− C
2
summationtext
i
γ
2
i
,
subject to E
˜p
[f
i
] − E
p
[f
i
] − A
i
≤ δ
i
, (18)
E
p
[f
i
] − E
˜p
[f
i
] − B
i
≤ γ
i
, (19)
5
It is also possible to impose 1-norm penalties in the objec-
tive function. It yields an optimization problem which is iden-
tical to the inequality ME model except that the parameters are
upper-bounded as 0 ≤ α
i
≤ C
1
and 0 ≤ β
i
≤ C
2
. We will not
investigate this 1-norm extension in this paper and leave it for
future research.
where C
1
and C
2
is the penalty constants. The para-
metric form is identical to the inequality ME model
(12). However, the dual objective function becomes:
LL(α,β) −
summationtext
i
parenleftBig
α
i
A
i
+
α
2
i
4C
1
parenrightBig
−
summationtext
i
parenleftBig
β
i
B
i
+
β
2
i
4C
2
parenrightBig
.
Accordingly, the gradient becomes:
∂L(α,β)
∂α
i
= E
˜p
[f
i
] − E
p
[f
i
] −
parenleftBig
A
i
+
α
i
2C
1
parenrightBig
,
∂L(α,β)
∂β
i
= E
p
[f
i
]− E
˜p
[f
i
]−
parenleftBig
B
i
+
β
i
2C
2
parenrightBig
. (20)
It can be seen that this model is a natural combina-
tion of the inequality ME model and the Gaussian
MAP estimation. It is important to note that the so-
lution sparseness is preserved in the above model.
5 Calculation of the Constraint Width
The widths, A
i
and B
i
, in the inequality constraints
are desirably widened according to the unreliability
of the feature (i.e., the unreliability of the calculated
empirical expectation). In this paper, we examine
two methods to determine the widths.
The first is to use a common width for all features
fixed by the following formula.
A
i
= B
i
= W ×
1
L
, (21)
where W is a constant, width factor, to control the
widths. This method can only capture the global re-
liability of all the features. That is, only the reli-
ability of the training examples as a whole can be
captured. We call this method single.
The second, which we call bayes, is a method that
determines the widths based on the Bayesian frame-
work to differentiate between the features depending
on their reliabilities.
For many NLP applications including text catego-
rization, we use the following type of features.
f
j,i
(x,y)=h
i
(x) if y = y
j
, 0 otherwise. (22)
In this case, if we assume the approximation,
˜p(y|x) ≈ ˜p(y|h
i
(x) > 0), the empirical expectation
can be interpreted as follows.
6
E
˜p
[f
j,i
]=
summationdisplay
x: h
i
(x)>0
˜p(x)˜p(y = y
j
|h
i
(x)>0)h
i
(x).
6
This is only for estimating the unreliability, and is not used
to calculate the actual empirical expectations in the constraints.
Here, a source of unreliability is ˜p(y|h
i
(x)>0).We
consider ˜p(y|h
i
(x) > 0) as the parameter θ of the
Bernoulli trials. That is, p(y|h
i
(x) > 0) = θ and
p(¯y|h
i
(x)>0) = 1 − θ. Then, we estimate the pos-
terior distribution of θ from the training examples
by Bayesian estimation and utilize the variance of
the distribution. With the uniform distribution as the
prior, k times out of n trials give the posterior distri-
bution: p(θ)=Be(1+k,1+n−k), where Be(α,β)
is the beta distribution. The variance is calculated as
follows.
V [θ]=
(1+k)(1+n−k)
(2+n)
2
(n+3)
. (23)
Letting k = c(f
j,i
(x,y)>0) and n = c(h
i
(x)>0),
we obtain fine-grained variances narrowed accord-
ing to c(h
i
(x) > 0) instead of a single value, which
just captures the global reliability. Assuming the in-
dependence of training examples, the variance of the
empirical expectation becomes:
V
bracketleftbig
E
˜p
[f
j,i
]
bracketrightbig
=
bracketleftBig
summationtext
x: h
i
(x)>0
{˜p(x)h
i
(x)}
2
bracketrightBig
V [θ
j,i
].
Then, we calculate the widths as follows:
A
i
= B
i
= W ×
radicalBig
V
bracketleftbig
E
˜p
[f
j,i
]
bracketrightbig
. (24)
6 Experiments
For the evaluation, we use the “Reuters-21578, Dis-
tribution 1.0” dataset and the “OHSUMED” dataset.
The Reuters dataset developed by David D. Lewis
is a collection of labeled newswire articles.
7
We
adopted “ModApte” split to split the collection,
and we obtained 7,048 documents for training, and
2,991 documents for testing. We used 112 “TOP-
ICS” that actually occurred in the training set as the
target categories.
The OHSUMED dataset (Hersh et al., 1994) is a
collection of clinical paper abstracts from the MED-
LINE database. Each abstract is manually assigned
MeSH terms. We simplified a MeSH term, like
“A/B/C mapsto→ A”, and used the most frequent 100
simplified terms as the target categories. We ex-
tracted 9,947 abstracts for training, and 9,948 ab-
stracts for testing from the file “ohsumed.91.”
A documents is converted to a bag-of-words vec-
tor representation with TFIDF values, after the stop
7
Available from http://www.daviddlewis.com/resources/
words are removed and all the words are downcased.
Since the text categorization task requires that mul-
tiple categories are assigned if appropriate, we con-
structed a binary categorizer, p
c
(y ∈{+1,−1}|d),
for each category c. If the probability p
c
(+1|d) is
greater than 0.5, the category is assigned. To con-
struct a conditional maximum entropy model, we
used the feature function of the form (22), where
h
i
(d) returns the TFIDF value of the i-th word of
the document vector.
We implemented the estimation algorithms as an
extension of an ME estimation tool, Amis,
8
using
the Toolkit for Advanced Optimization (TAO) (Ben-
son et al., 2002), which provides the LMVM and the
BLMVM optimization modules. For the inequal-
ity ME estimation, we added a hook that checks the
KKT conditions after the normal convergence test.
9
We compared the following models:
• ME models only with cut-off (cut-off ),
• ME models with cut-off and the Gaussian MAP
estimation (gaussian),
• Inequality ME models (ineq),
• Inequality ME models with 2-norm extension
described in Section 4 (2-norm),
10
For the inequality ME models, we compared the two
methods to determine the widths, single and bayes,
as described in Section 5. Although the Gaussian
MAP estimation can use different σ
i
for each fea-
ture, we used a common variance σ for gaussian.
Thus, gaussian roughly corresponds to single in the
way of dealing with the unreliability of features.
Note that, for inequality models, we started with
all possible features and rely on their ability to re-
move unnecessary features automatically by solu-
tion sparseness. The average maximum number of
features in a categorizer is 63,150.0 for the Reuters
dataset and 116,452.0 for the OHSUMED dataset.
8
Developed by Yusuke Miyao so as to support various
ME estimations such as the efficient estimation with compli-
cated event structures (Miyao and Tsujii, 2002). Available at
http://www-tsujii.is.s.u-tokyo.ac.jp/
˜yusuke/amis
9
The tolerance for the normal convergence test (relative im-
provement) and the KKT check is 10
−4
. We stop the training if
the KKT check has been failed many times and the ratio of the
bad (upper and lower active) features among the active features
is lower than 0.01.
10
Here, we fix the penalty constants C
1
= C
2
=10
16
.
 0.8
 0.805
 0.81
 0.815
 0.82
 0.825
 0.83
 0.835
 0.84
 0.845
 0.85
 1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 0.01 1
Accuracy (F-score)
Width Factor
A
B
C
D
A: ineq + single
B: 2-norm + single
C: ineq + bayes
D: 2-norm + bayes
cut-off best
gaussian best
(a) Reuters
 0.54
 0.55
 0.56
 0.57
 0.58
 0.59
 0.6
 0.61
 0.62
 1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 0.01 1 100
Accuracy (F-score)
Width Factor
A
B
C
D
A: ineq + single
B: 2-norm + single
C: ineq + bayes
D: 2-norm + bayes
cut-off best
gaussian best
(b) OHSUMED
Figure 1: Accuracies as a function of the width factor W for the development sets.
 0
 10000
 20000
 30000
 40000
 50000
 60000
 70000
 1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 0.01 1
# of Active Features
Width Factor
A
B
C
D
A: ineq + single
B: 2-norm + single
C: ineq + bayes
D: 2-norm + bayes
(a) Reuters
 0
 20000
 40000
 60000
 80000
 100000
 120000
 1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 0.01 1 100
# of Active Features
Width Factor
A
B
C
D
A: ineq + single
B: 2-norm + single
C: ineq + bayes
D: 2-norm + bayes
(b) OHSUMED
Figure 2: The average number of active features as a function of width factor W.
6.1 Results
We first found the best values for the control param-
eters of each model, W, σ, and the cut-off threshold,
by using the development set. We show that the in-
equality models outperform the other methods in the
development set. We then show that these values are
valid for the evaluation set. We used the first half of
the test set as the development set, and the second
half as the evaluation set.
Figure 1 shows the accuracies of the inequality
ME models for various width factors. The accura-
cies are presented by the “micro averaged” F-score.
The horizontal lines show the highest accuracies of
cut-off and gaussian models found by exhaustive
search. For cut-off, we varied the cut-off thresh-
old and found the best threshold. For gaussian,we
varied σ with each cut-off threshold, and found the
best σ and cut-off combination. We can see that
the inequality models outperform the cut-off method
and the Gaussian MAP estimation with an appro-
priate value for W in both datasets. Although the
OHSUMED dataset seems harder than the Reuters
dataset, the improvement in the OHSUMED dataset
is greater than that in the Reuters dataset. This may
be because the OHSUMED dataset is more sparse
than the Reuters dataset. The 2-norm extension
boosts the accuracies, especially for bayes, at the
moderate Ws (i.e., with the moderate numbers of
active features). However, we can not observe the
apparent advantage of the 2-norm extension in terms
of the highest accuracy here.
Figure 2 shows the average number of active fea-
tures of each inequality ME model for various width
factors. We can see that active features increase
 0.79
 0.8
 0.81
 0.82
 0.83
 0.84
 0.85
 100  1000  10000
Accuracy (F-score)
# of Active Features
B
D
F
E
B: 2-norm + single
D: 2-norm + bayes
E: cut-off
F: gaussian
(a) Reuters
 0.54
 0.55
 0.56
 0.57
 0.58
 0.59
 0.6
 0.61
 0.62
 1000  10000  100000
Accuracy (F-score)
# of Active Features
B
D
F E
B: 2-norm + single
D: 2-norm + bayes
E: cut-off
F: gaussian
(b) OHSUMED
Figure 3: Accuracies as a function of the average number of active features for the development sets. For
gaussian, the accuracy with the best σ found by exhaustive search is shown for each cut-off threshold.
when the widths become small as expected.
Figure 3 shows the accuracy of each model as a
function of the number of active features. We can
see that the inequality ME models achieve the high-
est accuracy with a fairly small number of active fea-
tures, removing unnecessary features on their own.
Besides, they consistently achieve much higher ac-
curacies than the cut-off and the Gaussian MAP es-
timation with a small number of features.
Table 1 summarizes the above results including
the best control parameters for the development set,
and shows how well each method performs for the
evaluation set with these parameters. We can see that
the best parameters are valid for the evaluation sets,
and the inequality ME models outperform the other
methods in the evaluation set as well. This means
that the inequality ME model is generally superior
to the cut-off method and the Gaussian MAP estima-
tion. At this point, the 2-norm extension shows the
advantage of being robust, especially for the Reuters
dataset. That is, the 2-norm models outperform the
normal inequality models in the evaluation set. To
see the reason for this, we show the average cross
entropy of each inequality model as a function of
the width factor in Figure 4. The average cross en-
tropy was calculated as −
1
C
summationtext
c
1
L
summationtext
i
log p
c
(y
i
|d
i
),
where C is the number of categories. The cross en-
tropy of the 2-norm model is consistently more sta-
ble than that of the normal inequality model. Al-
though there is no simple relation between the abso-
lute accuracy and the cross entropy, this consistent
difference can be one explanation for the advantage
of the 2-norm extension. Besides, it is possible that
the effect of 2-norm extension appears more clearly
in the Reuters dataset because the robustness is more
important in the Reuters dataset since the develop-
ment set is rather small and easy to overfit.
Lastly, we could not observe the advantage of
bayes method in these experiments. However, since
our method is still in development, it is premature
to conclude that the idea of using different widths
according to its unreliability is not successful. It is
possible that the uncertainty of ˜p(x), which were not
concerned about, is needed to be modeled, or the
Bernoulli trial assumption is inappropriate. Further
investigation on these points must be done.
7 Conclusion and Future Work
We have shown that the inequality ME models
outperform the cut-off method and the Gaussian
MAP estimation, using the two text categoriza-
tion datasets. Besides, the inequality ME models
achieved high accuracies with a small number of
features due to the sparseness of the solution. How-
ever, it is an open question how the inequality ME
model differs from other sophisticated methods of
feature selection based on other criteria.
Future work will investigate the details of the in-
equality model including the effect of the penalty
constants of the 2-norm extension. Evaluations on
other NLP tasks are also planned. In addition, we
need to analyze the inequality ME model further to
Table 1: The summary of the experiments.
Reuters OHSUMED
best setting # active feats acc (dev) acc (eval) best setting # active feats acc (dev) acc (eval)
cut-off cthr=2 16,961.9 83.24 86.38 cthr=0 116,452.0 58.83 58.35
gaussian cthr=3, σ=4.22E3 12,326.6 84.01 87.04 cthr=8, σ=2.55E3 10,154.7 59.53 59.08
ineq+single W =1.78E−11 9,479.9 84.47 87.41 W =4.22E−2 1,375.5 61.23 61.10
2-norm+single W =5.62E−11 6,611.1 84.35 87.59 W =4.50E−2 1,316.5 61.26 61.23
ineq+bayes W =3.16E−15 63,150.0 84.21 87.37 W =9.46 1,136.6 60.65 60.31
2-norm+bayes W =3.16E−9 10,022.3 84.01 87.57 W =9.46 1,154.5 60.67 60.32
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 0.01 1
Avg. Entropy
Width Factor
A
B
C
D
A: ineq + single
B: 2-norm + single
C: ineq + bayes
D: 2-norm + bayes
(a) Reuters
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1.8
 2
 1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 0.01 1 100
Avg. Entropy
Width Factor
A
B
C
D
A: ineq + single
B: 2-norm + single
C: ineq + bayes
D: 2-norm + bayes
(b) OHSUMED
Figure 4: W vs. the average cross entropy for the development sets.
clarify the reasons for its success.
Acknowledgments We would like to thank
Yusuke Miyao, Yoshimasa Tsuruoka, and the
anonymous reviewers for many helpful comments.

References
S. J. Benson and J. J. Mor´e. A limited memory variable metric
method for bound constraint minimization. Technical Re-
port ANL/MCS-P909-0901, Argonne National Laboratory.
S. Benson, L. C. McInnes, J. J. Mor´e, and J. Sarich. 2002.
TAO users manual. Technical Report ANL/MCS-TM-242-
Revision 1.4, Argonne National Laboratory.
A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. 1996. A
maximum entropy approach to natural language processing.
Computational Linguistics, 22(1):39–71.
A. Borthwick. 1999. A maximum entropy approach to named
entity recognition. Ph.D. Thesis. New York University.
S. F. Chen and R. Rosenfeld. 2000. A survey of smoothing
techniques for ME models. IEEE Trans. on Speech and Au-
dio Processing, 8(1):37–50.
J. N. Darroch and D. Ratcliff. 1972. Generalized iterative
scaling for log-linear models. The Annals of Mathematical
Statistics, 43:1470–1480.
W. Hersh, C. Buckley, T.J. Leone, and D. Hickam. 1994.
OHSUMED: An interactive retrieval evaluation and new
large test collection for research. In Proc. of the 17th An-
nual ACM SIGIR Conference, pages 192–201.
S. Khudanpur. 1995. A method of ME estimation with re-
laxed constraints. In Johns Hopkins Univ. Language Model-
ing Workshop, pages 1–17.
R. Lau. 1994. Adaptive statistical language modeling. A Mas-
ter’s Thesis. MIT.
R. Malouf. 2002. A comparison of algorithms for maximum
entropy parameter estimation. In Proc. of the sixth CoNLL.
Y. Miyao and J. Tsujii. 2002. Maximum entropy estimation for
feature forests. In Proc. of HLT 2002.
W. Newman. 1977. Extension to the ME method. In IEEE
Trans. on Information Theory, volume IT-23, pages 89–93.
K. Nigam, J. Lafferty, and A. McCallum. 1999. Using maxi-
mum entropy for text classification. In IJCAI-99 Workshop
on Machine Learning for Information Filtering, pages 61–
67.
S. Pietra, V. Pietra, and J. Lafferty. 1997. Inducing features of
random fields. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(4):380–393.
A. Ratnaparkhi. 1996. A maximum entropy model for part-of-
speech tagging. In Proc. of the EMNLP, pages 133–142.
V. Vapnik. 1995. The Nature of Statistical Learning Theory.
Springer Verlag.
