Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 692–699, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Maximum Expected F-Measure Training of Logistic Regression Models
Martin Jansche
Center for Computational Learning Systems
Columbia University
New York, NY 10027, USA
jansche@acm.org
Abstract
We consider the problem of training logis-
tic regression models for binary classifi-
cation in information extraction and infor-
mation retrieval tasks. Fitting probabilis-
tic models for use with such tasks should
take into account the demands of the task-
specific utility function, in this case the
well-known F-measure, which combines
recall and precision into a global measure
of utility. We develop a training proce-
dure based on empirical risk minimiza-
tion / utility maximization and evaluate it
on a simple extraction task.
1 Introduction
Log-linear models have been used in many areas of
Natural Language Processing (NLP) and Information
Retrieval (IR). Scenarios in which log-linear models
have been applied often involve simple binary clas-
sification decisions or probability assignments, as
in the following three examples: Ratnaparkhi et al.
(1994) consider a restricted form of the prepositional
phrase attachment problem where attachment deci-
sions are binary; Ittycheriah et al. (2003) reduce en-
tity mention tracking to the problem of modeling
the probability of two mentions being linked; and
Greiff and Ponte (2000) develop models of proba-
bilistic information retrieval that involve binary de-
cisions of relevance. What is common to all three
approaches is the application of log-linear models to
binary classification tasks.1 As Ratnaparkhi (1998,
1These kinds of log-linear models are also known among the
NLP community as “maximum entropy models” (Berger et al.,
p. 27f.) points out, log-linear models of binary re-
sponse variables are equivalent to, and in fact mere
notational variants of, logistic regression models.
In this paper we focus on binary classification
tasks, and in particular on the loss or utility associ-
ated with classification decisions. The three prob-
lems mentioned before – prepositional phrase at-
tachment, entity mention linkage, and relevance of
a document to a query – differ in one crucial aspect:
The first is evaluated in terms of accuracy or, equiva-
lently, symmetric zero–one loss; but the second and
third are treated as information extraction/retrieval
problems and evaluated in terms of recall and preci-
sion. Recall and precision are combined into a single
overall utility function, the well-known F-measure.
It may be desirable to estimate the parameters of a
logistic regression model by maximizing F-measure
during training. This is analogous, and in a cer-
tain sense equivalent, to empirical risk minimiza-
tion, which has been used successfully in related
areas, such as speech recognition (Rahim and Lee,
1997), language modeling (Paciorek and Rosenfeld,
2000), and machine translation (Och, 2003).
The novel contribution of this paper is a training
procedure for (approximately) maximizing the ex-
pected F-measure of a probabilistic classifier based
on a logistic regression model. We formulate a
vector-valued utility function which has a well-
defined expected value; F-measure is then a rational
function of this expectation and can be maximized
numerically under certain conventional regularizing
assumptions.
1996; Ratnaparkhi, 1998). This is an unfortunate choice of
terminology, because the term “maximum entropy” does not
uniquely determine a family of models unless the constraints
subject to which entropy is being maximized are specified.
692
We begin with a review of logistic regression
(Section 2) and then discuss the use of F-measure
for evaluation (Section 3). We reformulate F-
measure as a function of an expected utility (Sec-
tion 4) which is maximized during training (Sec-
tion 5). We discuss the differences between our pa-
rameter estimation technique and maximum likeli-
hood training on a toy example (Section 6) as well
as on a real extraction task (Section 7). We conclude
with a discussion of further applications and gener-
alizations (Section 8).
2 Review of Logistic Regression
Bernoulli regression models are conditional proba-
bility models of a binary response variable Y given
a vector vectorX of k explanatory variables (X1,...,Xk).
We will use the convention2 that Y takes on a value
y∈{−1,+1}.
Logistic regression models (Cox, 1958) are per-
haps best viewed as instances of generalized linear
models (Nelder and Wedderburn, 1972; McCullagh
and Nelder, 1989) where the the response variable
follows a Bernoulli distribution and the link func-
tion is the logit function. Let us summarize this first,
before expanding the relevant definitions:
Y ∼Bernoulli(p)
logit(p) = θ0 +x1 θ1 +x2 θ2 +···+xk θk
What this means is that there is an unobserved quan-
tity p, the success probability of the Bernoulli distri-
bution, which we interpret as the probability that Y
will take on the value +1:
Pr(Y = +1|vectorX = (x1,x2,...,xk),vectorθ) = p
The logit (log odds) function is defined as follows:
logit(p) = ln
parenleftbigg p
1−p
parenrightbigg
The logit function is used to transform a probabil-
ity, constrained to fall within the interval (0,1), into
a real number ranging over (−∞,+∞). The inverse
function of the logit is the cumulative distribution
2The natural choice may seem to be for Y to range over the
set{0,1}, but the convention adopted here is more common for
classification problems and has certain advantages which will
become clear soon.
function of the standard logistic distribution (also
known as the sigmoid or logistic function), which
we call g:
g(z) = 11+exp(−z)
This allows us to write
p = g(θ0 +x1 θ1 +x2 θ2 +···+xk θk).
We also adopt the usual convention that vectorx =
(1,x1,x2,...,xk), which is a k +1-dimensional vec-
tor whose first component is always 1 and whose
remaining k components are the values of the k ex-
planatory variables. So the Bernoulli probability can
be expressed as
p = g
parenleftBigg k
∑
j=0
x j θj
parenrightBigg
= g
parenleftBig
vectorx·vectorθ
parenrightBig
.
The conditional probability model then takes the
following abbreviated form, which will be used
throughout the rest of this paper:
Pr(+1|vectorx,vectorθ) = 11+exp(−vectorx·vectorθ) (1)
A classifier can be constructed from this probabil-
ity model using the MAP decision rule. This means
predicting the label +1 if Pr(+1|vectorx,vectorθ) exceeds 1/2,
which amounts to the following:
ymap(vectorx) = argmax
y
Pr(y|vectorx,vectorθ) = sgn
parenleftBig
vectorx·vectorθ
parenrightBig
This illustrates the well-known result that a MAP
classifier derived from a logistic regression model
is equivalent to a (single-layer) perceptron (Rosen-
blatt, 1958) or linear threshold unit.
3 F-Measure
Suppose the parameter vector θ of a logistic regres-
sion model is known. The performance of the re-
sulting classifier can then be evaluated in terms of
the recall (or sensitivity) and precision of the classi-
fier on an evaluation dataset. Recall (R) and preci-
sion (P) are defined in terms of the number of true
positives (A), misses (B), and false alarms (C) of the
classifier (cf. Table 1):
R = AA+B P = AA+C
693
predicted
+1 −1
total
true
+1 A B npos
−1 C D nneg
total mpos mneg n
Table 1: Schema for a 2×2 contingency table
The Fα measure – familiar from Information Re-
trieval – combines recall and precision into a single
utility criterion by taking their α-weighted harmonic
mean:
Fα(R,P) =
parenleftbigg
α 1R +(1−α) 1P
parenrightbigg−1
The Fα measure can be expressed in terms of the
triple (A,B,C) as follows:
Fα(A,B,C) = AA+α B+(1−α)C (2)
In order to define A, B, and C formally, we use the
notation llbracketpirrbracket to denote a variant of the Kronecker
delta defined like this, where pi is a Boolean expres-
sion:
llbracketpirrbracket =
braceleftBigg
1 if pi
0 if¬pi
Given an evaluation dataset (vectorx1,y1),...,(vectorxn,yn) the
counts of hits (true positives), misses, and false
alarms are, respectively:
A =
n∑
i=1
largellbrackety
map(vectorxi) = +1
largerrbracketllbrackety
i = +1rrbracket
B =
n∑
i=1
largellbrackety
map(vectorxi) =−1
largerrbracketllbrackety
i = +1rrbracket
C =
n∑
i=1
largellbrackety
map(vectorxi) = +1
largerrbracketllbrackety
i =−1rrbracket
Note that F-measure is seemingly a global measure
of utility that applies to an evaluation dataset as a
whole: while the F-measure of a classifier evaluated
on a single supervised instance is well defined, the
overall F-measure on a larger dataset is not a func-
tion of the F-measure evaluated on each instance
in the dataset. This is in contrast to ordinary loss/
utility, whose grand total (or average) on a dataset
can be computed by direct summation.
4 Relation to Expected Utility
We reformulate F-measure as a scalar-valued ratio-
nal function composed with a vector-valued utility
function. This allows us to define notions of ex-
pected and average utility, setting up the discussion
of parameter estimation in terms of empirical risk
minimization (or rather, utility maximization).
Define the following vector-valued utility func-
tion u, where u(˜y|y) is the utility of choosing the
label ˜y if the true label is y:
u(+1|+1) = (1,0,0)
u(−1|+1) = (0,1,0)
u(+1|−1) = (0,0,1)
u(−1|−1) = (0,0,0)
This function indicates whether a classification deci-
sion is a hit, miss, or false alarm. Correct rejections
are not counted.
Expected values are, of course, well-defined for
vector-valued functions. For example, the expected
utility is
E[u] = ∑
(vectorx,y)
u(ymap(vectorx)|y) Pr(vectorx,y).
In empirical risk minimization we approximate the
expected utility of a classifier by its average utility
US on a given dataset S = (vectorx1,y1),...,(vectorxn,yn):
E[u]≈US = 1n
n∑
i=1
u(ymap(vectorxi)|yi)
= 1n
n∑
i=1
u(+1|yi)largellbracketymap(vectorxi) = +1largerrbracket
+u(−1|yi)largellbracketymap(vectorxi) =−1largerrbracket
Now it is easy to see that US is the following vector:
US = 1n





n∑
i=1
largellbrackety
map(vectorxi) = +1
largerrbracketllbrackety
i = +1rrbracket
n∑
i=1
largellbrackety
map(vectorxi) =−1
largerrbracketllbrackety
i = +1rrbracket
n∑
i=1
largellbrackety
map(vectorxi) = +1
largerrbracketllbrackety
i =−1rrbracket





(3)
So US = n−1 (A,B,C) where A, B, and C are as de-
fined before. This means that we can interpret the
694
F-measure of a classifier as a simple rational func-
tion of its empirical average utility (the scaling fac-
tor 1/n in (3) can in fact be omitted). This allows
us to approach the parameter estimation task as an
empirical risk minimization or utility maximization
problem.
5 Discriminative Parameter Estimation
In the preceding two sections we assumed that the
parameter vector vectorθ was known. Now we turn to
the problem of estimating vectorθ by maximizing the F-
measure formulated in terms of expected utility. We
make the dependence on vectorθ explicit in the formula-
tion of the optimization task:
vectorθstar = argmax
vectorθ
Fα(A(vectorθ),B(vectorθ),C(vectorθ)),
where (A(vectorθ),B(vectorθ),C(vectorθ))=US(vectorθ) as defined in (3).
We encounter the usual problem: the basic quan-
tities involved are integers (counts of hits, misses,
and false alarms), and the optimization objective is
a piecewise-constant functions of the parameter vec-
tor vectorθ, due to the fact that vectorθ occurs exclusively inside
Kronecker deltas. For example:
largellbrackety
map(vectorx) = +1
largerrbracket = LargellbracketPr(+1|vectorx,vectorθ) > 0.5Largerrbracket
In general, we can set
Largellbracket
Pr(+1|vectorx,vectorθ) > 0.5
Largerrbracket
≈Pr(+1|vectorx,vectorθ), (4)
and in the case of logistic regression this arises as a
special case of approximating the limit
Largellbracket
Pr(+1|vectorx,vectorθ) > 0.5
Largerrbracket
= limγ→∞ g(γvectorx·vectorθ)
with a fixed value of γ = 1. The choice of γ does
not matter much. The important point is that we are
now dealing with approximate quantities which de-
pend continuously on vectorθ. In particular A(vectorθ)≈ ˜A(vectorθ),
where
˜A(vectorθ) = n∑
i=1y
i=+1
g(γvectorxi·vectorθ). (5)
Since the marginal total of positive instances npos
(cf. Table 1) does not depend onvectorθ, we use the identi-
ties ˜B(vectorθ) = npos− ˜A(vectorθ) and ˜mpos(vectorθ) = ˜A(vectorθ)+ ˜C(vectorθ)
to rewrite the optimization objective as ˜Fα:
˜Fα(vectorθ) = ˜A(vectorθ)
α npos +(1−α) ˜mpos(vectorθ), (6)
where ˜A(vectorθ) is given by (5) and ˜mpos(vectorθ) is
˜mpos(vectorθ) =
n∑
i=1
g(γvectorxi·vectorθ).
Maximization of ˜F as defined in (6) can be car-
ried out numerically using multidimensional opti-
mization techniques like conjugate gradient search
(Fletcher and Reeves, 1964) or quasi-Newton meth-
ods such as the BFGS algorithm (Broyden, 1967;
Fletcher, 1970; Goldfarb, 1970; Shanno, 1970). This
requires the evaluation of partial derivatives. The jth
partial derivative of ˜F is as follows:
∂ ˜F(vectorθ)
∂θj = h
∂ ˜A(vectorθ)
∂θj −h
2 ˜A(vectorθ)(1−α) ∂ ˜mpos(vectorθ)
∂θj
h = 1α n
pos +(1−α) ˜mpos(vectorθ)
∂ ˜A(vectorθ)
∂θj =
n∑
i=1y
i=+1
gprime(γvectorxi·vectorθ)γ xi j
∂ ˜mpos(vectorθ)
∂θj =
n∑
i=1
gprime(γvectorxi·vectorθ)γ xi j
gprime(z) = g(z)(1−g(z))
One can compute the value of ˜F(vectorθ) and its gradient
∇ ˜F(vectorθ) simultaneously at a given point vectorθ in O(nk)
time and O(k) space. Pseudo-code for such an al-
gorithm appears in Figure 1. In practice, the inner
loops on lines 8–9 and 14–18 can be made more ef-
ficient by using a sparse representation of the row
vectors x[i]. A concrete implementation of this al-
gorithm can then be used as a callback to a multi-
dimensional optimization routine. We use the BFGS
minimizer provided by the GNU Scientific Library
(Galassi et al., 2003). Important caveat: the func-
tion ˜F is generally not concave. We deal with this
problem by taking the maximum across several runs
of the optimization algorithm starting from random
initial values. The next section illustrates this point
further.
695
x y
0 +1
1 −1
2 +1
3 +1
Table 2: Toy dataset
6 Comparison with Maximum Likelihood
A comparison with the method of maximum like-
lihood illustrates two important properties of dis-
criminative parameter estimation. Consider the toy
dataset in Table 2 consisting of four supervised in-
stances with a single explanatory variable. Thus the
logistic regression model has two parameters and
takes the following form:
Pr toy(+1|x,θ0,θ1) = 11+exp(−θ
0−xθ1)
The log-likelihood function L is simply
L(θ0,θ1) =logPr toy(+1|0,θ0,θ1)
+logPr toy(−1|1,θ0,θ1)
+logPr toy(+1|2,θ0,θ1)
+logPr toy(+1|3,θ0,θ1).
A surface plot of L is shown in Figure 2. Ob-
serve that L is concave; its global maximum occurs
near (θ0,θ1)≈(0.35,0.57), and its value is always
strictly negative because the toy dataset is not lin-
early separable. The classifier resulting from maxi-
mum likelihood training predicts the label +1 for all
training instances and thus achieves a recall of 3/3
and precision 3/4 on its training data. The Fα=0.5
measure is 6/7.
Contrast the shape of the log-likelihood function
L with the function ˜Fα. Surface plots of ˜Fα=0.5 and
˜Fα=0.25 appear in Figure 3. The figures clearly illus-
trate the first important (but undesirable) property of
˜F, namely the lack of concavity. They also illustrate
a desirable property, namely the ability to take into
account certain properties of the loss function dur-
ing training. The ˜Fα=0.5 surface in the left panel of
Figure 3 achieves its maximum in the right corner
for (θ0,θ1)→(+∞,+∞). If we choose (θ0,θ1) =
(20,15) the classifier labels every instance of the
training data with +1.
fdf(θ):
1: m←0
2: A←0
3: for j←0 to k do
4: dm[j]←0
5: dA[j]←0
6: for i←1 to n do
7: p←0
8: for j←0 to k do
9: p←p+x[i][j]×θ[j]
10: p←1/(1+exp(−d))
11: m←m+ p
12: if y[i] = +1 then
13: A←A+ p
14: for j←0 to k do
15: t←p×(1−p)×x[i][j]
16: dm[j]←dm[j]+t
17: if y[i] = +1 then
18: dA[j]←dA[j]+t
19: h←1/(α×npos +(1−α)×m)
20: F←h×A
21: t←F×(1−α)
22: for j←0 to k do
23: dF[j]←h×(dA[j]−t×dm[j])
24: return (F,dF)
Figure 1: Algorithm for computing ˜F and ∇ ˜F
L(θ0, θ1)
-25 -20 -15
-10 -5  0
 5  10  15θ0 -20-15-10 -5
 0  5  10
 15  20
θ1
-180-160
-140-120
-100-80
-60-40
-20 0
Figure 2: Surface plot of L on the toy dataset
Observe the difference between the ˜Fα=0.5 surface
and the ˜Fα=0.25 surface in the right hand panel of
Figure 3: ˜Fα=0.25 achieves its maximum in the back
corner for (θ0,θ1)→(−∞,+∞). If we set (θ0,θ1)=
(−20,15) the resulting classifier labels the first two
696
F0.5(θ0, θ1)
-25 -20 -15
-10 -5  0
 5  10  15θ0 -20 -15 -10 -5
 0  5  10
 15  20
θ1
 0 0.1
 0.2 0.3
 0.4 0.5
 0.6 0.7
 0.8 0.9
 1
F0.25(θ0, θ1)
-25 -20 -15
-10 -5  0
 5  10  15θ0 -20 -15 -10 -5
 0  5  10
 15  20
θ1
 0 0.1
 0.2 0.3
 0.4 0.5
 0.6 0.7
 0.8 0.9
 1Figure 3: Surface plot of ˜Fα=0.5 (left) and ˜Fα=0.25 (right) on the toy dataset
instances (x = 0 and x = 1) as −1 and the last two
instances (x = 2 and x = 3) as +1.
The classifier trained according to the ˜Fα=0.5 cri-
terion achieves an Fα=0.5 measure of 6/7≈0.86,
compared with 4/5 = 0.80 for the classifier trained
according to the ˜Fα=0.25 criterion. Conversely, that
classifier achieves an Fα=0.25 measure of 8/9≈0.89
compared with 4/5 = 0.80 for the classifier trained
according to the ˜Fα=0.5 criterion. This demonstrates
that the training procedure can effectively take infor-
mation from the utility function into account, pro-
ducing a classifier that performs well under a given
evaluation criterion. This is the result of optimizing
a task-specific utility function during training, not
simply a matter of adjusting the decision threshold
of a trained classifier.
7 Evaluation on an Extraction Problem
We evaluated our discriminative training procedure
on a real extraction problem arising in broadcast
news summarization. The overall task is to summa-
rize the stories in an audio news broadcast (or in the
audio portion of an A/V broadcast). We assume that
story boundaries have been identified and that each
story has been broken up into sentence-like units. A
simple way of summarizing a story is then to classify
each sentence as either belonging into a summary or
not, so that a relevant subset of sentences can be ex-
tracted to form the basis of a summary. What makes
the classification task hard, and therefore interesting,
is the fact that reliable features are hard to come by.
Existing approaches such as Maskey and Hirschberg
2005 do well only when combining diverse features
such as lexical cues, acoustic properties, structural/
positional features, etc.
The task has another property which renders it
problematic, and which prompted us to develop
the discriminative training procedure described in
this paper. Summarization, by definition, aims for
brevity. This means that in any dataset the number
of positive instances will be much smaller than the
number of negative instances. Given enough data,
balance could be restored by discarding negative in-
stances. This, however, was not an option in our
case: a moderate amount of manually labeled data
had been produced and about one third would have
had to be discarded to achieve a balance in the dis-
tribution of class labels. This would have eliminated
precious supervised training data, which we were
not prepared to do.
The training and test data were prepared by
Maskey and Hirschberg (2005), who performed the
feature engineering, imputation of missing values,
and the training–test split. We used the data un-
changed in order to allow for a comparison between
approaches. The dataset is made up of 30 fea-
tures, divided into one binary response variable, and
one binary explanatory variable plus 28 integer- and
real-valued explanatory variables. The training por-
tion consists of 3 535 instances, the test portion of
408 instances.
We fitted logistic regression models in three dif-
ferent ways: by maximum likelihood ML, by ˜Fα=0.5
maximization, and by ˜Fα=0.75 maximization. Each
697
Method R P Fα=0.5 Fα=0.75
ML 24/99 24/33 0.3636 0.2909
ML† 85/99 85/229 0.5183 0.6464
˜Fα=0.5 85/99 85/211 0.5484 0.6693
˜Fα=0.75 95/99 95/330 0.4429 0.6061
Table 3: Evaluation results
classifier was evaluated on the test dataset and its re-
call (R), precision (P), Fα=0.5 measure, and Fα=0.75
measure recorded. The results appear in Table 3.
The row labeled ML†is special: the classifier used
here is the logistic regression model fitted by maxi-
mum likelihood; what is different is that the thresh-
old for positive predictions was adjusted post hoc to
match the number of true positives of the first dis-
criminatively trained classifier. This has the same
effect as manually adjusting the threshold parameter
θ0 based on partial knowledge of the test data (via
the performance of another classifier) and is thus
not permissible. It is interesting to note, however,
that the ML trained classifier performs worse than
the ˜Fα=0.5 trained classifier even when one param-
eter is adjusted by an oracle with knowledge of the
test data and the performance of the other classifier.
Fitting a model based on ˜Fα=0.75, which gives in-
creased weight to recall compared with ˜Fα=0.5, led
to higher recall as expected. However, we also ex-
pected that the Fα=0.75 score of the ˜Fα=0.75 trained
classifier would be higher than the Fα=0.75 score of
the ˜Fα=0.5 trained classifier. This is not the case, and
could be due to the optimization getting stuck in a
local maximum, or it may have been an unreason-
able expectation to begin with.
8 Conclusions
We have presented a novel estimation procedure
for probabilistic classifiers which we call, by a
slight abuse of terminology, maximum expected F-
measure training. We made use of the fact that ex-
pected utility computations can be carried out in a
vector space, and that an ordering of vectors can be
imposed for purposes of maximization which can
employ auxiliary functions like the F-measure (2).
This technique is quite general and well suited for
working with other quantities that can be expressed
in terms of hits, misses, false alarms, correct rejec-
tions, etc. In particular, it could be used to find a
point estimate which provides a certain tradeoff be-
tween specificity and sensitivity, or operating point.
A more general method would try to optimize sev-
eral such operating points simultaneously, an issue
which we will leave for future research.
The classifiers discussed in this paper are logistic
regression models. However, this choice is not cru-
cial. The approximation (4) is reasonable for binary
decisions in general, and one can use it in conjunc-
tion with any well-behaved conditional Bernoulli
model or related classifier. For Support Vector Ma-
chines, approximate F-measure maximization was
introduced by Musicant et al. (2003).
Maximizing F-measure during training seems es-
pecially well suited for dealing with skewed classes.
This can happen by accident, because of the nature
of the problem as in our summarization example
above, or by design: for example, one can expect
skewed binary classes as the result of the one-vs-all
reduction of multi-class classification to binary clas-
sification; and in multi-stage classification one may
want to alternate between classifiers with high recall
and classifiers with high precision.
Finally, the ability to incorporate non-standard
tradeoffs between precision and recall at training
time is useful in many information extraction and
retrieval applications. Human end-users often create
asymmetries between precision and recall, for good
reasons: they may prefer to err on the side of caution
(e.g., it is less of a problem to let an unwanted spam
email reach a user than it is to hold back a legitimate
message), or they may be better at some tasks than
others (e.g., search engine users are good at filtering
out irrelevant documents returned by a query, but are
not equipped to crawl the web in order to look for
relevant information that was not retrieved). In the
absence of methods that work well for a wide range
of operating points, we need training procedures that
can be made sensitive to rare cases depending on the
particular demands of the application.
Acknowledgements
I would like to thank Julia Hirschberg, Sameer
Maskey, and the three anonymous reviewers for
helpful comments. I am especially grateful to
698
Sameer Maskey for allowing me to use his speech
summarization dataset for the evaluation in Sec-
tion 7. The usual disclaimers apply.
References
Adam L. Berger, Vincent J. Della Pietra, and
Stephen A. Della Pietra. 1996. A maximum
entropy approach to natural language processing.
Computational Linguistics, 22(1):39–71.
C. G. Broyden. 1967. Quasi-Newton methods and
their application to function minimisation. Math-
ematics of Computation, 21(99):368–381.
D. R. Cox. 1958. The regression analysis of binary
sequences. Journal of the Royal Statistical Soci-
ety, Series B (Methodological), 20(2):215–242.
R. Fletcher. 1970. A new approach to variable
metric algorithms. The Computer Journal,
13(3):317–322. doi:10.1093/comjnl/13.3.317.
R. Fletcher and C. M. Reeves. 1964. Func-
tion minimization by conjugate gradients.
The Computer Journal, 7(2):149–154.
doi:10.1093/comjnl/7.2.149.
Mark Galassi, Jim Davies, James Theiler, Brian
Gough, Gerard Jungman, Michael Booth, and
Fabrice Rossi. 2003. GNU Scientific Library
Reference Manual. Network Theory, Bristol,
UK, second edition.
Donald Goldfarb. 1970. A family of variable-metric
methods derived by variational means. Mathe-
matics of Computation, 24(109):23–26.
Warren R. Greiff and Jay M. Ponte. 2000.
The maximum entropy approach and prob-
abilistic IR models. ACM Transactions
on Information Systems, 18(3):246–287.
doi:10.1145/352595.352597.
Abraham Ittycheriah, Lucian Lita, Nanda Kamb-
hatla, Nicolas Nicolov, Salim Roukos, and
Margo Stys. 2003. Identifying and tracking
entity mentions in a maximum entropy frame-
work. In HLT/NAACL 2003. ACL Anthology
N03-2014.
Sameer Maskey and Julia Hirschberg. 2005. Com-
paring lexical, acoustic/prosodic, structural and
discourse features for speech summarization. In
Interspeech 2005 (Eurospeech).
P. McCullagh and J. A. Nelder. 1989. Generalized
Linear Models. Chapman & Hall/CRC, Boca
Raton, FL, second edition.
David R. Musicant, Vipin Kumar, and Aysel Ozgur.
2003. Optimizing F-measure with Support Vec-
tor Machines. In FLAIRS 16, pages 356–360.
J. A. Nelder and R. W. M. Wedderburn. 1972. Gen-
eralized linear models. Journal of the Royal Sta-
tistical Society, Series A (General), 135(3):370–
384.
Franz Josef Och. 2003. Minimum error rate train-
ing in statistical machine translation. In ACL 41.
ACL Anthology P03-1021.
Chris Paciorek and Roni Rosenfeld. 2000. Mini-
mum classification error training in exponential
language models. In NIST/DARPA Speech Tran-
scription Workshop.
Mazin Rahim and Chin-Hui Lee. 1997. String-
based minimum verification error (SB-MVE)
training for speech recognition. Computer,
Speech and Language, 11(2):147–160.
Adwait Ratnaparkhi. 1998. Maximum Entropy
Models for Natural Language Ambiguity Reso-
lution. Ph.D. thesis, University of Pennsylvania,
Computer and Information Science.
Adwait Ratnaparkhi, Jeff Reynar, and Salim
Roukos. 1994. A maximum entropy model
for prepositional phrase attachment. In ARPA
Human Language Technology Workshop, pages
250–255. ACL Anthology H94-1048.
Frank Rosenblatt. 1958. The perceptron: A prob-
abilistic model for information storage and or-
ganization in the brain. Psychological Review,
65(6):386–408.
D. F. Shanno. 1970. Conditioning of quasi-Newton
methods for function minimization. Mathematics
of Computation, 24(111):647–656.
699
